首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine, etc. Traditional retrieval models, including the Okapi's BM25 model and the Smart's vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice, the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show: 1) the popular retrieval models (the Okapi's BM25 model and the Smart's vector space model with length normalization) do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.  相似文献   

2.
EDCMS: A Content Management System for Engineering Documents   总被引:1,自引:0,他引:1  
Engineers often need to look for the right pieces of information by sifting through long engineering documents.It is a very tiring and time-consuming job.To address this issue,researchers are increasingly devoting their attention to new ways to help information users,including engineers,to access and retrieve document content.The research reported in this paper explores how to use the key technologies of document decomposition (study of document structure),document mark-up (with EXtensible Mark- up Language (XML),HyperText Mark-up Language (HTML),and Scalable Vector Graphics (SVG)),and a facetted classification mechanism.Document content extraction is implemented via computer programming (with Java).An Engineering Document Content Management System (EDCMS) developed in this research demonstrates that as information providers we can make document content in a more accessible manner for information users including engineers. The main features of the EDCMS system are: 1) EDCMS is a system that enables users,especially engineers,to access and retrieve information at content rather than document level.In other words,it provides the right pieces of information that answer specific questions so that engineers don't need to waste time sifting through the whole document to obtain the required piece of information. 2) Users can use the EDCMS via both the data and metadata of a document to access engineering document content. 3) Users can use the EDCMS to access and retrieve content objects,i.e.text,images and graphics (including engineering drawings) via multiple views and at different granularities based on decomposition schemes. Experiments with the EDCMS have been conducted on semi-structured documents,a textbook of CADCAM,and a set of project posters in the Engineering Design domain.Experimental results show that the system provides information users with a powerful solution to access document content.  相似文献   

3.
A Workflow Process Mining Algorithm Based on Synchro-Net   总被引:5,自引:0,他引:5       下载免费PDF全文
Sometimes historic information about workflow execution is needed to analyze business processes. Process mining aims at extracting information from event logs for capturing a business process in execution. In this paper a process mining algorithm is proposed based on Synchro-Net which is a synchronization-based model of workflow logic and workflow semantics. With this mining algorithm based on the model, problems such as invisible tasks and short-loops can be dealt with at ease. A process mining example is presented to illustrate the algorithm, and the evaluation is also given.  相似文献   

4.
With the quick increase of information and knowledge, automatically classifying text documents is becoming a hotspot of knowledge management. A critical capability of knowledge management systems is to classify the text documents into different categories, which are meaningful to users. In this paper, a text topic classification model based on domain ontology by using Vector Space Model is proposed. Eigenvectors as the input to the vector space model are constructed by utilizing concepts and hierarchical structure of ontology, which also provides the domain knowledge. However, a limited vocabulary problem is encountered while mapping keywords to their corresponding ontology concepts. A synonymy lexicon is utilized to extend the ontology and compress the eigenvector. The problem that eigenvectors are too large and complex to be calculated in traditional methods can be solved. At last, combing the concept's supporting, a top-down method according to the ontology structure is used to complete topic classification. An experimental system is implemented and the model is applied to this practical system. Test results show that this model is feasible.  相似文献   

5.
《信息安全学报》2017,(收录汇总):96-108
Compared with malicious office documents based on macros, malicious office documents based on vulnerability exploitation often do not need target interaction in the attack process, and can complete the attack without target perception. It has become an important means of Advanced Persistent Threat (APT) attack. Therefore, detecting malicious documents based on vulnerability exploitation, especially unknown vulnerability exploitation, plays an important role in discovering APT attacks. The current malicious document detection methods mainly focus on PDF documents. It is mainly divided into two categories: static analysis and dynamic analysis. Static analysis is easy to be evaded by hackers, and can not discovery exploits triggered by remote payload. Dynamic analysis only considers the behaviors of the JavaScript in PDF or document reader’s process, ignoring the indirect attacks against other processes of the system, leads to a detection blind spot. To solve the above problems, we analyze the attack surface of malicious Office documents, come up with a threat model and implement an unknown malicious document detection method based on global behavior feature. In the process of document processing, the whole system behavior features are extracted, and only benign document samples are trained to form a behavior feature database for malicious document detection. In order to reduce false alarm rate, we introduce sensitive behavioral feature in detection. In this paper, 522 benign documents including DOCX, RTF and DOC are trained to obtain the behavior feature database, and then 2088 benign document samples and 211 malicious document samples are tested. Of these, 10 malicious samples are manually crafted to simulate several typical attack scenarios. The experimental results show that this method can detect all malicious samples with a very low false positive rate (0.14%) and is able to detect malicious documents that exploit unknown vulnerabilities. Further experiments show that this method can also be used to detect malicious documents exploiting WPS office software. © 2023 Chinese Academy of Sciences. All rights reserved.  相似文献   

6.
Entity linking(EL)systems aim to link entity mentions in the document to their corresponding entity records in a reference knowledge base.Existing EL approaches usually ignore the semantic correlation between the mentions in the text,and are limited to the scale of the local knowledge base.In this paper,we propose a novel graphranking collective Chinese entity linking(GRCCEL)algorithm,which can take advantage of both the structured relationship between entities in the local knowledge base and the additional background information offered by external knowledge sources.By improved weighted word2vec textual similarity and improved PageRank algorithm,more semantic information and structural information can be captured in the document.With an incremental evidence mining process,more powerful discrimination capability for similar entities can be obtained.We evaluate the performance of our algorithm on some open domain corpus.Experimental results show the effectiveness of our method in Chinese entity linking task and demonstrate the superiority of our method over state-of-the-art methods.  相似文献   

7.
The performance of smart structures in trajectory tracking under sub-micron level is hindered by the rate-dependent hysteresis nonlinearity.In this paper,a Hammerstein-like model based on the support vector machines(SVM)is proposed to capture the rate-dependent hysteresis nonlinearity.We show that it is possible to construct a unique dynamic model in a given frequency range for a rate-dependent hysteresis system using the sinusoidal scanning signals as the training set of signals for the linear dynamic subsystem of the Hammerstein-like model.Subsequently,a two-degree-of-freedom(2DOF)H∞robust control scheme for the ratedependent hysteresis nonlinearity is implemented on a smart structure with a piezoelectric actuator(PEA)for real-time precision trajectory tracking.Simulations and experiments on the structure verify both the efectiveness and the practicality of the proposed modeling and control methods.  相似文献   

8.
Semantic-based searching in peer-to-peer (P2P) networks has drawn significant attention recently. A number of semantic searching schemes, such as GES proposed by Zhu Y et al., employ search models in Information Retrieval (IR). All these IR-based schemes use one vector to summarize semantic contents of all documents on a single node. For example, GES derives a node vector based on the IR model: VSM (Vector Space Model). A topology adaptation algorithm and a search protocol are then designed according to the similarity between node vectors of different nodes. Although the single semantic vector is suitable when the distribution of documents in each node is uniform, it may not be efficient when the distribution is diverse. When there are many categories of documents at each node, the node vector representation may be inaccurate. We extend the idea of GES and present a new class-based semantic searching scheme (CSS) specifically designed for unstructured P2P networks with heterogeneous single-node document collection. It makes use of a state-of-the-art data clustering algorithm, online spherical k-means clustering (OSKM), to cluster all documents on a node into several classes. Each class can be viewed as a virtual node. Virtual nodes are connected through virtual links. As a result, the class vector replaces the node vector and plays an important role in the class-based topology adaptation and search process. This makes CSS very efficient. Our simulation using the IR benchmark TREC collection demonstrates that CSS outperforms GES in terms of higher recall, higher precision, and lower search cost.  相似文献   

9.
A Fuzzy Approach to Classification of Text Documents   总被引:1,自引:0,他引:1       下载免费PDF全文
This paper discusses the classification problems of text documents. Based on the concept of the proximity degree, the set of words is partitioned into some equivalence classes.Particularly, the concepts of the semantic field and association degree are given in this paper.Based on the above concepts, this paper presents a fuzzy classification approach for document categorization. Furthermore, applying the concept of the entropy of information, the approaches to select key words from the set of words covering the classification of documents and to construct the hierarchical structure of key words are obtained.  相似文献   

10.
Object-oriented modeling with declarative equation based languages often unconsciously leads to structural inconsistencies. Component-based debugging is a new structural analysis approach that addresses this problem by analyzing the structure of each component in a model to separately locate faulty components. The analysis procedure is performed recursively based on the depth-first rule. It first generates fictitious equations for a component to establish a debugging environment, and then detects structural defects by using graph theoretical approaches to analyzing the structure of the system of equations resulting from the component. The proposed method can automatically locate components that cause the structural inconsistencies, and show the user detailed error messages. This information can be a great help in finding and localizing structural inconsistencies, and in some cases pinpoints them immediately.  相似文献   

11.
基于核方法的XML文档自动分类   总被引:3,自引:0,他引:3  
杨建武 《计算机学报》2011,34(2):353-359
支持向量机(SVM)方法通过核函数进行空间映射并构造最优分类超平面解决分类器的构造问题,该方法在文本自动分类应用中具有明显优势.XML 文档是文本内容信息与结构信息的综合体,作为一种新的数据形式,成为当前的研究热点.文中以结构链接向量模型为基础,研究了基于支持向量机的XML文档自动分类方法,提出了适合XML文档分类的核...  相似文献   

12.
信息处理领域中,现有的各种文本分类算法大都基于向量空间模型,而向量空间模型却不能够有效地表达文档的结构信息,从而使得它还不能充分地表达文档的语义信息.为了更有效地表达文档的语义信息,本文首先提出了一种新的文档表示模型一图模型,即通过带权标号图表达文档的特征词条及其位置关联信息,在此基础上本文继而提出了一种新的文档相似性度量标准,并用于中文文本的分类.实验结果表明,基于图模型的这种文档表示方式是有效的和可行的.  相似文献   

13.
Biblio is an adaptive system that automatically extracts meta-data from semi-structured and structured scanned documents. Instead of using hand-coded templates or other methods manually customized for each given document format, it uses example-based machine learning to adapt to customer-defined document and meta-data types. We provide results from experiments on the recognition of document information in two document corpuses: a set of scanned journal articles and a set of scanned legal documents. The first set is semi-structured, as the different journals use a variety of flexible layouts. The second set is largely free-form text based on poor quality scans of FAX-quality legal documents. We demonstrate accuracy on the semi-structured document set roughly comparable to hand-coded systems, and much worse performance on the legal documents.  相似文献   

14.
提出了一种基于TreeMiner算法挖掘频繁子树的文档结构相似度量方法,解决了传统的距离编辑法计算代价高而路径匹配法无法处理重复标签的问题。该方法架构了一个新的检索模型—频繁结构向量模型,给出了文档的结构向量表示和权重函数,构造了XML文档结构相似度量计算公式;同时从数据结构和挖掘程序上对TreeMiner 算法进行了改进,使其更适合大文档数据集的结构挖掘。实验结果表明,该方法具有很高的计算精度和准确率。  相似文献   

15.
基于压缩稀疏矩阵矢量相乘的文本相似度计算   总被引:4,自引:0,他引:4  
在信息检索矢量模型的基础上.提出了一种基于压缩稀疏矩阵矢量相乘的文本相似度计算方法,具有矢量模型计算简单和速度快的特点.该方法采用压缩稀疏矩阵矢量空间存储数据,在相似度计算和数据存储时不需要考虑文本矢量矩阵中的零元素,大大减少了计算量和存储空间,从而使信息检索系统运行效率显著提高.仿真实验表明,上述方法比基于矢量模型的传统反向索引机制节省了38%的存储空间.  相似文献   

16.
基于本体的Web文本挖掘与信息检索   总被引:1,自引:0,他引:1  
艾伟  孙四明  张峰 《计算机工程》2010,36(22):75-77
针对传统Web文本挖掘技术缺少语义理解能力的不足,提出并实现一种基于本体的Web文本挖掘模型,即利用基于本体概念体系的向量空间模型替代传统的向量空间模型来表示文档,在此基础上进行Web文本挖掘,并给出一种集成语义信息检索的设计。实验结果初步验证了本体模型在Web文本挖掘技术上应用的可行性。  相似文献   

17.
Learning element similarity matrix for semi-structured document analysis   总被引:3,自引:3,他引:0  
Capturing latent structural and semantic properties in semi-structured documents (e.g., XML documents) is crucial for improving the performance of related document analysis tasks. Structured Link Vector Mode (SLVM) is a representation recently proposed for modeling semi-structured documents. It uses an element similarity matrix to capture the latent relationships between XML elements—the constructing components of an XML document. In this paper, instead of applying heuristics to define the element similarity matrix, we propose to compute the matrix using the machine learning approach. In addition, we incorporate term semantics into SLVM using latent semantic indexing to enhance the model accuracy, with the element similarity learnability property preserved. For performance evaluation, we applied the similarity learning to k-nearest neighbors search and similarity-based clustering, and tested the performance using two different XML document collections. The SLVM obtained via learning was found to outperform significantly the conventional Vector Space Model and the edit-distance-based methods. Also, the similarity matrix, obtained as a by-product, can provide higher-level knowledge on the semantic relationships between the XML elements.
Xiaoou ChenEmail:
  相似文献   

18.
针对因应急文档知识查找和利用效率不高造成应急决策者不能快速有效制定应急决策的问题,从知识系统工程的角度出发,结合知识元理论对应急文档知识进行结构化建模,为决策者快速有效地使用应急文档知识提供了一种新的途径。通过对物理结构分析提取元数据和进行文档结构化处理,对逻辑结构分析提出知识元提取的方法,知识元导航链接建立知识与结构化文档间的关联,进行知识推理与检索,并对应急文档的细粒度知识挖掘模式进行了深入的探讨。最后开发了应急决策知识支持系统原型并进行了验证,结果表明该建模方法能有效解决应急文档知识查找和利用效率不高的问题。  相似文献   

19.
基于PCA的XML文档特征提取方法   总被引:1,自引:0,他引:1  
郭丽红  王箭 《计算机工程与设计》2011,32(11):3894-3896,3911
为了更好地对XML文档进行分类或聚类分析,以主成分分析的理论基础为指导,在研究了文本表示的各种模型的基础上,提出了两种对XML文档进行向量化表示并进行特征提取的方法,同时也实现了对XML文档的有效降维。实验结果表明,两种方法都能有效地表示XML文档的主体特征,但全路径特征向量抽取方法能更好地描述XML信息,为下一步有效处理XML文档做了良好铺垫,具有一定的研究价值。  相似文献   

20.
在文本分类研究中,向量空间模型具有表示形式简单的特点,但只能表示特征词的词频信息而忽视了特征词间的结构信息和语义语序信息,所以可能导致不同文档被表示为相同向量。针对这种问题,本文采用图结构模型表示文本,把文本表示成一个有向图(简称文本图),可有效解决结构化信息缺失的问题。本文将图核技术应用于文本分类,提出适用于文本图之间的相似度计算的图核算法--间隔通路核,然后利用支持向量机对文本进行分类。在文本集上的实验结果表明:与向量空间模型相比,间隔通路核相比于其他核函数的分类准确率更高,所以间隔通路核是一种很好的图结构相似性计算算法,能广泛应用于文本分类中。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号