首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Efficient phrase-based document indexing for Web document clustering   总被引:4,自引:0,他引:4  
Document clustering techniques mostly rely on single term analysis of the document data set, such as the vector space model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly important in such scenarios. Document clustering is particularly useful in many applications such as automatic categorization of documents, grouping search engine results, building a taxonomy of documents, and others. This article presents two key parts of successful document clustering. The first part is a novel phrase-based document index model, the document index graph, which allows for incremental construction of a phrase-based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only. It provides efficient phrase matching that is used to judge the similarity between documents. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. The second part is an incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. The combination of these two components creates an underlying model for robust and accurate document similarity calculation that leads to much improved results in Web document clustering over traditional methods.  相似文献   

2.
Document Similarity Using a Phrase Indexing Graph Model   总被引:3,自引:1,他引:2  
Document clustering techniques mostly rely on single term analysis of text, such as the vector space model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes Web documents based on phrases rather than on single terms only. The semistructured Web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The Document Index Graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. However, using phrase indexing yields more accurate document similarity calculations. The similarity between documents is based on both single term weights and matching phrase weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, gives a more accurate measure of document similarity and thus significantly enhances Web document clustering quality.  相似文献   

3.
In this paper, we describe a document clustering method called novelty-based document clustering. This method clusters documents based on similarity and novelty. The method assigns higher weights to recent documents than old ones and generates clusters with the focus on recent topics. The similarity function is derived probabilistically, extending the conventional cosine measure of the vector space model by incorporating a document forgetting model to produce novelty-based clusters. The clustering procedure is a variation of the K-means method. An additional feature of our clustering method is an incremental update facility, which is applied when new documents are incorporated into a document repository. Performance of the clustering method is examined through experiments. Experimental results show the efficiency and effectiveness of our method.  相似文献   

4.
文本聚类是自然语言处理中的一项重要研究课题,主要应用于信息检索和Web挖掘等领域。其中的关键是文本的表示和聚类算法。在层次聚类的基础上,提出了一种新的基于边界距离的层次聚类算法,该方法通过选择两个类间边缘样本点的距离作为类间距离,有效地利用类的边界信息,提高类间距离计算的准确性。综合考虑不同词性特征对文本的贡献,采用多向量模型对文本进行表示。不同文本集上的实验表明,基于边界距离的多向量文本聚类算法取得了较好的性能。  相似文献   

5.
一种基于词共现的文档聚类算法   总被引:1,自引:0,他引:1       下载免费PDF全文
常鹏  冯楠  马辉 《计算机工程》2012,38(2):213-214
为解决文本主题表达存在的信息缺失问题,提出一种基于词共现的文档聚类算法。利用文档集上的频繁共现词建立文档主题向量表示模型,将其应用于层次聚类算法中,并通过聚类熵寻找最优的层次划分,从而准确反映文档之间的主题相关关系。实验结果表明,该算法所获得的结果优于其他基于短语的文档层次聚类算法。  相似文献   

6.
该文提出了改进的维吾尔语Web文本后缀树聚类算法STCU,其中后缀树的构建以维吾尔语句子为基本单位。针对维吾尔语语言和Web文本特点,文中对词语进行词干提取,构建了维吾尔语绝对停用词表和相对停用词表,采用文档频率和词性结合的方法提取关键短语,改进了合并基类的二进制方法,根据语料类别数自动调整聚类类别阈值,利用最一般短语对聚类类别进行描述,有效地改善了文本聚类的质量。与传统的后缀树聚类算法相比,聚类全面率提高了44.51%,聚类准确率提高了11.74%,错误率降低了0.94%。实验结果表明 改进的后缀树算法在Web文本聚类的精度和效率方面具有较强的优越性。  相似文献   

7.
Many studies on developing technologies have been published as articles, papers, or patents. We use and analyze these documents to find scientific and technological trends. In this paper, we consider document clustering as a method of document data analysis. In general, we have trouble analyzing documents directly because document data are not suitable for statistical and machine learning methods of analysis. Therefore, we have to transform document data into structured data for analytical purposes. For this process, we use text mining techniques. The structured data are very sparse, and hence, it is difficult to analyze them. This study proposes a new method to overcome the sparsity problem of document clustering. We build a combined clustering method using dimension reduction and K-means clustering based on support vector clustering and Silhouette measure. In particular, we attempt to overcome the sparseness in patent document clustering. To verify the efficacy of our work, we first conduct an experiment using news data from the machine learning repository of the University of California at Irvine. Second, using patent documents retrieved from the United States Patent and Trademark Office, we carry out patent clustering for technology forecasting.  相似文献   

8.
目前,搜索结果聚类方法大多数采用基于文档的方法,不能生成有意义的聚类标签。为了解决这个问题,提出一种基于关键名词短语聚类的中文搜索结果聚类方法,该方法将名词短语、相关搜索词作为候选聚类标签,利用C-Value算法、IDF值筛选标签,然后使用Chameleon算法将标签聚类,最后将搜索结果划分到最相关的聚类簇。实验证明,该方法把关键名词短语和相关搜索词作为聚类标签,有效地提高了标签的描述性,降低了聚类算法的时间复杂度。  相似文献   

9.
一种基于群体智能的Web文档聚类算法   总被引:31,自引:0,他引:31  
将群体智能聚类模型运用于文档聚类,提出了一种基于群体智能的Web文档聚类算法,首先运用向量空间模型表示Web文档信息,采用常规方法如消除无用词和特征词条约简法则得到文本特征集,然后将文档的向量随机分布到一个平面上,运用基于群体智能的聚类方法进行文档聚类,最后从平面上采用递归算法收集聚类结果,为了改善算法的实用性,将原算法与k均值算法结合提出一种混合聚类算法,通过实验比较,结果表明基于群体智能的Web文档聚类算法具有较好的聚类特性,它能将与一个主题相关的Web文档较完全而准确地聚成一类。  相似文献   

10.
一种基于LDA的潜在语义区划分及Web文档聚类算法   总被引:2,自引:0,他引:2  
该文应用LDA模型进行文档的潜在语义分析,将语义分布划分成低频、中频、高频语义区,以低频语义区的语义进行Web游离文档检测,以中、高频语义区的语义作为文档特征进行文档聚类,采用文档类别与语义互作用机制对聚类结果进行修正。与相关工作比较,该文不仅应用LDA模型表示文档,而且进行了深入的语义分布区域划分,并将分析结果应用于Web文档聚类。实验表明,该文提出的基于LDA的文档类别与语义互作用聚类算法获得了更好的聚类结果。  相似文献   

11.
Text document clustering plays an important role in providing better document retrieval, document browsing, and text mining. Traditionally, clustering techniques do not consider the semantic relationships between words, such as synonymy and hypernymy. To exploit semantic relationships, ontologies such as WordNet have been used to improve clustering results. However, WordNet-based clustering methods mostly rely on single-term analysis of text; they do not perform any phrase-based analysis. In addition, these methods utilize synonymy to identify concepts and only explore hypernymy to calculate concept frequencies, without considering other semantic relationships such as hyponymy. To address these issues, we combine detection of noun phrases with the use of WordNet as background knowledge to explore better ways of representing documents semantically for clustering. First, based on noun phrases as well as single-term analysis, we exploit different document representation methods to analyze the effectiveness of hypernymy, hyponymy, holonymy, and meronymy. Second, we choose the most effective method and compare it with the WordNet-based clustering method proposed by others. The experimental results show the effectiveness of semantic relationships for clustering are (from highest to lowest): hypernymy, hyponymy, meronymy, and holonymy. Moreover, we found that noun phrase analysis improves the WordNet-based clustering method.  相似文献   

12.
This paper details a modular, self-contained web search results clustering system that enhances search results by (i) performing clustering on lists of web documents returned by queries to search engines, and (ii) ranking the results and labeling the resulting clusters, by using a calculated relevance value as a degree of membership to clusters. In addition, we demonstrate an external evaluation method based on precision for comparing fuzzy clustering techniques, as well as internal measures suitable for working on non-training data. The built-in label generator uses the membership degrees and relevance values to weight the most relevant results more heavily. The membership degrees of documents to fuzzy clusters also facilitate effective detection and removal of overly similar clusters. To achieve this, our transduction-based clustering algorithm (TCA) and its fuzzy counterpart (FTCA) employ a transduction-based relevance model (TRM) to consider local relationships between each web document. Results from testing on five different real-world and synthetic datasets results show favorable results compared to established label-based clustering algorithms Suffix Tree Clustering (STC) and Lingo.  相似文献   

13.
用于文本聚类的模糊谱聚类算法   总被引:1,自引:0,他引:1       下载免费PDF全文
谱聚类方法的应用已经开始从图像分割领域扩展到文本挖掘领域中,并取得了一定的成果。在自动确定聚类数目的基础上,结合模糊理论与谱聚类算法,提出了一种应用在多文本聚类中的模糊聚类算法,该算法主要描述了如何实现单个文本同时属于多个文本类的模糊谱聚类方法。实验仿真结果表明该算法具有很好的聚类效果。  相似文献   

14.
Efficient Phrase-Based Document Similarity for Clustering   总被引:1,自引:0,他引:1  
In this paper, we propose a phrase-based document similarity to compute the pair-wise similarities of documents based on the Suffix Tree Document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in the Vector Space Document (VSD) model, the phrase-based document similarity naturally inherits the term tf-idf weighting scheme in computing the document similarity with phrases. We apply the phrase-based document similarity to the group-average Hierarchical Agglomerative Clustering (HAC) algorithm and develop a new document clustering approach. Our evaluation experiments indicate that, the new clustering approach is very effective on clustering the documents of two standard document benchmark corpora OHSUMED and RCV1. The quality of the clustering results significantly surpass the results of traditional single-word textit{tf-idf} similarity measure in the same HAC algorithm, especially in large document data sets. Furthermore, by studying the property of STD model, we conclude that the feature vector of phrase terms in the STD model can be considered as an expanded feature vector of the traditional single-word terms in the VSD model. This conclusion sufficiently explains why the phrase-based document similarity works much better than the single-word tf-idf similarity measure.  相似文献   

15.
In this paper, we introduce variability of syntactic phrases and propose a new retrieval approach reflecting the variability of syntactic phrase representation. With variability measure of a phrase, we can estimate how likely a phrase in a given query would appear in relevant documents and control the impact of syntactic phrases in a retrieval model. Various experimental results over different types of queries and document collections show that our retrieval model based on variability of syntactic phrases is very effective in terms of retrieval performance, especially for long natural language queries.  相似文献   

16.
Web文本表示方法作为所有Web文本分析的基础工作,对文本分析的结果有深远的影响。提出了一种多维度的Web文本表示方法。传统的文本表示方法一般都是从文本内容中提取特征,而文档的深层次特征和外部特征也可以用来表示文本。本文主要研究文本的表层特征、隐含特征和社交特征,其中表层特征和隐含特征可以由文本内容中提取和学习得到,而文本的社交特征可以通过分析文档与用户的交互行为得到。所提出的多维度文本表示方法具有易用性,可以应用于各种文本分析模型中。在实验中,改进了两种常用的文本聚类算法——K-means和层次聚类算法,并命名为多维度K-means MDKM和多维度层次聚类算法MDHAC。通过大量的实验表明了本方法的高效性。此外,我们在各种特征的结合实验结果中还有一些深层次的发现。  相似文献   

17.
In this paper, an automatic image–text alignment algorithm is developed to achieve more effective indexing and retrieval of large-scale web images by aligning web images with their most relevant auxiliary text terms or phrases. First, a large number of cross-media web pages (which contain web images and their auxiliary texts) are crawled and segmented into a set of image–text pairs (informative web images and their associated text terms or phrases). Second, near-duplicate image clustering is used to group large-scale web images into a set of clusters of near-duplicate images according to their visual similarities. The near-duplicate web images in the same cluster share similar semantics and are simultaneously associated with a same or similar set of auxiliary text terms or phrases which co-occur frequently in the relevant text blocks, thus performing near-duplicate image clustering can significantly reduce the uncertainty on the relatedness between the semantics of web images and their auxiliary text terms or phrases. Finally, random walk is performed over a phrase correlation network to achieve more precise image–text alignment by refining the relevance scores between the web images and their auxiliary text terms or phrases. Our experiments on algorithm evaluation have achieved very positive results on large-scale cross-media web pages.  相似文献   

18.
基于关联规则的Web文档聚类算法   总被引:32,自引:1,他引:32  
宋擒豹  沈钧毅 《软件学报》2002,13(3):417-423
Web文档聚类可以有效地压缩搜索空间,加快检索速度,提高查询精度.提出了一种Web文档的聚类算法.该算法首先采用向量空间模型VSM(vector space model)表示主题,根据主题表示文档;再以文档为事务,以主题为事务项,将文档和主题间的关系看作事务的形式,采用关联规则挖掘算法发现主题频集,相应的文档集即为初步文档类;然后依据类间距离和类内连接强度阈值合并、拆分类,最终实现文档聚类.实验结果表明,该算法是有效的,能处理文档类间固有的重叠情况,具有一定的实用价值.  相似文献   

19.
一种用于文章推荐系统中的用户模型表示方法   总被引:2,自引:0,他引:2  
分析了现有文章推荐系统中基于关键词向量的用户模型表示方法存在的不足,提出了基于聚类兴趣点的用户模型表示方法。该方法可通过文章聚类形成兴趣点。由于传统的基于划分的聚类算法存在的不足,提出了基于复杂网络特征的文章聚类算法。实验结果表明该用户模型的表示方法较好地反映了用户多方面的兴趣,提高了文章推荐系统的性能。  相似文献   

20.
We propose a parallelization scheme for an existing algorithm for constructing a web‐directory, that contains categories of web documents organized hierarchically. The clustering algorithm automatically infers the number of clusters using a quality function based on graph cuts. A parallel implementation of the algorithm has been developed to run on a cluster of multi‐core processors interconnected by an intranet. The effect of the well‐known Latent Semantic Indexing on the performance of the clustering algorithm is also considered. The parallelized graph‐cut based clustering algorithm achieves an F‐measure in the range [0.69,0.91] for the generated leaf‐level clusters while yielding a precision‐recall performance in the range [0.66,0.84] for the entire hierarchy of the generated clusters. As measured via empirical observations, the parallel algorithm achieves an average speedup of 7.38 over its sequential variant, at the same time yielding a better clustering performance than the sequential algorithm in terms of F‐measure. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号