首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
《Knowledge》2007,20(7):607-613
Discovering topics from large amount of documents has become an important task recently. Most of the topic models treat document as a word sequence, whether in discrete character or term frequency form. However, the number of words in a document is greatly different from that in other documents. This will lead to several problems for current topic models in dealing with topics analysis. On the other hand, it is difficult to perform topic transition analysis based on current topic models. In an attempt to overcome these deficiencies, a variable space hidden Markov model (VSHMM) is proposed to represent the topics, and several operations based on space computation are presented. A hierarchical clustering algorithm with dynamically changing of the component number in topic model is proposed to demonstrate the effectiveness of the VSHMM. Method of document partition based on topic transition is also present. Experiments on a real-world dataset show that the VSHMM can improve the accuracy while decreasing the algorithm’s time complexity greatly compared with the algorithm based on current mixture model.  相似文献   

2.
为将文档聚类划分的分布式检索方法直接应用于视觉检索领域,提出一种基于潜在主题的分布式视觉检索模型。给出模型框架,包括图像视觉单词的数据集划分方法和图像子集选择方法,以此优化图像分布式检索性能。实验结果表明,该模型在不降低检索准确率的前提下,能优先选择少量的图像子集进行检索,并提高查询的吞吐量。  相似文献   

3.
文健  李舟军 《中文信息学报》2008,22(1):61-66,122
近年来研究表明使用主题语言模型增强了信息检索的性能,但是仍然不能解决信息检索存在的一些难点问题,如数据稀疏问题,同义词问题,多义词问题,对文档中不可见项和可见项的平滑问题。这些问题在一些领域相关文献检索中显得尤其重要,比如大规模的生物文献检索。本文提出了一种新的基于聚类的主题语言模型方法进行生物文献检索,这主要包括两个方面工作,一是采用本体库中的概念表示文档,并在此基础上进行模糊聚类,把聚类的结果作为数据集中的主题,文档属于某个主题的概率由文档与聚类的模糊相似度决定。二是采用EM算法来估计主题产生项的概率。把上述方法集成到语言模型中就得到本文的语言模型。本文的语言模型能够准确描述项在不同主题中的分布概率,以及文档属于某个主题的概率,并且利用本体中概念部分地解决了同义词问题,而且项可以由不同的主题产生,这也能够部分解决词的多义问题。本文的方法在TREC 2004/05 Genomics Track数据集上进行了测试,与简单语言模型以及现有主题语言模型相比,检索性能得到一定的提高。  相似文献   

4.
基于增量型聚类的自动话题检测研究   总被引:1,自引:0,他引:1  
张小明  李舟军  巢文涵 《软件学报》2012,23(6):1578-1587
随着网络信息飞速的发展,收集并组织相关信息变得越来越困难.话题检测与跟踪(topic detection and tracking,简称TDT)就是为解决该问题而提出来的研究方向.话题检测是TDT中重要的研究任务之一,其主要研究内容是把讨论相同话题的故事聚类到一起.虽然话题检测已经有了多年的研究,但面对日益变化的网络信息,它具有了更大的挑战性.提出了一种基于增量型聚类的和自动话题检测方法,该方法旨在提高话题检测的效率,并且能够自动检测出文本库中话题的数量.采用改进的权重算法计算特征的权重,通过自适应地提炼具有较强的主题辨别能力的文本特征来提高文档聚类的准确率,并且在聚类过程中利用BIC来判断话题类别的数目,同时利用话题的延续性特征来预聚类文档,并以此提高话题检测的速度.基于TDT-4语料库的实验结果表明,该方法能够大幅度提高话题检测的效率和准确率.  相似文献   

5.
This paper addresses automatic image annotation problem and its application to multi-modal image retrieval. The contribution of our work is three-fold. (1) We propose a probabilistic semantic model in which the visual features and the textual words are connected via a hidden layer which constitutes the semantic concepts to be discovered to explicitly exploit the synergy among the modalities. (2) The association of visual features and textual words is determined in a Bayesian framework such that the confidence of the association can be provided. (3) Extensive evaluation on a large-scale, visually and semantically diverse image collection crawled from Web is reported to evaluate the prototype system based on the model. In the proposed probabilistic model, a hidden concept layer which connects the visual feature and the word layer is discovered by fitting a generative model to the training image and annotation words through an Expectation-Maximization (EM) based iterative learning procedure. The evaluation of the prototype system on 17,000 images and 7736 automatically extracted annotation words from crawled Web pages for multi-modal image retrieval has indicated that the proposed semantic model and the developed Bayesian framework are superior to a state-of-the-art peer system in the literature.  相似文献   

6.
Biomedical time series clustering that automatically groups a collection of time series according to their internal similarity is of importance for medical record management and inspection such as bio-signals archiving and retrieval. In this paper, a novel framework that automatically groups a set of unlabelled multichannel biomedical time series according to their internal structural similarity is proposed. Specifically, we treat a multichannel biomedical time series as a document and extract local segments from the time series as words. We extend a topic model, i.e., the Hierarchical probabilistic Latent Semantic Analysis (H-pLSA), which was originally developed for visual motion analysis to cluster a set of unlabelled multichannel time series. The H-pLSA models each channel of the multichannel time series using a local pLSA in the first layer. The topics learned in the local pLSA are then fed to a global pLSA in the second layer to discover the categories of multichannel time series. Experiments on a dataset extracted from multichannel Electrocardiography (ECG) signals demonstrate that the proposed method performs better than previous state-of-the-art approaches and is relatively robust to the variations of parameters including length of local segments and dictionary size. Although the experimental evaluation used the multichannel ECG signals in a biometric scenario, the proposed algorithm is a universal framework for multichannel biomedical time series clustering according to their structural similarity, which has many applications in biomedical time series management.  相似文献   

7.
基于分级神经网络的Web文档模糊聚类技术   总被引:2,自引:1,他引:1  
给出了一种多层向量空间模型,该模型将一篇文档的相关信息从逻辑上划分为多个相对独立的文本段,按照不同位置的文本段确定相应的索引项权重.然后提出了一种简明而有效的基于分级神经网络的模糊聚类算法.与现有方法不同,该模糊聚类方法采用自组织神经网络和模糊聚类网络两部分组成的3层神经网络来实现.首先采用自组织神经网络从原始数据产生一个初始聚类结果,然后运用FCM方法对初始聚类的数目进行优化.实验结果表明,提出的Web文档聚类算法具有较好的聚类特性,它能将与一个主题相关的web文档较完全和准确地聚成一类.  相似文献   

8.
In this paper, we extend the work of Kraft et al. to present a new method for fuzzy information retrieval based on fuzzy hierarchical clustering and fuzzy inference techniques. First, we present a fuzzy agglomerative hierarchical clustering algorithm for clustering documents and to get the document cluster centers of document clusters. Then, we present a method to construct fuzzy logic rules based on the document clusters and their document cluster centers. Finally, we apply the constructed fuzzy logic rules to modify the user's query for query expansion and to guide the information retrieval system to retrieve documents relevant to the user's request. The fuzzy logic rules can represent three kinds of fuzzy relationships (i.e., fuzzy positive association relationship, fuzzy specialization relationship and fuzzy generalization relationship) between index terms. The proposed fuzzy information retrieval method is more flexible and more intelligent than the existing methods due to the fact that it can expand users' queries for fuzzy information retrieval in a more effective manner.  相似文献   

9.
层次主题模型是构建主题层次的重要工具. 现有的层次主题模型大多通过在主题模型中引入nCRP构造方法, 为文档主题提供树形结构的先验分布, 但无法生成具有明确领域涵义的主题层次结构, 即领域主题层次. 同时, 领域主题不仅存在层次关系, 而且不同父主题下的子主题之间还存在子领域方面共享的关联关系, 在现有主题关系研究中没有合适的模型来生成这种领域主题层次. 为了从领域文本中自动、有效地挖掘出领域主题的层次关系和关联关系, 在4个方面进行创新研究. 首先, 通过主题共享机制改进nCRP构造方法, 提出nCRP+层次构造方法, 为主题模型中的主题提供具有分层主题方面共享的树形先验分布; 其次, 结合nCRP+和HDP模型构建重分层的Dirichlet过程, 提出rHDP (reallocated hierarchical Dirichlet processes)层次主题模型; 第三, 结合领域分类信息、词语语义和主题词的领域代表性, 定义领域知识, 包括基于投票机制的领域隶属度、词语与领域主题的语义相关度和层次化的主题-词语贡献度; 最后, 通过领域知识改进rHDP主题模型中领域主题和主题词的分配过程, 提出结合领域知识的层次主题模型rHDP_DK (rHDP with domain knowledge), 并改进采样过程. 实验结果表明, 基于nCRP+的层次主题模型在评价指标方面均优于基于nCRP的层次主题模型(hLDA, nHDP)和神经主题模型(TSNTM); 通过rHDP_DK模型生成的主题层次结构具有领域主题层次清晰、关联子主题的主题词领域差异明确的特点. 此外, 该模型将为领域主题层次提供一个通用的自动挖掘框架.  相似文献   

10.
视频镜头聚类是基于内容的视频分析和检索领域中的一个重要问题.提出了一种对视频镜头的半监督聚类算法(SSCA),该算法首先在初始化时对已知的成对实例约束集进行聚类,利用在初始化时生成的簇来指导高维空间中其他视频镜头数据的聚类.由于高维空间中不同的维度存在着不同的相关性,所以为每一个簇引入权重向量.之后提出了一种基于最大距离的聚类中心分割策略,来解决聚类中心的选取问题.最后,考虑到对于聚类个数的选择往往对最终的结果有很大的影响,算法中采用贝叶斯信息准则来评估给定范围的聚类个数.实验结果表明,提出的算法有效地提高了聚类算法的准确性并减少了算法的响应时间.  相似文献   

11.
Frequent itemset originates from association rule mining. Recently, it has been applied in text mining such as document categorization, clustering, etc. In this paper, we conduct a study on text clustering using frequent itemsets. The main contribution of this paper is three manifolds. First, we present a review on existing methods of document clustering using frequent patterns. Second, a new method called Maximum Capturing is proposed for document clustering. Maximum Capturing includes two procedures: constructing document clusters and assigning cluster topics. We develop three versions of Maximum Capturing based on three similarity measures. We propose a normalization process based on frequency sensitive competitive learning for Maximum Capturing to merge cluster candidates into predefined number of clusters. Third, experiments are carried out to evaluate the proposed method in comparison with CFWS, CMS, FTC and FIHC methods. Experiment results show that in clustering, Maximum Capturing has better performances than other methods mentioned above. Particularly, Maximum Capturing with representation using individual words and similarity measure using asymmetrical binary similarity achieves the best performance. Moreover, topics produced by Maximum Capturing distinguished clusters from each other and can be used as labels of document clusters.  相似文献   

12.
Trajectory clustering and behavior pattern extraction are the foundations of research into activity perception of objects in motion. In this paper, a new framework is proposed to extract behavior patterns through trajectory analysis. Firstly, we introduce directional trimmed mean distance (DTMD), a novel method used to measure similarity between trajectories. DTMD has the attributes of anti-noise, self-adaptation and the capability to determine the direction for each trajectory. Secondly, we use a hierarchical clustering algorithm to cluster trajectories. We design a length-weighted linkage rule to enhance the accuracy of trajectory clustering and reduce problems associated with incomplete trajectories. Thirdly, the motion model parameters are estimated for each trajectory’s classification, and behavior patterns for trajectories are extracted. Finally, the difference between normal and abnormal behaviors can be distinguished.  相似文献   

13.
为有效地弥补全文搜索引擎的不足,提出了一种动态求解的最优密度聚类算法并加以实现.该算法构造了一颗簇关系树,将两种典型聚类算法:密度聚类算法DBSCAN和层次聚类算法BIRCH进行有效结合,对聚类参数ε进行动态求解,以达到参数ε的最优.与其它文本聚类算法相比,该算法的查询结果与用户感兴趣的主题相关度较大,对具有二义性的关键词有较高的查准率,能有效提升搜索引擎的查询效率,加快用户搜索信息的速度.  相似文献   

14.
针对传统的Single-Pass聚类算法对数据输入顺序过于敏感和准确率较低的问题,提出一种以子话题为粒度,考虑新闻文本动态性、时效性和上下文语义特征的增量文本聚类算法(SP-HTD).首先通过解析LDA2Vec主题模型,联合训练文档向量和词向量,获得上下文向量,充分挖掘文本的语义特征及重要性关系.然后在SinglePass算法基础上,根据提取到的热点主题特征词,划分子话题,并设置时间阈值,来确认类簇中心的时效性,将挖掘的语义特征和任务相结合,动态更新类簇中心.最后以时间特性为辅,更新话题质心向量,提高文本相似度计算的准确性.结果表明,所提方法的F值最高可达89.3%,且在保证聚类精度的前提下,在漏检率和误检率上较传统算法有明显改善,能够有效提高话题检测的准确性.  相似文献   

15.
With the rapid growth of text documents, document clustering technique is emerging for efficient document retrieval and better document browsing. Recently, some methods had been proposed to resolve the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels by using frequent itemsets derived from association rule mining for clustering documents. In order to improve the quality of document clustering results, we propose an effective Fuzzy Frequent Itemset-based Document Clustering (F2IDC) approach that combines fuzzy association rule mining with the background knowledge embedded in WordNet. A term hierarchy generated from WordNet is applied to discover generalized frequent itemsets as candidate cluster labels for grouping documents. We have conducted experiments to evaluate our approach on Classic4, Re0, R8, and WebKB datasets. Our experimental results show that our proposed approach indeed provide more accurate clustering results than prior influential clustering methods presented in recent literature.  相似文献   

16.
Visual motion segmentation (VMS) is an important and key part of many intelligent crowd systems. It can be used to figure out the flow behavior through a crowd and to spot unusual life-threatening incidents like crowd stampedes and crashes, which pose a serious risk to public safety and have resulted in numerous fatalities over the past few decades. Trajectory clustering has become one of the most popular methods in VMS. However, complex data, such as a large number of samples and parameters, makes it difficult for trajectory clustering to work well with accurate motion segmentation results. This study introduces a spatial-angular stacked sparse autoencoder model (SA-SSAE) with l2-regularization and softmax, a powerful deep learning method for visual motion segmentation to cluster similar motion patterns that belong to the same cluster. The proposed model can extract meaningful high-level features using only spatial-angular features obtained from refined tracklets (a.k.a ‘trajectories’). We adopt l2-regularization and sparsity regularization, which can learn sparse representations of features, to guarantee the sparsity of the autoencoders. We employ the softmax layer to map the data points into accurate cluster representations. One of the best advantages of the SA-SSAE framework is it can manage VMS even when individuals move around randomly. This framework helps cluster the motion patterns effectively with higher accuracy. We put forward a new dataset with its manual ground truth, including 21 crowd videos. Experiments conducted on two crowd benchmarks demonstrate that the proposed model can more accurately group trajectories than the traditional clustering approaches used in previous studies. The proposed SA-SSAE framework achieved a 0.11 improvement in accuracy and a 0.13 improvement in the F-measure compared with the best current method using the CUHK dataset.  相似文献   

17.
基于潜在语义索引和句子聚类的中文自动文摘   总被引:2,自引:0,他引:2  
自动文摘是自然语言处理领域的一项重要的研究课题.提出一种基于潜在语义索引和句子聚类的中文自动文摘方法.该方法的特色在于:使用潜在语义索引计算句子的相似度,并将层次聚类算法和K-中心聚类算法相结合进行句子聚类,这样提高了句子相似度计算和主题划分的准确性,有利于生成的文摘在全面覆盖文档主题的同时减少自身的冗余.实验结果验证了该文提出的方法的有效性,对比传统的基于聚类的自动文摘方法,该方法生成的文摘质量获得了显著的提高.  相似文献   

18.
In this paper, we consider the problem of clustering and re-ranking web image search results so as to improve diversity at high ranks. We propose a novel ranking framework, namely cluster-constrained conditional Markov random walk (CCCMRW), which has two key steps: first, cluster images into topics, and then perform Markov random walk in an image graph conditioned on constraints of image cluster information. In order to cluster the retrieval results of web images, a novel graph clustering model is proposed in this paper. We explore the surrounding text to mine the correlations between words and images and therefore the correlations are used to improve clustering results. Two kinds of correlations, namely word to image and word to word correlations, are mainly considered. As a standard text process technique, tf-idf method cannot measure the correlation of word to image directly. Therefore, we propose to combine tf-idf method with a novel feature of word, namely visibility, to infer the word-to-image correlation. By latent Dirichlet allocation model, we define a topic relevance function to compute the weights of word-to-word correlations. Taking word to image correlations as heterogeneous links and word-to-word correlations as homogeneous links, graph clustering algorithms, such as complex graph clustering and spectral co-clustering, are respectively used to cluster images into topics in this paper. In order to perform CCCMRW, a two-layer image graph is constructed with image cluster nodes as upper layer added to a base image graph. Conditioned on the image cluster information from upper layer, Markov random walk is constrained to incline to walk across different image clusters, so as to give high rank scores to images of different topics and therefore gain the diversity. Encouraging clustering and re-ranking outputs on Google image search results are reported in this paper.  相似文献   

19.
Based on the analysis of temporal slices, we propose novel approaches for clustering and retrieval of video shots. Temporal slices are a set of two-dimensional (2-D) images extracted along the time dimension of an image volume. They encode rich set of visual patterns for similarity measure. In this paper, we first demonstrate that tensor histogram features extracted from temporal slices are suitable for motion retrieval. Subsequently, we integrate both tensor and color histograms for constructing a two-level hierarchical clustering structure. Each cluster in the top level contains shots with similar color while each cluster in bottom level consists of shots with similar motion. The constructed structure is then used for the cluster-based retrieval. The proposed approaches are found to be useful particularly for sports games, where motion and color are important visual cues when searching and browsing the desired video shots.  相似文献   

20.
由于利用全局特征的图像检索方法在很大程度上受到背景的影响,提出了一种基于显著区域和pLSA相结合的图像检索方法。该方法首先通过谱残差和多分辨率分析提取图像的显著目标区域,其次计算所有图像显著区域的颜色和纹理特征并利用K-均值聚类生成视觉词汇表,然后将每幅图像表示成若干视觉词汇的集合。最后利用概率潜在语义分析(pLSA)来提取区域潜在语义特征,并使用该特征构建SVM分类器模型进行图像检索。将本方法和基于全局特征的图像检索方法比较,实验结果表明,基于显著区域的图像检索结果更加准确。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号