共查询到10条相似文献,搜索用时 490 毫秒
1.
Qi Xiang Huang Yu Chen Ziyan Liu Xiaoyan Tian Jing Huang Tinglei Wang Hongqi 《电子科学学刊(英文版)》2014,(6):565-575
Topic models such as Latent Dirichlet Allocation (LDA) have been successfully applied to many text mining tasks for extracting topics embedded in corpora. However, existing topic models generally cannot discover bursty topics that experience a sudden increase during a period of time. In this paper, we propose a new topic model named Burst-LDA, which simultaneously discovers topics and reveals their burstiness through explicitly modeling each topic's burst states with a first order Markov chain and using the chain to generate the topic proportion of documents in a Logistic Normal fashion. A Gibbs sampling algorithm is developed for the posterior inference of the proposed model. Experimental results on a news data set show our model can efficiently discover bursty topics, outperforming the state-of-the-art method. 相似文献
2.
3.
4.
Zanoni Dias Siome Goldenstein Anderson Rocha 《Journal of Visual Communication and Image Representation》2013,24(7):1124-1134
Currently, multimedia objects can be easily created, stored, (re)-transmitted, and edited for good or bad. In this sense, there has been an increasing interest in finding the structure of temporal evolution within a set of documents and how documents are related to one another overtime. This process, also known in the literature as Multimedia Phylogeny, aims at finding the phylogeny tree(s) that best explains the creation process of a set of near-duplicate documents (e.g., images/videos) and their ancestry relationships. Solutions to this problem have direct applications in forensics, security, copyright enforcement, news tracking services and other areas. In this paper, we explore one heuristic and one optimum branching algorithm for reconstructing the evolutionary tree associated with a set of image documents. This can be useful for aiding experts to track the source of child pornography image broadcasting or the chain of image distribution in time, for instance. We compare the algorithms with the state-of-the-art solution considering 350,000 test cases and discuss advantages and disadvantages of each one in a real scenario. 相似文献
5.
Web服务数量的激增对服务发现提出了更高的要求,服务聚类是促进服务发现的一种重要技术.但是,现有服务聚类方法只对单一类型的服务文档进行聚类,缺乏考虑服务的领域特性和服务标签的应用.针对这些问题,本文首先使用本体辅助的支持向量机和面向领域的服务特征降维技术建立服务的特征内容向量,然后使用一种标签辅助的主题服务聚类方法T-LDA建立融合标签信息之后的隐含主题表示,并利用归一化方法消除通用主题的影响,综合上述方法建立一个面向领域标签辅助的Web服务聚类方法DTWSC.实验结果表明,该框架能够提高针对不同类型的服务文档的聚类效果.与LDA、K-Means等方法相比,该方法在聚类纯度、熵和F-Measure指标上均具有更好的效果. 相似文献
6.
This paper presents a new Bayesian sparse learning approach to select salient lexical features for sparse topic modeling. The Bayesian learning based on latent Dirichlet allocation (LDA) is performed by incorporating the spike-and-slab priors. According to this sparse LDA (sLDA), the spike distribution is used to select salient words while the slab distribution is applied to establish the latent topic model based on those selected relevant words. The variational inference procedure is developed to estimate prior parameters for sLDA. In the experiments on document modeling using LDA and sLDA, we find that the proposed sLDA does not only reduce the model perplexity but also reduce the memory and computation costs. Bayesian feature selection method does effectively identify relevant topic words for building sparse topic model. 相似文献
7.
This paper focuses on semantic knowledge acquisition from blogs with the proposed tag-topic model. The model extends the Latent Dirichlet Allocation (LDA) model by adding a tag layer between the document and the topic. Each document is represented by a mixture of tags; each tag is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. After parameter estimation, the tags are used to describe the underlying topics. Thus the latent semantic knowledge within the topics could be represented explicitly. The tags are treated as concepts, and the top-N words from the top topics are selected as related words of the concepts. Then PMI-IR is employed to compute the relatedness between each tag-word pair and noisy words with low correlation removed to improve the quality of the semantic knowledge. Experiment results show that the proposed method can effectively capture semantic knowledge, especially the polyseme and synonym. 相似文献
8.
LDA主题模型在提取特征时缺乏对词语关联及相关词对的理解,这会影响情感极性分类的准确率。针对这一问题,文中提出一种在LDA主题模型中引入特征情感词对抽取方法的新模型,以改善特征情感词对的抽取效果。利用依存句法分析设计特征情感词对的识别方法,随后将识别方法作为约束条件引入LDA模型对特征情感词对进行抽取。通过吉布斯采样进行参数计算,给出了模型的生成过程。最后利用随机森林分类方法对文本进行情感极性分类。为验证文中模型的有效性,将其和另外两种模型一起进行实验,当主题个数为20时,文中所提模型分类的准确率、召回率、F值分别为81.54%、83.13%和82.33%,显著高于另外两种模型。 相似文献
9.
10.
传统的文本关键词提取方法忽略了上下文语义信息,不能解决一词多义问题,提取效果并不理想。基于LDA和BERT模型,文中提出LDA-BERT-LightG BM(LB-LightG BM)模型。该方法选择LDA主题模型获得每个评论的主题及其词分布,根据阈值筛选出候选关键词,将筛选出来的词和原评论文本拼接在一起输入到BERT模型中,进行词向量训练,得到包含文本主题词向量,从而将文本关键词提取问题通过LightG BM算法转化为二分类问题。通过实验对比了textrank算法、LDA算法、LightG BM算法及文中提出的LB-LightG BM模型对文本关键词提取的准确率P、召回率R以及F1。结果表明,当Top N取3~6时,F1的平均值比最优方法提升3.5%,该方法的抽取效果整体上优于实验中所选取的对比方法,能够更准确地发现文本关键词。 相似文献