首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 234 毫秒
1.
基于改进向量空间模型的话题识别与跟踪   总被引:4,自引:0,他引:4  
话题识别与跟踪旨在发展一系列基于事件的信息组织技术,通过监测以实现对新闻媒体信息流中新话题的自动识别和已知话题的动态跟踪。文中提供一种利用改进的向量空间模型进行识别和跟踪的方法。没有使用传统向量空间模型中单个向量,而是按照语义将特征词划分为4个组(人物、时间、地点、内容)并形成4个向量空间。每个空间进行独立的权重计算和相似度计算。实验证明这些方法是有效的。  相似文献   

2.
话题跟踪旨在实现对新闻媒体信息流中已知话题的动态跟踪。在现有的向量空间模型分类算法的基础上,提出一种基于话题更新的话题跟踪算法,通过实验对其进行评价。  相似文献   

3.
本文在对新闻报道理论分析及实验验证的基础上,提出一种多向量表示模型,使其在尽量不丢失信息的情况下,对特征集合尽可能细地划分。基于该模型,本文设计了一种模糊匹配的方法用于计算命名实体子向量之间的关联度,它们和多个向量相似度一起用支持向量机进行整合,形成报道模型间的相似度。本文选用TDT4中文语料作为测试语料,将上述模型及模糊匹配技术用于话题关联识别。实验表明,多向量模型能够改进话题关联识别的性能,模糊匹配技术也在一定程度上弥补了精确匹配带来的性能损失。  相似文献   

4.
针对贝叶斯信念网络应用于话题识别进行了研究, 提出了新的话题识别模型。模型的拓扑结构包括新报道、报道术语、事件术语、话题四层节点, 用弧标明索引关系。在贝叶斯概率和条件独立性假设的基础上, 模型运用条件概率计算新报道和已有话题簇的相似度, 从而实现话题识别。考虑到核心报道、核心事件的重要性, 对不同层次的权重计算进行了调整。实验采用DET曲线评测法对模型性能进行测试, 实验结果显示, 调整后的权重计算可在一定程度上提高新模型的性能, 与向量空间模型相比, 在相同阈值下新模型的漏报率与误报率有所降低。  相似文献   

5.
互联网上充斥着用户生成文档,如论坛中的帖子。如何对这些杂乱无章的内容进行监控是安全部门所关心的重点之一,话题识别与跟踪(Topic Detection and Tracking,TDT)是监控的有效手段之一。但是,网络论坛帖子的特点是回帖篇幅短、话题转移快,使得面向论坛的话题识别与跟踪变得异常困难。针对其特点,给出了三个TDT模型 首先给出一个基线模型;为了缓解“话题漂移”现象,提出了将一个话题表示为种子向量与后续向量的改进模型;在改进的模型上运用最新的命名实体(NE)权重调节策略。针对论坛帖子格式不规范及TDT系统对处理速度的要求,提出了一种特征提取方法。最后,在真实数据集上给出了所用TDT模型的实验结果,证实了所建模型及特征提取方法的有效性。  相似文献   

6.
话题检测可以及时发现互联网舆情热点和突发性事件,并可对话题进行持续跟踪,从而实时掌握舆情事件动向。文中提出了一种基于聚类的改进话题检测和跟踪算法。首先,对文本的特征向量进行改进,增加了基于句子主干的主干向量。然后对每个检测到的话题提取两个中心向量,一个是基本中心向量,另一个是基于主干向量提炼的主干中心向量。在此基础上再通过计算每个文本与中心向量之间的距离进行聚类分析,保证话题中各个文本之间的内聚性。同时基于主题词抽取,在主题词的基础上计算话题之间的主题相关性,有效地实现了子话题检测功能,从而提高了话题检测和跟踪的准确性。通过对10大网站5个频道超过两周数据量的测试,结果表明此方法在一定程度上提高了话题检测和跟踪的正确率,并具有一定的适应性和推广性。  相似文献   

7.
微博突发话题检测方法研究   总被引:1,自引:0,他引:1       下载免费PDF全文
邱云飞  程亮 《计算机工程》2012,38(9):288-290
话题检测与跟踪模型不能很好地处理随意性强、用语不规范的微博短信息。为此,提出一种基于动态滑动窗口的微博突发话题检测方法。利用窗口提取具有潜在突发性的信息,采用结合语义的归一化词频-反文档频率函数计算特征权重,构建结合语义的空间向量模型,使用Single-Pass聚类算法思想对其加以改进,生成最终聚类。实验结果表明,该算法能获得较准确的突发话题检测结果。  相似文献   

8.
报道关系识别是话题识别与跟踪TDT(Topic Detection and Tracking)研究内容中的基本任务之一,根据新闻话题的几大要素:时间、地点、人物、内容等,提出了一种基于话题要素的话题报道表示模型,并给出了基于话题要素相似度计算的报道关系识别方法。实验证明这种方法特别适用于同主题下不同话题的报道关系识别。  相似文献   

9.
微博文本的数据稀疏特性,使传统话题跟踪技术只能捕获部分话题微博且准确度不高。同时,在追踪过程中,话题会出现漂移现象。针对以上两个问题,提出一种基于层叠条件随机场的微博热点话题跟踪方法。该方法先通过标识模型标识出可能相关的微博,源热点微博和标识微博分别作为分类模型的观察序列和状态序列来计算相关度分类。其次,通过构造自适应模型对识别模型进行更新且削弱数据稀疏问题,并从相关微博中选取新的观察序列,其余作为新的状态序列进行迭代分类处理。实验表明,该方法比传统方法综合指标F值平均提升4.13%。  相似文献   

10.
提出一个基于符号序列间LZ复杂性相似度的垃圾邮件识别方法。相比基于向量空间模型的邮件识别,邮件文本间的LZ复杂性相似度计算无需对文本进行预处理和特征提取。同时,K近邻规则的延迟学习特性适合于垃圾邮件样本需要动态调整的应用环境。在Ling-Spam邮件语料集上对提出的识别方法进行十重交叉验证,其总体的识别效果优于基于向量空间模型的部分统计和机器学习方法。  相似文献   

11.
A large body of research analyzes the runtime execution of a system to extract abstract behavioral views. Those approaches primarily analyze control flow by tracing method execution events or they analyze object graphs of heap memory snapshots. However, they do not capture how objects are passed through the system at runtime. We refer to the exchange of objects as the object flow, and we claim that it is necessary to analyze object flows if we are to understand the runtime of an object-oriented application. We propose and detail object flow analysis, a novel dynamic analysis technique that takes this new information into account. To evaluate its usefulness, we present a visual approach that allows a developer to study classes and components in terms of how they exchange objects at runtime. We illustrate our approach on three case studies.  相似文献   

12.
Abstract

This paper is motivated by the following question: Can one axiomatize information first and then probability in terms of information rather than vice versa as suggested by information theory.

The emphasis here is on a new methodological approach toward a conceptualization of behavioral information which might be better suited for inferences involving nonrepeatable events or an sufficient number of repeatable events, based on the assumption that information is prior to probability statements.

The main idea is to generate (via a Boolean homomorphism) a Boolean algebra of events by an appropriate information structure and to utilize the notion of a topogeneous order similar to that of a Boolean order.  相似文献   

13.
Craven  Mark  Slattery  Seán 《Machine Learning》2001,43(1-2):97-119
We present a new approach to learning hypertext classifiers that combines a statistical text-learning method with a relational rule learner. This approach is well suited to learning in hypertext domains because its statistical component allows it to characterize text in terms of word frequencies, whereas its relational component is able to describe how neighboring documents are related to each other by hyperlinks that connect them. We evaluate our approach by applying it to tasks that involve learning definitions for (i) classes of pages, (ii) particular relations that exist between pairs of pages, and (iii) locating a particular class of information in the internal structure of pages. Our experiments demonstrate that this new approach is able to learn more accurate classifiers than either of its constituent methods alone.  相似文献   

14.
Unlike the usual definition of measures of information in terms of the relation between information and uncertainty, a different approach is followed in this paper, in which the relation between certainty and information plays a central role. Information measures are introduced with the help of certainty measures within this framework. This approach leads to three generalized classes of information measures and unifies the known measures of information into one generalized probabilistic theory of discrete information measures.  相似文献   

15.
Numerous paper-based newspapers have been transformed into a digital format and published on the Internet. Digital newspapers are gradually becoming a popular electronic media for conveying information immediately. Google developed a powerful news service, Google news alert, based on the Google news aggregator for tracking user-interested new events utilizing a keywords matching approach. However, this service only monitors and tracks news events using the keyword-matching scheme; consequently, the Google news alert retrieves many irrelevant news events and sends them to users. In other words, the current service cannot monitor news events via a specific news topic; although recall rate is high, the precision rate is low when tracking user-interested news events. Thus, this study presents a novel personalized e-news monitoring agent system that employs the topic-tracking-based approach, improving the flaw of the keyword-based approach, for tracking user-interested news events on Google News site. The proposed scheme simultaneously considers both similarities and the semantic relationships among news topics to track news events. Additionally, to further support the promotion of the accuracy rate in tracking user-interested Chinese news events, the Chinese word segmentation system ECScanner (An Extension Chinese Lexicon Scanner) with new word extension is proposed for the Chinese word segmentation process. Experimental results demonstrated that the proposed scheme, based on topic-based approach, is superior to the keyword-based approach used by Google news alert in terms of precision rate, and retains a high recall rate when tracking user-interested news events. Compared with the conventional Chinese word segmentation system CKIP (Chinese Knowledge Information Processing), experimental results also confirmed that using the proposed ECScanner with novel extension mechanism for new words improves the accuracy rate in tracking user-interested news events.  相似文献   

16.
The massive web videos prompt an imperative demand on efficiently grasping the major events.However, the distinct characteristics of web videos, such as the limited number of features, the noisy text information, and the unavoidable error in near-duplicate keyframes (NDKs) detection, make web video event mining a challenging task.In this paper, we propose a novel four-stage framework to improve the performance of web video event mining.Data preprocessing is the first stage.Multiple Correspondence Analysis (MCA) is then applied to explore the correlation between terms and classes, targeting for bridging the gap between NDKs and high-level semantic concepts.Next, co-occurrence information is used to detect the similarity between NDKs and classes using the NDK-within-video information.Finally, both of them are integrated for web video event mining through negative NDK pruning and positive NDK enhancement.Moreover, both NDKs and terms with relatively low frequencies are treated as useful information in our experiments.Experimental results on large-scale web videos from YouTube demonstrate that the proposed framework outperforms several existing mining methods and obtains good results for web video event mining.  相似文献   

17.
The rapidly growing amount of newswire stories stored in electronic devices raises new challenges for information retrieval technology. Traditional query-driven retrieval is not suitable for generic queries. It is desirable to have an intelligent system to automatically locate topically related events or topics in a continuous stream of newswire stories. This is the goal of automatic event detection. We propose a new approach to performing event detection from multilingual newswire stories. Unlike traditional methods which employ simple keyword matching, our method makes use of concept terms and named entities such as person, location, and organization names. Concept terms of a story are derived from statistical context analysis between sentences in the news story and stories in the concept database. We have conducted a set of experiments to study the effectiveness of our approach. The results show that the performance of detection using concept terms together with story keywords is better than traditional methods which only use keyword representation. © 2001 John Wiley & Sons, Inc.  相似文献   

18.
A robust automatic classification system is critical for polarimetric synthetic aperture radar (POLSAR) terrain processing. In most of the conventional classification methods, the number of classes could not be calculated before classification. In this article, we present a new unsupervised classification algorithm with an adaptive number of classes for POLSAR data which is capable of estimating the class numbers automatically. The approach is mainly composed of three operations. First, region-based feature map combining the polarimetric statistical and spatial information is constructed based on the turbopixel method. This is followed by a clustering step performed through an improved affinity propagation clustering with Wishart distance. Finally, the result of the improved affinity propagation clustering is classified using Wishart classifier. The proposed approach takes the spatial information into consideration and makes good use of the inherent statistical characteristics of POLSAR data. The performance of the proposed classification approach on three real datasets is presented and analysed, and the experimental results show that the approach provides more accurate estimation under the condition of various numbers of classes compared with existing methods.  相似文献   

19.
基于相容粗糙集的图形图像信息预检索   总被引:8,自引:0,他引:8  
早期的利用粗糙集理论进行信息检索都是以“等价粗糙集模型”为基础的,但是等价粗糙集的性质限制了该方法的应用范围,为此有些研究者提出以“相容粗糙集模型”代替“等价粗糙集模型”的新的信息检索的概念,此概念的关键在于“关键词的同时发生”和关于相容粗包含的“匹配算法”,提出了一种利用“相容粗糙集”的理论对图形图像进行预检索的新方法,即在相容类的近似空间里对图形图像进行预检索。为了验证这种新方法的有效性,在人脸图形和图像库中做了若干实验,实验结果表明,该方法可以有效地克服等价粗糙集在图形图像检索方面的限制,对提高图形图像的检索效率具有一定的作用。  相似文献   

20.
该文提出了一种基于衰退理论对Flickr数据进行热点事件检测的方法。该方法首先将从Flickr图像中提取的视觉词汇(Visual Words)与图像的文本信息加权合并成文档。然后训练LDA模型获得文档的主题分布作为其最终向量表示。在此基础上提出了一种改进的Single-Pass算法进行事件检测,该算法不仅考虑了图片的地理位置信息,而且基于衰退理论(Aging Theory)对检测到的事件进行生命周期建模,以便计算事件在每个时间段的能量值。最后,根据能量值进行事件排序,获得给定时间段内的热点事件。在真实Flickr数据集上的实验结果表明所提出的方法在精确率、召回率和F1测度上优于传统事件检测方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号