首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
文本挖掘技术的基础是对文本的统计分析。通常,文本挖掘技术的基本做法是通过计算出某一个词或短语的出现频率来计算其在文档中的重要程度。但在统计分析中,其原始语义可能不是其在语句中的准确意思。为了解决这个问题,本文提出一个新的基于概念的模型框架,可以有效地找出文档间的匹配及相关联的概念。  相似文献   

2.
徐建民  朱松  陈富节 《计算机应用》2007,27(12):3013-3015
对如何利用术语间的关系提高信息检索系统的性能进行了探讨,分析了术语相似度和术语相关度融合的可行性,设计了一种挖掘术语间关系的新方法,提出了术语相关度对相似度的修正因子,用来调整对术语关系的影响程度,更准确地解决了术语间语义概念的匹配问题。将该方法应用于两种检索模型的实验结果表明,所提方法比单一使用术语相似度或术语相关度的方法具有更好的检索效果。  相似文献   

3.
In this paper we present context matching, a novel context-based technique for the ad-hoc retrieval of web documents. The aim of the technique is to dynamically generate a measure of document term significance during retrieval that can be used as a substitute or co-contributor of the term frequency measure. Unlike term frequency, which relies on a term occurring multiple times in a document to be considered significant, context matching is based on the notion that if a term in a given document occurs in that document in the context of the query, then that term is deemed to be significant. Context matching has the ability to potentially determine a term to be significant even if it occurs only once in a document. Vice versa, it also has the ability to determine a term to be insignificant, even if occurs frequently within a document. We show how expanded terms generated by a typical query expansion technique can be used effectively as query context for context matching. The technique is ideally suited to the nature of web information retrieval and we show how context matching significantly improves retrieval accuracy through experimental results on TREC web benchmark data.  相似文献   

4.
在文本分类领域中,目前关于特征权重的研究存在两方面不足:一方面,对于基于文档频率的特征权重算法,其中的文档频率常常忽略特征的词频信息;另一方面,对特征与类别的关系表达不够准确和充分。针对以上两点不足,提出一种新的基于词频的类别相关特征权重算法(全称CDF-AICF)。该算法在度量特征权重时,考虑了特征在每个词频下的文档频率。同时,为了准确表达特征与类别的关系,提出了两个新的概念:类别相关文档频率CDF和平均逆类频率AICF,分别用于表示特征对类别的表现力和区分力。最后,通过与其它5个特征权重度量方法相比较,在三个数据集上进行分类实验。结果显示,CDF-AICF的分类性能优于其它5种度量方法。  相似文献   

5.
Wang  Tao  Cai  Yi  Leung  Ho-fung  Lau  Raymond Y. K.  Xie  Haoran  Li  Qing 《Knowledge and Information Systems》2021,63(9):2313-2346

In text categorization, Vector Space Model (VSM) has been widely used for representing documents, in which a document is represented by a vector of terms. Since different terms contribute to a document’s semantics in various degrees, a number of term weighting schemes have been proposed for VSM to improve text categorization performance. Much evidence shows that the performance of a term weighting scheme often varies across different text categorization tasks, while the mechanism underlying variability in a scheme’s performance remains unclear. Moreover, existing schemes often weight a term with respect to a category locally, without considering the global distribution of a term’s occurrences across all categories in a corpus. In this paper, we first systematically examine pros and cons of existing term weighting schemes in text categorization and explore the reasons why some schemes with sound theoretical bases, such as chi-square test and information gain, perform poorly in empirical evaluations. By measuring the concentration that a term distributes across all categories in a corpus, we then propose a series of entropy-based term weighting schemes to measure the distinguishing power of a term in text categorization. Through extensive experiments on five different datasets, the proposed term weighting schemes consistently outperform the state-of-the-art schemes. Moreover, our findings shed new light on how to choose and develop an effective term weighting scheme for a specific text categorization task.

  相似文献   

6.
文档检索中句法信息的有效利用研究   总被引:1,自引:0,他引:1  
利用词项依存关系来改进词袋模型,一直是文本检索中一个热门话题。已有的定义词项依存的方法中,有两类主要的方法一类是词汇层次的依存关系,利用统计近邻信息来定义词项依存关系,另一类是句法层次的依存关系,由句法结构来定义词项依存关系。虽然已有的研究表明,相对于词袋模型,利用词项依存关系能够显著地提高检索性能,但这两类词项依存关系却缺乏系统的比较在利用词项依存关系来改进文档和查询的表达上,如何有效地利用句法信息,哪些句法信息对文本检索比较有效,依然是个有待研究的问题。为此,在文档表达上,比较了利用近邻信息和句法信息定义的词项依存关系的性能;在查询表达上,对利用不同层次的句法信息所定义的词项依存关系的性能进行了比较。为了系统地比较这些词项依存关系对检索性能的影响,在语言模型基础上,以平滑为思路,提出了一个能方便融入这两类词项依存关系的检索模型。在TREC语料上的实验表明,对于文档表达来说,句法关系较统计近邻关系没有明显的差别。在查询表达上,基于名词/专有词短语的部分句法信息较其他的句法信息更加有效。  相似文献   

7.
An information retrieval system has to retrieve all and only those documents that are relevant to a user query, even if index terms and query terms are not matched exactly. However, term mismatches between index terms and query terms have been a serious obstacle to the enhancement of retrieval performance. In this article, we discuss automatic term normalization between words and phrases in text corpora and their application to a Korean information retrieval system. We perform three new types of term normalizations: transliterated word normalization, noun phrase normalization, and context-based term normalization. Transliterated words are normalized into equivalence classes by using contextual similarity to alleviate lexical term mismatches. Then, noun phrases are normalized into phrasal terms by segmenting compound nouns as well as normalizing noun phrases. Moreover, context-based terms are normalized by using a combination of mutual information and word context to establish word similarities. Next, unsupervised clustering is done by using the K-means algorithm and cooccurrence clusters are identified to alleviate semantic term mismatches. These term normalizations are used in both the indexing and the retrieval system. The experimental results show that our proposed system can alleviate three types of term mismatches and can also provide the appropriate similarity measurements. As a result, our system can improve the retrieval effectiveness of the information retrieval system.  相似文献   

8.
One of the most important research topics in Information Retrieval is term weighting for document ranking and retrieval, such as TFIDF, BM25, etc. We propose a term weighting method that utilizes past retrieval results consisting of the queries that contain a particular term, retrieval documents, and their relevance judgments. A term’s Discrimination Power(DP) is based on the difference degree of the term’s average weights obtained from between relevant and non-relevant retrieved document sets. The difference based DP performs better compared to ratio based DP introduced in the previous research. Our experimental result shows that a term weighting scheme based on the discrimination power method outperforms a TF*IDF based scheme.  相似文献   

9.
潜在语义索引中特征优化技术的研究   总被引:3,自引:0,他引:3  
潜在语义索引被广泛应用于信息检索、文本分类、自动问答等领域中。潜在语义索引是一种降维方法,它把共现特征映射到同一维空间上,而非共现特征映射到不同的空间上。在潜在语义索引的语义空间中,共现特征通过文档内部以及文档之间的特征传递关系获得。该文认为这种特征传递关系会引入一些不存在的共现特征,从而降低潜在语义索引的性能,应该对这种特征传递关系进行一些选择,削除不存在的共现特征信息。该文采用文档频率对文档集合进行特征选择,用Complete-Link聚类算法在两个公开语料上进行三个实验,实验结果显示,保留文档频度的10%~15%时,其F1值分别提高了6.577 0%,1.992 8%和3.361 4%。  相似文献   

10.
文本分类是研究文本数据挖掘、信息检索的重要手段,文本特征项权重值的计算是文本分类算法的关键。针对经典的特征权重计算方法TF-IDF中存在的不足,提出了一种动态自适应特征权重计算方法(DATW)。该算法不仅考虑了特征项在文本中出现的频率及该特征项所属文本在训练集中的数量,而且通过考查特征项的分散度和特征向量梯度差以自适应动态文本的分类。实验结果表明,采用DATW方法计算特征权重可以有效提高文本分类的性能。  相似文献   

11.
Active contours are a popular class of variational models used in computer vision for tracking and segmentation. The variational model consists of a data-fitting and a regularisation term. Depending on the data-fitting term, active contour models are classified as either gradient or region based models. An often overlooked but crucial aspect of these models is that these two terms are weighted by a manually set constant weight. This constant weight often leads to incorrect segmentation, particularly for gradient based energies. This failure rate is high in the presence of strong gradients nearby the target or when the object gradient is not uniformly strong. In such circumstances, setting the weight becomes a critical and often unsatisfying task. In this work, we propose a new spatially varying and dynamic curve evolution term for robust segmentation of gradient based models. In contrast to the majority of the existing work in literature which focuses on defining new data-fitting terms, the evolution term proposed here is related to the regularisation of evolution. The intuition here is that in images although object boundaries are generally continuous, the magnitude of the gradient map so generated is not uniformly strong. Therefore, any energy formulation which fixes the weights of the data-fitting and regularisation term will run into the problems mentioned above. In this work, we propose an energy term which defines the regularisation term in a spatially varying manner. The advantage of this term is that it is independent of the image based data-fitting energy term and hence can be plugged into the vast variety of the existing gradient based active contour models.  相似文献   

12.
In vector space model (VSM), text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified by a computer or a classifier. Different terms (i.e. words, phrases, or any other indexing units used to identify the contents of a text) have different importance in a text. The term weighting methods assign appropriate weights to the terms to improve the performance of text categorization. In this study, we investigate several widely-used unsupervised (traditional) and supervised term weighting methods on benchmark data collections in combination with SVM and kNN algorithms. In consideration of the distribution of relevant documents in the collection, we propose a new simple supervised term weighting method, i.e. tf.rf, to improve the terms' discriminating power for text categorization task. From the controlled experimental results, these supervised term weighting methods have mixed performance. Specifically, our proposed supervised term weighting method, tf.rf, has a consistently better performance than other term weighting methods while other supervised term weighting methods based on information theory or statistical metric perform the worst in all experiments. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance in terms of different data sets.  相似文献   

13.
基于变分水平集方法提出了一种通用的曲面扩散变分模型,其数据项为演化曲面与原曲面的水平集函数Heaviside函数差的平方,规则项为基于整体曲率的通用函数,通过图像扩散模型中的总变差与该模型中的总曲率类比设计该规则项,以实现曲面扩散的任务。为了避免水平集函数的重新初始化,在本文的能量泛函中增加了水平集函数为符号距离函数的惩罚项。所得到的演化方程为4阶偏微分方程,对其对流项采用经典迎风差分格式离散,对其中的扩散项采用中心差分格式。最后通过数值算例验证了模型用于曲面光滑、边缘保持与边缘增强的可行性。  相似文献   

14.
提出了一种基于增量词集频率的文本主题词提取算法,其核心思想是计算主题词集频率增量,算法从候选主题词集提取主题词时,计算单个候选主题词对主题词集频率的增量,若增量小于给定阈值,则主题词提取算法结束,否则将该候选主题词加入主题词集,继续考察下一个候选主题词。实验结果表明,该算法取得了较好的效果,所获得的主题词能更贴切地反映文章的主要内容。  相似文献   

15.
An efficient term mining method to build a general term network is presented. The resulting term network can be used for entity relation visualization and exploration, which is useful in many text-mining applications such as crime exploration and investigation from vast piles of crime news or official criminal records. In the proposed method, terms from each document in a text collection are first identified. They are subjected to an analysis for pairwise association weights. The weights are then accumulated over all the documents to obtain final similarity for each term pair. Based on the resulting term similarity, a general term network for the collection is built with terms as nodes and non-zero similarities as links. In application, a list of predefined terms having similar attributes was selected to extract the desired sub-network from the general term network for entity relation visualization. This text analysis scenario based on the collective terms of the similar type or from the same topic enables evidence-based relation exploration. Some practical instances of crime exploration and investigation are demonstrated. Our application examples show that term relations, be it causality, subordination, coupling, or others, can be effectively revealed by our method and easily verified by the underlying text collection. This work contributes by presenting an integrated term-relationship mining and exploration approach and demonstrating the feasibility of the term network to the increasingly important application of crime exploration and investigation.  相似文献   

16.
提出了一种基于增量词集频率的文本主题词提取算法,其核心思想是计算主题词集频率增量,算法从候选主题词集提取主题词时,计算单个候选主题词对主题词集频率的增量,若增量小于给定阈值,则主题词提取算法结束,否则将该候选主题词加入主题词集,继续考察下一个候选主题词。实验结果表明,该算法取得了较好的效果,所获得的主题词能更贴切地反映文章的主要内容。  相似文献   

17.
汉语术语定义的结构分析和提取   总被引:13,自引:2,他引:13  
本文介绍的工作是在汉语句法分析研究基础上的一种应用研究,对术语如何下定义问题进行了理论上的探讨。术语的定义形式在汉语语法结构方面提供了模板结构和构成方式,可以作为知识发现研究的数据基础,也可以作为特定领域的语法知识系统。本文针对电子学和计算机领域的语料进行了分词和词性标注处理,然后应用句法分析工具分析出句子中的短语成分,并根据汉语句子的句型结构,总结出术语定义的结构特点,自动提取定义的模板。最后根据已建立的数据和概念描述,给出了术语发现的算法。  相似文献   

18.
When navigating in crowds, humans are able to move efficiently between people. They look ahead to know which path would reduce the complexity of their interactions with others. Current navigation systems for virtual agents consider long‐term planning to find a path in the static environment and short‐term reactions to avoid collisions with close obstacles. Recently some mid‐term considerations have been added to avoid high density areas. However, there is no mid‐term planning among static and dynamic obstacles that would enable the agent to look ahead and avoid difficult paths or find easy ones as humans do. In this paper, we present a system for such mid‐term planning. This system is added to the navigation process between pathfinding and local avoidance to improve the navigation of virtual agents. We show the capacities of such a system using several case studies. Finally we use an energy criterion to compare trajectories computed with and without the mid‐term planning.  相似文献   

19.
在抽象匹配流框架下,构造能够克服大色差问题的彩色图像配准模型.该模型中,数据项采用互相关函数作为2幅图像间的相似性度量,以解决大色差问题;正则项采用各向异性扩散滤波器约束图像演化,从而实现在演化过程中对图像特征的有效保持.扩散滤波器中的扩散系数定义为关于彩色结构张量的函数,以使图像演化能够综合各通道信息,解决了各通道所得位移场不一致而引起的色彩混迭问题.实验结果表明,文中模型对具有大色差的彩色图像能够实现有效配准.  相似文献   

20.
J. Jaffre 《Calcolo》1984,21(2):171-197
We analyze a numerical scheme for scalar diffusion-convection equations. The convective term is approximated by an upwind scheme for discontinuous finite elements and the diffusion term is approximated by a mixed finite element method. Studying large convection problems, we calculate estimates which remain valid when the diffusion term vanishes. Since the error analysis shows that the convection term is approximated less precisely than the diffusion term, the initial formulation is modified in order to balance errors from these two terms.   相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号