首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Field Association (FA) Terms—words or phrases that serve to identify document fields are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract and select relevant FA Terms to build a comprehensive dictionary of FA Terms. This paper presents a new method to extract, select and rank FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules, corpora comparison and modified tf-idf weighting. Experimental evaluation on 21 fields using 306 MB of domain-specific corpora obtained from English Wikipedia dumps selected up to 2,517 FA Terms (single and compound) per field at precision and recall of 74–97 and 65–98. This is better than the traditional methods. The FA Terms dictionary constructed using this method achieved an average accuracy of 97.6% in identifying the fields of 10,077 test documents collected from Wikipedia, Reuters RCV1 corpus and 20 Newsgroup data set.  相似文献   

2.
We define Markoff words as certain factors appearing in bi-infinite words satisfying the Markoff condition. We prove that these words coincide with central words, yielding a new characterization of Christoffel words.  相似文献   

3.
张永  杨浩 《计算机应用》2017,37(8):2244-2247
针对视觉词袋(BOV)模型中过大的视觉词典会导致图像分类时间代价过大的问题,提出一种加权最大相关最小相似(W-MR-MS)视觉词典优化准则。首先,提取图像的尺度不变特征转换(SIFT)特征,并用K-Means算法对特征聚类生成原始视觉词典;然后,分别计算视觉单词与图像类别间的相关性,以及各视觉单词间的语义相似性,引入一个加权系数权衡两者对图像分类的重要程度;最后,基于权衡结果,删除视觉词典中与图像类别相关性弱、与视觉单词间语义相似性大的视觉单词,从而达到优化视觉词典的目的。实验结果表明,在视觉词典规模相同的情况下,所提方法的图像分类精度比传统基于K-Means算法的图像分类精度提高了5.30%;当图像分类精度相同的情况下,所提方法的时间代价比传统K-Means算法下的时间代价降低了32.18%,因此,所提方法具有较高的分类效率,适用于图像分类。  相似文献   

4.
Recently the second two authors characterized quasiperiodic Sturmian words, proving that a Sturmian word is non-quasiperiodic if and only if, it is an infinite Lyndon word. Here we extend this study to episturmian words (a natural generalization of Sturmian words) by describing all the quasiperiods of an episturmian word, which yields a characterization of quasiperiodic episturmian words in terms of their directive words. Even further, we establish a complete characterization of all episturmian words that are Lyndon words. Our main results show that, unlike the Sturmian case, there is a much wider class of episturmian words that are non-quasiperiodic, besides those that are infinite Lyndon words. Our key tools are morphisms and directive words, in particular normalized directive words, which we introduced in an earlier paper. Also of importance is the use of return words to characterize quasiperiodic episturmian words, since such a method could be useful in other contexts.  相似文献   

5.
This paper concerns a specific class of strict standard episturmian words whose directive words resemble those of characteristic Sturmian words. In particular, we explicitly determine all integer powers occurring in such infinite words, extending recent results of Damanik and Lenz [D. Damanik, D. Lenz, Powers in Sturmian sequences, European J. Combin. 24 (2003) 377–390, doi:10.1016/S0195-6698(03)00026-X], who studied powers in Sturmian words. The key tools in our analysis are canonical decompositions and a generalization of singular words, which were originally defined for the ubiquitous Fibonacci word. Our main results are demonstrated via some examples, including the k-bonacci word, a generalization of the Fibonacci word to a k-letter alphabet (k≥2).  相似文献   

6.
The use of the computing with words paradigm for the automatic text documents categorization problem is discussed. This specific problem of information retrieval (IR) becomes more and more important, notably in view of a fast proliferation of textual information available on the Internet. The main issues that have to be addressed here are: document representation and classification. The use of fuzzy logic for both problems has already been quite deeply studied though for the latter, i.e., classification, generally not in an IR context. Our approach is based mainly on the classical calculus of linguistically quantified propositions proposed by Zadeh. Moreover, we employ results related to fuzzy (linguistic) queries in IR, notably various interpretations of the weights of query terms. Some preliminary results on widely adopted text corpora are presented.  相似文献   

7.
Sturmian sequences are well-known as the ones having minimal complexity over a 2-letter alphabet. They are also the balanced sequences over a 2-letter alphabet and the sequences describing discrete lines. They are famous and have been extensively studied since the 18th century. One of the extensions of these sequences over a k-letter alphabet, with k≥3, is the episturmian sequences, which generalizes a construction of Sturmian sequences using the palindromic closure operation. There exists a finite version of the Sturmian sequences called the Christoffel words. They have been known since the works of Christoffel and have interested many mathematicians. In this paper, we introduce a generalization of Christoffel words for an alphabet with 3 letters or more, using the episturmian morphisms. We call them the epichristoffel words. We define this new class of finite words and show how some of the properties of the Christoffel words can be generalized naturally or not for this class.  相似文献   

8.
The stabilizer of an infinite word over a finite alphabet Σ is the monoid of morphisms over Σ that fix . In this paper we study various problems related to stabilizers and their generators. We show that over a binary alphabet, there exist stabilizers with at least n generators for all n. Over a ternary alphabet, the monoid of morphisms generating a given infinite word by iteration can be infinitely generated, even when the word is generated by iterating an invertible primitive morphism. Stabilizers of strict epistandard words are cyclic when non-trivial, while stabilizers of ultimately strict epistandard words are always non-trivial. For this latter family of words, we give a characterization of stabilizer elements.  相似文献   

9.
In this paper, we propose a novel scene categorization method based on contextual visual words. In the proposed method, we extend the traditional ‘bags of visual words’ model by introducing contextual information from the coarser scale and neighborhood regions to the local region of interest based on unsupervised learning. The introduced contextual information provides useful information or cue about the region of interest, which can reduce the ambiguity when employing visual words to represent the local regions. The improved visual words representation of the scene image is capable of enhancing the categorization performance. The proposed method is evaluated over three scene classification datasets, with 8, 13 and 15 scene categories, respectively, using 10-fold cross-validation. The experimental results show that the proposed method achieves 90.30%, 87.63% and 85.16% recognition success for Dataset 1, 2 and 3, respectively, which significantly outperforms the methods based on the visual words that only represent the local information in the statistical manner. We also compared the proposed method with three representative scene categorization methods. The result confirms the superiority of the proposed method.  相似文献   

10.
基于关键词语的文本特征选择及权重计算方案   总被引:2,自引:3,他引:2  
文本的形式化表示一直是文本分类的重要难题.在被广泛采用的向量空间模型中,文本的每一维特征的权重就是其TFIDF值,这种方法难以突出对文本内容起到关键性作用的特征。提出一种基于关键词语的特征选择及权重计算方案,它利用了文本的结构信息同时运用互信息理论提取出对文本内容起到关键性作用的词语;权重计算则综合了词语位置、词语关系和词语频率等信息,突出了文本中关键词语的贡献,弥补了TFIDF的缺陷。通过采用支持向量机(SVM)分类器进行实验,结果显示提出的Score权重计算法比传统TFIDF法的平均分类准确率要高5%左右。  相似文献   

11.
朴素贝叶斯分类器在地形评估中的应用方法   总被引:3,自引:0,他引:3  
针对目前流行的评估方法的缺点以及实际问题的具体情况,提出将朴素贝叶斯分类器应用在地形评估中。具体方法是从用专家函数评估的数据库中提取训练样本,通过基于分布熵最小原则进行特征约减,再基于最优性条件进行属性离散化,最后基于共轭分布进行参数学习得到一个的分类器。待分类样本可以直接由贝叶斯分类器得出分类结果,并且根据增量学习理论,将分类结果作为训练新的分类器的训练样本,可以进一步提高分类精度。试验表明该方法的应用减少了评估时间,并且分类精度也令人满意。  相似文献   

12.
Using a combinatorial characterization of digital convexity based on words, one defines the language of convex words. The complement of this language forms an ideal whose minimal elements, with respect to the factorial ordering, appear to have a particular combinatorial structure very close to the Christoffel words. In this paper, those words are completely characterized as those of the form uwkv where k≥1, w=uv and u,v,w are Christoffel words. Also, by considering the most balanced among the unbalanced words, we obtain a second characterization for a special class of minimal non-convex words that are of the form u2v2 corresponding to the case k=1 in the previous form.  相似文献   

13.
14.
Originally introduced and studied by the third and fourth authors together with J. Justin and S. Widmer (2008), rich words constitute a new class of finite and infinite words characterized by containing the maximal number of distinct palindromes. Several characterizations of rich words have already been established. A particularly nice characteristic property is that all ‘complete returns’ to palindromes are palindromes. In this note, we prove that rich words are also characterized by the property that each factor is uniquely determined by its longest palindromic prefix and its longest palindromic suffix.  相似文献   

15.
Most of the text categorization algorithms in the literature represent documents as collections of words. An alternative which has not been sufficiently explored is the use of word meanings, also known as senses. In this paper, using several algorithms, we compare the categorization accuracy of classifiers based on words to that of classifiers based on senses. The document collection on which this comparison takes place is a subset of the annotated Brown Corpus semantic concordance. A series of experiments indicates that the use of senses does not result in any significant categorization improvement.  相似文献   

16.
自然语言处理中逻辑词的知识图分析   总被引:1,自引:1,他引:0  
知识图是一种新的知识表示方法。本文从本体论的角度出发,将知识图的本体论分别与Aristotle、Kant和Peirce的三种知识表示的本体论进行了比较,表明知识图方法的有效性以及本原性,说明知识图是一种更为一般的知识表示方法。从知识图本体论的观点,研究了各类逻辑词的知识图表示。本文结合汉语的特点,从结构的角度,研究并揭示了逻辑词的共性和规律性。进一步阐明知识图“结构就是含义”的思想。逻辑词的知识图分析将为自然语言分析中词典的建立奠定基础。  相似文献   

17.
A set of words is factorially balanced if the set of all the factors of its words is balanced. We prove that if all words of a factorially balanced set have a finite index, then this set is a subset of the set of factors of a Sturmian word. Moreover, characterizing the set of factors of a given length n of a Sturmian word by the left special factor of length n−1 of this Sturmian word, we provide an enumeration formula for the number of sets of words that correspond to some set of factors of length n of a Sturmian word.  相似文献   

18.
以哈萨克语通用词汇自动提取为目标,在传统的词语领域使用度的基础上运用改进的词语领域通用度公式进行哈语词汇通用度的计算,使改进的公式对哈语通用词汇的排序位置有更大的影响。基于通用词汇的三大特征:领域通用性、地域通用性、时间通用性,采用统计的方法考察哈语词汇的通用程度,在哈语词频统计的基础上实现了哈语词汇的通用度统计。实验结果表明改进的词语领域通用度计算公式在提取哈语通用词汇时对词语排序位置的影响力度比传统的词语领域使用度计算公式更大。  相似文献   

19.
A balanced word is one in which any two factors of the same length contain the same number of each letter of the alphabet up to one. Finite binary balanced words are called Sturmian words. A Sturmian word is bispecial if it can be extended to the left and to the right with both letters remaining a Sturmian word. There is a deep relation between bispecial Sturmian words and Christoffel words, that are the digital approximations of Euclidean segments in the plane. In 1997, J. Berstel and A. de Luca proved that palindromic bispecial Sturmian words are precisely the maximal internal factors of primitive Christoffel words. We extend this result by showing that bispecial Sturmian words are precisely the maximal internal factors of all Christoffel words. Our characterization allows us to give an enumerative formula for bispecial Sturmian words. We also investigate the minimal forbidden words for the language of Sturmian words.  相似文献   

20.
中文短文本分类中存在大量低频词,利用好低频词中的信息能有效提高文本分类效果,针对基于词向量的文本分类研究中低频词不能被有效利用的问题,提出一种针对低频词进行数据增强的方法。首先,利用受限文本生成模型产生的数据来微调低频词的词向量,再利用一种词向量的构造算法将高频词的更新信息迁移到低频词中,使低频词获取更准确且符合训练集分布的词向量表示;其次,引入相似词和实体概念等先验知识来补充上下文信息;最后,利用改进的卡方统计去除明显的噪声词,以及设计词注意力层对每个词进行加权,减少无关噪声对分类的影响。在多个基础分类模型上进行实验,结果表明各基础模型经改进后都有明显提升,体现了提出方法的有效性,同时也说明了短文本分类任务中低频词能改善分类的效果。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号