首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
在文本分类中,互信息是一种被广泛应用的特征选择方法,但是该方法仅考虑了特征的文档频而没有考虑特征的词频,导致它经常倾向于选择出现频率较低的特征。为此,提出了一个新的文档频并把它引入到互信息方法中,从而获得了一种优化的互信息方法。该优化的互信息方法不但考虑了特征的文档频而且还考虑了特征出现的词频。实验结果表明该优化的互信息方法性能良好。  相似文献   

2.
熊忠阳  付玲玲  张玉芳  蒋健 《计算机应用》2010,30(10):2621-2623
传统的基于词频统计的特征选择方法忽略了特征项本身的语义信息,特征项之间存在冗余使得维数有限的特征空间无法容纳更多的对分类有用的特征项。为此,利用《知网》(HowNet)的中英双语知识词典构建“概念—领域”表,对每个词语查询该表,如果在表中,则把该词语映射到“领域”;否则保留原词。这样不仅可以将较低层概念泛化到较高层概念,还能在一定程度上消除特征项之间的冗余,而且从语义上加强它对所在“领域”的分类贡献度。分别应用信息增益和χ2统计利用该方法进行文本分类实验,结果表明该方法可以有效地提高分类准确率。  相似文献   

3.
Text categorization plays an important role in applications where information is filtered, monitored, personalized, categorized, organized or searched. Feature selection remains as an effective and efficient technique in text categorization. Feature selection metrics are commonly based on term frequency or document frequency of a word. We focus on relative importance of these frequencies for feature selection metrics. The document frequency based metrics of discriminative power measure and GINI index were examined with term frequency for this purpose. The metrics were compared and analyzed on Reuters 21,578 dataset. Experimental results revealed that the term frequency based metrics may be useful especially for smaller feature sets. Two characteristics of term frequency based metrics were observed by analyzing the scatter of features among classes and the rate at which information in data was covered. These characteristics may contribute toward their superior performance for smaller feature sets.  相似文献   

4.
Text categorization presents unique challenges to traditional classification methods due to the large number of features inherent in the datasets from real-world applications of text categorization, and a great deal of training samples. In high-dimensional document data, the classes are typically categorized only by subsets of features, which are typically different for the classes of different topics. This paper presents a simple but effective classifier for text categorization using class-dependent projection based method. By projecting onto a set of individual subspaces, the samples belonging to different document classes are separated such that they are easily to be classified. This is achieved by developing a new supervised feature weighting algorithm to learn the optimized subspaces for all the document classes. The experiments carried out on common benchmarking corpuses showed that the proposed method achieved both higher classification accuracy and lower computational costs than some distinguishing classifiers in text categorization, especially for datasets including document categories with overlapping topics.  相似文献   

5.
Most of text categorization techniques are based on word and/or phrase analysis of the text. Statistical analysis of a term frequency captures the importance of the term within a document only. However, two terms can have the same frequency in there documents, but one term contributes more to the meaning of its sentences than the other term. Thus, the underlying model should identify terms that capture the semantics of text. In this case, the model can capture terms that present the concepts of the sentence, which leads to discovering the topic of the document. A new concept‐based model that analyzes terms on the sentence, document, and corpus levels rather than the traditional analysis of document only is introduced. The concept‐based model can effectively discriminate between nonimportant terms with respect to sentence semantics and terms which hold the concepts that represent the sentence meaning. A set of experiments using the proposed concept‐based model on different datasets in text categorization is conducted in comparison with the traditional models. The results demonstrate the substantial enhancement of the categorization quality using the sentence‐based, document‐based and corpus‐based concept analysis.  相似文献   

6.
Text categorization assigns predefined categories to either electronically available texts or those resulting from document image analysis. A generic system for text categorization is presented which is based on statistical analysis of representative text corpora. Significant features are automatically derived from training texts by selecting substrings from actual word forms and applying statistical information and general linguistic knowledge. The dimension of the feature vectors is then reduced by linear transformation, keeping the essential information. The classification is a minimum least-squares approach based on polynomials. The described system can be efficiently adapted to new domains or different languages. In application, the adapted text categorizers are reliable, fast, and completely automatic. Two example categorization tasks achieve recognition scores of approximately 80% and are very robust against recognition or typing errors.  相似文献   

7.
基于特征投票机制设计一种线性文本分类方法,运用信任机制理论分析文档类别对特征的信任关系,给出具体特征信任度的模型,并在Newsgroup、复旦中文分类语料、Reuters-21578 3个广泛使用且具有不同特性的语料集上与传统方法进行比较。实验结果表明,该方法分类性能优于传统方法且稳定、高效,适用于大规模文本分类任务。  相似文献   

8.
基于主题词频数特征的文本主题划分   总被引:4,自引:1,他引:4  
康恺  林坤辉  周昌乐 《计算机应用》2006,26(8):1993-1995
目前文本分类所采用的文本—词频矩阵具有词频维数过大和过于稀疏两个特点,给计算造成了一定困难。为解决这一问题,从用户使用搜索引擎时选择所需文本的心理出发,提出了一种基于主题词频数特征的文本主题划分方法。该方法首先根据统计方法筛选各文本类的主题词,然后以主题词类替代单个词作为特征采用模糊C 均值(FCM)算法施行文本聚类。实验获得了较好的主题划分效果,并与一种基于词聚类的文本聚类方法进行了过程及结果中多个方面的比较,得出了一些在实施要点和应用背景上较有意义的结论。  相似文献   

9.
Text categorization is continuing to be one of the most researched NLP problems due to the ever-increasing amounts of electronic documents and digital libraries. In this paper, we present a new text categorization method that combines the distributional clustering of words and a learning logic technique, called Lsquare, for constructing text classifiers. The high dimensionality of text in a document has not been fruitful for the task of categorization, for which reason, feature clustering has been proven to be an ideal alternative to feature selection for reducing the dimensionality. We, therefore, use distributional clustering method (IB) to generate an efficient representation of documents and apply Lsquare for training text classifiers. The method was extensively tested and evaluated. The proposed method achieves higher or comparable classification accuracy and {rm F}_1 results compared with SVM on exact experimental settings with a small number of training documents on three benchmark data sets WebKB, 20Newsgroup, and Reuters-21578. The results prove that the method is a good choice for applications with a limited amount of labeled training data. We also demonstrate the effect of changing training size on the classification performance of the learners.  相似文献   

10.
杨为民  李龙澍 《微机发展》2007,17(2):135-137
信息检索的一个核心问题是自动文本分类。基于分类体系的文本分类需要全文抽取主题词、计算权重,再根据分类体系对文献进行分类。文中构建一种基于Agent技术的文本自动分类系统,仅需要对文档头进行信息处理就可以进行快速文本分类,有效地减少了文本分类过程中的时间和空间的消耗。  相似文献   

11.
基于深度信念网络的文本分类算法   总被引:2,自引:0,他引:2  
随着网络的迅猛发展,文本分类成为处理和组织大量文档数据的关键技术.目前已经有许多不同类型的神经网络应用于文本分类,并且取得良好的效果.但是,大部分模型仅采用文档的少量特征作为输入,没有考虑到足够的信息量;而当考虑到足够的特征时,又会发生维数灾难,导致模型难以训练或者训练时间大幅增加.利用深度信念网络从文本中抽取特征,并利用softmax回归分类器对抽取后的特征分类.深度信念网络不仅具有强大的学习能力,同时还能从高维的原始特征中抽取低维度高度可区分的低维特征,因此利用深度信念网络来对文本分类,不仅能够考虑到文档的足够的信息量,而且能够快速的训练.并且实验结果也表明利用深度信念网络实现文本分类的性能很好.  相似文献   

12.
为了提高文本分类算法的效率和精度,必须使用特征选择算法来降低特征空间的维数。然而许多常用特征选择算法在选择属性时,只是利用特征的权重而并没有考虑特征之间的隐含关系,使得得到的特征集存在一定的冗余,并不具备较好的代表性。首先给出了一个基于最小词频的文档频方法,并用它过滤掉一些词条以降低文本矩阵的稀疏性,然后使用LSA进行词语间的语义分析,消除同义词和多义词的影响,提高了文本分类的速度与精确度。实验结果表明此种特征选择方法效果良好。  相似文献   

13.
特征选择是文本分类的关键步骤之一,所选特征子集的优劣直接影响文本分类的结果.分析了词频法和文档频法并总结了其缺陷,给出了一个改进的文档频方法;引进粗糙集理论,提出了一个属性约简算法;最后提出了一个新的特征选择方法.该特征选择方法使用改进的文档频初选特征并用所提属性约简算法消除冗余.仿真结果表明该特征选择方法性能较好.  相似文献   

14.
基于句类特征的作者写作风格分类研究   总被引:1,自引:1,他引:0       下载免费PDF全文
不同作家的作品有自己的特点,这些特点体现在词汇、句型、修辞手法等各个方面,尝试使用句类特征进行作者写作风格分类,进一步可以用于作者的识别。利用向量空间模型,以句类作为特征,并通过混合句类分解等技术对句类向量空间降维,使用itc算法对特征项进行权重计算,KNN算法进行分类并利用集成判决技术,形成作者写作风格分类器。本分类器的性能在近现代小说的按作者写作风格的分类和鉴别方面的性能是可以接受的,并有进一步提升的可能。  相似文献   

15.
Most of the text categorization algorithms in the literature represent documents as collections of words. An alternative which has not been sufficiently explored is the use of word meanings, also known as senses. In this paper, using several algorithms, we compare the categorization accuracy of classifiers based on words to that of classifiers based on senses. The document collection on which this comparison takes place is a subset of the annotated Brown Corpus semantic concordance. A series of experiments indicates that the use of senses does not result in any significant categorization improvement.  相似文献   

16.
文本的表示与文本的特征提取是文本分类需要解决的核心问题,基于此,提出了基于改进的连续词袋模型(CBOW)与ABiGRU的文本分类模型。该分类模型把改进的CBOW模型所训练的词向量作为词嵌入层,然后经过卷积神经网络的卷积层和池化层,以及结合了注意力(Attention)机制的双向门限循环单元(BiGRU)神经网络充分提取了文本的特征。将文本特征向量输入到softmax分类器进行分类。在三个语料集中进行的文本分类实验结果表明,相较于其他文本分类算法,提出的方法有更优越的性能。  相似文献   

17.
流派分类和基于主题的文本分类最大的区别之处就在于文本的特征。流派分类需要能够描述文档风格的、表达更强语义信息的特征,基于特征情感色彩的分类方法是将情感色彩这种语义信息附加到特征上。首先介绍了文档流派分类的概念及其应用,然后分析了流派分类的文本特征和词汇的情感倾向权值的几种计算方法,论述了基于特征情感色彩的文档流派分类过程,最后对几种分类方法进行了实验结果分析和比较。  相似文献   

18.
Text Retrieval from Document Images Based on Word Shape Analysis   总被引:2,自引:1,他引:2  
In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-gram algorithm for text documents.  相似文献   

19.
随着互联网的普及,人类获取特定信息需求的增加,如何快速获取特定类别信息是当前搜索引擎,门户网站等必须解决的问题。当前网页分类的任务都由机器学习的文本分类算法完成,但传统的机器学习分类方法基本没有考虑文本数据特征,提供无差别的分类服务。该系统充分考虑网页文本数据的特征,以文本标题为突破口实现快速分类以及依据SVM的普通分类。快速分类依据文本标题通过分词模型训练快速对应到分类标签上,完成快速分类。如果快速分类不成功则将文本内容通过结巴分词器分词,word2vec进行分词向量的训练,再根据分类要求通过SVM进行分类,完成普通的分类。通过提供两种不同的服务来完成不同的需求。  相似文献   

20.
文本特征表示是在文本自动分类中最重要的一个环节。在基于向量空间模型(VSM)的文本表示中特征单元粒度的选择直接影响到文本分类的效果。对于基于词袋模型(BOW)的维吾尔文文本分类效果不理想的问题,提出了一种基于统计方法的维吾尔语短语抽取算法并将抽取到的短语作为文本特征项,采用支持向量机(SVM)算法对维吾尔文文本进行了分类实验。实验结果表明,与以词为特征的文本分类相比,短语作为文本特征能够提高维吾尔文文本分类的准确率和召回率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号