首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 421 毫秒
1.
针对文本情感分类准确率不高的问题,提出基于CCA-VSM分类器和KFD的多级文本情感分类方法。采用典型相关性分析对文档的权重特征向量和词性特征向量进行降维,在约简向量集上构建向量空间模型,根据模型之间的差异度设计VSM分类器,筛选出与测试文档差异度较小的R个模型作为核Fisher判别的输入,最终判别出文档的情感观点。实验结果表明:该方法比传统支持向量机有较高的分类准确率和较快的分类速度,权重特征和词性特征对分类准确率的影响较大。  相似文献   

2.
文本分类在采用向量空间模型(VSM)表达文本特征时,容易出现特征向量高维且稀疏的现象,为了对原始的文本特征向量进行有效简化,提出了一种基于粒子群(PSO)优化独立分量分析(ICA)进行降维的方法,并将其运用到文本分类中。在该算法中,以负熵作为粒子群算法的适应度函数,依据其高斯性原理作为独立性判别标准对分离矩阵进行自适应更新。实验结果表明,相比于传统的特征降维方法,该方法可以解决高维度文本特征向量降维困难的问题,使得文本分类的效率、准确率显著提升。  相似文献   

3.
对高维特征集的降维是文本分类的一个主要问题。在分析现有特征降维方法的基础上,借助《知网》提出一种新的二次降维方法:采用传统的特征选择方法提取一个候选特征集合;利用《知网》对候选集合中的特征项进行概念映射,把大量底层分散的原始特征项替换成少量的高层概念进行第二次特征降维。实验表明,这种方法可以在减少文本语义信息丢失的前提下,有效地降低特征空间维数,提升文本分类的准确度。  相似文献   

4.
为了获得更好的文本分类准确率和更快的执行效率, 研究了多种Web文本的特征提取方法, 通过对互信息(MI)、文档频率(DF)、信息增益(IG)和χ2统计(CHI)算法的研究, 利用其各自的优势互补, 提出一种基于主成分分析(PCA)的多重组合特征提取算法(PCA-CFEA)。通过PCA算法的正交变换快速地将文本特征空间降维, 再通过多重组合特征提取算法在降维后的特征空间中快速提取出更具代表性的特征项, 过滤掉一些代表性较弱的特征项, 最后使用SVM分类器对文本进行分类。实验结果表明, PCA-CFEA能有效地提高文本分类的正确率和执行效率。  相似文献   

5.
基于核方法的Web挖掘研究   总被引:2,自引:0,他引:2  
基于词空间的分类方法很难处理文本的高维特性和捕获文本语义概念.利用核主成分分析和支持向量机。提出一种通过约简文本数据维数抽取语义概念、基于语义概念进行文本分类的新方法.首先将文档映射到高维线性特征空间消除非线性特征,然后在映射空间中通过主成分分析消除变量之间的相关性,实现降维和语义概念抽取,得到文档的语义概念空间,最后在语义概念空间中采用支持向量机进行分类.通过新定义的核函数,不必显式实现到语义概念空间的映射,可在原始文档向量空间中直接实现基于语义概念的分类.利用核化的GHA方法自适应迭代求解核矩阵的特征向量和特征值,适于求解大规模的文本分类问题.试验结果表明该方法对于改进文本分类的性能具有较好的效果.  相似文献   

6.
文本分类中特征向量空间是高维和稀疏的,降维处理是分类的关键步骤。针对传统特征提取方法的不足,提出采用基于迭代的CCIPCA和ICA特征提取方法处理大规模文本分类问题,实验结果表明降维提高了分类效果。在CCIPCA、ICA及ICA与IG组合降维的方法中,基于ICA降维的分类效果是最好的。  相似文献   

7.
自动文本分类是指在给定的分类体系下,让计算机根据文本的内容确定与它相关联的类别。现有的文本分类算法大都基于向量空间模型,因而不能充分表达文档的语义特征信息,从而影响了分类器性能。针对此问题,本文通过训练文档构造相似矩阵,从中获得每个类别的主题信息,由此构造分类器,最后与经典的分类器进行组合以确定文本类别。实验系统证明本文提出的分类方法较大改进了分类器性能。  相似文献   

8.
针对传统向量空间模型中的特征项孤立处理问题,首先通过χ2统计和特征聚类相结合的模式实现特征降维,然后使用图模型来建立词和词之间相互关联信息,最后运用KNN方法进行文档分类测试。该算法提高了稀有词对分类的贡献,强化了关联词的分类效果,并降低了文档向量的维数。实验证明,该算法提高了分类的准确率和召回率。  相似文献   

9.
王自强  钱旭 《计算机应用》2009,29(2):416-418
为了高效地解决Web文档分类问题,提出了一种基于核鉴别分析方法KDA和SVM的文档分类算法。该算法首先利用KDA对训练集中的高维Web文档空间进行降维,然后在降维后的低维特征空间中利用乘性更新规则优化的SVM进行分类预测。采用了文档分类领域两个著名的数据集Reuters-21578和20-Newsgroup进行实验,实验结果表明该算法不仅获得了更高的分类准确率,而且具有较少的运行时间。  相似文献   

10.
结合LSA的中文谱聚类算法研究   总被引:2,自引:2,他引:0  
传统的文本谱聚类需要的文本相似矩阵依赖于向量空间模型,忽略了词与词之间的语义关系,存在词频维数过高、计算代价高等问题。针对这些问题,提出了一种基于潜在语义分析(latent semantic analysis,LSA)的文本相似矩阵构造方法,利用奇异值分解(singular value decomposition,SVD)降维,在低维的语义空间表示文本,以此来提高同类文本间的语义相似度,并进行了相关对比实验。在该实验中,改进方法的聚类效果要好于传统的方法,从而验证了改进方法的有效性和可行性。  相似文献   

11.
A Fuzzy Approach to Classification of Text Documents   总被引:1,自引:0,他引:1       下载免费PDF全文
This paper discusses the classification problems of text documents. Based on the concept of the proximity degree, the set of words is partitioned into some equivalence classes.Particularly, the concepts of the semantic field and association degree are given in this paper.Based on the above concepts, this paper presents a fuzzy classification approach for document categorization. Furthermore, applying the concept of the entropy of information, the approaches to select key words from the set of words covering the classification of documents and to construct the hierarchical structure of key words are obtained.  相似文献   

12.
《Pattern recognition》2014,47(2):758-768
Sentiment analysis, which detects the subjectivity or polarity of documents, is one of the fundamental tasks in text data analytics. Recently, the number of documents available online and offline is increasing dramatically, and preprocessed text data have more features. This development makes analysis more complex to be analyzed effectively. This paper proposes a novel semi-supervised Laplacian eigenmap (SS-LE). The SS-LE removes redundant features effectively by decreasing detection errors of sentiments. Moreover, it enables visualization of documents in perceptible low dimensional embedded space to provide a useful tool for text analytics. The proposed method is evaluated using multi-domain review data set in sentiment visualization and classification by comparing other dimensionality reduction methods. SS-LE provides a better similarity measure in the visualization result by separating positive and negative documents properly. Sentiment classification models trained over reduced data by SS-LE show higher accuracy. Overall, experimental results suggest that SS-LE has the potential to be used to visualize documents for the ease of analysis and to train a predictive model in sentiment analysis. SS-LE can also be applied to any other partially annotated text data sets.  相似文献   

13.
针对文本分类中传统特征选择方法卡方统计量和信息增益的不足进行了分析,得出文本分类中的特征选择关键在于选择出集中分布于某类文档并在该类文档中均匀分布且频繁出现的特征词。因此,综合考虑特征词的文档频、词频以及特征词的类间集中度、类内分散度,提出一种基于类内类间文档频和词频统计的特征选择评估函数,并利用该特征选择评估函数在训练集每个类别中选取一定比例的特征词组成该类别的特征词库,而训练集的特征词库则为各类别特征词库的并集。通过基于SVM的中文文本分类实验表明,该方法与传统的卡方统计量和信息增益相比,在一定程度上提高了文本分类的效果。  相似文献   

14.
Text categorization is continuing to be one of the most researched NLP problems due to the ever-increasing amounts of electronic documents and digital libraries. In this paper, we present a new text categorization method that combines the distributional clustering of words and a learning logic technique, called Lsquare, for constructing text classifiers. The high dimensionality of text in a document has not been fruitful for the task of categorization, for which reason, feature clustering has been proven to be an ideal alternative to feature selection for reducing the dimensionality. We, therefore, use distributional clustering method (IB) to generate an efficient representation of documents and apply Lsquare for training text classifiers. The method was extensively tested and evaluated. The proposed method achieves higher or comparable classification accuracy and {rm F}_1 results compared with SVM on exact experimental settings with a small number of training documents on three benchmark data sets WebKB, 20Newsgroup, and Reuters-21578. The results prove that the method is a good choice for applications with a limited amount of labeled training data. We also demonstrate the effect of changing training size on the classification performance of the learners.  相似文献   

15.
In case of spatial multi spectral images, such as remotely sensed earth cover, there could be many classes in one entire frame covering a large spatial stretch, because of which meaningful dimensionality reduction cannot perhaps be realizable without trading off with the quality of classification. However most often one would encounter in such images, presence of only a few classes in a small neighborhood, which would enable to devise a very effective dimensionality reduction around that small neighborhood identified as a block. Based on this theme a new method for dimensionality reduction is proposed in this paper.

The method proposed divides the image into uniform non-overlapping windows/blocks. The few features that are essential in discriminating classes in a window are identified. Clustering is performed independently on each of the blocks with the reduced set of features. These clusters in the blocks are later merged to obtain an overall classification of the entire image. The efficacy of the method is corroborated experimentally.  相似文献   


16.
自动文本分类特征选择方法研究   总被引:4,自引:4,他引:4  
文本分类是指根据文本的内容将大量的文本归到一个或多个类别的过程,文本表示技术是文本分类的核心技术之一,而特征选择又是文本表示技术的关键技术之一,对分类效果至关重要。文本特征选择是最大程度地识别和去除冗余信息,提高训练数据集质量的过程。对文本分类的特征选择方法,包括信息增益、互信息、X^2统计量、文档频率、低损降维和频率差法等做了详细介绍、分析、比较研究。  相似文献   

17.
Feature selection is known as a good solution to the high dimensionality of the feature space and mostly preferred feature selection methods for text classification are filter-based ones. In a common filter-based feature selection scheme, unique scores are assigned to features depending on their discriminative power and these features are sorted in descending order according to the scores. Then, the last step is to add top-N features to the feature set where N is generally an empirically determined number. In this paper, an improved global feature selection scheme (IGFSS) where the last step in a common feature selection scheme is modified in order to obtain a more representative feature set is proposed. Although feature set constructed by a common feature selection scheme successfully represents some of the classes, a number of classes may not be even represented. Consequently, IGFSS aims to improve the classification performance of global feature selection methods by creating a feature set representing all classes almost equally. For this purpose, a local feature selection method is used in IGFSS to label features according to their discriminative power on classes and these labels are used while producing the feature sets. Experimental results on well-known benchmark datasets with various classifiers indicate that IGFSS improves the performance of classification in terms of two widely-known metrics namely Micro-F1 and Macro-F1.  相似文献   

18.
提出了一种没有训练集情况下实现对未标注类别文本文档进行分类的问题。类关联词是与类主体相关、能反映类主体的单词或短语。利用类关联词提供的先验信息,形成文档分类的先验概率,然后组合利用朴素贝叶斯分类器和EM迭代算法,在半监督学习过程中加入分类约束条件,用类关联词来监督构造一个分类器,实现了对完全未标注类别文档的分类。实验结果证明,此方法能够以较高的准确率实现没有训练集情况下的文本分类问题,在类关联词约束下的分类准确率要高于没有约束情况下的分类准确率。  相似文献   

19.
文本分类的研究者一直在提高文本的分类精度方面做着不懈的努力,在实验中发现,相似主题的文档的分类错误率比较高,该文尝试着提出了一种二次权重分配的新的特征权值分配策略,构造了一种计算难以区分的主题类别的特征辨别能力的权值函数,目的是减少相似主题类别的文档的分类错误。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号