首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
基于核方法的XML文档自动分类   总被引:3,自引:0,他引:3  
杨建武 《计算机学报》2011,34(2):353-359
支持向量机(SVM)方法通过核函数进行空间映射并构造最优分类超平面解决分类器的构造问题,该方法在文本自动分类应用中具有明显优势.XML 文档是文本内容信息与结构信息的综合体,作为一种新的数据形式,成为当前的研究热点.文中以结构链接向量模型为基础,研究了基于支持向量机的XML文档自动分类方法,提出了适合XML文档分类的核...  相似文献   

2.
基于向量空间模型的文本分类由于文本向量维数较高导致分类器效率较低.针对这一不足,提出一种新的基于簇划分的文本分类方法.其主要思想是根据向量空间中向量间的距离,将训练文档分成若干簇,同一簇中的文档具有相同类别.测试时,根据测试文档落入哪个簇,确定文档的类别,并且和传统的文本分类方法k-NN进行了比较.实验结果表明,该方法在高维空间具有良好的泛化能力和很好的时间性能.  相似文献   

3.
基于关联规则的Web文档分类   总被引:5,自引:2,他引:5  
在现有的Web文档分类器中,有的分类器产生比较精确的分类结果,有的分类器产生更易解释的分类模型,但还没有分类器可以将两个方面的优点结合起来.有鉴于此,论文提出一种基于关联规则的Web文档分类方法.该方法采用事务概念,主要考虑两方面的问题:①在文档训练集中发现最优的词条关联规则;②用这些规则构建一个Web文档分类器.试验表明该分类器性能良好,训练速度快,产生的规则易于被人理解,而且容易更新和调整.  相似文献   

4.
一种基于向量空间模型的多层次文本分类方法   总被引:37,自引:2,他引:37  
本文研究和改进了经典的向量空间模型(VSM)的词语权重计算方法,并在此基础上提出了一种基于向量空间模型的多层次文本分类方法。也就是把各类按照一定的层次关系组织成树状结构,并将一个类中的所有训练文档合并为一个类文档,在提取各类模型时只在同层同一结点下的类文档之间进行比较;而对文档进行自动分类时,首先从根结点开始找到对应的大类,然后递归往下直到找到对应的叶子子类。实验和实际系统表明,该方法具有较高的正确率和召回率。  相似文献   

5.
中心分类法性能高效,但需要大量的训练文档(已标识文档)来训练分类器以保证分类的正确性.而训练文档因需花费大量人力物力来分类而数量有限,同时,网络上存在着很多未标识文档.为此,对中心分类法进行改进,提出了ONUC和0FFUC算法,以弥补当训练文档不足时,中心分类法性能急剧下降的缺陷.考虑到中心分类法易受孤立点的影响,采取了去边处理.实验证明,与普通的中心分类法、其它半监督经典算法比较,在训练文档很少的情况下,该算法能获得较好的性能.  相似文献   

6.
自动文本分类是指在给定的分类体系下,让计算机根据文本的内容确定与它相关联的类别。现有的文本分类算法大都基于向量空间模型,因而不能充分表达文档的语义特征信息,从而影响了分类器性能。针对此问题,本文通过训练文档构造相似矩阵,从中获得每个类别的主题信息,由此构造分类器,最后与经典的分类器进行组合以确定文本类别。实验系统证明本文提出的分类方法较大改进了分类器性能。  相似文献   

7.
针对文本情感分类准确率不高的问题,提出基于CCA-VSM分类器和KFD的多级文本情感分类方法。采用典型相关性分析对文档的权重特征向量和词性特征向量进行降维,在约简向量集上构建向量空间模型,根据模型之间的差异度设计VSM分类器,筛选出与测试文档差异度较小的R个模型作为核Fisher判别的输入,最终判别出文档的情感观点。实验结果表明:该方法比传统支持向量机有较高的分类准确率和较快的分类速度,权重特征和词性特征对分类准确率的影响较大。  相似文献   

8.
王自强  钱旭 《计算机应用》2009,29(2):416-418
为了高效地解决Web文档分类问题,提出了一种基于核鉴别分析方法KDA和SVM的文档分类算法。该算法首先利用KDA对训练集中的高维Web文档空间进行降维,然后在降维后的低维特征空间中利用乘性更新规则优化的SVM进行分类预测。采用了文档分类领域两个著名的数据集Reuters-21578和20-Newsgroup进行实验,实验结果表明该算法不仅获得了更高的分类准确率,而且具有较少的运行时间。  相似文献   

9.
针对现有文档向量表示方法受噪声词语影响和重要词语语义不完整的问题,通过融合单词贡献度与Word2Vec词向量提出一种新的文档表示方法。应用数据集训练Word2Vec模型,计算数据集中词语的贡献度,同时设置贡献度阈值,提取贡献度大于该阈值的单词构建单词集合。在此基础上,寻找文档与集合中共同存在的单词,获取其词向量并融合单词贡献度生成文档向量。实验结果表明,该方法在搜狗中文文本语料库和复旦大学中文文本分类语料库上分类的平均准确率、召回率和F1值均优于TF-IDF、均值Word2Vec、PTF-IDF加权Word2Vec模型等传统方法,同时其对英文文本也能进行有效分类。  相似文献   

10.
Web文档分类是Web挖掘中最基本的技术之一,而构造一个按照兴趣分类的分类器,需要做大量的预处理工作,来收集正负的训练样例。但负例的收集是非常困难的。文章提出了一个只有正例没有负例的学习模型。该模型主要是重复执行SVM。实验表明,该学习模型对于Web文档分类的分类精度和速度都是非常理想的。  相似文献   

11.
医疗问题诉求分类属于文本分类,是自然语言处理中的基础任务。该文提出一种基于强化学习的方法对医疗问题诉求进行分类。首先,通过强化学习自动识别出医疗问题中的关键词,并且对医疗问题中的关键词和非关键词赋予不同的值构成一个向量;其次,利用该向量作为attention机制的权重向量,对Bi-LSTM模型生成的隐含层状态序列加权求和得到问题表示;最后通过Softmax分类器对问题表示进行分类。实验结果表明,该方法比基于Bi-LSTM模型的分类结果准确率提高1.49%。  相似文献   

12.
Our objective here is to provide an extension of the naive Bayesian classifier in a manner that gives us more parameters for matching data. We first describe the naive Bayesian classifier, and then discuss the ordered weighted averaging (OWA) aggregation operators. We introduce a new class of OWA operators which are based on a combining the OWA operators with t-norm’s operators. We show that the naive Bayesian classifier can seen as a special case of this. We use this to suggest an extended version of the naive Bayesian classifier which involves a weighted summation of products of the probabilities. An algorithm is suggested to obtain the weights associated with this extended naive Bayesian classifier.  相似文献   

13.
Reliable pedestrian detection is of great importance in visual surveillance. In this paper, we propose a novel multiplex classifier model, which is composed of two multiplex cascades parts: Haar-like cascade classifier and shapelet cascade classifier. The Haar-like cascade classifier filters out most of irrelevant image background, while the shapelet cascade classifier detects intensively head-shoulder features. The weighted linear regression model is introduced to train its weak classifiers. We also introduce a structure table to label the foreground pixels by means of background differences. The experimental results illustrate that our classifier model provides satisfying detection accuracy. In particular, our detection approach can also perform well for low resolution and relatively complicated backgrounds.  相似文献   

14.
This paper presents a novel method for differential diagnosis of erythemato-squamous disease. The proposed method is based on fuzzy weighted pre-processing, k-NN (nearest neighbor) based weighted pre-processing, and decision tree classifier. The proposed method consists of three parts. In the first part, we have used decision tree classifier to diagnosis erythemato-squamous disease. In the second part, first of all, fuzzy weighted pre-processing, which can improved by ours, is a new method and applied to inputs erythemato-squamous disease dataset. Then, the obtained weighted inputs were classified using decision tree classifier. In the third part, k-NN based weighted pre-processing, which can improved by ours, is a new method and applied to inputs erythemato-squamous disease dataset. Then, the obtained weighted inputs were classified via decision tree classifier. The employed decision tree classifier, fuzzy weighted pre-processing decision tree classifier, and k-NN based weighted pre-processing decision tree classifier have reached to 86.18, 97.57, and 99.00% classification accuracies using 20-fold cross validation, respectively.  相似文献   

15.
We proposed a feature selection approach, Patterned Keyword in Phrase ( PKIP ), to text categorization for item banks. The item bank is a collection of textual question items that are short sentences. Each sentence does not contain enough relevant words for directly categorizing by the traditional approaches such as "bag-of-words." Therefore, PKIP was designed to categorize such question item using only available keywords and their patterns. PKIP identifies the appropriate keywords by computing the weight of all words. In this paper, two keyword selection strategies are suggested to ensure the categorization accuracy of PKIP. PKIP was implemented and tested with the item bank of Thai high primary mathematics questions. The test results have proved that PKIP is able to categorize the question items correctly and the two keyword selection strategies can extract the very informative keywords.  相似文献   

16.
基于权重查询词的XML结构查询扩展   总被引:9,自引:0,他引:9  
万常选  鲁远 《软件学报》2008,19(10):2611-2619
文本文档信息检索中检索质量不高的一个主要原因是用户难以提出准确的描述查询意图的查询表达式. 而XML文档除了具有文本文档的内容特征外,还具有结构特征,导致用户更难以提出准确的查询表达式.为了解决这一问题,提出一种基于相关反馈的查询扩展方法,可以帮助用户构建满足查询意图的"内容 结构"的查询表达式.该方法首先进行查询词扩展,找到最能代表用户查询意图的权重扩展查询词;然后在扩展查询词的基础上进行结构查询扩展;最终形成完整的"内容 结构"的查询扩展表达式.实验结果表明,与未进行查询扩展相比,扩展后prec@10和prec@20的平均准确率提高30%以上.  相似文献   

17.
With the flood and popularity of various multimedia contents on the Internet, searching for appropriate contents and representing them effectively has become an essential part for user satisfaction. So far, many contents recommendation systems have been proposed for this purpose. A popular approach is to select hot or popular contents for recommendation using some popularity metric. Recently, various social network services (SNSs) such as Facebook and Twitter have become a widespread social phenomenon owing to the smartphone boom. Considering the popularity and user participation, SNS can be a good source for finding social interests or trends. In this study, we propose a platform called TrendsSummary for retrieving trendy multimedia contents and summarizing them. To identify trendy multimedia contents, we select candidate keywords from raw data collected from Twitter using a syntactic feature-based filtering method. Then, we merge various keyword variants based on several heuristics. Next, we select trend keywords and their related keywords from the merged candidate keywords based on term frequency and expand them semantically by referencing portal sites such as Wikipedia and Google. Based on the expanded trend keywords, we collect four types of relevant multimedia contents—TV programs, videos, news articles, and images—from various websites. The most appropriate media type for the trend keywords is determined based on a naïve Bayes classifier. After classification, appropriate contents are selected from among the contents of the selected media type. Finally, both trend keywords and their related multimedia contents are displayed for effective browsing. We implemented a prototype system and experimentally demonstrated that our scheme provides satisfactory results.  相似文献   

18.
How to effectively predict financial distress is an important problem in corporate financial management. Though much attention has been paid to financial distress prediction methods based on single classifier, its limitation of uncertainty and benefit of multiple classifier combination for financial distress prediction has also been neglected. This paper puts forward a financial distress prediction method based on weighted majority voting combination of multiple classifiers. The framework of multiple classifier combination system, model of weighted majority voting combination, basic classifiers’ voting weight model and basic classifiers’ selection principles are discussed in detail. Empirical experiment with Chinese listed companies’ real world data indicates that this method can greatly improve the average prediction accuracy and stability, and it is more suitable for financial distress prediction than single classifiers.  相似文献   

19.
Traditional approaches for text data stream classification usually require the manual labeling of a number of documents, which is an expensive and time consuming process. In this paper, to overcome this limitation, we propose to classify text streams by keywords without labeled documents so as to reduce the burden of labeling manually. We build our base text classifiers with the help of keywords and unlabeled documents to classify text streams, and utilize classifier ensemble algorithms to cope with concept drifting in text data streams. Experimental results demonstrate that the proposed method can build good classifiers by keywords without manual labeling, and when the ensemble based algorithm is used, the concept drift in the streams can be well detected and adapted, which performs better than the single window algorithm.  相似文献   

20.
最小距离分类器的改进算法--加权最小距离分类器   总被引:12,自引:0,他引:12  
任靖  李春平 《计算机应用》2005,25(5):992-994
最小距离分类器是一种简单而有效的分类方法。为了提高最小距离分类器的分类性能,主要的改进方法是选择更有效的距离度量。通过分析多重限制分类器和决策树分类器的分类原则,提出了基于标准化欧式距离的加权最小距离分类器。该分类器通过对标称型和字符串型属性的距离的加权定义。以及增加属性值的范围约束,扩大了最小标准化欧式距离分类器的适用范围,同时提高了其分类准确率。实验结果表明,加权最小距离分类器具有较高的分类准确率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号