首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Automatic text classification is usually based on models constructed through learning from training examples. However, as the size of text document repositories grows rapidly, the storage requirements and computational cost of model learning is becoming ever higher. Instance selection is one solution to overcoming this limitation. The aim is to reduce the amount of data by filtering out noisy data from a given training dataset. A number of instance selection algorithms have been proposed in the literature, such as ENN, IB3, ICF, and DROP3. However, all of these methods have been developed for the k-nearest neighbor (k-NN) classifier. In addition, their performance has not been examined over the text classification domain where the dimensionality of the dataset is usually very high. The support vector machines (SVM) are core text classification techniques. In this study, a novel instance selection method, called Support Vector Oriented Instance Selection (SVOIS), is proposed. First of all, a regression plane in the original feature space is identified by utilizing a threshold distance between the given training instances and their class centers. Then, another threshold distance, between the identified data (forming the regression plane) and the regression plane, is used to decide on the support vectors for the selected instances. The experimental results based on the TechTC-100 dataset show the superior performance of SVOIS over other state-of-the-art algorithms. In particular, using SVOIS to select text documents allows the k-NN and SVM classifiers perform better than without instance selection.  相似文献   

2.
“Dimensionality” is one of the major problems which affect the quality of learning process in most of the machine learning and data mining tasks. Having high dimensional datasets for training a classification model may lead to have “overfitting” of the learned model to the training data. Overfitting reduces generalization of the model, therefore causes poor classification accuracy for the new test instances. Another disadvantage of dimensionality of dataset is to have high CPU time requirement for learning and testing the model. Applying feature selection to the dataset before the learning process is essential to improve the performance of the classification task. In this study, a new hybrid method which combines artificial bee colony optimization technique with differential evolution algorithm is proposed for feature selection of classification tasks. The developed hybrid method is evaluated by using fifteen datasets from the UCI Repository which are commonly used in classification problems. To make a complete evaluation, the proposed hybrid feature selection method is compared with the artificial bee colony optimization, and differential evolution based feature selection methods, as well as with the three most popular feature selection techniques that are information gain, chi-square, and correlation feature selection. In addition to these, the performance of the proposed method is also compared with the studies in the literature which uses the same datasets. The experimental results of this study show that our developed hybrid method is able to select good features for classification tasks to improve run-time performance and accuracy of the classifier. The proposed hybrid method may also be applied to other search and optimization problems as its performance for feature selection is better than pure artificial bee colony optimization, and differential evolution.  相似文献   

3.
Automatic classification of text documents, one of essential techniques for Web mining, has always been a hot topic due to the explosive growth of digital documents available on-line. In text classification community, k-nearest neighbor (kNN) is a simple and yet effective classifier. However, as being a lazy learning method without premodelling, kNN has a high cost to classify new documents when training set is large. Rocchio algorithm is another well-known and widely used technique for text classification. One drawback of the Rocchio classifier is that it restricts the hypothesis space to the set of linear separable hyperplane regions. When the data does not fit its underlying assumption well, Rocchio classifier suffers. In this paper, a hybrid algorithm based on variable precision rough set is proposed to combine the strength of both kNN and Rocchio techniques and overcome their weaknesses. An experimental evaluation of different methods is carried out on two common text corpora, i.e., the Reuters-21578 collection and the 20-newsgroup collection. The experimental results indicate that the novel algorithm achieves significant performance improvement.  相似文献   

4.
信息技术的飞速发展造成了大量的文本数据累积,其中很大一部分是短文本数据。文本分类技术对于从这些海量短文中自动获取知识具有重要意义。但是由于短文中的关键词出现次数少,而且带标签的训练样本又通常数量很少,现有的一般文本挖掘算法很难得到可接受的准确度。一些基于语义的分类方法获得了较好的准确度但又由于其低效性而无法适用于海量数据。文本提出了一个新颖的短文分类算法。该算法基于文本语义特征图,并使用类似kNN的方法进行分类。实验表明该算法在对海量短文进行分类时,其准确度和性能超过其它的算法。  相似文献   

5.
提出了一种没有训练集情况下实现对未标注类别文本文档进行分类的问题。类关联词是与类主体相关、能反映类主体的单词或短语。利用类关联词提供的先验信息,形成文档分类的先验概率,然后组合利用朴素贝叶斯分类器和EM迭代算法,在半监督学习过程中加入分类约束条件,用类关联词来监督构造一个分类器,实现了对完全未标注类别文档的分类。实验结果证明,此方法能够以较高的准确率实现没有训练集情况下的文本分类问题,在类关联词约束下的分类准确率要高于没有约束情况下的分类准确率。  相似文献   

6.
Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naïve Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification.  相似文献   

7.
Sentiment classification has played an important role in various research area including e-commerce applications and a number of advanced Computational Intelligence techniques including machine learning and computational linguistics have been proposed in the literature for improved sentiment classification results. While such studies focus on improving performance with new techniques or extending existing algorithms based on previously used dataset, few studies provide practitioners with insight on what techniques are better for their datasets that have different properties. This paper applies four different sentiment classification techniques from machine learning (Naïve Bayes, SVM and Decision Tree) and sentiment orientation approaches to datasets obtained from various sources (IMDB, Twitter, Hotel review, and Amazon review datasets) to learn how different data properties including dataset size, length of target documents, and subjectivity of data affect the performance of those techniques. The results of computational experiments confirm the sensitivity of the techniques on data properties including training data size, the document length and subjectivity of training /test data in the improvement of performances of techniques. The theoretical and practical implications of the findings are discussed.  相似文献   

8.
Text categorization is continuing to be one of the most researched NLP problems due to the ever-increasing amounts of electronic documents and digital libraries. In this paper, we present a new text categorization method that combines the distributional clustering of words and a learning logic technique, called Lsquare, for constructing text classifiers. The high dimensionality of text in a document has not been fruitful for the task of categorization, for which reason, feature clustering has been proven to be an ideal alternative to feature selection for reducing the dimensionality. We, therefore, use distributional clustering method (IB) to generate an efficient representation of documents and apply Lsquare for training text classifiers. The method was extensively tested and evaluated. The proposed method achieves higher or comparable classification accuracy and {rm F}_1 results compared with SVM on exact experimental settings with a small number of training documents on three benchmark data sets WebKB, 20Newsgroup, and Reuters-21578. The results prove that the method is a good choice for applications with a limited amount of labeled training data. We also demonstrate the effect of changing training size on the classification performance of the learners.  相似文献   

9.
康厚良  杨玉婷 《图学学报》2022,43(5):865-874
以卷积神经网络(CNN)为代表的深度学习技术在图像分类和识别领域表现出了非常优异的性能。但东巴象形文字未有标准、公开的数据集,无法借鉴或使用已有的深度学习算法。为了快速建立权威、有效的东巴文字库,分析已出版东巴文档的版面结构,从文档中提取文本行、东巴字成为了当前的首要任务。因此,结合东巴象形文字文档图像的结构特点,给出了东巴文档图像的文本行自动分割算法。首先利用基于密度和距离的k-均值聚类算法确定了文本行的分类数量和分类标准;然后,通过文字块的二次处理矫正了分割中的错误结果,提高了算法的准确率。在充分利用东巴字文档结构特征的同时,保留了机器学习模型客观、无主观经验影响的优势。通过实验表明,该算法可用于东巴文档图像、脱机手写汉字、东巴经的文本行分割,以及文本行中东巴字和汉字的分割,具有实现简单、准确性高、适应性强的特点,从而为东巴文字库的建立奠定基础。  相似文献   

10.
李钊  李晓  王春梅  李诚  杨春 《计算机科学》2016,43(1):246-250, 269
在文本聚类中,相似性度量是影响聚类效果的重要因素。常用的相似性度量测度,如欧氏距离、相关系数等,只能描述文本间的低阶相关性,而文本间的关系非常复杂,基于低阶相关测度的聚类效果不太理想。一些基于复杂测度的文本聚类方法已被提出,但随着数据规模的扩展,文本聚类的计算量不断增加,传统的聚类方法已不适用于大规模文本聚类。针对上述问题,提出一种基于MapReduce的分布式聚类方法,该方法对传统K-means算法进行了改进,采用了基于信息损失量的相似性度量。为进一步提高聚类的效率,将该方法与基于MapReduce的主成分分析方法相结合,以降低文本特征向量的维数。实例分析表明,提出的大规模文本聚类方法的 聚类性能 比已有的聚类方法更好。  相似文献   

11.
Self-organizing maps (SOM) have been applied on numerous data clustering and visualization tasks and received much attention on their success. One major shortage of classical SOM learning algorithm is the necessity of predefined map topology. Furthermore, hierarchical relationships among data are also difficult to be found. Several approaches have been devised to conquer these deficiencies. In this work, we propose a novel SOM learning algorithm which incorporates several text mining techniques in expanding the map both laterally and hierarchically. On training a set of text documents, the proposed algorithm will first cluster them using classical SOM algorithm. We then identify the topics of each cluster. These topics are then used to evaluate the criteria on expanding the map. The major characteristic of the proposed approach is to combine the learning process with text mining process and makes it suitable for automatic organization of text documents. We applied the algorithm on the Reuters-21578 dataset in text clustering and categorization tasks. Our method outperforms two comparing models in hierarchy quality according to users’ evaluation. It also receives better F1-scores than two other models in text categorization task.  相似文献   

12.
Text Classification from Labeled and Unlabeled Documents using EM   总被引:51,自引:0,他引:51  
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.  相似文献   

13.
The incredible increase in the amount of information on the World Wide Web has caused the birth of topic specific crawling of the Web. During a focused crawling process, an automatic Web page classification mechanism is needed to determine whether the page being considered is on the topic or not. In this study, a genetic algorithm (GA) based automatic Web page classification system which uses both HTML tags and terms belong to each tag as classification features and learns optimal classifier from the positive and negative Web pages in the training dataset is developed. Our system classifies Web pages by simply computing similarity between the learned classifier and the new Web pages. In the existing GA-based classifiers, only HTML tags or terms are used as features, however in this study both of them are taken together and optimal weights for the features are learned by our GA. It was found that, using both HTML tags and terms in each tag as separate features improves accuracy of classification, and the number of documents in the training dataset affects the accuracy such that if the number of negative documents is larger than the number of positive documents in the training dataset, the classification accuracy of our system increases up to 95% and becomes higher than the well known Naïve Bayes and k nearest neighbor classifiers.  相似文献   

14.
最大信息熵原理已被成功地应用于各种自然语言处理领域,如机器翻译、语音识别和文本自动分类等,提出了将其应用于互联网异常流量的分类。由于最大信息熵模型利用二值特征函数来表达和处理符号特征,而KDD99数据集中存在多种连续型特征,因此采用基于信息熵的离散化方法对数据集进行预处理,并利用CFS算法选择合适的特征子集,形成训练数据集合。最后利用BLVM算法进行参数估计,得到满足最大熵约束的指数形式的概率模型。通过实验,比较了最大信息熵模型和Naive Bayes、Bayes Net、SVM与C4.5决策树方法之间的精度、召回率、F-Measure,发现最大信息熵模型具有良好的综合性能,尤其在训练数据集样本数量有限的情况下仍然能保持较高的分类精度,在实际应用中具有广阔的前景。  相似文献   

15.
The text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has been an application for many learning approaches, which proved effective. Nevertheless, TC provides many challenges to machine learning. In this paper, we suggest, for text categorization, the integration of external WordNet lexical information to supplement training data for a semi-supervised clustering algorithm which (i) uses a finite design set of labeled data to (ii) help agglomerative hierarchical clustering algorithms (AHC) partition a finite set of unlabeled data and then (iii) terminates without the capacity to classify other objects. This algorithm is the “semi-supervised agglomerative hierarchical clustering algorithm” (ssAHC). Our experiments use Reuters 21578 database and consist of binary classifications for categories selected from the 89 TOPICS classes of the Reuters collection. Using the vector space model (VSM), each document is represented by its original feature vector augmented with external feature vector generated using WordNet. We verify experimentally that the integration of WordNet helps ssAHC improve its performance, effectively addresses the classification of documents into categories with few training documents, and does not interfere with the use of training data. © 2001 John Wiley & Sons, Inc.  相似文献   

16.
针对经典支持向量机在增量学习中的不足,提出一种基于云模型的最接近支持向量机增量学习算法。该方法利用最接近支持向量机的快速学习能力生成初始分类超平面,并与k近邻法对全部训练集进行约简,在得到的较小规模的精简集上构建云模型分类器直接进行分类判断。该算法模型简单,不需迭代求解,时间复杂度较小,有较好的抗噪性,能较好地体现新增样本的分布规律。仿真实验表明,本算法能够保持较好的分类精度和推广能力,运算速度较快。  相似文献   

17.
学习算法是否具有增量学习能力是衡量其是否适合于解决现实问题的一个重要方面。增量学习使学习算法的时间和空间资源消耗保持在可以管理和控制的水平,已被广泛应用于解决大规模数据集问题。针对文本分类问题,本文提出了增量学习算法的一般性问题。基于推拉策略的基本思想,本文提出了文本分类的增量学习算法ICCDP,并使用该算法对提出的一般性问题进行了分析。实验表明,该算法训练速度快,分类精度高,具有较高的实用价值。  相似文献   

18.
Image classification is of great importance for digital photograph management. In this paper we propose a general statistical learning method based on boosting algorithm to perform image classification for photograph annotation and management. The proposed method employs both features extracted from image content (i.e., color moment and edge direction histogram) and features from the EXIF metadata recorded by digital cameras. To fully utilize potential feature correlations and improve the classification accuracy, feature combination is needed. We incorporate linear discriminant analysis (LDA) algorithm to implement linear combinations between selected features and generate new combined features. The combined features are used along with the original features in boosting algorithm for improving classification performance. To make the proposed learning algorithm more efficient, we present two heuristics for selective feature combinations, which can significantly reduce training computation without losing performance. The proposed image classification method has several advantages: small model size, computational efficiency and improved classification performance based on LDA feature combination.  相似文献   

19.
作为自然语言处理技术中的底层任务之一,文本分类任务对于上游任务有非常重要的辅助价值。而随着最近几年深度学习广泛应用于NLP中的上下游任务的趋势,深度学习在下游任务文本分类中性能不错。但是目前的基于深层学习网络的模型在捕捉文本序列的长距离型上下文语义信息进行建模方面仍有不足,同时也没有引入语言信息来辅助分类器进行分类。针对这些问题,提出了一种新颖的结合Bert与Bi-LSTM的英文文本分类模。该模型不仅能够通过Bert预训练语言模型引入语言信息提升分类的准确性,还能基于Bi-LSTM网络去捕捉双向的上下文语义依赖信息对文本进行显示建模。具体而言,该模型主要有输入层、Bert预训练语言模型层、Bi-LSTM层以及分类器层搭建而成。实验结果表明,与现有的分类模型相比较,所提出的Bert-Bi-LSTM模型在MR数据集、SST-2数据集以及CoLA数据集测试中达到了最高的分类准确率,分别为86.2%、91.5%与83.2%,大大提升了英文文本分类模型的性能。  相似文献   

20.
医疗文本专业术语复杂,垂直领域训练样本不足,传统的分类方法不能满足现实需求,提出一种基于元学习的小样本文本分类模型提高医疗文本分类效率。该模型基于迁移学习思想,加入注意力机制赋予句子中的词语不同的权重,利用两个相互竞争的神经网络分别扮演领域识别者和元知识生成者的角色,通过自适应性网络加强元学习对新数据集的适应性,最后使用岭回归获得数据集的分类。实验对比分析结果验证了该模型对一些公开文本数据集和医疗文本数据具有很好的分类效果。基于元学习的小样本文本分类模型可以成功地应用在医疗文本分类领域。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号