期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Comparison of term frequency and document frequency based feature selection metrics in text categorization

Nouman Azam JingTao Yao 《Expert systems with applications》2012,39(5):4760-4768

Text categorization plays an important role in applications where information is filtered, monitored, personalized, categorized, organized or searched. Feature selection remains as an effective and efficient technique in text categorization. Feature selection metrics are commonly based on term frequency or document frequency of a word. We focus on relative importance of these frequencies for feature selection metrics. The document frequency based metrics of discriminative power measure and GINI index were examined with term frequency for this purpose. The metrics were compared and analyzed on Reuters 21,578 dataset. Experimental results revealed that the term frequency based metrics may be useful especially for smaller feature sets. Two characteristics of term frequency based metrics were observed by analyzing the scatter of features among classes and the rate at which information in data was covered. These characteristics may contribute toward their superior performance for smaller feature sets. 相似文献

2.

基于类别的特征选择算法的文本分类系统 总被引：1，自引：0，他引：1

蒋伟贞陶宏才《计算机应用》2005,25(11):2658-2660

目前的索引词选择算法大多是基于词频的,没有利用训练样本中的类别信息,为此提出了一种新的基于类别的特征选择算法。该算法根据某个词是否存在于文档中导致该类文档相似度的区别,来确定该词区分不同文档的分辨力,以此分辨力作为选取关键词的重要度。以该算法为基础,设计了一个英文文本自动分类系统,并对该系统进行了测试和结果分析。相似文献

3.

基于文档频率的特征选择方法 总被引：1，自引：1，他引：0

下载免费PDF全文

杨凯峰张毅坤李燕《计算机工程》2010,36(17):33-35,38

传统的文档频率(DF)方法在进行特征选择时仅考虑特征词在类别中出现的DF,没有考虑特征词在每篇文档中出现的词频率(TF)问题。针对该问题,基于特征词在每篇文档中出现的TF,结合特征词在类别中出现的DF提出特征选择的新算法,并使用支持向量机方法训练分类器。实验结果表明,在进行特征选择时,考虑高词频特征词对类别的贡献,可提高传统DF方法的分类性能。相似文献

4.

基于二进制PSO算法的特征选择及SVM参数同步优化 总被引：3，自引：0，他引：3

任江涛赵少东许盛灿印鉴《计算机科学》2007,34(6):179-182

特征选择及分类器参数优化是提高分类器性能的两个重要方面,传统上这两个问题是分开解决的。近年来,随着进化优化计算技术在模式识别领域的广泛应用,编码上的灵活性使得特征选择及参数的同步优化成为一种可能和趋势。为了解决此问题,本文研究采用二进制PSO算法同步进行特征选择及SVM参数的同步优化,提出了一种PSO-SVM算法。实验表明,该方法可有效地找出合适的特征子集及SVM参数,并取得较好的分类效果;且与文[4]所提出的GA-SVM算法相比具有特征精简幅度较大、运行效率较高等优点。相似文献

5.

文本分类中一种特征选择方法研究

赵婧邵雄凯刘建舟王春枝《计算机应用研究》2019,36(8)

针对文本分类中传统特征选择方法卡方统计量和信息增益的不足进行了分析,得出文本分类中的特征选择关键在于选择出集中分布于某类文档并在该类文档中均匀分布且频繁出现的特征词。因此,综合考虑特征词的文档频、词频以及特征词的类间集中度、类内分散度,提出一种基于类内类间文档频和词频统计的特征选择评估函数,并利用该特征选择评估函数在训练集每个类别中选取一定比例的特征词组成该类别的特征词库,而训练集的特征词库则为各类别特征词库的并集。通过基于SVM的中文文本分类实验表明,该方法与传统的卡方统计量和信息增益相比,在一定程度上提高了文本分类的效果。相似文献

6.

Detecting near-duplicate documents using sentence-level features and supervised learning

Yung-Shen Lin Ting-Yi Liao Shie-Jue Lee 《Expert systems with applications》2013,40(5):1467-1476

We present a novel method for detecting near-duplicates from a large collection of documents. Three major parts are involved in our method, feature selection, similarity measure, and discriminant derivation. To find near-duplicates to an input document, each sentence of the input document is fetched and preprocessed, the weight of each term is calculated, and the heavily weighted terms are selected to be the feature of the sentence. As a result, the input document is turned into a set of such features. A similarity measure is then applied and the similarity degree between the input document and each document in the given collection is computed. A support vector machine (SVM) is adopted to learn a discriminant function from a training pattern set, which is then employed to determine whether a document is a near-duplicate to the input document based on the similarity degree between them. The sentence-level features we adopt can better reveal the characteristics of a document. Besides, learning the discriminant function by SVM can avoid trial-and-error efforts required in conventional methods. Experimental results show that our method is effective in near-duplicate document detection. 相似文献

7.

Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering

《Expert systems with applications》2017

This paper proposes three feature selection algorithms with feature weight scheme and dynamic dimension reduction for the text document clustering problem. Text document clustering is a new trend in text mining; in this process, text documents are separated into several coherent clusters according to carefully selected informative features by using proper evaluation function, which usually depends on term frequency. Informative features in each document are selected using feature selection methods. Genetic algorithm (GA), harmony search (HS) algorithm, and particle swarm optimization (PSO) algorithm are the most successful feature selection methods established using a novel weighting scheme, namely, length feature weight (LFW), which depends on term frequency and appearance of features in other documents. A new dynamic dimension reduction (DDR) method is also provided to reduce the number of features used in clustering and thus improve the performance of the algorithms. Finally, k-mean, which is a popular clustering method, is used to cluster the set of text documents based on the terms (or features) obtained by dynamic reduction. Seven text mining benchmark text datasets of different sizes and complexities are evaluated. Analysis with k-mean shows that particle swarm optimization with length feature weight and dynamic reduction produces the optimal outcomes for almost all datasets tested. This paper provides new alternatives for text mining community to cluster text documents by using cohesive and informative features. 相似文献

8.

Learning Semi-Structured Document Categorization Using Bounded-Length Spectrum Sub-Sequence Kernels

Olivier de Vel 《Data mining and knowledge discovery》2006,13(3):309-334

In this paper we report an investigation into the learning of semi-structured document categorization. We automatically discover low-level, short-range byte data structure patterns from a document data stream by extracting all byte sub-sequences within a sliding window to form an augmented (or bounded-length) string spectrum feature map and using a modified suffix trie data structure (called the coloured generalized suffix tree or CGST) to efficiently store and manipulate the feature map. Using the CGST we are able to efficiently compute the stream's bounded-length sequence spectrum kernel. We compare the performance of two classifier algorithms to categorize the data streams, namely, the SVM and Naive Bayes (NB) classifiers. Experiments have provided good classification performance results on a variety of document byte streams, particularly when using the NB classifier under certain parameter settings. Results indicate that the bounded-length kernel is superior to the standard fixed-length kernel for semi-structured documents. 相似文献

9.

基于改进斑点鬣狗优化算法的同步优化特征选择

贾鹤鸣姜子超李瑶孙康健《计算机应用》2021,41(5):1290-1298

针对传统支持向量机（SVM）在封装式特征选择中分类精度低、特征子集选择冗余以及计算效率差的不足,利用元启发式优化算法同步优化SVM与特征选择。为改善SVM分类效果以及选择特征子集的能力,首先,利用自适应差分进化（DE）算法、混沌初始化与锦标赛选择策略对斑点鬣狗优化（SHO）算法改进,以增强其局部搜索能力并提高其寻优效率与求解精度;其次,将改进后的算法用于特征选择与SVM参数调整的同步优化中;最后,在UCI数据集进行特征选择仿真实验,采取分类准确率、选择特征数、适应度值及运行时间来综合评估所提算法的优化性能。实验结果证明,改进算法的同步优化机制能够在高分类准确率下降低特征选择的数目,该算法比传统算法更适合解决封装式特征选择问题,具有良好的应用价值。相似文献

10.

改进的MEA进行特征选择及SVM参数同步优化

丁胜张进李波《计算机工程与应用》2017,53(8):120-125

特征选择和参数优化是提高支持向量机（SVM）分类性能的两个重要手段,将两者进行同步优化能提高分类器的分类精度。利用思维进化算法（MEA）进行特征选择和SVM参数同步优化能取得较好的分类效果,但也存在着收敛速度慢,易陷入局部最优的问题,无法进一步提高分类精度。针对这一问题,提出了一种改进的思维进化算法进行分类器优化（RMEA-SVM）,在传统思维进化算法的基础上引入了“学习”和“反思”机制,利用子群体间信息共享进行学习,通过适应度值的比较进行反思。通过这种方式保证种群的多样性,加快收敛速度,进一步提高分类精度。实验结果证明了算法的有效性。相似文献

11.

A new feature selection algorithm based on binomial hypothesis testing for spam filtering

Jieming Yang Yuanning Liu Zhen Liu Xiaodong Zhu Xiaoxu Zhang 《Knowledge》2011,24(6):904-914

Content-based spam filtering is a binary text categorization problem. To improve the performance of the spam filtering, feature selection, as an important and indispensable means of text categorization, also plays an important role in spam filtering. We proposed a new method, named Bi-Test, which utilizes binomial hypothesis testing to estimate whether the probability of a feature belonging to the spam satisfies a given threshold or not. We have evaluated Bi-Test on six benchmark spam corpora (pu1, pu2, pu3, pua, lingspam and CSDMC2010), using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVM), and compared it with four famous feature selection algorithms (information gain, χ²-statistic, improved Gini index and Poisson distribution). The experiments show that Bi-Test performs significantly better than χ²-statistic and Poisson distribution, and produces comparable performance with information gain and improved Gini index in terms of F1 measure when Naïve Bayes classifier is used; it achieves comparable performance with the other methods when SVM classifier is used. Moreover, Bi-Test executes faster than the other four algorithms. 相似文献

12.

中文文本分类中特征抽取方法的比较研究 总被引：99，自引：9，他引：99

代六玲黄河燕陈肇雄《中文信息学报》2004,18(1):27-33

本文比较研究了在中文文本分类中特征选取方法对分类效果的影响。考察了文档频率DF、信息增益IG、互信息MI、χ2分布CHI四种不同的特征选取方法。采用支持向量机(SVM)和KNN两种不同的分类器以考察不同抽取方法的有效性。实验结果表明,在英文文本分类中表现良好的特征抽取方法(IG、MI和CHI)在不加修正的情况下并不适合中文文本分类。文中从理论上分析了产生差异的原因,并分析了可能的矫正方法包括采用超大规模训练语料和采用组合的特征抽取方法。最后通过实验验证组合特征抽取方法的有效性。相似文献

13.

基于IGA的支持向量机特征子集选择和参数优化

郝艳友迟忠先李克秋张永《计算机工程与应用》2008,44(22):35-38

特征子集选择和训练参数的优化一直是SVM研究中的两个重要方面,选择合适的特征和合理的训练参数可以提高SVM分类器的性能,以往的研究是将两个问题分别进行解决。随着遗传优化等自然计算技术在人工智能领域的应用,开始出现特征选择及参数的同时优化研究。研究采用免疫遗传算法（IGA）对特征选择及SVM 参数的同时优化,提出了一种IGA-SVM 算法。实验表明,该方法可找出合适的特征子集及SVM 参数,并取得较好的分类效果,证明算法的有效性。相似文献

14.

Unsupervised feature selection based on kernel fisher discriminant analysis and regression learning

Shang Ronghua Meng Yang Liu Chiyang Jiao Licheng Esfahani Amir M. Ghalamzan Stolkin Rustam 《Machine Learning》2019,108(4):659-686

In this paper, we propose a new feature selection method called kernel fisher discriminant analysis and regression learning based algorithm for unsupervised feature selection. The existing feature selection methods are based on either manifold learning or discriminative techniques, each of which has some shortcomings. Although some studies show the advantages of two-steps method benefiting from both manifold learning and discriminative techniques, a joint formulation has been shown to be more efficient. To do so, we construct a global discriminant objective term of a clustering framework based on the kernel method. We add another term of regression learning into the objective function, which can impose the optimization to select a low-dimensional representation of the original dataset. We use L_2,1-norm of the features to impose a sparse structure upon features, which can result in more discriminative features. We propose an algorithm to solve the optimization problem introduced in this paper. We further discuss convergence, parameter sensitivity, computational complexity, as well as the clustering and classification accuracy of the proposed algorithm. In order to demonstrate the effectiveness of the proposed algorithm, we perform a set of experiments with different available datasets. The results obtained by the proposed algorithm are compared against the state-of-the-art algorithms. These results show that our method outperforms the existing state-of-the-art methods in many cases on different datasets, but the improved performance comes with the cost of increased time complexity.

相似文献

15.

基于关键词语的文本特征选择及权重计算方案 总被引：2，自引：3，他引：2

刘里何中市《计算机工程与设计》2006,27(6):934-936

文本的形式化表示一直是文本分类的重要难题．在被广泛采用的向量空间模型中,文本的每一维特征的权重就是其TFIDF值,这种方法难以突出对文本内容起到关键性作用的特征。提出一种基于关键词语的特征选择及权重计算方案,它利用了文本的结构信息同时运用互信息理论提取出对文本内容起到关键性作用的词语;权重计算则综合了词语位置、词语关系和词语频率等信息,突出了文本中关键词语的贡献,弥补了TFIDF的缺陷。通过采用支持向量机（SVM）分类器进行实验,结果显示提出的Score权重计算法比传统TFIDF法的平均分类准确率要高5％左右。相似文献

16.

改进的基于粒子群优化的支持向量机特征选择和参数联合优化算法

张进丁胜李波《计算机应用》2016,36(5):1330-1335

针对支持向量机(SVM)中特征选择和参数优化对分类精度有较大影响,提出了一种改进的基于粒子群优化(PSO)的SVM特征选择和参数联合优化算法(GPSO-SVM),使算法在提高分类精度的同时选取尽可能少的特征数目。为了解决传统粒子群算法在进行优化时易出现陷入局部最优和早熟的问题,该算法在PSO中引入遗传算法(GA)中的交叉变异算子,使粒子在每次迭代更新后进行交叉变异操作来避免这一问题。该算法通过粒子之间的不相关性指数来决定粒子之间的交叉配对,由粒子适应度值的大小决定其变异概率的大小,由此产生新的粒子进入到群体中。这样使得粒子跳出当前搜索到的局部最优位置,提高了群体的多样性,在全局范围内寻找更优值。在不同数据集上进行实验,与基于PSO和GA的特征选择和SVM参数联合优化算法相比,GPSO-SVM的分类精度平均提高了2%~3%,选择的特征数目减少了3%~15%。实验结果表明,所提算法的特征选择和参数优化效果更好。相似文献

17.

Best terms: an efficient feature-selection algorithm for text categorization

Dimitris Fragoudis Dimitris Meretakis Spiridon Likothanassis 《Knowledge and Information Systems》2005,8(1):16-33

In this paper, we propose a new feature-selection algorithm for text classification, called best terms (BT). The complexity of BT is linear in respect to the number of the training-set documents and is independent from both the vocabulary size and the number of categories. We evaluate BT on two benchmark document collections, Reuters-21578 and 20-Newsgroups, using two classification algorithms, naive Bayes (NB) and support vector machines (SVM). Our experimental results, comparing BT with an extensive and representative list of feature-selection algorithms, show that (1) BT is faster than the existing feature-selection algorithms; (2) BT leads to a considerable increase in the classification accuracy of NB and SVM as measured by the F1 measure; (3) BT leads to a considerable improvement in the speed of NB and SVM; in most cases, the training time of SVM has dropped by an order of magnitude; (4) in most cases, the combination of BT with the simple, but very fast, NB algorithm leads to classification accuracy comparable with SVM while sometimes it is even more accurate. 相似文献

18.

新颖的判别性特征选择方法

吴锦华左开中接标丁新涛《计算机应用》2015,35(10):2752-2756

作为数据预处理的一种常用的手段,特征选择不仅能够提高分类器的分类性能,而且能增加对分类结果的解释性。针对基于稀疏学习的特征选择方法有时会忽略一些有用的判别信息而影响分类性能的问题,提出了一种新的判别性特征选择方法——D-LASSO,用于选择出更具有判别力的特征。首先D-LASSO模型包含一个L₁-范式正则化项,用于产生一个稀疏解;其次,为了诱导出更具有判别力的特征,模型中增加了一个新的判别性正则化项,用于保留同类样本以及不同类样本之间几何分布信息,用于诱导出更具有判别力的特征。在一系列Benchmark数据集上的实验结果表明,与已有方法相比较,D-LASSO不仅能进一步提高分类器的分类精度,而且对参数也较为鲁棒。相似文献

19.

面向跨领域情感分类的特征选择方法

张玉红周全胡学钢《模式识别与人工智能》2013,26(11):1068-1072

数据标记的难以获取使得跨领域适应成为一种有效的途径.然而情感分类具有较强的领域依赖性,利用传统的特征选择方法在原始领域构建的特征空间不能体现领域间的共性,难以适用于目标领域.为此,提出一种面向跨领域情感分类的特征选择方法(LLRTF),利用对数似然比选取在原始领域富有判别力的特征,并通过对照两个领域的统计信息,选出其中在目标领域影响较大的特征.基于该方法构建的公共特征空间,能减少领域间数据分布的差异.实验结果表明,LLRTF优于基准算法. 相似文献

20.

Motor Imagery Electroencephalograph Classification Based on Optimized Support Vector Machine by Magnetic Bacteria Optimization Algorithm

Hongwei Mo Yanyan Zhao 《Neural Processing Letters》2016,44(1):185-197

In this paper, an optimized support vector machine (SVM) based on a new bio-inspired method called magnetic bacteria optimization algorithm method is proposed to construct a high performance classifier for motor imagery electroencephalograph based brain–computer interface (BCI). Butterworth band-pass filter and artifact removal technique are combined to extract the feature of frequency band of the ERD/ERS. Common spatial pattern is used to extract the feature vector which are put into the classifier later. The optimization mechanism involves kernel parameters setting in the SVM training procedure, which significantly influences the classification accuracy. Our novel approach aims to optimize the penalty factor parameter C and kernel parameter g of the SVM. The experimental results on the BCI Competition IV dataset II-a clearly present the effectiveness of the proposed method outperforming other competing methods in the literature such as genetic algorithm, particle swarm algorithm, artificial bee colony, biogeography based optimization. 相似文献