首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
In this paper, a supervised feature selection approach is presented, which is based on metric applied on continuous and discrete data representations. This method builds a dissimilarity space using information theoretic measures, in particular conditional mutual information between features with respect to a relevant variable that represents the class labels. Applying a hierarchical clustering, the algorithm searches for a compression of the information contained in the original set of features. The proposed technique is compared with other state of art methods also based on information measures. Eventually, several experiments are presented to show the effectiveness of the features selected from the point of view of classification accuracy.  相似文献   

2.
针对多维数据集,为得到一个最优特征子集,提出一种基于特征聚类的封装式特征选择算法。在初始阶段,利用三支决策理论动态地将原始特征集划分为若干特征子空间,通过特征聚类算法对每个特征子空间内的特征进行聚类;从每个特征类簇里挑选代表特征,利用邻域互信息对剩余特征进行降序排序并依次迭代选择,使用封装器评估该特征是否应该被选择,可得到一个具有最低分类错误率的最优特征子集。在UCI数据集上的实验结果表明,相较于其它特征选择算法,该算法能有效地提高各数据集在libSVM、J48、Nave Bayes以及KNN分类器上的分类准确率。  相似文献   

3.
A new algorithm for ranking the input features and obtaining the best feature subset is developed and illustrated in this paper. The asymptotic formula for mutual information and the expectation maximisation (EM) algorithm are used to developing the feature selection algorithm in this paper. We not only consider the dependence between the features and the class, but also measure the dependence among the features. Even for noisy data, this algorithm still works well. An empirical study is carried out in order to compare the proposed algorithm with the current existing algorithms. The proposed algorithm is illustrated by application to a variety of problems.  相似文献   

4.
为解决连续值特征条件互信息计算困难和对多值特征偏倚的问题,提出了一种基于 Parzen 窗条件互信息计算的特征选择方法。该方法通过 Parzen 窗估计出连续值特征的概率密度函数,进而方便准确地计算出条件互信息;同时在评价准则中引入特征离散度作为惩罚因子,克服了条件互信息计算对于多值特征的偏倚,实现了对连续型数据的特征选择。实验证明,该方法能够达到与现有方法相当甚至更好的效果,是一种有效的特征选择方法。  相似文献   

5.
Feature selection plays an important role in data mining and pattern recognition, especially for large scale data. During past years, various metrics have been proposed to measure the relevance between different features. Since mutual information is nonlinear and can effectively represent the dependencies of features, it is one of widely used measurements in feature selection. Just owing to these, many promising feature selection algorithms based on mutual information with different parameters have been developed. In this paper, at first a general criterion function about mutual information in feature selector is introduced, which can bring most information measurements in previous algorithms together. In traditional selectors, mutual information is estimated on the whole sampling space. This, however, cannot exactly represent the relevance among features. To cope with this problem, the second purpose of this paper is to propose a new feature selection algorithm based on dynamic mutual information, which is only estimated on unlabeled instances. To verify the effectiveness of our method, several experiments are carried out on sixteen UCI datasets using four typical classifiers. The experimental results indicate that our algorithm achieved better results than other methods in most cases.  相似文献   

6.
Feature selection is one of the major problems in an intrusion detection system (IDS) since there are additional and irrelevant features. This problem causes incorrect classification and low detection rate in those systems. In this article, four feature selection algorithms, named multivariate linear correlation coefficient (MLCFS), feature grouping based on multivariate mutual information (FGMMI), feature grouping based on linear correlation coefficient (FGLCC), and feature grouping based on pairwise MI, are proposed to solve this problem. These algorithms are implementable in any IDS. Both linear and nonlinear measures are used in the sense that the correlation coefficient and the multivariate correlation coefficient are linear, whereas the MI and the multivariate MI are nonlinear. Least Square Support Vector Machine (LS-SVM) as an intrusion classifier is used to evaluate the selected features. Experimental results on the KDDcup99 and Network Security Laboratory-Knowledge Discovery and Data Mining (NSL) datasets showed that the proposed feature selection methods have a higher detection and accuracy and lower false-positive rate compared with the pairwise linear correlation coefficient and the pairwise MI employed in several previous algorithms.  相似文献   

7.
提出了一种优化互信息文本特征选择方法。针对互信息模型的不足之处主要从三方面进行改进:用权重因子对正、负相关特征加以区分;以修正因子的方式在MI中引入词频信息对低频词进行抑制;针对特征项在文本里的位置差异进行基于位置的特征加权。该方法改善了MI模型的特征选择效率。文本分类实验结果验证了提出的优化互信息特征选择方法的合理性与有效性。  相似文献   

8.
特征选择是数据挖掘和机器学习领域中一种常用的数据预处理技术。在无监督学习环境下,定义了一种特征平均相关度的度量方法,并在此基础上提出了一种基于特征聚类的特征选择方法 FSFC。该方法利用聚类算法在不同子空间中搜索簇群,使具有较强依赖关系(存在冗余性)的特征被划分到同一个簇群中,然后从每一个簇群中挑选具有代表性的子集共同构成特征子集,最终达到去除不相关特征和冗余特征的目的。在 UCI 数据集上的实验结果表明,FSFC 方法与几种经典的有监督特征选择方法具有相当的特征约减效果和分类性能。  相似文献   

9.
Exploratory data analysis methods are essential for getting insight into data. Identifying the most important variables and detecting quasi-homogenous groups of data are problems of interest in this context. Solving such problems is a difficult task, mainly due to the unsupervised nature of the underlying learning process. Unsupervised feature selection and unsupervised clustering can be successfully approached as optimization problems by means of global optimization heuristics if an appropriate objective function is considered. This paper introduces an objective function capable of efficiently guiding the search for significant features and simultaneously for the respective optimal partitions. Experiments conducted on complex synthetic data suggest that the function we propose is unbiased with respect to both the number of clusters and the number of features.  相似文献   

10.
基于类信息的文本聚类中特征选择算法   总被引:2,自引:0,他引:2  
文本聚类属于无监督的学习方法,由于缺乏类信息还很难直接应用有监督的特征选择方法,因此提出了一种基于类信息的特征选择算法,此算法在密度聚类算法的聚类结果上使用信息增益特征选择法重新选择最有分类能力的特征,实验验证了算法的可行性和有效性。  相似文献   

11.
Artificial neural networks (ANNs) have been widely used to model environmental processes. The ability of ANN models to accurately represent the complex, non-linear behaviour of relatively poorly understood processes makes them highly suited to this task. However, the selection of an appropriate set of input variables during ANN development is important for obtaining high-quality models. This can be a difficult task when considering that many input variable selection (IVS) techniques fail to perform adequately due to an underlying assumption of linearity, or due to redundancy within the available data.This paper focuses on a recently proposed IVS algorithm, based on estimation of partial mutual information (PMI), which can overcome both of these issues and is considered highly suited to the development of ANN models. In particular, this paper addresses the computational efficiency and accuracy of the algorithm via the formulation and evaluation of alternative techniques for determining the significance of PMI values estimated during selection. Furthermore, this paper presents a rigorous assessment of the PMI-based algorithm and clearly demonstrates the superior performance of this non-linear IVS technique in comparison to linear correlation-based techniques.  相似文献   

12.
自动文本分类特征选择方法研究   总被引:4,自引:4,他引:4  
文本分类是指根据文本的内容将大量的文本归到一个或多个类别的过程,文本表示技术是文本分类的核心技术之一,而特征选择又是文本表示技术的关键技术之一,对分类效果至关重要。文本特征选择是最大程度地识别和去除冗余信息,提高训练数据集质量的过程。对文本分类的特征选择方法,包括信息增益、互信息、X^2统计量、文档频率、低损降维和频率差法等做了详细介绍、分析、比较研究。  相似文献   

13.
基于互信息可信度的贝叶斯网络入侵检测研究   总被引:2,自引:0,他引:2  
传统贝叶斯入侵检测算法没有考虑不同属性和属性权值对入侵检测结果的影响,因此分类准确率不够高.针对传统贝叶斯入侵检测算法存在的不足,提出基于互信息可信度的贝叶斯网络入侵检测算法.在综合考虑网络入侵检测数据特点和传统贝叶斯分类算法优点的基础上,用互信息相对可信度进行特征选择,删除一些冗余属性,把互信息相对可信度作为权值引进贝叶斯分类算法中,得到优化的贝叶斯网络入侵检测算法(MI-NB).实验结果表明,MI-NB算法能大大降低分类数据的维数,比传统贝叶斯入侵检测算法及改进算法有更高的分类准确率.  相似文献   

14.
This paper provides a unifying view of three discriminant linear feature extraction methods: linear discriminant analysis, heteroscedastic discriminant analysis and maximization of mutual information. We propose a model-independent reformulation of the criteria related to these three methods that stresses their similarities and elucidates their differences. Based on assumptions for the probability distribution of the classification data, we obtain sufficient conditions under which two or more of the above criteria coincide. It is shown that these conditions also suffice for Bayes optimality of the criteria. Our approach results in an information-theoretic derivation of linear discriminant analysis and heteroscedastic discriminant analysis. Finally, regarding linear discriminant analysis, we discuss its relation to multidimensional independent component analysis and derive suboptimality bounds based on information theory.  相似文献   

15.
随着网络信息的迅猛发展,信息处理已经成为人们获取有用信息不可缺少的工具,文本自动分类系统是信息处理的重要研究方向.对文本分类关键技术中的特征选择算法进行了探讨,并结合网页特性,对特征权重算法及互信息算法进行了改进.实验结果证明,改进算法是可行的.  相似文献   

16.
利用互信息确定延迟时间的新算法   总被引:1,自引:0,他引:1  
对比研究混沌时间序列相空间重构中对延迟时间选取的各种算法,根据混沌吸引子所具有的伪周期性与各态历经的性质,提出采用逐步细化的方法寻找系统互信息函数的第一局部最小值,作为最佳的延迟时间。克服了传统互信息函数计算繁琐、难以编程实现的缺点,兼具精确性与高效性。通过Risssler和Lorenz系统的数值仿真结果验证了算法的可靠性。  相似文献   

17.
A new mutual information based algorithm is introduced for term selection in spatio-temporal models. A generalised cross validation procedure is also introduced for model length determination and examples based on cellular automata, coupled map lattice and partial differential equations are described.  相似文献   

18.
分析了传统信息增益(IG)特征选择方法忽略了特征项在类间、类内分布信息的缺点,引入类内分散度、类间集中度等因素,区分与类强相关的特征;针对传统信息增益(IG)特征选择方法没有很好组合正相关特征和负相关特征的问题,引入比例因子来平衡特征出现和不出现时的信息量,降低在不平衡语料集上负相关特征的比例,提高分类效果.通过实验证明了改进的信息增益特征选择方法的有效性和可行性.  相似文献   

19.
针对动态话题追踪模型高误报率的现象,提出了动态追踪中的误报检测来判断追踪到的相关报道是否误报,进而降低动态模型的误报率。考虑到新报道是否和话题相关,除了依据两者的相似度外,还涉及时间距离、差值关系、分布关系、追踪到的报道和话题核心报道的相似度四方面内容,给出了误报检测因子计算式。实验采用TDT4测试集合和DET曲线进行评测,通过反复实验获得了误报检测因子δ的阈值,与基于信念网络的动态话题追踪模型相比,使用误报检测后模型的最优(Cdet)norm降低了5.032%。  相似文献   

20.
采取了3种必要的措施提高了聚类质量:考虑到各维数据特征属性对聚类效果影响不同,采用了基于统计方法的维度加权的方法进行特征选择;对于和声搜索算法的调音概率进行了改进,将改进的和声搜索算法和模糊聚类相结合用于快速寻找最优的聚类中心;循环测试各种中心数情况下的聚类质量以获得最佳的类中心数。该算法被应用于并行计算性能分析中,用于识别并行程序运行时各处理器运行性能瓶颈的类别。实验结果表明该算法较其他算法更优,这样的性能分析方法可以提高并行程序的运行效率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号