首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 12 毫秒
1.
DNA microarray is a very active area of research in the molecular diagnosis of cancer. Microarray data are composed of many thousands of features and from tens to hundreds of instances, which make the analysis and diagnosis of cancer very complex. In this case, gene/feature selection becomes an elemental and essential task in data classification. In this paper, we propose a complete cancer diagnostic process through kernel-based learning and feature selection. First, support vector machines recursive feature elimination (SVM-RFE) is used to prefilter the genes. Second, the SVM-RFE is enhanced by using binary dragonfly (BDF), which is a recently developed metaheuristic that has never been benchmarked in the context of feature selection. The objective function is the average of classification accuracy rate generated by three kernel-based learning methods. We conducted a series of experiments on six microarray datasets often used in the literature. Experiment results demonstrate that this approach is efficient and provides a higher classification accuracy rate using a reduced number of genes.  相似文献   

2.
目前在识别钓鱼网站的研究中,对识别速度有着越来越高的需求,因此提出了一种基于混合特征选择模型的钓鱼网站快速识别方法.混合特征选择模型包含初次特征选择、二次特征选择和分类三个主要部分,使用信息增益、卡方检验相结合以及基于随机森林的递归特征消除算法建立了混合特征选择模型,并在模型中使用分布函数与梯度,获取最佳截断阈值,得到...  相似文献   

3.
点击欺诈是近年来最常见的网络犯罪手段之一,互联网广告行业每年都会因点击欺诈而遭受巨大损失。为了能够在海量点击中有效地检测欺诈点击,构建了多种充分结合广告点击与时间属性关系的特征,并提出了一种点击欺诈检测的集成学习框架——CAT-RFE集成学习框架。CAT-RFE集成学习框架包含3个部分:基分类器、递归特征消除(RFE,recursive feature elimination)和voting集成学习。其中,将适用于类别特征的梯度提升模型——CatBoost(categorical boosting)作为基分类器;RFE是基于贪心策略的特征选择方法,可在多组特征中选出较好的特征组合;Voting集成学习是采用投票的方式将多个基分类器的结果进行组合的学习方法。该框架通过CatBoost和RFE在特征空间中获取多组较优的特征组合,再在这些特征组合下的训练结果通过voting进行集成,获得集成的点击欺诈检测结果。该框架采用了相同的基分类器和集成学习方法,不仅克服了差异较大的分类器相互制约而导致集成结果不理想的问题,也克服了RFE在选择特征时容易陷入局部最优解的问题,具备更好的检测能力。在实际互联网点击欺诈数据集上的性能评估和对比实验结果显示,CAT-RFE集成学习框架的点击欺诈检测能力超过了CatBoost模型、CatBoost和RFE组合的模型以及其他机器学习模型,证明该框架具备良好的竞争力。该框架为互联网广告点击欺诈检测提供一种可行的解决方案。  相似文献   

4.
针对特征子集区分度准则(Discernibility of feature subsets, DFS)没有考虑特征测量量纲对特征子集区分能力影响的缺陷, 引入离散系数, 提出GDFS (Generalized discernibility of feature subsets)特征子集区分度准则. 结合顺序前向、顺序后向、顺序前向浮动和顺序后向浮动4种搜索策略, 以极限学习机为分类器, 得到4种混合特征选择算法. UCI数据集与基因数据集的实验测试, 以及与DFS、Relief、DRJMIM、mRMR、LLE Score、AVC、SVM-RFE、VMInaive、AMID、AMID-DWSFS、CFR和FSSC-SD的实验比较和统计重要度检测表明: 提出的GDFS优于DFS, 能选择到分类能力更好的特征子集.  相似文献   

5.
如何针对半监督数据集,利用不完整的监督信息完成特征选择,已经成为模式识别与机器学习领域的研究热点。为方便研究者系统地了解半监督特征选择领域的研究现状和发展趋势,对半监督特征选择方法进行综述。首先探讨了半监督特征选择方法的分类,将其按理论基础的不同分为基于图的方法、基于伪标签的方法、基于支持向量机的方法以及其他方法;然后详细介绍并比较了各个类别的典型方法;之后整理了半监督特征选择的热点应用;最后展望了半监督特征选择方法未来的研究方向。  相似文献   

6.
Tabular knowledge-based systems are known to be extremely versatile for verification and validation of knowledge bases. However, a major disadvantage of these systems is the combinatorial explosion that accompanies addition of new attributes or condition entries in the table. One of the means of alleviating this problem in tabular knowledge-based systems is through modularization, which is the process of breaking a big comprehensive table into smaller tables that are easy to deal with. In this study, we propose and illustrate another means to deal with this problem through use of feature selection methodology. The proposed method can be used synergistically with modularization to alleviate problems associated with combinatorial explosion in tabular knowledge bases.  相似文献   

7.
高维数据特征降维研究综述   总被引:6,自引:2,他引:6  
胡洁 《计算机应用研究》2008,25(9):2601-2606
特征降维能够有效地提高机器学习的效率,特征子集的搜索过程以及特征评价标准是特征降维的两个核心问题。综述国际上关于特征降维的研究成果,总结并提出了较完备的特征降维模型定义;通过列举解决特征降维上重要问题的各种方案来比较各种算法的特点以及优劣,并讨论了该方向上尚未解决的问题和发展趋势。  相似文献   

8.
In this paper we use genetic programming for changing the representation of the input data for machine learners. In particular, the topic of interest here is feature construction in the learning-from-examples paradigm, where new features are built based on the original set of attributes. The paper first introduces the general framework for GP-based feature construction. Then, an extended approach is proposed where the useful components of representation (features) are preserved during an evolutionary run, as opposed to the standard approach where valuable features are often lost during search. Finally, we present and discuss the results of an extensive computational experiment carried out on several reference data sets. The outcomes show that classifiers induced using the representation enriched by the GP-constructed features provide better accuracy of classification on the test set. In particular, the extended approach proposed in the paper proved to be able to outperform the standard approach on some benchmark problems on a statistically significant level.  相似文献   

9.
由于影响煤矿突水的因素多、相关性强,影响模型预测精度;数据收集工作量大,成本较高,如何科学地选取特征以提高模型预测准确率成为本文重点研究内容.本文首先提出采用稳定性选择方法在已知的22个影响因素中选取7个最重要的因素,之后构建随机森林、神经网络以及支持向量机3种典型机器学习分类预测模型对特征选取前后的数据进行预测分析,实验结果表明,特征选取后的预测模型非常稳定且预测准确率可达100%,同时降低了样本数据收集成本.  相似文献   

10.
支持向量机分类算法中多元变量共线性问题的改进   总被引:3,自引:2,他引:3  
结合核主成分分析的主因子提取和支持向量机的分类机理,提出了一种组合建模算法.应用核主成分分析过程作为预处理器,可以把共线性的多元变量糅合为几个主因子,但基本不损失有效信息.然后进行基于支持向量机的分类建模和预测.这样不仅可以防止共线性多元变量对模型的负面影响,还可以降低数据维数,减少支持向量机分类过程中的复杂度和运算量.最后用实验进行评估所得到的训练模型,实例说明了所提方法的有效性.  相似文献   

11.
基于相似关系粗糙集模型的数值属性约简算法   总被引:1,自引:0,他引:1  
吴敏 《计算机应用》2010,30(1):156-158
针对数值属性数据包含大量噪声而经典粗糙集方法易受噪声干扰的问题,提出一种属性度量指标综合衡量属性在样本上的差异性和相似性。以这种属性度量指标为启发式设计了相似关系粗糙集框架下的数值属性约简算法,并推广到经典粗糙集。在车牌字符集和UCI手写体数字字符集上和常用约简算法进行了比较,实验结果显示这种方法产生的约简属性可以导出规则数少并且具有较好分类能力的规则集。  相似文献   

12.
准确的网络流量分类既是众多网络研究工作的重要基础,也是网络测量领域的研究热点。基于流特征的六种分类算法进行比较分析,实验结果表明,使用特征选择方法,SVM算法具有较高的整体准确率和较好的计算性能,适合用于网络流量分类。  相似文献   

13.
文本自动分类是指将文本按照一定的策略归于一个或多个类别中的应用技术。文本分类是文本挖掘的基础,而特征选择又是文本分类中的核心。论文分析了以前特征选择方法中由于特征数目过多而造成分类时间和精度不高的缺点,提出了一种基于粗糙集的特征选择方法,其特点是以特征在文本分类中的重要性对特征进行选择。最后通过实验验证了该算法,证明该方法是可行的。  相似文献   

14.
随着电商平台分期付款方式和P2P信贷平台的不断推广,如何从海量的用户信贷数据中挖掘出潜在的用户模型并对未知用户进行信贷风险评估,以降低信贷业务的风险,已经成为研究的主流。针对现有方法无法高效处理高维度信贷数据的问题,使用一系列的数据预处理方法和基于Embedded思想的特征选择方法XGBFS(XGBoost Feature Selection),以降低用户信贷数据维度并训练出XGBoost评估模型,最终实现用户信贷风险评估。实验表明,与现有的方法相比,该方法能够从高维的数据中选择出重要属性,并且分类器在精确率、召回率等方面具有较为突出的性能。  相似文献   

15.
This paper presents a novel approach for classifying sleep apneas into one of the three basic types: obstructive, central and mixed. The goal is to overcome the problems encountered in previous work and improve classification accuracy. The proposed model uses a new classification approach based on the characteristics that each type of apnea presents in different segments of the signal. The model is based on the error correcting output code and it is formed by a combination of artificial neural networks experts where their inputs are the coefficients obtained by a discrete wavelet decomposition applied to the raw samples of the apnea in the thoracic effort signal. The input coefficients received for each network were determined by a feature selection method (support vector machine recursive feature elimination). In order to train and test the systems, 120 events from six different patients were used. The true error rate was estimated using a 10-fold cross validation. The results presented in this work were averaged over 10 different simulations and a multiple comparison procedure was used for model selection. The mean test accuracy obtained was 90.27% ± 0.79, and the values for each class apnea were 94.62% (obstructive), 95.47% (central) and 90.45% (mixed). Up to the authors’ knowledge, the proposed classifier surpasses all previous results.  相似文献   

16.
High sensitivity to irrelevant features is arguably the main shortcoming of simple lazy learners. In response to it, many feature selection methods have been proposed, including forward sequential selection (FSS) and backward sequential selection (BSS). Although they often produce substantial improvements in accuracy, these methods select the same set of relevant features everywhere in the instance space, and thus represent only a partial solution to the problem. In general, some features will be relevant only in some parts of the space; deleting them may hurt accuracy in those parts, but selecting them will have the same effect in parts where they are irrelevant. This article introduces RC, a new feature selection algorithm that uses a clustering-like approach to select sets of locally relevant features (i.e., the features it selects may vary from one instance to another). Experiments in a large number of domains from the UCI repository show that RC almost always improves accuracy with respect to FSS and BSS, often with high significance. A study using artificial domains confirms the hypothesis that this difference in performance is due to RC's context sensitivity, and also suggests conditions where this sensitivity will and will not be an advantage. Another feature of RC is that it is faster than FSS and BSS, often by an order of magnitude or more.  相似文献   

17.
在机器学习的研究中,特征选择对于提高学习机器的性能和效率具有重要的意义。各种特征选择算法的不断提出和应用,给各领域科研工作的实施带来极大的帮助,但是当前各种算法普遍存在着具体实现独立性较强、可扩展性差的问题,使得算法的使用者难以对多种算法的性能进行统一的对比评估,算法的替换和扩展工作量也相应较大。论文以面向对象的设计理念为指导,基于设计模式中的策略模式,提出了特征选择算法工具库FSLS(FeatureSelectionLibrarybasedonStrategy-pattern)的设计构想,通过将特征选择方法中一些常用的算法按照策略模式进行包装,以此方便机器学习算法用户的使用,同时确保算法工具库的本身具有较强的可扩展性。  相似文献   

18.
This study examines the available literature on the effects of serious games on people with intellectual disabilities or autism spectrum disorder. The studies were categorized based on the limitations in skills that these people address. Fifty‐four studies were selected, from different data sources. These studies address limitations in intellectual functioning and adaptive behaviour. The results showed that the majority of studies on the effects of serious games for people with intellectual disabilities or autism spectrum disorder had a positive impact. Also, most studies for people with autism aim to improve social and communicational skills, whereas conceptual and cognitive skills were mainly observed in studies for people with intellectual disabilities. Although this study covers serious games in all platforms or delivery systems, the overwhelming majority of the presented studies include computer serious games. Computer‐assisted learning through serious games is considered quite promising for people with intellectual disabilities or autism spectrum disorder.  相似文献   

19.
         下载免费PDF全文
As Machine Learning (ML) is widely applied in security-critical fields, the requirements for the interpretability of ML also increase. The interpretability aims at helping people understand internal operation principles and decision principles of models, so as to improve models'' credibility. However, research on the interpretability of ML models such as Random Forest (RF) is still in the infant stage. Considering the strict and standardized characteristics of formal methods and their wide application in the field of ML in recent years, this study leverages formal methods and logical reasoning to develop an ML interpretability method for interpreting the prediction of RF. Specifically, the decision-making process of RF is encoded into a first-order logical formula, and a Minimal Unsatisfiable Core (MUC) is taken as the core. Local interpretation of feature importance and counterfactual sample generation methods are provided. Experimental results on several public datasets illustrate the high quality of the proposed feature importance measurement, and the counterfactual sample generation method outperforms the existing state-of-the-art methods. Moreover, from the perspective of user-friendliness, the user report can be generated according to the analysis results of counterfactual samples, which can provide suggestions for users to improve their situation in real-life applications.  相似文献   

20.
Autism spectrum disorders (ASD) are a diverse group of conditions characterized by difficulty with social interaction and communication. ASD is expected to be a high-risk disease. Recent studies have focused on the diagnosis based on sociodemographic and family characteristics factors. The development of a diagnosis model, which is primarily based on machine learning methods, has been carried out to alleviate the detection of autism. However, they neglected the importance of ASD features in a training dataset, especially because some features have different levels of contributions to the processing data and possess more relevancies to the classification information than others. Such limitations use preprocessing techniques for the construction of the machine learning model, but the role of the physician's experience towards feature contributions remains limited. However, for certain autism datasets, the relevancies of sociodemographic and family characteristic feature concerning the given class labels should be considered. Accordingly, this study developed a new machine learning model for the diagnosis of ASD based on multi-criteria decision-making (MCDM). By using three methodology phases, the model combines two representative theories, namely, MCDM and machine learning. The identification phase for imbalance ASD dataset and application of pre-possessing stages by imputing missing values, feature selection of sociodemographic and family characteristics, and data imbalanced approach resulted in balanced ASD dataset, including 107,573 cases. The development phase for the new model was achieved by the proposed complex T-spherical fuzzy-weighted zero-inconsistency (CT-SFWZIC) method. CT-SFWZIC was developed based on a new fuzzy set (i.e., complex T-spherical fuzzy) for weighting affected features, and then applied for training and testing the machine learning model considering various complex T-spherical fuzzy membership functions (i.e., T = 1, 2, 3, 5, 7, and 10). The results obtained from a 10-fold cross-validation test for all T values by using nine machine learning classifiers were measured under seven evaluation metrics, namely AUC, accuracy, F1, precision, recall, training time (s), and test time (s). Performance evaluation results reveal that AdaBoost can be used to boost the ASD diagnosis as the best machine learning algorithm for all T values based on all metrics to improve the diagnosis based on physician's assessment. Under the most extreme evaluation metric, which is accuracy, the results of the AdaBoost classifiers for T = 1, 2, 3, 5, 7, 10 have obtained 0.99948, 0.99934, 0.99930, 0.99939, 0.99910, and 0.99930, respectively.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号