首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.

This article describes the development of a machine vision application for automatic process assessment by image analysis and machine learning. The system is required to differentiate between the various stages of a mashing process and determine the termination point. A large number of histograms, Haralick and Gabor features (835) were extracted from 275 training images. Three feature selection algorithms - wrapper, consistency filter, and correlation filter - were then applied to the training data, resulting in feature sets of size 29, 15, and 11, respectively. A number of decision tree, rule induction, and nearest neighbor classification algorithms were then applied to the reduced data set. For discriminating seven stages of the mashing process, the highest accuracy obtained was 71.6%. For the binary problem of differentiating the finished state from all of the other states the accuracy was 92.0%. This accuracy is good enough for deployment. The results indicate that using a large library of features and machine-learning methods for removing redundant features can significantly reduce development times for vision systems by eliminating the time-consuming manual search for the best discriminating features.  相似文献   

2.
Attribute selection is one of the important problems encountered in pattern recognition, machine learning, data mining, and bioinformatics. It refers to the problem of selecting those input attributes or features that are most effective to predict the sample categories. In this regard, rough set theory has been shown to be successful for selecting relevant and nonredundant attributes from a given data set. However, the classical rough sets are unable to handle real valued noisy features. This problem can be addressed by the fuzzy-rough sets, which are the generalization of classical rough sets. A feature selection method is presented here based on fuzzy-rough sets by maximizing both relevance and significance of the selected features. This paper also presents different feature evaluation criteria such as dependency, relevance, redundancy, and significance for attribute selection task using fuzzy-rough sets. The performance of different rough set models is compared with that of some existing feature evaluation indices based on the predictive accuracy of nearest neighbor rule, support vector machine, and decision tree. The effectiveness of the fuzzy-rough set based attribute selection method, along with a comparison with existing feature evaluation indices and different rough set models, is demonstrated on a set of benchmark and microarray gene expression data sets.  相似文献   

3.
The degree of malignancy in brain glioma is assessed based on magnetic resonance imaging (MRI) findings and clinical data before operation. These data contain irrelevant features, while uncertainties and missing values also exist. Rough set theory can deal with vagueness and uncertainty in data analysis, and can efficiently remove redundant information. In this paper, a rough set method is applied to predict the degree of malignancy. As feature selection can improve the classification accuracy effectively, rough set feature selection algorithms are employed to select features. The selected feature subsets are used to generate decision rules for the classification task. A rough set attribute reduction algorithm that employs a search method based on particle swarm optimization (PSO) is proposed in this paper and compared with other rough set reduction algorithms. Experimental results show that reducts found by the proposed algorithm are more efficient and can generate decision rules with better classification performance. The rough set rule-based method can achieve higher classification accuracy than other intelligent analysis methods such as neural networks, decision trees and a fuzzy rule extraction algorithm based on Fuzzy Min-Max Neural Networks (FRE-FMMNN). Moreover, the decision rules induced by rough set rule induction algorithm can reveal regular and interpretable patterns of the relations between glioma MRI features and the degree of malignancy, which are helpful for medical experts.  相似文献   

4.
目前移动恶意软件数量呈爆炸式增长,变种层出不穷,日益庞大的特征库增加了安全厂商在恶意软件样本处理方面的难度,传统的检测方式已经不能及时有效地处理软件行为样本大数据。基于机器学习的移动恶意软件检测方法存在特征数量多、检测准确率低和不平衡数据的问题。针对现存的问题,文章提出了基于均值和方差的特征选择方法,以减少对分类无效的特征;实现了基于不同特征提取技术的集合分类方法,包括主成分分析、Kaehunen-Loeve 变换和独立成分分析,以提高检测的准确性。针对软件样本的不平衡数据,文章提出了基于决策树的多级分类集成模型。实验结果表明,文章提出的三种检测方法都可以有效地检测 Android平台中的恶意软件样本,准确率分别提高了6.41%、3.96%和3.36%。  相似文献   

5.
This paper presents a hybrid approach based on feature selection, fuzzy weighted pre-processing and artificial immune recognition system (AIRS) to medical decision support systems. We have used the heart disease and hepatitis disease datasets taken from UCI machine learning database as medical dataset. Artificial immune recognition system has shown an effective performance on several problems such as machine learning benchmark problems and medical classification problems like breast cancer, diabetes, and liver disorders classification. The proposed approach consists of three stages. In the first stage, the dimensions of heart disease and hepatitis disease datasets are reduced to 9 from 13 and 19 in the feature selection (FS) sub-program by means of C4.5 decision tree algorithm (CBA program), respectively. In the second stage, heart disease and hepatitis disease datasets are normalized in the range of [0,1] and are weighted via fuzzy weighted pre-processing. In the third stage, weighted input values obtained from fuzzy weighted pre-processing are classified using AIRS classifier system. The obtained classification accuracies of our system are 92.59% and 81.82% using 50-50% training-test split for heart disease and hepatitis disease datasets, respectively. With these results, the proposed method can be used in medical decision support systems.  相似文献   

6.
The aim of this paper is to propose a new hybrid data mining model based on combination of various feature selection and ensemble learning classification algorithms, in order to support decision making process. The model is built through several stages. In the first stage, initial dataset is preprocessed and apart of applying different preprocessing techniques, we paid a great attention to the feature selection. Five different feature selection algorithms were applied and their results, based on ROC and accuracy measures of logistic regression algorithm, were combined based on different voting types. We also proposed a new voting method, called if_any, that outperformed all other voting methods, as well as a single feature selection algorithm's results. In the next stage, a four different classification algorithms, including generalized linear model, support vector machine, naive Bayes and decision tree, were performed based on dataset obtained in the feature selection process. These classifiers were combined in eight different ensemble models using soft voting method. Using the real dataset, the experimental results show that hybrid model that is based on features selected by if_any voting method and ensemble GLM + DT model performs the highest performance and outperforms all other ensemble and single classifier models.  相似文献   

7.
Features selection is the process of choosing the relevant subset of features from the high-dimensional dataset to enhance the performance of the classifier. Much research has been carried out in the present world for the process of feature selection. Algorithms such as Naïve Bayes (NB), decision tree, and genetic algorithm are applied to the high-dimensional dataset to select the relevant features and also to increase the computational speed. The proposed model presents a solution for selection of features using ensemble classifier algorithms. The proposed algorithm is the combination of minimum redundancy and maximum relevance (mRMR) and forest optimization algorithm (FOA). Ensemble-based algorithms such as support vector machine (SVM), K-nearest neighbor (KNN), and NB is further used to enhance the performance of the classifier algorithm. The mRMR-FOA is used to select the relevant features from the various datasets and 21% to 24% improvement is recorded in the feature selection. The ensemble classifier algorithms further improves the performance of the algorithm and provides accuracy of 96%.  相似文献   

8.
Decision trees have been widely used in data mining and machine learning as a comprehensible knowledge representation. While ant colony optimization (ACO) algorithms have been successfully applied to extract classification rules, decision tree induction with ACO algorithms remains an almost unexplored research area. In this paper we propose a novel ACO algorithm to induce decision trees, combining commonly used strategies from both traditional decision tree induction algorithms and ACO. The proposed algorithm is compared against three decision tree induction algorithms, namely C4.5, CART and cACDT, in 22 publicly available data sets. The results show that the predictive accuracy of the proposed algorithm is statistically significantly higher than the accuracy of both C4.5 and CART, which are well-known conventional algorithms for decision tree induction, and the accuracy of the ACO-based cACDT decision tree algorithm.  相似文献   

9.
张晓龙  骆名剑 《计算机应用》2005,25(9):1986-1988
决策树是机器学习和数据挖掘领域中一种基本的学习方法。文中分析了C4.5算法以及该算法不足之处,提出了一种决策树裁剪算法,其中以规则信息量作为判断标准。实验结果表明这种方法可以提高最终模型的预测精度,并能够很好克服数据中的噪音。  相似文献   

10.
CAIM discretization algorithm   总被引:8,自引:0,他引:8  
The task of extracting knowledge from databases is quite often performed by machine learning algorithms. The majority of these algorithms can be applied only to data described by discrete numerical or nominal attributes (features). In the case of continuous attributes, there is a need for a discretization algorithm that transforms continuous attributes into discrete ones. We describe such an algorithm, called CAIM (class-attribute interdependence maximization), which is designed to work with supervised data. The goal of the CAIM algorithm is to maximize the class-attribute interdependence and to generate a (possibly) minimal number of discrete intervals. The algorithm does not require the user to predefine the number of intervals, as opposed to some other discretization algorithms. The tests performed using CAIM and six other state-of-the-art discretization algorithms show that discrete attributes generated by the CAIM algorithm almost always have the lowest number of intervals and the highest class-attribute interdependency. Two machine learning algorithms, the CLIP4 rule algorithm and the decision tree algorithm, are used to generate classification rules from data discretized by CAIM. For both the CLIP4 and decision tree algorithms, the accuracy of the generated rules is higher and the number of the rules is lower for data discretized using the CAIM algorithm when compared to data discretized using six other discretization algorithms. The highest classification accuracy was achieved for data sets discretized with the CAIM algorithm, as compared with the other six algorithms.  相似文献   

11.
This work is motivated by the interest in forensics steganalysis which is aimed at detecting the presence of secret messages transmitted through a subliminal channel. A critical part of the steganalyser design depends on the choice of stego-sensitive features and an efficient machine learning paradigm. The goals of this paper are: (1) to demonstrate that the higher-order statistics of Hausdorff distance - a dissimilarity metric, offers potential discrimination ability for a clean and a stego audio and (2) to achieve promising classification accuracy by realizing the proposed steganalyser with evolving decision tree classifier. Stego sensitive feature selection process is imparted by the genetic algorithm (GA) component and the construction of the rule base is facilitated by the decision tree module. The objective function is designed to maximize the Precision and Recall measures of the classifier thereby enhancing the detection accuracy of the system with low-dimensional and informative features. An extensive experimental evaluation of the proposed system on a database containing 4800 clean and stego audio files (generated by using six different embedding schemes), with the family of six GA decision trees was conducted. The observations reported as 90%+ detection rate, a promising score for a blind steganalyser, show that the proposed scheme, with the Hausdorff distance statistics as features and the evolving decision tree as classifier, is a state-of-the-art steganalyser that outperforms many of the previous steganalytic methods.  相似文献   

12.
异常检测系统在网络空间安全中起着至关重要的作用,为网络安全提供有效的保障.对于复杂的网络流量信息,传统的单一的分类器往往无法同时具备较高检测精确度和较强的泛化能力.此外,基于全特征的异常检测模型往往会受到冗余特征的干扰,影响检测的效率和精度.针对这些问题,本文提出了一种基于平均特征重要性的特征选择和集成学习的模型,选取决策树(DT)、随机森林(RF)、额外树(ET)作为基分类器,建立投票集成模型,并基于基尼系数计算基分类器的平均特征重要性进行特征选择.在多个数据集上的实验评估结果表明,本文提出的集成模型优于经典集成学习模型及其他著名异常检测集成模型.且提出的基于平均特征重要性的特征选择方法可以使集成模型准确率平均进一步提升约0.13%,训练时间平均节省约30%.  相似文献   

13.
The decision tree method has grown fast in the past two decades and its performance in classification is promising. The tree-based ensemble algorithms have been used to improve the performance of an individual tree. In this study, we compared four basic ensemble methods, that is, bagging tree, random forest, AdaBoost tree and AdaBoost random tree in terms of the tree size, ensemble size, band selection (BS), random feature selection, classification accuracy and efficiency in ecological zone classification in Clark County, Nevada, through multi-temporal multi-source remote-sensing data. Furthermore, two BS schemes based on feature importance of the bagging tree and AdaBoost tree were also considered and compared. We conclude that random forest or AdaBoost random tree can achieve accuracies at least as high as bagging tree or AdaBoost tree with higher efficiency; and although bagging tree and random forest can be more efficient, AdaBoost tree and AdaBoost random tree can provide a significantly higher accuracy. All ensemble methods provided significantly higher accuracies than the single decision tree. Finally, our results showed that the classification accuracy could increase dramatically by combining multi-temporal and multi-source data set.  相似文献   

14.
基于小波分析和分层决策的模拟电路故障识别方法*   总被引:1,自引:1,他引:0  
针对模拟电路存在较多故障模式的诊断中易出现分类混叠的问题,提出一种小波分析和分层决策的故障识别方法。首先用小波变换方法提取电路的两种故障特征,模糊C均值算法分析故障特征数据的分布特性,以决策树的形式分割各故障子类。通过对决策树节点特征的优化选择,使各故障子类的区分得以最大化。最后按照决策树结构建立分级诊断的故障决策系统,分别以支持向量机和神经网络作为树节点分类器,有效地提高了故障的识别率。该方法应用于高通滤波器电路的故障识别,正确率高于99%,比经典支持向量机多分类方法具有更好的诊断性能。  相似文献   

15.
糖尿病是一种无法根治的代谢性慢性病,早发现、早治疗能降低其发病风险。机器学习模型可以对疾病进行有效预测,提供辅助诊疗。为此,提出一种GA_Xgboost模型应用于糖尿病风险预测。以Xgboost算法为基础,利用遗传算法良好的全局搜索能力弥补Xgboost收敛较慢的缺陷,通过精英选择策略保证每一轮的进化结果最佳。实验结果表明,GA_Xgboost模型在糖尿病预测中的均方误差为0.606,预测精度优于线性回归、决策树、支持向量机和神经网络等算法,调参时间为152 s,用时少于网格搜索和随机游走方法。  相似文献   

16.
Support vector machine (SVM) is a state-of-art classification tool with good accuracy due to its ability to generate nonlinear model. However, the nonlinear models generated are typically regarded as incomprehensible black-box models. This lack of explanatory ability is a serious problem for practical SVM applications which require comprehensibility. Therefore, this study applies a C5 decision tree (DT) to extract rules from SVM result. In addition, a metaheuristic algorithm is employed for the feature selection. Both SVM and C5 DT require expensive computation. Applying these two algorithms simultaneously for high-dimensional data will increase the computational cost. This study applies artificial bee colony optimization (ABC) algorithm to select the important features. The proposed algorithm ABC–SVM–DT is applied to extract comprehensible rules from SVMs. The ABC algorithm is applied to implement feature selection and parameter optimization before SVM–DT. The proposed algorithm is evaluated using eight datasets to demonstrate the effectiveness of the proposed algorithm. The result shows that the classification accuracy and complexity of the final decision tree can be improved simultaneously by the proposed ABC–SVM–DT algorithm, compared with genetic algorithm and particle swarm optimization algorithm.  相似文献   

17.
In this work, a data set describing phone interactions arising in a multichannel and multiskill contact centre is considered with the aim of classifying inbound sessions into those that will be eventually managed by an agent and those that, instead, will be abandoned before. More precisely, the goal of the work is to extract interpretable pieces of information that allow us to predict whether a user will or will not abandon a call, which may turn out to be very useful for the purpose of contact centre managing. To this end, the performance of two well‐known, state‐of‐the‐art evolutionary algorithms for feature selection (evolutionary nondominated radial slots based algorithm and nondominated sorted genetic algorithm) is compared for the task of feature selection, under the criteria of accuracy and cardinality of the selection, as well as for the task of fuzzy rule extraction, under the criteria of interpretability, accuracy, and hypervolume test. The best obtained fuzzy classifier, chosen after a decision making process, is validated and interpreted by domain experts.  相似文献   

18.
In this paper, we propose a new feature selection method called class dependency based feature selection for dimensionality reduction of the macular disease dataset from pattern electroretinography (PERG) signals. In order to diagnosis of macular disease, we have used class dependency based feature selection as feature selection process, fuzzy weighted pre-processing as weighted process and decision tree classifier as decision making. The proposed system consists of three parts. First, we have reduced to 9 features number of features of macular disease dataset that has 63 features using class dependency based feature selection, which is first developed by ours. Second, the macular disease dataset that has 9 features is weighted by using fuzzy weighted pre-processing. And finally, decision tree classifier was applied to PERG signals to distinguish between healthy eye and diseased eye (macula diseases). The employed class dependency based feature selection, fuzzy weighted pre-processing and decision tree classifier have reached to 96.22%, 96.27% and 96.30% classification accuracies using 5–10–15-fold cross-validation, respectively. The results confirmed that the medical decision making system based on the class dependency based feature selection, fuzzy weighted pre-processing and decision tree classifier has potential in detecting the macular disease. The stated results show that the proposed method could point out the ability of design of a new intelligent assistance diagnosis system.  相似文献   

19.
Feature selection and feature weighting are useful techniques for improving the classification accuracy of K-nearest-neighbor (K-NN) rule. The term feature selection refers to algorithms that select the best subset of the input feature set. In feature weighting, each feature is multiplied by a weight value proportional to the ability of the feature to distinguish pattern classes. In this paper, a novel hybrid approach is proposed for simultaneous feature selection and feature weighting of K-NN rule based on Tabu Search (TS) heuristic. The proposed TS heuristic in combination with K-NN classifier is compared with several classifiers on various available data sets. The results have indicated a significant improvement in the performance in classification accuracy. The proposed TS heuristic is also compared with various feature selection algorithms. Experiments performed revealed that the proposed hybrid TS heuristic is superior to both simple TS and sequential search algorithms. We also present results for the classification of prostate cancer using multispectral images, an important problem in biomedicine.  相似文献   

20.
Credit risk assessment has been a crucial issue as it forecasts whether an individual will default on loan or not. Classifying an applicant as good or bad debtor helps lender to make a wise decision. The modern data mining and machine learning techniques have been found to be very useful and accurate in credit risk predictive capability and correct decision making. Classification is one of the most widely used techniques in machine learning. To increase prediction accuracy of standalone classifiers while keeping overall cost to a minimum, feature selection techniques have been utilized, as feature selection removes redundant and irrelevant attributes from dataset. This paper initially introduces Bolasso (Bootstrap-Lasso) which selects consistent and relevant features from pool of features. The consistent feature selection is defined as robustness of selected features with respect to changes in dataset Bolasso generated shortlisted features are then applied to various classification algorithms like Random Forest (RF), Support Vector Machine (SVM), Naïve Bayes (NB) and K-Nearest Neighbors (K-NN) to test its predictive accuracy. It is observed that Bolasso enabled Random Forest algorithm (BS-RF) provides best results forcredit risk evaluation. The classifiers are built on training and test data partition (70:30) of three datasets (Lending Club’s peer to peer dataset, Kaggle’s Bank loan status dataset and German credit dataset obtained from UCI). The performance of Bolasso enabled various classification algorithms is then compared with that of other baseline feature selection methods like Chi Square, Gain Ratio, ReliefF and stand-alone classifiers (no feature selection method applied). The experimental results shows that Bolasso provides phenomenal stability of features when compared with stability of other algorithms. Jaccard Stability Measure (JSM) is used to assess stability of feature selection methods. Moreover BS-RF have good classification accuracy and is better than other methods in terms of AUC and Accuracy resulting in effectively improving the decision making process of lenders.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号