首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The National Incident-Based Reporting System (NIBRS) is used by law enforcement to record a detailed picture of crime incidents, including data on offenses, victims and suspected arrestees. Such incident data lends itself to the use of data mining to uncover hidden patterns that can provide meaningful insights to law enforcement and policy makers. In this paper we analyze all homicide data recorded over one year in the NIBRS database, and use classification to predict the relationships between murder victims and the offenders. We evaluate different ways for formulating classification problems for this prediction and compare four classification methods: decision tree, random forest, support vector machine and neural network. Our results show that by setting up binary classification problems to discriminate each type of victim–offender relationship versus all others good classification accuracy can be obtained, especially by the support vector machine method and the random forest approach. Furthermore, our results show that interesting structural insight can be obtain by performing attribute selection and by using transparent decision tree models.  相似文献   

2.
ObjectiveManual evaluation of machine learning algorithms and selection of a suitable classifier from the list of available candidate classifiers, is highly time consuming and challenging task. If the selection is not carefully and accurately done, the resulting classification model will not be able to produce the expected performance results. In this study, we present an accurate multi-criteria decision making methodology (AMD) which empirically evaluates and ranks classifiers’ and allow end users or experts to choose the top ranked classifier for their applications to learn and build classification models for them.Methods and materialExisting classifiers performance analysis and recommendation methodologies lack (a) appropriate method for suitable evaluation criteria selection, (b) relative consistent weighting mechanism, (c) fitness assessment of the classifiers’ performances, and (d) satisfaction of various constraints during the analysis process. To assist machine learning practitioners in the selection of suitable classifier(s), AMD methodology is proposed that presents an expert group-based criteria selection method, relative consistent weighting scheme, a new ranking method, called optimum performance ranking criteria, based on multiple evaluation metrics, statistical significance and fitness assessment functions, and implicit and explicit constraints satisfaction at the time of analysis. For ranking the classifiers performance, the proposed ranking method integrates Wgt.Avg.F-score, CPUTimeTesting, CPUTimeTraining, and Consistency measures using the technique for order performance by similarity to ideal solution (TOPSIS). The final relative closeness score produced by TOPSIS, is ranked and the practitioners select the best performance (top-ranked) classifier for their problems in-hand.FindingsBased on the extensive experiments performed on 15 publically available UCI and OpenML datasets using 35 classification algorithms from heterogeneous families of classifiers, an average Spearman's rank correlation coefficient of 0.98 is observed. Similarly, the AMD method has showed improved performance of 0.98 average Spearman's rank correlation coefficient as compared to 0.83 and 0.045 correlation coefficient of the state-of-the-art ranking methods, performance of algorithms (PAlg) and adjusted ratio of ratio (ARR).Conclusion and implicationThe evaluation, empirical analysis of results and comparison with state-of-the-art methods demonstrate the feasibility of AMD methodology, especially the selection and weighting of right evaluation criteria, accurate ranking and selection of optimum performance classifier(s) for the user's application's data in hand. AMD reduces expert's time and efforts and improves system performance by designing suitable classifier recommended by AMD methodology.  相似文献   

3.
Ensemble methods aim at combining multiple learning machines to improve the efficacy in a learning task in terms of prediction accuracy, scalability, and other measures. These methods have been applied to evolutionary machine learning techniques including learning classifier systems (LCSs). In this article, we first propose a conceptual framework that allows us to appropriately categorize ensemble‐based methods for fair comparison and highlights the gaps in the corresponding literature. The framework is generic and consists of three sequential stages: a pre‐gate stage concerned with data preparation; the member stage to account for the types of learning machines used to build the ensemble; and a post‐gate stage concerned with the methods to combine ensemble output. A taxonomy of LCSs‐based ensembles is then presented using this framework. The article then focuses on comparing LCS ensembles that use feature selection in the pre‐gate stage. An evaluation methodology is proposed to systematically analyze the performance of these methods. Specifically, random feature sampling and rough set feature selection‐based LCS ensemble methods are compared. Experimental results show that the rough set‐based approach performs significantly better than the random subspace method in terms of classification accuracy in problems with high numbers of irrelevant features. The performance of the two approaches are comparable in problems with high numbers of redundant features.  相似文献   

4.
In the last few years, machine learning techniques have been successfully applied to solve engineering problems. However, owing to certain complexities found in real-world problems, such as class imbalance, classical learning algorithms may not reach a prescribed performance. There can be situations where a good result on different conflicting objectives is desirable, such as true positive and true negative ratios, or it is important to balance model’s complexity and prediction score. To solve such issues, the application of multi-objective optimization design procedures can be used to analyze various trade-offs and build more robust machine learning models. Thus, the creation of ensembles of predictive models using such procedures is addressed in this work. First, a set of diverse predictive models is built by employing a multi-objective evolutionary algorithm. Next, a second multi-objective optimization step selects the previous models as ensemble members, resulting on several non-dominated solutions. A final multi-criteria decision making stage is applied to rank and visualize the resulting ensembles. To analyze the proposed methodology, two different experiments are conducted for binary classification. The first case study is a famous classification problem through which the proposed procedure is illustrated. The second one is a challenging real-world problem related to water quality monitoring, where the proposed procedure is compared to four classical ensemble learning algorithms. Results on this second experiment show that the proposed technique is able to create robust ensembles that can outperform other ensemble methods. Overall, the authors conclude that the proposed methodology for ensemble generation creates competitive models for real-world engineering problems.  相似文献   

5.
Aiming at the characteristics of varied and complex geomorphic types,crisscross network of ravines and broken terrain in high altitude complicated terrain regions,it is very important to study and find the rapid and effective land use/land cover classification method for obtaining and timely updating of land use information.Taking the Huangshui river basin located in the transitional zone between the Loess Plateau and the Qinghai-Tibet Plateau as acasestudy area,the objective of this study is to explore a kind of effective information extraction method from comparison of four kinds machine learning methods for complicated terrain regions.based on Landsat 8 OLI satellite data,DEM and combined with various thematic features,on the basis of geographical division of the study area,artificial neural network,decision tree,support vector machine and random forest four machine learning methods for land use information extraction were used to obtain land use data,and confusion matrix was constructed to evaluate classification accuracy.The results showed that the classification accuracies of random forest and decision tree are obviously higher than those of support vector machine and artificial neural network.The random forest method has the highest classification accuracy,the overall classification accuracy is 85.65%,the Kappa coefficient is 0.84.based on the above classification,Random forest classification method was chose to further classify Landsat 8 fusion datafrom panchromatic 15 meter and multispectral 30 meter image,the overall classification accuracy is 86.49% and the Kappa coefficient is 0.85.This indicated that the random forest classification method can obtain higher classification efficiency while ensuring the classification accuracy.It is very effective for the extraction of land use information in complicated terrain regions.Data fusion can improve the classification accuracy to a certain extent.  相似文献   

6.
The decision tree method has grown fast in the past two decades and its performance in classification is promising. The tree-based ensemble algorithms have been used to improve the performance of an individual tree. In this study, we compared four basic ensemble methods, that is, bagging tree, random forest, AdaBoost tree and AdaBoost random tree in terms of the tree size, ensemble size, band selection (BS), random feature selection, classification accuracy and efficiency in ecological zone classification in Clark County, Nevada, through multi-temporal multi-source remote-sensing data. Furthermore, two BS schemes based on feature importance of the bagging tree and AdaBoost tree were also considered and compared. We conclude that random forest or AdaBoost random tree can achieve accuracies at least as high as bagging tree or AdaBoost tree with higher efficiency; and although bagging tree and random forest can be more efficient, AdaBoost tree and AdaBoost random tree can provide a significantly higher accuracy. All ensemble methods provided significantly higher accuracies than the single decision tree. Finally, our results showed that the classification accuracy could increase dramatically by combining multi-temporal and multi-source data set.  相似文献   

7.

Dementia is one of the leading causes of severe cognitive decline, it induces memory loss and impairs the daily life of millions of people worldwide. In this work, we consider the classification of dementia using magnetic resonance (MR) imaging and clinical data with machine learning models. We adapt univariate feature selection in the MR data pre-processing step as a filter-based feature selection. Bagged decision trees are also implemented to estimate the important features for achieving good classification accuracy. Several ensemble learning-based machine learning approaches, namely gradient boosting (GB), extreme gradient boost (XGB), voting-based, and random forest (RF) classifiers, are considered for the diagnosis of dementia. Moreover, we propose voting-based classifiers that train on an ensemble of numerous basic machine learning models, such as the extra trees classifier, RF, GB, and XGB. The implementation of a voting-based approach is one of the important contributions, and the performance of different classifiers are evaluated in terms of precision, accuracy, recall, and F1 score. Moreover, the receiver operating characteristic curve (ROC) and area under the ROC curve (AUC) are used as metrics for comparing these classifiers. Experimental results show that the voting-based classifiers often perform better compared to the RF, GB, and XGB in terms of precision, recall, and accuracy, thereby indicating the promise of differentiating dementia from imaging and clinical data.

  相似文献   

8.
Improving accuracies of machine learning algorithms is vital in designing high performance computer-aided diagnosis (CADx) systems. Researches have shown that a base classifier performance might be enhanced by ensemble classification strategies. In this study, we construct rotation forest (RF) ensemble classifiers of 30 machine learning algorithms to evaluate their classification performances using Parkinson's, diabetes and heart diseases from literature.While making experiments, first the feature dimension of three datasets is reduced using correlation based feature selection (CFS) algorithm. Second, classification performances of 30 machine learning algorithms are calculated for three datasets. Third, 30 classifier ensembles are constructed based on RF algorithm to assess performances of respective classifiers with the same disease data. All the experiments are carried out with leave-one-out validation strategy and the performances of the 60 algorithms are evaluated using three metrics; classification accuracy (ACC), kappa error (KE) and area under the receiver operating characteristic (ROC) curve (AUC).Base classifiers succeeded 72.15%, 77.52% and 84.43% average accuracies for diabetes, heart and Parkinson's datasets, respectively. As for RF classifier ensembles, they produced average accuracies of 74.47%, 80.49% and 87.13% for respective diseases.RF, a newly proposed classifier ensemble algorithm, might be used to improve accuracy of miscellaneous machine learning algorithms to design advanced CADx systems.  相似文献   

9.
决策树作为机器学习和数据挖掘领域中广泛应用的预测模型,其输出结果易于理解和解释。针对高速铁路车载智能设备数量庞大的流数据且设备故障复杂和诊断效率低等问题,采用CVFDT决策树算法,通过对规范化的列控设备流数据进行机器学习,构建车载设备智能故障预测模型(低概率发生、高概率发生和已发生故障),实现对设备潜在故障“事前排除”,提高故障分类精度、定位和诊断准确性,保障高速铁路运营安全和运输效率。  相似文献   

10.
There are many adaptive learning systems that adapt learning materials to student properties, preferences, and activities. This study is focused on designing such a learning system by relating combinations of different learning styles to preferred types of multimedia materials. We explore a decision model aimed at proposing learning material of an appropriate multimedia type. This study includes 272 student participants. The resulting decision model shows that students prefer well-structured learning texts with color discrimination, and that the hemispheric learning style model is the most important criterion in deciding student preferences for different multimedia learning materials. To provide a more accurate and reliable model for recommending different multimedia types more learning style models must be combined. Kolb's classification and the VAK classification allow us to learn if students prefer an active role in the learning process, and what multimedia type they prefer.  相似文献   

11.
Ribonucleic acid (RNA) hybridization is widely used in popular RNA simulation software in bioinformatics. However, limited by the exponential computational complexity of combinatorial problems, it is challenging to decide, within an acceptable time, whether a specific RNA hybridization is effective. We hereby introduce a machine learning based technique to address this problem. Sample machine learning (ML) models tested in the training phase include algorithms based on the boosted tree (BT), random forest (RF), decision tree (DT) and logistic regression (LR), and the corresponding models are obtained. Given the RNA molecular coding training and testing sets, the trained machine learning models are applied to predict the classification of RNA hybridization results. The experiment results show that the optimal predictive accuracies are 96.2%, 96.6%, 96.0% and 69.8% for the RF, BT, DT and LR-based approaches, respectively, under the strong constraint condition, compared with traditional representative methods. Furthermore, the average computation efficiency of the RF, BT, DT and LR-based approaches are 208 679, 269 756, 184 333 and 187 458 times higher than that of existing approach, respectively. Given an RNA design, the BT-based approach demonstrates high computational efficiency and better predictive accuracy in determining the biological effectiveness of molecular hybridization.   相似文献   

12.
Image collections are currently widely available and are being generated in a fast pace due to mobile and accessible equipment. In principle, that is a good scenario taking into account the design of successful visual pattern recognition systems. However, in particular for classification tasks, one may need to choose which examples are more relevant in order to build a training set that well represents the data, since they often require representative and sufficient observations to be accurate. In this paper we investigated three methods for selecting relevant examples from image collections based on learning models from small portions of the available data. We considered supervised methods that need labels to allow selection, and an unsupervised method that is agnostic to labels. The image datasets studied were described using both handcrafted and deep learning features. A general purpose algorithm is proposed which uses learning methods as subroutines. We show that our relevance selection algorithm outperforms random selection, in particular when using unlabelled data in an unsupervised approach, significantly reducing the size of the training set with little decrease in the test accuracy.  相似文献   

13.
Remote sensing is the main means of extracting land cover types,which has important significance for monitoring land use change and developing national policies.Object-based classification methods can provide higher accuracy data than pixel-based methods by using spectral,shape and texture information.In this study,we choose GF-1 satellite’s imagery and proposed a method which can automatically calculate the optimal segmentation scale.The object-based methods for classifying four typical land cover types are compared using multi-scale segmentation and three supervised machine learning algorithms.The relationship between the accuracy of classification results and the training sample proportion is analyzed and the result shows that object-based methods can achieve higher classification results in the case of small training sample ratio,overall accuracies are higher than 94%.Overall,the classification accuracy of support vector machine is higher than that of neural network and decision tree during the process of object-oriented classification.  相似文献   

14.
This paper presents a hybrid approach based on feature selection, fuzzy weighted pre-processing and artificial immune recognition system (AIRS) to medical decision support systems. We have used the heart disease and hepatitis disease datasets taken from UCI machine learning database as medical dataset. Artificial immune recognition system has shown an effective performance on several problems such as machine learning benchmark problems and medical classification problems like breast cancer, diabetes, and liver disorders classification. The proposed approach consists of three stages. In the first stage, the dimensions of heart disease and hepatitis disease datasets are reduced to 9 from 13 and 19 in the feature selection (FS) sub-program by means of C4.5 decision tree algorithm (CBA program), respectively. In the second stage, heart disease and hepatitis disease datasets are normalized in the range of [0,1] and are weighted via fuzzy weighted pre-processing. In the third stage, weighted input values obtained from fuzzy weighted pre-processing are classified using AIRS classifier system. The obtained classification accuracies of our system are 92.59% and 81.82% using 50-50% training-test split for heart disease and hepatitis disease datasets, respectively. With these results, the proposed method can be used in medical decision support systems.  相似文献   

15.
Zhang  Hongpo  Cheng  Ning  Zhang  Yang  Li  Zhanbo 《Applied Intelligence》2021,51(7):4503-4514

Label flipping attack is a poisoning attack that flips the labels of training samples to reduce the classification performance of the model. Robustness is used to measure the applicability of machine learning algorithms to adversarial attack. Naive Bayes (NB) algorithm is a anti-noise and robust machine learning technique. It shows good robustness when dealing with issues such as document classification and spam filtering. Here we propose two novel label flipping attacks to evaluate the robustness of NB under label noise. For the three datasets of Spambase, TREC 2006c and TREC 2007 in the spam classification domain, our attack goal is to increase the false negative rate of NB under the influence of label noise without affecting normal mail classification. Our evaluation shows that at a noise level of 20%, the false negative rate of Spambase and TREC 2006c has increased by about 20%, and the test error of the TREC 2007 dataset has increased to nearly 30%. We compared the classification accuracy of five classic machine learning algorithms (random forest(RF), support vector machine(SVM), decision tree(DT), logistic regression(LR), and NB) and two deep learning models(AlexNet, LeNet) under the proposed label flipping attacks. The experimental results show that two label noises are suitable for various classification models and effectively reduce the accuracy of the models.

  相似文献   

16.
This paper presents an effective machine learning-based depth selection algorithm for CTU (Coding Tree Unit) in HEVC (High Efficiency Video Coding). Existing machine learning methods are limited in their ability in handling the initial depth decision of CU (Coding Unit) and selecting the proper set of input features for the depth selection model. In this paper, we first propose a new classification approach for the initial division depth prediction. In particular, we study the correlation of the texture complexity, QPs (quantization parameters) and the depth decision of the CUs to forecast the original partition depth of the current CUs. Secondly, we further aim to determine the input features of the classifier by analysing the correlation between depth decision of the CUs, picture distortion and the bit-rate. Using the found relationships, we also study a decision method for the end partition depth of the current CUs using bit-rate and picture distortion as input. Finally, we formulate the depth division of the CUs as a binary classification problem and use the nearest neighbor classifier to conduct classification. Our proposed method can significantly improve the efficiency of inter-frame coding by circumventing the traversing cost of the division depth. It shows that the mentioned method can reduce the time spent by 34.56% compared to HM-16.9 while keeping the partition depth of the CUs correct.  相似文献   

17.
基于集成聚类的流量分类架构   总被引:1,自引:0,他引:1  
鲁刚  余翔湛  张宏莉  郭荣华 《软件学报》2016,27(11):2870-2883
流量分类是优化网络服务质量的基础与关键.机器学习算法利用数据流统计特征分类流量,对于识别加密私有协议流量具有重要意义.然而,特征偏置和类别不平衡是基于机器学习的流量分类研究所面临的两大挑战.特征偏置是指一些数据流统计特征在提高部分应用识别准确率的同时也降低了另外一部分应用识别的准确率.类别不平衡是指机器学习流量分类器对样本数较少的应用识别的准确率较低.为解决上述问题,提出了基于集成聚类的流量分类架构(traffic classification framework based on ensemble clustering,简称TCFEC).TCFEC由多个基于不同特征子空间聚类的基分类器和一个最优决策部件构成,能够提高流量分类的准确率.具体而言,与传统的机器学习流量分类器相比,TCFEC的平均流准确率最高提升5%,字节准确率最高提升6%.  相似文献   

18.
An ensemble in machine learning is defined as a set of models (such as classifiers or predictors) that are induced individually from data by using one or more machine learning algorithms for a given task and then work collectively in the hope of generating improved decisions. In this paper we investigate the factors that influence ensemble performance, which mainly include accuracy of individual classifiers, diversity between classifiers, the number of classifiers in an ensemble and the decision fusion strategy. Among them, diversity is believed to be a key factor but more complex and difficult to be measured quantitatively, and it was thus chosen as the focus of this study, together with the relationships between the other factors. A technique was devised to build ensembles with decision trees that are induced with randomly selected features. Three sets of experiments were performed using 12 benchmark datasets, and the results indicate that (i) a high level of diversity indeed makes an ensemble more accurate and robust compared with individual models; (ii) small ensembles can produce results as good as, or better than, large ensembles provided the appropriate (e.g. more diverse) models are selected for the inclusion. This has implications that for scaling up to larger databases the increased efficiency of smaller ensembles becomes more significant and beneficial. As a test case study, ensembles are built based on these findings for a real world application—osteoporosis classification, and found that, in each case of three datasets used, the ensembles out-performed individual decision trees consistently and reliably.  相似文献   

19.
In our work, we review and empirically evaluate five different raw methods of text representation that allow automatic processing of Wikipedia articles. The main contribution of the article—evaluation of approaches to text representation for machine learning tasks—indicates that the text representation is fundamental for achieving good categorization results. The analysis of the representation methods creates a baseline that cannot be compensated for even by sophisticated machine learning algorithms. It confirms the thesis that proper data representation is a prerequisite for achieving high-quality results of data analysis. Evaluation of the text representations was performed within the Wikipedia repository by examination of classification parameters observed during automatic reconstruction of human-made categories. For that purpose, we use a classifier based on a support vector machines method, extended with multilabel and multiclass functionalities. During classifier construction we observed parameters such as learning time, representation size, and classification quality that allow us to draw conclusions about text representations. For the experiments presented in the article, we use data sets created from Wikipedia dumps. We describe our software, called Matrix’u, which allows a user to build computational representations of Wikipedia articles. The software is the second contribution of our research, because it is a universal tool for converting Wikipedia from a human-readable form to a form that can be processed by a machine. Results generated using Matrix’u can be used in a wide range of applications that involve usage of Wikipedia data.  相似文献   

20.
We describe approaches for positive data modeling and classification using both finite inverted Dirichlet mixture models and support vector machines (SVMs). Inverted Dirichlet mixture models are used to tackle an outstanding challenge in SVMs namely the generation of accurate kernels. The kernels generation approaches, grounded on ideas from information theory that we consider, allow the incorporation of data structure and its structural constraints. Inverted Dirichlet mixture models are learned within a principled Bayesian framework using both Gibbs sampler and Metropolis-Hastings for parameter estimation and Bayes factor for model selection (i.e., determining the number of mixture’s components). Our Bayesian learning approach uses priors, which we derive by showing that the inverted Dirichlet distribution belongs to the family of exponential distributions, over the model parameters, and then combines these priors with information from the data to build posterior distributions. We illustrate the merits and the effectiveness of the proposed method with two real-world challenging applications namely object detection and visual scenes analysis and classification.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号