期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Feature transfer learning by reinforcement learning for detecting software defect

Shikai Guo Jiahui Wang Zhihao Xu Lin Huang Hui Li Rong Chen 《Software》2023,53(2):366-389

Software defects, produced inevitably in software projects, seriously affect the efficiency of software testing and maintenance. An appealing solution is the software defect prediction (SDP) that has achieved good performance in many software projects. However, the difference between features and the difference of the same feature between training data and test data may degrade defect prediction performance if such differences violate the model's assumption. To address this issue, we propose a SDP method based on feature transfer learning (FTL), which performs a transformation sequence for each feature in order to map the original features to another feature space. Specifically, FTL first uses the reinforcement learning scheme that automatically learns a strategy for transferring the potential feature knowledge from the training data. Then, we use the learned feature knowledge to inspire the transformation of the test data. The classifier is trained by the transformed training data and predicts defects for transformed test data. We evaluate the validity of FTL on 43 projects from PROMISE and NASA MDP using three classifiers, logistic regression, random forest, and Naive Bayes (NB). Experimental results indicate that FTL is better than the original classifiers and has the best performance on the NB classifier. For PROMISE, after using FTL, the average results of F1-score, AUC, MCC are 0.601, 0.757, and 0.350 respectively, which are 24.9%, 2.6%, and 16.7% higher than the original NB classifier results. The number of projects with improved performance accounts for 83.87%, 83.87%, and 64.52%. Similarly, FTL performs well on NASA MDP. Besides, compared with four feature engineering (FE) methods, FTL achieves an excellent improvement on most projects and the average performance is also better than or close to the FE methods. 相似文献

2.

Image mining by spectral features: A case study of scenery image classification

《Expert systems with applications》2007,32(1):135-142

Spectral features of images, such as Gabor filters and wavelet transform can be used for texture image classification. That is, a classifier is trained based on some labeled texture features as the training set to classify unlabeled texture features of images into some pre-defined classes. The aim of this paper is twofold. First, it investigates the classification performance of using Gabor filters, wavelet transform, and their combination respectively, as the texture feature representation of scenery images (such as mountain, castle, etc.). A k-nearest neighbor (k-NN) classifier and support vector machine (SVM) are also compared. Second, three k-NN classifiers and three SVMs are combined respectively, in which each of the combined three classifiers uses one of the above three texture feature representations respectively, to see whether combining multiple classifiers can outperform the single classifier in terms of scenery image classification. The result shows that a single SVM using Gabor filters provides the highest classification accuracy than the other two spectral features and the combined three k-NN classifiers and three SVMs. 相似文献

3.

A comparative study for content-based dynamic spam classification using four machine learning algorithms 总被引：1，自引：0，他引：1

Bo Yu Zong-ben Xu 《Knowledge》2008,21(4):355-362

The growth of email users has resulted in the dramatic increasing of the spam emails during the past few years. In this paper, four machine learning algorithms, which are Naïve Bayesian (NB), neural network (NN), support vector machine (SVM) and relevance vector machine (RVM), are proposed for spam classification. An empirical evaluation for them on the benchmark spam filtering corpora is presented. The experiments are performed based on different training set size and extracted feature size. Experimental results show that NN classifier is unsuitable for using alone as a spam rejection tool. Generally, the performances of SVM and RVM classifiers are obviously superior to NB classifier. Compared with SVM, RVM is shown to provide the similar classification result with less relevance vectors and much faster testing time. Despite the slower learning procedure, RVM is more suitable than SVM for spam classification in terms of the applications that require low complexity. 相似文献

4.

Learning Semi-Structured Document Categorization Using Bounded-Length Spectrum Sub-Sequence Kernels

Olivier de Vel 《Data mining and knowledge discovery》2006,13(3):309-334

In this paper we report an investigation into the learning of semi-structured document categorization. We automatically discover low-level, short-range byte data structure patterns from a document data stream by extracting all byte sub-sequences within a sliding window to form an augmented (or bounded-length) string spectrum feature map and using a modified suffix trie data structure (called the coloured generalized suffix tree or CGST) to efficiently store and manipulate the feature map. Using the CGST we are able to efficiently compute the stream's bounded-length sequence spectrum kernel. We compare the performance of two classifier algorithms to categorize the data streams, namely, the SVM and Naive Bayes (NB) classifiers. Experiments have provided good classification performance results on a variety of document byte streams, particularly when using the NB classifier under certain parameter settings. Results indicate that the bounded-length kernel is superior to the standard fixed-length kernel for semi-structured documents. 相似文献

5.

Comparison of feature selection and classification algorithms in identifying malicious executables

D. Michael Cai Maya Gokhale 《Computational statistics & data analysis》2007,51(6):3156-3172

Malicious executables, often spread as email attachments, impose serious security threats to computer systems and associated networks. We investigated the use of byte sequence frequencies as a way to automatically distinguish malicious from benign executables without actually executing them. In a series of experiments, we compared classification accuracies over seven feature selection methods, four classification algorithms, and variable byte sequence lengths. We found that single-byte patterns provided surprisingly reliable features to separate malicious executables from benign. Between classifiers and feature selection methods, the overall performance of the models depended more on the choice of classifier than the method of feature selection. Support vector machine (SVM) classifiers were found to be superior in terms of prediction accuracy, training time, and aversion to overfitting. 相似文献

6.

基于多种特征选择的NB组合文本分类器设计

下载免费PDF全文

樊康新《计算机工程》2009,35(24):191-193

针对朴素贝叶斯（NB）分类器在分类过程中存在诸如分类模型对样本具有敏感性、分类精度难以提高等缺陷,提出一种基于多种特征选择方法的NB组合文本分类器方法。依据Boosting分类算法,采用多种不同的特征选择方法建立文本的特征词集,训练NB分类器作为Boosting迭代过程的基分类器,通过对基分类器的加权投票生成最终的NB组合文本分类器。实验结果表明,该组合分类器较单NB文本分类器具有更好的分类性能。相似文献

7.

ECG arrhythmia classification based on optimum-path forest

Eduardo José da S. Luz Thiago M. Nunes Victor Hugo C. de Albuquerque João P. Papa David Menotti 《Expert systems with applications》2013,40(9):3561-3573

An important tool for the heart disease diagnosis is the analysis of electrocardiogram (ECG) signals, since the non-invasive nature and simplicity of the ECG exam. According to the application, ECG data analysis consists of steps such as preprocessing, segmentation, feature extraction and classification aiming to detect cardiac arrhythmias (i.e., cardiac rhythm abnormalities). Aiming to made a fast and accurate cardiac arrhythmia signal classification process, we apply and analyze a recent and robust supervised graph-based pattern recognition technique, the optimum-path forest (OPF) classifier. To the best of our knowledge, it is the first time that OPF classifier is used to the ECG heartbeat signal classification task. We then compare the performance (in terms of training and testing time, accuracy, specificity, and sensitivity) of the OPF classifier to the ones of other three well-known expert system classifiers, i.e., support vector machine (SVM), Bayesian and multilayer artificial neural network (MLP), using features extracted from six main approaches considered in literature for ECG arrhythmia analysis. In our experiments, we use the MIT-BIH Arrhythmia Database and the evaluation protocol recommended by The Association for the Advancement of Medical Instrumentation. A discussion on the obtained results shows that OPF classifier presents a robust performance, i.e., there is no need for parameter setup, as well as a high accuracy at an extremely low computational cost. Moreover, in average, the OPF classifier yielded greater performance than the MLP and SVM classifiers in terms of classification time and accuracy, and to produce quite similar performance to the Bayesian classifier, showing to be a promising technique for ECG signal analysis. 相似文献

8.

Decision support system for fatty liver disease using GIST descriptors extracted from ultrasound images

《Information Fusion》2016

相似文献

9.

Credit scoring with a data mining approach based on support vector machines 总被引：3，自引：0，他引：3

Cheng-Lung Huang Mu-Chen Chen Chieh-Jen Wang 《Expert systems with applications》2007,33(4):847-856

The credit card industry has been growing rapidly recently, and thus huge numbers of consumers’ credit data are collected by the credit department of the bank. The credit scoring manager often evaluates the consumer’s credit with intuitive experience. However, with the support of the credit classification model, the manager can accurately evaluate the applicant’s credit score. Support Vector Machine (SVM) classification is currently an active research area and successfully solves classification problems in many domains. This study used three strategies to construct the hybrid SVM-based credit scoring models to evaluate the applicant’s credit score from the applicant’s input features. Two credit datasets in UCI database are selected as the experimental data to demonstrate the accuracy of the SVM classifier. Compared with neural networks, genetic programming, and decision tree classifiers, the SVM classifier achieved an identical classificatory accuracy with relatively few input features. Additionally, combining genetic algorithms with SVM classifier, the proposed hybrid GA-SVM strategy can simultaneously perform feature selection task and model parameters optimization. Experimental results show that SVM is a promising addition to the existing data mining methods. 相似文献

10.

A new weighted naive Bayes method based on information diffusion for software defect prediction

Ji Haijin Huang Song Wu Yaning Hui Zhanwei Zheng Changyou 《Software Quality Journal》2019,27(3):923-968

Software defect prediction (SDP) plays a significant part in identifying the most defect-prone modules before software testing and allocating limited testing resources. One of the most commonly used classifiers in SDP is naive Bayes (NB). Despite the simplicity of the NB classifier, it can often perform better than more complicated classification models. In NB, the features are assumed to be equally important, and the numeric features are assumed to have a normal distribution. However, the features often do not contribute equivalently to the classification, and they usually do not have a normal distribution after performing a Kolmogorov-Smirnov test; this may harm the performance of the NB classifier. Therefore, this paper proposes a new weighted naive Bayes method based on information diffusion (WNB-ID) for SDP. More specifically, for the equal importance assumption, we investigate six weight assignment methods for setting the feature weights and then choose the most suitable one based on the F-measure. For the normal distribution assumption, we apply the information diffusion model (IDM) to compute the probability density of each feature instead of the acquiescent probability density function of the normal distribution. We carry out experiments on 10 software defect data sets of three types of projects in three different programming languages provided by the PROMISE repository. Several well-known classifiers and ensemble methods are included for comparison. The final experimental results demonstrate the effectiveness and practicability of the proposed method.

相似文献

11.

动态朴素贝叶斯网络分类器的特征子集选择

余民杰王双成杜瑞杰《计算机应用与软件》2012,(2):57-59

分类准确性是分类器最重要的性能指标,特征子集选择是提高分类器分类准确性的一种有效方法。现有的特征子集选择方法主要针对静态分类器,缺少动态分类器特征子集选择方面的研究。首先给出具有连续属性的动态朴素贝叶斯网络分类器和动态分类准确性评价标准,在此基础上建立动态朴素贝叶斯网络分类器的特征子集选择方法,并使用真实宏观经济时序数据进行实验与分析。相似文献

12.

Credit scoring with a data mining approach based on support vector machines

《Expert systems with applications》2008,34(4):847-856

The credit card industry has been growing rapidly recently, and thus huge numbers of consumers’ credit data are collected by the credit department of the bank. The credit scoring manager often evaluates the consumer’s credit with intuitive experience. However, with the support of the credit classification model, the manager can accurately evaluate the applicant’s credit score. Support Vector Machine (SVM) classification is currently an active research area and successfully solves classification problems in many domains. This study used three strategies to construct the hybrid SVM-based credit scoring models to evaluate the applicant’s credit score from the applicant’s input features. Two credit datasets in UCI database are selected as the experimental data to demonstrate the accuracy of the SVM classifier. Compared with neural networks, genetic programming, and decision tree classifiers, the SVM classifier achieved an identical classificatory accuracy with relatively few input features. Additionally, combining genetic algorithms with SVM classifier, the proposed hybrid GA-SVM strategy can simultaneously perform feature selection task and model parameters optimization. Experimental results show that SVM is a promising addition to the existing data mining methods. 相似文献

13.

Development of Efficient Classification Systems for the Diagnosis of Melanoma

S. Palpandi T. Meeradevi 《计算机系统科学与工程》2022,42(1):361-371

Skin cancer is usually classified as melanoma and non-melanoma. Melanoma now represents 75% of humans passing away worldwide and is one of the most brutal types of cancer. Previously, studies were not mainly focused on feature extraction of Melanoma, which caused the classification accuracy. However, in this work, Histograms of orientation gradients and local binary patterns feature extraction procedures are used to extract the important features such as asymmetry, symmetry, boundary irregularity, color, diameter, etc., and are removed from both melanoma and non-melanoma images. This proposed Efficient Classification Systems for the Diagnosis of Melanoma (ECSDM) framework consists of different schemes such as preprocessing, segmentation, feature extraction, and classification. We used Machine Learning (ML) and Deep Learning (DL) classifiers in the classification framework. The ML classifier is Naïve Bayes (NB) and Support Vector Machines (SVM). And also, DL classification framework of the Convolution Neural Network (CNN) is used to classify the melanoma and benign images. The results show that the Neural Network (NNET) classifier’ achieves 97.17% of accuracy when contrasting with ML classifiers. 相似文献

14.

A new hybrid ensemble credit scoring model based on classifiers consensus system approach

《Expert systems with applications》2016

During the last few years there has been marked attention towards hybrid and ensemble systems development, having proved their ability to be more accurate than single classifier models. However, among the hybrid and ensemble models developed in the literature there has been little consideration given to: 1) combining data filtering and feature selection methods 2) combining classifiers of different algorithms; and 3) exploring different classifier output combination techniques other than the traditional ones found in the literature. In this paper, the aim is to improve predictive performance by presenting a new hybrid ensemble credit scoring model through the combination of two data pre-processing methods based on Gabriel Neighbourhood Graph editing (GNG) and Multivariate Adaptive Regression Splines (MARS) in the hybrid modelling phase. In addition, a new classifier combination rule based on the consensus approach (ConsA) of different classification algorithms during the ensemble modelling phase is proposed. Several comparisons will be carried out in this paper, as follows: 1) Comparison of individual base classifiers with the GNG and MARS methods applied separately and combined in order to choose the best results for the ensemble modelling phase; 2) Comparison of the proposed approach with all the base classifiers and ensemble classifiers with the traditional combination methods; and 3) Comparison of the proposed approach with recent related studies in the literature. Five of the well-known base classifiers are used, namely, neural networks (NN), support vector machines (SVM), random forests (RF), decision trees (DT), and naïve Bayes (NB). The experimental results, analysis and statistical tests prove the ability of the proposed approach to improve prediction performance against all the base classifiers, hybrid and the traditional combination methods in terms of average accuracy, the area under the curve (AUC) H-measure and the Brier Score. The model was validated over seven real world credit datasets. 相似文献

15.

Associated evolution of a support vector machine-based classifier for pedestrian detection

X.B. Cao Y.W. Xu D. Chen H. Qiao 《Information Sciences》2009,179(8):1070-4877

Support vector machine (SVM) has become a dominant classification technique used in pedestrian detection systems. In such systems, classifiers are used to detect pedestrians in some input frames. The performance of a SVM classifier is mainly influenced by two factors: the selected features and the parameters of the kernel function. These two factors are highly related and therefore, it is desirable that the two factors can be analyzed simultaneously, which are usually not the case in the previous work.In this paper, we propose an evolutionary method to simultaneously optimize the feature set and the parameters for the SVM classifier. Specifically, adaptive genetic operators were designed to be suitable for the feature selection and parameter tuning. The proposed method is used to train a SVM classifier for pedestrian detection. Experiments in real city traffic scenes show that the proposed approach leads to higher detection accuracy and shorter detection time. 相似文献

16.

Using DragPushing to Refine Concept Index for Text Categorization

下载免费PDF全文

Xueqi Cheng Songbo Tan and Lilian Tang 《计算机科学技术学报》2006,21(4):592-596

Concept index （CI） is a very fast and efficient feature extraction （FE） algorithm for text classification. The key approach in CI scheme is to express each document as a function of various concepts （centroids） present in the collection. However,the representative ability of centroids for categorizing corpus is often influenced by so-called model misfit caused by a number of factors in the FE process including feature selection to similarity measure. In order to address this issue, this work employs the ＂DragPushing＂ Strategy to refine the centroids that are used for concept index. We present an extensive experimental evaluation of refined concept index （RCI） on two English collections and one Chinese corpus using state-of-the-art Support Vector Machine （SVM） classifier. The results indicate that in each case, RCI-based SVM yields a much better performance than the normal CI-based SVM but lower computation cost during training and classification phases. 相似文献

17.

An improved method of early diagnosis of smoking-induced respiratory changes using machine learning algorithms

Jorge L.M. Amaral Agnaldo J. Lopes José M. Jansen Alvaro C.D. Faria Pedro L. Melo 《Computer methods and programs in biomedicine》2013

The purpose of this study was to develop an automatic classifier to increase the accuracy of the forced oscillation technique (FOT) for diagnosing early respiratory abnormalities in smoking patients. The data consisted of FOT parameters obtained from 56 volunteers, 28 healthy and 28 smokers with low tobacco consumption. Many supervised learning techniques were investigated, including logistic linear classifiers, k nearest neighbor (KNN), neural networks and support vector machines (SVM). To evaluate performance, the ROC curve of the most accurate parameter was established as baseline. To determine the best input features and classifier parameters, we used genetic algorithms and a 10-fold cross-validation using the average area under the ROC curve (AUC). In the first experiment, the original FOT parameters were used as input. We observed a significant improvement in accuracy (KNN = 0.89 and SVM = 0.87) compared with the baseline (0.77). The second experiment performed a feature selection on the original FOT parameters. This selection did not cause any significant improvement in accuracy, but it was useful in identifying more adequate FOT parameters. In the third experiment, we performed a feature selection on the cross products of the FOT parameters. This selection resulted in a further increase in AUC (KNN = SVM = 0.91), which allows for high diagnostic accuracy. In conclusion, machine learning classifiers can help identify early smoking-induced respiratory alterations. The use of FOT cross products and the search for the best features and classifier parameters can markedly improve the performance of machine learning classifiers. 相似文献

18.

Design of pattern recognition system for static security assessment and classification

S. Kalyani K. Shanti Swarup 《Pattern Analysis & Applications》2012,15(3):299-311

Static security analysis is an important study carried out in the control centers of electric utilities. Static security assessment (SSA) is the process of determining whether the current operational state is in a secure or emergency (insecure) state. Conventional method of security evaluation involves performing continuous load flow analysis, which is highly time consuming and infeasible for real-time applications. This led to the application of pattern recognition (PR) approach for static security analysis. This paper presents a more efficient design of a PR system suitable for on-line SSA. The feature selection stage in the PR system uses many algorithms to select the optimal feature set. This paper proposes the use of Support Vector Machine (SVM), a recently introduced machine learning tool, in the classifier design stage of PR system. The developed PR system is implemented in IEEE standard test systems for SSA and classification. The performance of SVM classifier is compared with the conventional K-nearest neighbor, method of least squares and neural network classifiers. Simulation results prove that the SVM-PR classifier outperforms other equivalent classifier algorithms, giving high classification accuracy and less misclassification rate. The feasibility of SVM-PR classifier for on-line security assessment process is also presented. 相似文献

19.

A feature selection enabled hybrid‐bagging algorithm for credit risk evaluation

下载免费PDF全文

Shashi Dahiya S.S. Handa N.P. Singh 《Expert Systems》2017,34(6)

Hybrid models based on feature selection and machine learning techniques have significantly enhanced the accuracy of standalone models. This paper presents a feature selection‐based hybrid‐bagging algorithm (FS‐HB) for improved credit risk evaluation. The 2 feature selection methods chi‐square and principal component analysis were used for ranking and selecting the important features from the datasets. The classifiers were built on 5 training and test data partitions of the input data set. The performance of the hybrid algorithm was compared with that of the standalone classifiers: feature selection‐based classifiers and bagging. The hybrid FS‐HB algorithm performed best for qualitative dataset with less features and tree‐based unstable base classifier. Its performance on numeric data was also better than other standalone classifiers, whereas comparable to bagging with only selected features. Its performance was found better on 70:30 data partition and the type II error, which is very significant in risk evaluation was also reduced significantly. The improved performance of FS‐HB is attributed to the important features used for developing the classifier thereby reducing the complexity of the algorithm and the use of ensemble methodology, which added to the classical bias variance trade‐off and performed better than standalone classifiers. 相似文献

20.

Towards a comprehensive evaluation of V-I-S sub-pixel fractions and land surface temperature for urban land-use classification in the USA

Quan Tang Lei Wang Bin Li Jaehyung Yu 《International journal of remote sensing》2013,34(19):5996-6019

Remote-sensing image classification based on the vegetation–impervious surface–soil (V-I-S) model and land-surface temperature (LST) has proved to be more efficient in characterizing the urban landscape than conventional spectral-based classification. However, current literature emphasizes discussion of the classifier's accuracy improvement achieved by the input of V-I-S fractions and LST over conventional spectral-based classification while ignoring the stability evaluation. Hence, this study proposes an evaluation framework for exploring the superiority of the input features and the stability of classifiers by integrating statistical randomization techniques and a kappa-error diagram. The evaluation framework was applied to case studies for demonstrating the superiority of V-I-S fractions and LST in the context of urban land-use classification with five different types of classifiers, including the maximum likelihood classifier (MLC), the tree classifier, the Bagging classifier, the random forest (RF) and the support vector machine (SVM). It followed that the use of V-I-S fractions and LST (1) could alleviate the ‘salt and pepper’ effect; (2) is preferred by tree and tree-based ensembles for branch splitting; (3) could produce classification trees with less complexity; (4) could benefit the stability of classifiers in addition to the accuracy improvement; and (5) could allow histograms following nearly normal distribution in its feature space, which boosts the performance of MLC. It is shown that MLC becomes comparable with modern classifiers when trained with V-I-S fractions and LST combination. Because of its adequacy and simplicity, MLC is recommended for urban land-use classification when V-I-S fractions and LST are used as the only input features. However, replacing them with, or including, the band reflectance might degrade MLC. A direct use of spectral band reflectance is not recommended for any of the classification approaches being considered in this study, except for SVM, which is the most robust classifier as it has a consistently high performance for all the input feature combinations. We recommend using tree-based ensemble classifiers or SVM when V-I-S fractions and LST as well as the band reflectance are all used in the classification. The proposed evaluation framework can also be applied to the assessment of input features and classifiers in other remote-sensing classification endeavours. 相似文献