期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Toward an efficient and scalable feature selection approach for internet traffic classification

Adil Fahad Zahir Tari Ibrahim Khalil Ibrahim Habib Hussein Alnuweiri 《Computer Networks》2013,57(9):2040-2057

There is significant interest in the network management and industrial security community about the need to identify the “best” and most relevant features for network traffic in order to properly characterize user behaviour and predict future traffic. The ability to eliminate redundant features is an important Machine Learning (ML) task because it helps to identify the best features in order to improve the classification accuracy as well as to reduce the computational complexity related to the construction of the classifier. In practice, feature selection (FS) techniques can be used as a preprocessing step to eliminate irrelevant features and as a knowledge discovery tool to reveal the “best” features in many soft computing applications. In this paper, we investigate the advantages and disadvantages of such FS techniques with new proposed metrics (namely goodness, stability and similarity). We continue our efforts toward developing an integrated FS technique that is built on the key strengths of existing FS techniques. A novel way is proposed to identify efficiently and accurately the “best” features by first combining the results of some well-known FS techniques to find consistent features, and then use the proposed concept of support to select a smallest set of features and cover data optimality. The empirical study over ten high-dimensional network traffic data sets demonstrates significant gain in accuracy and improved run-time performance of a classifier compared to individual results produced by some well-known FS techniques. 相似文献

2.

Higher order feature selection for text classification

Jan Bakus Mohamed S. Kamel 《Knowledge and Information Systems》2006,9(4):468-491

In this paper. we present the MIFS-C variant of the mutual information feature-selection algorithms. We present an algorithm to find the optimal value of the redundancy parameter, which is a key parameter in the MIFS-type algorithms. Furthermore, we present an algorithm that speeds up the execution time of all the MIFS variants. Overall, the presented MIFS-C has comparable classification accuracy (in some cases even better) compared with other MIFS algorithms, while its running time is faster. We compared this feature selector with other feature selectors, and found that it performs better in most cases. The MIFS-C performed especially well for the breakeven and F-measure because the algorithm can be tuned to optimise these evaluation measures. Jan Bakus received the B.A.Sc. and M.A.Sc. degrees in electrical engineering from the University of Waterloo, Waterloo, ON, Canada, in 1996 and 1998, respectively, and Ph.D. degree in systems design engineering in 2005. He is currently working at Maplesoft, Waterloo, ON, Canada as an applications engineer, where he is responsible for the development of application specific toolboxes for the Maple scientific computing software. His research interests are in the area of feature selection for text classification, text classification, text clustering, and information retrieval. He is the recipient of the Carl Pollock Fellowship award from the University of Waterloo and the Datatel Scholars Foundation scholarship from Datatel. Mohamed S. Kamel holds a Ph.D. in computer science from the University of Toronto, Canada. He is at present Professor and Director of the Pattern Analysis and Machine Intelligence Laboratory in the Department of Electrical and Computing Engineering, University of Waterloo, Canada. Professor Kamel holds a Canada Research Chair in Cooperative Intelligent Systems. Dr. Kamel's research interests are in machine intelligence, neural networks and pattern recognition with applications in robotics and manufacturing. He has authored and coauthored over 200 papers in journals and conference proceedings, 2 patents and numerous technical and industrial project reports. Under his supervision, 53 Ph.D. and M.A.Sc. students have completed their degrees. Dr. Kamel is a member of ACM, AAAI, CIPS and APEO and has been named s Fellow of IEEE (2005). He is the editor-in-chief of the International Journal of Robotics and Automation, Associate Editor of the IEEE SMC, Part A, the International Journal of Image and Graphics, Pattern Recognition Letters and is a member of the editorial board of the Intelligent Automation and Soft Computing. He has served as a consultant to many Companies, including NCR, IBM, Nortel, VRP and CSA. He is a member of the board of directors and cofounder of Virtek Vision International in Waterloo. 相似文献

3.

Neighborhood based sample and feature selection for SVM classification learning 总被引：3，自引：0，他引：3

Qiang HeAuthor VitaeZongxia XieAuthor Vitae Qinghua HuAuthor Vitae Congxin WuAuthor Vitae 《Neurocomputing》2011,74(10):1585-1594

Support vector machines (SVMs) are a class of popular classification algorithms for their high generalization ability. However, it is time-consuming to train SVMs with a large set of learning samples. Improving learning efficiency is one of most important research tasks on SVMs. It is known that although there are many candidate training samples in some learning tasks, only the samples near decision boundary which are called support vectors have impact on the optimal classification hyper-planes. Finding these samples and training SVMs with them will greatly decrease training time and space complexity. Based on the observation, we introduce neighborhood based rough set model to search boundary samples. Using the model, we firstly divide sample spaces into three subsets: positive region, boundary and noise. Furthermore, we partition the input features into four subsets: strongly relevant features, weakly relevant and indispensable features, weakly relevant and superfluous features, and irrelevant features. Then we train SVMs only with the boundary samples in the relevant and indispensable feature subspaces, thus feature and sample selection is simultaneously conducted with the proposed model. A set of experimental results show the model can select very few features and samples for training; in the mean time the classification performances are preserved or even improved. 相似文献

4.

互联网流量分类中流量特征研究

刘珍王若愚蔡先发唐德玉《计算机应用研究》2017,34(1)

互联网流量特征用于描述和测量网络流量,是开展流量分类的重要基础。为了系统性分析互联网流量特征,首先根据统计对象或统计角度研究流量特征的归类法,随后展开评述每类流量特征;针对流量特征的稳定性问题,分析报文抽样、网络环境和模糊化技术对流量特征的影响;从分类能力、稳定性、时效性和分类粒度等方面评述流量特征的优缺点,为流量统计特征应用提供指导性建议;最后总结流量特征的未来研究方向。相似文献

5.

Distributed feature selection: An application to microarray data classification

《Applied Soft Computing》2015

Feature selection is often required as a preliminary step for many pattern recognition problems. However, most of the existing algorithms only work in a centralized fashion, i.e. using the whole dataset at once. In this research a new method for distributing the feature selection process is proposed. It distributes the data by features, i.e. according to a vertical distribution, and then performs a merging procedure which updates the feature subset according to improvements in the classification accuracy. The effectiveness of our proposal is tested on microarray data, which has brought a difficult challenge for researchers due to the high number of gene expression contained and the small samples size. The results on eight microarray datasets show that the execution time is considerably shortened whereas the performance is maintained or even improved compared to the standard algorithms applied to the non-partitioned datasets. 相似文献

6.

An adaptive feature fusion framework for multi-class classification based on SVM

Peipei Yin Fuchun Sun Chao Wang Huaping Liu 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2008,12(7):685-691

An adaptive feature fusion framework is proposed for multi-class classification based on SVM. In a similar manner of one-versus-all (OVA), one of the multi-class SVM schemes, the proposed approach decomposes a multi-class classification into several binary classifications. The main difference lies in that each classifier is created with the most suitable feature vectors to discriminate one class from all the other classes. The feature vectors of the unknown samples are selected by each classifier adaptively such that recognition is fulfilled accordingly. In addition, novel evaluation criterions are defined to deal with the frequent small-number sample problems. A writer recognition experiment is carried out to accomplish this framework with three kinds of feature vectors: texture, structure and morphological features. Finally, the performance of the proposed approach is illustrated as compared with the OVA by applying the same feature vectors for all classes. 相似文献

7.

Evolutionary feature and instance selection for traffic sign recognition

《Computers in Industry》2015

The problem of traffic sign recognition is generally approached by first constructing a classifier, which is trained by some relevant image features extracted from traffic signs, to recognize new unknown traffic signs. Feature selection and instance selection are two important data preprocessing steps in data mining, with the former aimed at removing some irrelevant and/or redundant features from a given dataset and the latter at discarding the faulty data. However, there has thus far been no study examining the impact of performing feature and instance selection on traffic sign recognition performance. Given that genetic algorithms (GA) have been widely used for these types of data preprocessing tasks in related studies, we introduce a novel genetic-based biological algorithm (GBA). GBA fits “biological evolution” into the evolutionary process, where the most streamlined process also complies with reasonable rules. In other words, after long-term evolution, organisms find the most efficient way to allocate resources and evolve. Similarly, we closely simulate the natural evolution of an algorithm, to find an option it will be both efficient and effective. Experiments are carried out comparing the performance of the GBA and a GA based on the German Traffic Sign Recognition Benchmark. The results show that the GBA outperforms the GA in terms of the reduction rate, classification accuracy, and computational cost. 相似文献

8.

Information-theoretic feature selection for functional data classification

Vanessa Michel Jrme 《Neurocomputing》2009,72(16-18):3580

The classification of functional or high-dimensional data requires to select a reduced subset of features among the initial set, both to help fighting the curse of dimensionality and to help interpreting the problem and the model. The mutual information criterion may be used in that context, but it suffers from the difficulty of its estimation through a finite set of samples. Efficient estimators are not designed specifically to be applied in a classification context, and thus suffer from further drawbacks and difficulties. This paper presents an estimator of mutual information that is specifically designed for classification tasks, including multi-class ones. It is combined to a recently published stopping criterion in a traditional forward feature selection procedure. Experiments on both traditional benchmarks and on an industrial functional classification problem show the added value of this estimator. 相似文献

9.

An expert model for self-care problems classification using probabilistic neural network and feature selection approach

《Applied Soft Computing》2019

Self-care problems classification is one of the important challenges for occupational therapists. Extent and variety of disorders make the self-care problems classification process complex and time-consuming. To overcome this challenge, an expert model is proposed innovatively in this research. The proposed model is based on Probabilistic Neural Network (PNN) and Genetic Algorithm (GA) for classifying self-care problems of children with physical and motor disability. In this model, PNN is employed as a classifier and GA is applied for feature selection. The PNN is trained by using a standard ICF-CY dataset. Based on ICF-CY, occupational therapists must evaluate many features to diagnose self-care problems. According to the experiences of occupational therapists, these features have different effects on classification. Hence, GA is employed to select relevant and important features in self-care problems classification. Since the classification rules are important for occupational therapists, the self-care problems classification rules are extracted additionally by using the CART algorithm. The experimental results show that by using the feature selection algorithm, the accuracy and time complexity of classification are improved in comparison to other models. The proposed model can classify self-care problems of children with 94.28% accuracy by using only 16.5% of all features. 相似文献

10.

基于类标记扩展的半监督网络流量特征选择算法

林荣强李鸥李青李林林《计算机应用》2014,34(11):3206-3209

针对网络流量特征选择过程中存在的样本标记瓶颈问题,以及现有半监督方法无法选择强相关的特征的不足,提出一种基于类标记扩展的多类半监督特征选择(SFSEL)算法。该算法首先从少量的标记样本出发,通过K-means算法对未标记样本进行类标记扩展;然后结合基于双重正则的支持向量机(MDrSVM)算法实现多类数据的特征选择。与半监督特征选择算法Spectral、PCFRSC和SEFR在Moore数据集进行了对比实验,SFSEL得到的分类准确率和召回率明显都要高于其他算法,而且SFSEL算法选择的特征个数明显少于其他算法。实验结果表明: SFSEL算法能够有效地提高所选特征的相关性,获取更好的网络流量分类性能。相似文献

11.

An improved boosting based on feature selection for corporate bankruptcy prediction

《Expert systems with applications》2014,41(5):2353-2361

With the recent financial crisis and European debt crisis, corporate bankruptcy prediction has become an increasingly important issue for financial institutions. Many statistical and intelligent methods have been proposed, however, there is no overall best method has been used in predicting corporate bankruptcy. Recent studies suggest ensemble learning methods may have potential applicability in corporate bankruptcy prediction. In this paper, a new and improved Boosting, FS-Boosting, is proposed to predict corporate bankruptcy. Through injecting feature selection strategy into Boosting, FS-Booting can get better performance as base learners in FS-Boosting could get more accuracy and diversity. For the testing and illustration purposes, two real world bankruptcy datasets were selected to demonstrate the effectiveness and feasibility of FS-Boosting. Experimental results reveal that FS-Boosting could be used as an alternative method for the corporate bankruptcy prediction. 相似文献

12.

基于属性重要度的启发式特征选取算法

孙兴波杨平先干树川《自动化与仪器仪表》2005,(5):13-14,17

数据库通常包含很多冗余特征,找出重要特征叫做特征提取。本文提出一种基于属性重要度的启发式特征选取算法。该算法以属性重要度为迭代准则得到属性集合的最小约简。相似文献

13.

An incremental feature selection approach based on scatter matrices for classification of cancer microarray data

Manju Sardana R.K. Agrawal Baljeet Kaur 《国际计算机数学杂志》2015,92(2):277-295

Microarray data are often characterized by high dimension and small sample size. There is a need to reduce its dimension for better classification performance and computational efficiency of the learning model. The minimum redundancy and maximum relevance (mRMR), which is widely explored to reduce the dimension of the data, requires discretization and setting of external parameters. We propose an incremental formulation of the trace of ratio of the scatter matrices to determine a relevant set of genes which does not involve discretization and external parameter setting. It is analytically shown that the proposed incremental formulation is computationally efficient in comparison to its batch formulation. Extensive experiments on 14 well-known available microarray cancer datasets demonstrate that the performance of the proposed method is better in comparison to the well-known mRMR method. Statistical tests also show that the proposed method is significantly better when compared to the mRMR method. 相似文献

14.

Evolutionary feature subsets selection based on interaction information for high dimensional imbalanced data classification

《Applied Soft Computing》2019

With the advent of technology in various scientific fields, high dimensional data are becoming abundant. A general approach to tackle the resulting challenges is to reduce data dimensionality through feature selection. Traditional feature selection approaches concentrate on selecting relevant features and ignoring irrelevant or redundant ones. However, most of these approaches neglect feature interactions. On the other hand, some datasets have imbalanced classes, which may result in biases towards the majority class. The main goal of this paper is to propose a novel feature selection method based on the interaction information (II) to provide higher level interaction analysis and improve the search procedure in the feature space. In this regard, an evolutionary feature subset selection algorithm based on interaction information is proposed, which consists of three stages. At the first stage, candidate features and candidate feature pairs are identified using traditional feature weighting approaches such as symmetric uncertainty (SU) and bivariate interaction information. In the second phase, candidate feature subsets are formed and evaluated using multivariate interaction information. Finally, the best candidate feature subsets are selected using dominant/dominated relationships. The proposed algorithm is compared with some other feature selection algorithms including mRMR, WJMI, IWFS, IGFS, DCSF, IWFS, K_OFSD, WFLNS, Information Gain and ReliefF in terms of the number of selected features, classification accuracy, F-measure and algorithm stability using three different classifiers, namely KNN, NB, and CART. The results justify the improvement of classification accuracy and the robustness of the proposed method in comparison with the other approaches. 相似文献

15.

基于成对约束扩展的半监督网络流量特征选择算法

李平红王勇陶晓玲《传感器与微系统》2013,32(5)

针对网络流量特征选择过程中监督信息缺乏的问题,提出一种基于成对约束扩展的半监督网络流量特征选择算法。该算法同时考虑少量成对约束和大量无标记样本,利用样本集合间的相关性和自相关性,扩展成对约束集到无标记样本上,产生更多可靠性强的成对约束,以揭示样本空间分布信息。最后,利用扩展的成对约束集进行特征选择。实验证明:与未进行成对约束扩展的算法相比,该算法在少量初始成对约束的情况下能获得更好的分类性能。相似文献

16.

An Exponential Monte-Carlo algorithm for feature selection problems

《Computers & Industrial Engineering》2014

相似文献

17.

基于动态交通仿真模型的最优路径选择方法* 总被引：1，自引：0，他引：1

余燕芳陆军《计算机应用研究》2010,27(5):1662-1664

采用动态交通仿真模型INTEGRATION搭建了动态交通仿真平台,应用组件式蚁群算法来求解动态交通信息诱导下的最优路径选择问题。实例表明,基于动态交通仿真模型的最优路径选择方法是可行的、正确的和有效的。该方法易于理解和使用,具有很强的可重用性和可扩展性,为求解各类优化问题提供了可持续发展的框架。相似文献

18.

Combined VMD-SVM based feature selection method for classification of power quality events

《Applied Soft Computing》2016

Power quality (PQ) issues have become more important than before due to increased use of sensitive electrical loads. In this paper, a new hybrid algorithm is presented for PQ disturbances detection in electrical power systems. The proposed method is constructed based on four main steps: simulation of PQ events, extraction of features, selection of dominant features, and classification of selected features. By using two powerful signal processing tools, i.e. variational mode decomposition (VMD) and S-transform (ST), some potential features are extracted from different PQ events. VMD as a new tool decomposes signals into different modes and ST also analyzes signals in both time and frequency domains. In order to avoid large dimension of feature vector and obtain a detection scheme with optimum structure, sequential forward selection (SFS) and sequential backward selection (SBS) as wrapper based methods and Gram–Schmidt orthogonalization (GSO) based feature selection method as filter based method are used for elimination of redundant features. In the next step, PQ events are discriminated by support vector machines (SVMs) as classifier core. Obtained results of the extensive tests prove the satisfactory performance of the proposed method in terms of speed and accuracy even in noisy conditions. Moreover, the start and end points of PQ events can be detected with high precision. 相似文献

19.

Building lightweight intrusion detection system using wrapper-based feature selection mechanisms

Yang Jun-Li Wang Zhi-Hong Tian Tian-Bo Lu Chen Young 《Computers & Security》2009,28(6):466-475

Intrusion Detection System (IDS) is an important and necessary component in ensuring network security and protecting network resources and network infrastructures. How to build a lightweight IDS is a hot topic in network security. Moreover, feature selection is a classic research topic in data mining and it has attracted much interest from researchers in many fields such as network security, pattern recognition and data mining. In this paper, we effectively introduced feature selection methods to intrusion detection domain. We propose a wrapper-based feature selection algorithm aiming at building lightweight intrusion detection system by using modified random mutation hill climbing (RMHC) as search strategy to specify a candidate subset for evaluation, as well as using modified linear Support Vector Machines (SVMs) iterative procedure as wrapper approach to obtain the optimum feature subset. We verify the effectiveness and the feasibility of our feature selection algorithm by several experiments on KDD Cup 1999 intrusion detection dataset. The experimental results strongly show that our approach is not only able to speed up the process of selecting important features but also to yield high detection rates. Furthermore, our experimental results indicate that intrusion detection system with feature selection algorithm has better performance than that without feature selection algorithm both in detection performance and computational cost. 相似文献

20.

A unified approach to optimal feature selection

Salvatore D Morgera 《Pattern recognition letters》1983,2(2):61-68

The optimum finite set of linear observables for discriminating two Gaussian stochastic processes is derived using classical methods and distribution function theory. The results offer a new, accurate information-theoretic strategy and are superior to well-known conventional methods using statistical distance measures. 相似文献