首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
There is significant interest in the network management and industrial security community about the need to identify the “best” and most relevant features for network traffic in order to properly characterize user behaviour and predict future traffic. The ability to eliminate redundant features is an important Machine Learning (ML) task because it helps to identify the best features in order to improve the classification accuracy as well as to reduce the computational complexity related to the construction of the classifier. In practice, feature selection (FS) techniques can be used as a preprocessing step to eliminate irrelevant features and as a knowledge discovery tool to reveal the “best” features in many soft computing applications. In this paper, we investigate the advantages and disadvantages of such FS techniques with new proposed metrics (namely goodness, stability and similarity). We continue our efforts toward developing an integrated FS technique that is built on the key strengths of existing FS techniques. A novel way is proposed to identify efficiently and accurately the “best” features by first combining the results of some well-known FS techniques to find consistent features, and then use the proposed concept of support to select a smallest set of features and cover data optimality. The empirical study over ten high-dimensional network traffic data sets demonstrates significant gain in accuracy and improved run-time performance of a classifier compared to individual results produced by some well-known FS techniques.  相似文献   

2.
In this paper. we present the MIFS-C variant of the mutual information feature-selection algorithms. We present an algorithm to find the optimal value of the redundancy parameter, which is a key parameter in the MIFS-type algorithms. Furthermore, we present an algorithm that speeds up the execution time of all the MIFS variants. Overall, the presented MIFS-C has comparable classification accuracy (in some cases even better) compared with other MIFS algorithms, while its running time is faster. We compared this feature selector with other feature selectors, and found that it performs better in most cases. The MIFS-C performed especially well for the breakeven and F-measure because the algorithm can be tuned to optimise these evaluation measures. Jan Bakus received the B.A.Sc. and M.A.Sc. degrees in electrical engineering from the University of Waterloo, Waterloo, ON, Canada, in 1996 and 1998, respectively, and Ph.D. degree in systems design engineering in 2005. He is currently working at Maplesoft, Waterloo, ON, Canada as an applications engineer, where he is responsible for the development of application specific toolboxes for the Maple scientific computing software. His research interests are in the area of feature selection for text classification, text classification, text clustering, and information retrieval. He is the recipient of the Carl Pollock Fellowship award from the University of Waterloo and the Datatel Scholars Foundation scholarship from Datatel. Mohamed S. Kamel holds a Ph.D. in computer science from the University of Toronto, Canada. He is at present Professor and Director of the Pattern Analysis and Machine Intelligence Laboratory in the Department of Electrical and Computing Engineering, University of Waterloo, Canada. Professor Kamel holds a Canada Research Chair in Cooperative Intelligent Systems. Dr. Kamel's research interests are in machine intelligence, neural networks and pattern recognition with applications in robotics and manufacturing. He has authored and coauthored over 200 papers in journals and conference proceedings, 2 patents and numerous technical and industrial project reports. Under his supervision, 53 Ph.D. and M.A.Sc. students have completed their degrees. Dr. Kamel is a member of ACM, AAAI, CIPS and APEO and has been named s Fellow of IEEE (2005). He is the editor-in-chief of the International Journal of Robotics and Automation, Associate Editor of the IEEE SMC, Part A, the International Journal of Image and Graphics, Pattern Recognition Letters and is a member of the editorial board of the Intelligent Automation and Soft Computing. He has served as a consultant to many Companies, including NCR, IBM, Nortel, VRP and CSA. He is a member of the board of directors and cofounder of Virtek Vision International in Waterloo.  相似文献   

3.
Support vector machines (SVMs) are a class of popular classification algorithms for their high generalization ability. However, it is time-consuming to train SVMs with a large set of learning samples. Improving learning efficiency is one of most important research tasks on SVMs. It is known that although there are many candidate training samples in some learning tasks, only the samples near decision boundary which are called support vectors have impact on the optimal classification hyper-planes. Finding these samples and training SVMs with them will greatly decrease training time and space complexity. Based on the observation, we introduce neighborhood based rough set model to search boundary samples. Using the model, we firstly divide sample spaces into three subsets: positive region, boundary and noise. Furthermore, we partition the input features into four subsets: strongly relevant features, weakly relevant and indispensable features, weakly relevant and superfluous features, and irrelevant features. Then we train SVMs only with the boundary samples in the relevant and indispensable feature subspaces, thus feature and sample selection is simultaneously conducted with the proposed model. A set of experimental results show the model can select very few features and samples for training; in the mean time the classification performances are preserved or even improved.  相似文献   

4.
互联网流量特征用于描述和测量网络流量,是开展流量分类的重要基础。为了系统性分析互联网流量特征,首先根据统计对象或统计角度研究流量特征的归类法,随后展开评述每类流量特征;针对流量特征的稳定性问题,分析报文抽样、网络环境和模糊化技术对流量特征的影响;从分类能力、稳定性、时效性和分类粒度等方面评述流量特征的优缺点,为流量统计特征应用提供指导性建议;最后总结流量特征的未来研究方向。  相似文献   

5.
Feature selection is often required as a preliminary step for many pattern recognition problems. However, most of the existing algorithms only work in a centralized fashion, i.e. using the whole dataset at once. In this research a new method for distributing the feature selection process is proposed. It distributes the data by features, i.e. according to a vertical distribution, and then performs a merging procedure which updates the feature subset according to improvements in the classification accuracy. The effectiveness of our proposal is tested on microarray data, which has brought a difficult challenge for researchers due to the high number of gene expression contained and the small samples size. The results on eight microarray datasets show that the execution time is considerably shortened whereas the performance is maintained or even improved compared to the standard algorithms applied to the non-partitioned datasets.  相似文献   

6.
An adaptive feature fusion framework is proposed for multi-class classification based on SVM. In a similar manner of one-versus-all (OVA), one of the multi-class SVM schemes, the proposed approach decomposes a multi-class classification into several binary classifications. The main difference lies in that each classifier is created with the most suitable feature vectors to discriminate one class from all the other classes. The feature vectors of the unknown samples are selected by each classifier adaptively such that recognition is fulfilled accordingly. In addition, novel evaluation criterions are defined to deal with the frequent small-number sample problems. A writer recognition experiment is carried out to accomplish this framework with three kinds of feature vectors: texture, structure and morphological features. Finally, the performance of the proposed approach is illustrated as compared with the OVA by applying the same feature vectors for all classes.  相似文献   

7.
The problem of traffic sign recognition is generally approached by first constructing a classifier, which is trained by some relevant image features extracted from traffic signs, to recognize new unknown traffic signs. Feature selection and instance selection are two important data preprocessing steps in data mining, with the former aimed at removing some irrelevant and/or redundant features from a given dataset and the latter at discarding the faulty data. However, there has thus far been no study examining the impact of performing feature and instance selection on traffic sign recognition performance. Given that genetic algorithms (GA) have been widely used for these types of data preprocessing tasks in related studies, we introduce a novel genetic-based biological algorithm (GBA). GBA fits “biological evolution” into the evolutionary process, where the most streamlined process also complies with reasonable rules. In other words, after long-term evolution, organisms find the most efficient way to allocate resources and evolve. Similarly, we closely simulate the natural evolution of an algorithm, to find an option it will be both efficient and effective. Experiments are carried out comparing the performance of the GBA and a GA based on the German Traffic Sign Recognition Benchmark. The results show that the GBA outperforms the GA in terms of the reduction rate, classification accuracy, and computational cost.  相似文献   

8.
Vanessa  Michel  Jrme 《Neurocomputing》2009,72(16-18):3580
The classification of functional or high-dimensional data requires to select a reduced subset of features among the initial set, both to help fighting the curse of dimensionality and to help interpreting the problem and the model. The mutual information criterion may be used in that context, but it suffers from the difficulty of its estimation through a finite set of samples. Efficient estimators are not designed specifically to be applied in a classification context, and thus suffer from further drawbacks and difficulties. This paper presents an estimator of mutual information that is specifically designed for classification tasks, including multi-class ones. It is combined to a recently published stopping criterion in a traditional forward feature selection procedure. Experiments on both traditional benchmarks and on an industrial functional classification problem show the added value of this estimator.  相似文献   

9.
Self-care problems classification is one of the important challenges for occupational therapists. Extent and variety of disorders make the self-care problems classification process complex and time-consuming. To overcome this challenge, an expert model is proposed innovatively in this research. The proposed model is based on Probabilistic Neural Network (PNN) and Genetic Algorithm (GA) for classifying self-care problems of children with physical and motor disability. In this model, PNN is employed as a classifier and GA is applied for feature selection. The PNN is trained by using a standard ICF-CY dataset. Based on ICF-CY, occupational therapists must evaluate many features to diagnose self-care problems. According to the experiences of occupational therapists, these features have different effects on classification. Hence, GA is employed to select relevant and important features in self-care problems classification. Since the classification rules are important for occupational therapists, the self-care problems classification rules are extracted additionally by using the CART algorithm. The experimental results show that by using the feature selection algorithm, the accuracy and time complexity of classification are improved in comparison to other models. The proposed model can classify self-care problems of children with 94.28% accuracy by using only 16.5% of all features.  相似文献   

10.
林荣强  李鸥  李青  李林林 《计算机应用》2014,34(11):3206-3209
针对网络流量特征选择过程中存在的样本标记瓶颈问题,以及现有半监督方法无法选择强相关的特征的不足,提出一种基于类标记扩展的多类半监督特征选择(SFSEL)算法。该算法首先从少量的标记样本出发,通过K-means算法对未标记样本进行类标记扩展;然后结合基于双重正则的支持向量机(MDrSVM)算法实现多类数据的特征选择。与半监督特征选择算法Spectral、PCFRSC和SEFR在Moore数据集进行了对比实验,SFSEL得到的分类准确率和召回率明显都要高于其他算法,而且SFSEL算法选择的特征个数明显少于其他算法。实验结果表明: SFSEL算法能够有效地提高所选特征的相关性,获取更好的网络流量分类性能。  相似文献   

11.
With the recent financial crisis and European debt crisis, corporate bankruptcy prediction has become an increasingly important issue for financial institutions. Many statistical and intelligent methods have been proposed, however, there is no overall best method has been used in predicting corporate bankruptcy. Recent studies suggest ensemble learning methods may have potential applicability in corporate bankruptcy prediction. In this paper, a new and improved Boosting, FS-Boosting, is proposed to predict corporate bankruptcy. Through injecting feature selection strategy into Boosting, FS-Booting can get better performance as base learners in FS-Boosting could get more accuracy and diversity. For the testing and illustration purposes, two real world bankruptcy datasets were selected to demonstrate the effectiveness and feasibility of FS-Boosting. Experimental results reveal that FS-Boosting could be used as an alternative method for the corporate bankruptcy prediction.  相似文献   

12.
数据库通常包含很多冗余特征,找出重要特征叫做特征提取。本文提出一种基于属性重要度的启发式特征选取算法。该算法以属性重要度为迭代准则得到属性集合的最小约简。  相似文献   

13.
Microarray data are often characterized by high dimension and small sample size. There is a need to reduce its dimension for better classification performance and computational efficiency of the learning model. The minimum redundancy and maximum relevance (mRMR), which is widely explored to reduce the dimension of the data, requires discretization and setting of external parameters. We propose an incremental formulation of the trace of ratio of the scatter matrices to determine a relevant set of genes which does not involve discretization and external parameter setting. It is analytically shown that the proposed incremental formulation is computationally efficient in comparison to its batch formulation. Extensive experiments on 14 well-known available microarray cancer datasets demonstrate that the performance of the proposed method is better in comparison to the well-known mRMR method. Statistical tests also show that the proposed method is significantly better when compared to the mRMR method.  相似文献   

14.
With the advent of technology in various scientific fields, high dimensional data are becoming abundant. A general approach to tackle the resulting challenges is to reduce data dimensionality through feature selection. Traditional feature selection approaches concentrate on selecting relevant features and ignoring irrelevant or redundant ones. However, most of these approaches neglect feature interactions. On the other hand, some datasets have imbalanced classes, which may result in biases towards the majority class. The main goal of this paper is to propose a novel feature selection method based on the interaction information (II) to provide higher level interaction analysis and improve the search procedure in the feature space. In this regard, an evolutionary feature subset selection algorithm based on interaction information is proposed, which consists of three stages. At the first stage, candidate features and candidate feature pairs are identified using traditional feature weighting approaches such as symmetric uncertainty (SU) and bivariate interaction information. In the second phase, candidate feature subsets are formed and evaluated using multivariate interaction information. Finally, the best candidate feature subsets are selected using dominant/dominated relationships. The proposed algorithm is compared with some other feature selection algorithms including mRMR, WJMI, IWFS, IGFS, DCSF, IWFS, K_OFSD, WFLNS, Information Gain and ReliefF in terms of the number of selected features, classification accuracy, F-measure and algorithm stability using three different classifiers, namely KNN, NB, and CART. The results justify the improvement of classification accuracy and the robustness of the proposed method in comparison with the other approaches.  相似文献   

15.
针对网络流量特征选择过程中监督信息缺乏的问题,提出一种基于成对约束扩展的半监督网络流量特征选择算法。该算法同时考虑少量成对约束和大量无标记样本,利用样本集合间的相关性和自相关性,扩展成对约束集到无标记样本上,产生更多可靠性强的成对约束,以揭示样本空间分布信息。最后,利用扩展的成对约束集进行特征选择。实验证明:与未进行成对约束扩展的算法相比,该算法在少量初始成对约束的情况下能获得更好的分类性能。  相似文献   

16.
17.
基于动态交通仿真模型的最优路径选择方法*   总被引:1,自引:0,他引:1  
采用动态交通仿真模型INTEGRATION搭建了动态交通仿真平台,应用组件式蚁群算法来求解动态交通信息诱导下的最优路径选择问题。实例表明,基于动态交通仿真模型的最优路径选择方法是可行的、正确的和有效的。该方法易于理解和使用,具有很强的可重用性和可扩展性,为求解各类优化问题提供了可持续发展的框架。  相似文献   

18.
Power quality (PQ) issues have become more important than before due to increased use of sensitive electrical loads. In this paper, a new hybrid algorithm is presented for PQ disturbances detection in electrical power systems. The proposed method is constructed based on four main steps: simulation of PQ events, extraction of features, selection of dominant features, and classification of selected features. By using two powerful signal processing tools, i.e. variational mode decomposition (VMD) and S-transform (ST), some potential features are extracted from different PQ events. VMD as a new tool decomposes signals into different modes and ST also analyzes signals in both time and frequency domains. In order to avoid large dimension of feature vector and obtain a detection scheme with optimum structure, sequential forward selection (SFS) and sequential backward selection (SBS) as wrapper based methods and Gram–Schmidt orthogonalization (GSO) based feature selection method as filter based method are used for elimination of redundant features. In the next step, PQ events are discriminated by support vector machines (SVMs) as classifier core. Obtained results of the extensive tests prove the satisfactory performance of the proposed method in terms of speed and accuracy even in noisy conditions. Moreover, the start and end points of PQ events can be detected with high precision.  相似文献   

19.
Intrusion Detection System (IDS) is an important and necessary component in ensuring network security and protecting network resources and network infrastructures. How to build a lightweight IDS is a hot topic in network security. Moreover, feature selection is a classic research topic in data mining and it has attracted much interest from researchers in many fields such as network security, pattern recognition and data mining. In this paper, we effectively introduced feature selection methods to intrusion detection domain. We propose a wrapper-based feature selection algorithm aiming at building lightweight intrusion detection system by using modified random mutation hill climbing (RMHC) as search strategy to specify a candidate subset for evaluation, as well as using modified linear Support Vector Machines (SVMs) iterative procedure as wrapper approach to obtain the optimum feature subset. We verify the effectiveness and the feasibility of our feature selection algorithm by several experiments on KDD Cup 1999 intrusion detection dataset. The experimental results strongly show that our approach is not only able to speed up the process of selecting important features but also to yield high detection rates. Furthermore, our experimental results indicate that intrusion detection system with feature selection algorithm has better performance than that without feature selection algorithm both in detection performance and computational cost.  相似文献   

20.
The optimum finite set of linear observables for discriminating two Gaussian stochastic processes is derived using classical methods and distribution function theory. The results offer a new, accurate information-theoretic strategy and are superior to well-known conventional methods using statistical distance measures.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号