共查询到20条相似文献,搜索用时 0 毫秒
1.
In this paper. we present the MIFS-C variant of the mutual information feature-selection algorithms. We present an algorithm
to find the optimal value of the redundancy parameter, which is a key parameter in the MIFS-type algorithms. Furthermore,
we present an algorithm that speeds up the execution time of all the MIFS variants. Overall, the presented MIFS-C has comparable
classification accuracy (in some cases even better) compared with other MIFS algorithms, while its running time is faster.
We compared this feature selector with other feature selectors, and found that it performs better in most cases. The MIFS-C
performed especially well for the breakeven and F-measure because the algorithm can be tuned to optimise these evaluation measures.
Jan Bakus received the B.A.Sc. and M.A.Sc. degrees in electrical engineering from the University of Waterloo, Waterloo, ON, Canada,
in 1996 and 1998, respectively, and Ph.D. degree in systems design engineering in 2005. He is currently working at Maplesoft,
Waterloo, ON, Canada as an applications engineer, where he is responsible for the development of application specific toolboxes
for the Maple scientific computing software.
His research interests are in the area of feature selection for text classification, text classification, text clustering,
and information retrieval. He is the recipient of the Carl Pollock Fellowship award from the University of Waterloo and the
Datatel Scholars Foundation scholarship from Datatel.
Mohamed S. Kamel holds a Ph.D. in computer science from the University of Toronto, Canada. He is at present Professor and Director of the
Pattern Analysis and Machine Intelligence Laboratory in the Department of Electrical and Computing Engineering, University
of Waterloo, Canada. Professor Kamel holds a Canada Research Chair in Cooperative Intelligent Systems.
Dr. Kamel's research interests are in machine intelligence, neural networks and pattern recognition with applications in robotics
and manufacturing. He has authored and coauthored over 200 papers in journals and conference proceedings, 2 patents and numerous
technical and industrial project reports. Under his supervision, 53 Ph.D. and M.A.Sc. students have completed their degrees.
Dr. Kamel is a member of ACM, AAAI, CIPS and APEO and has been named s Fellow of IEEE (2005). He is the editor-in-chief of
the International Journal of Robotics and Automation, Associate Editor of the IEEE SMC, Part A, the International Journal
of Image and Graphics, Pattern Recognition Letters and is a member of the editorial board of the Intelligent Automation and
Soft Computing. He has served as a consultant to many Companies, including NCR, IBM, Nortel, VRP and CSA. He is a member of
the board of directors and cofounder of Virtek Vision International in Waterloo. 相似文献
2.
Feature selection is known as a good solution to the high dimensionality of the feature space and mostly preferred feature selection methods for text classification are filter-based ones. In a common filter-based feature selection scheme, unique scores are assigned to features depending on their discriminative power and these features are sorted in descending order according to the scores. Then, the last step is to add top-N features to the feature set where N is generally an empirically determined number. In this paper, an improved global feature selection scheme (IGFSS) where the last step in a common feature selection scheme is modified in order to obtain a more representative feature set is proposed. Although feature set constructed by a common feature selection scheme successfully represents some of the classes, a number of classes may not be even represented. Consequently, IGFSS aims to improve the classification performance of global feature selection methods by creating a feature set representing all classes almost equally. For this purpose, a local feature selection method is used in IGFSS to label features according to their discriminative power on classes and these labels are used while producing the feature sets. Experimental results on well-known benchmark datasets with various classifiers indicate that IGFSS improves the performance of classification in terms of two widely-known metrics namely Micro-F1 and Macro-F1. 相似文献
3.
4.
Feature selection is an important step for large-scale image data analysis, which has been proved to be difficult due to large size in both dimensions and samples. Feature selection firstly eliminates redundant and irrelevant features and then chooses a subset of features that performs as efficient as the complete set. Generally, supervised feature selection yields better performance than unsupervised feature selection because of the utilization of labeled information. However, labeled data samples are always expensive to obtain, which constraints the performance of supervised feature selection, especially for the large web image datasets. In this paper, we propose a semi-supervised feature selection algorithm that is based on a hierarchical regression model. Our contribution can be highlighted as: (1) Our algorithm utilizes a statistical approach to exploit both labeled and unlabeled data, which preserves the manifold structure of each feature type. (2) The predicted label matrix of the training data and the feature selection matrix are learned simultaneously, making the two aspects mutually benefited. Extensive experiments are performed on three large-scale image datasets. Experimental results demonstrate the better performance of our algorithm, compared with the state-of-the-art algorithms. 相似文献
5.
Asgarnezhad Razieh Monadjemi S. Amirhassan Soltanaghaei Mohammadreza 《The Journal of supercomputing》2021,77(6):5806-5839
The Journal of Supercomputing - Due to extensive web applications, sentiment classification (SC) has become a relevant issue of interest among text mining experts. The extensive online reviews... 相似文献
6.
Stefano Baccianella Andrea Esuli Fabrizio Sebastiani 《Expert systems with applications》2013,40(11):4687-4696
Most popular feature selection methods for text classification such as information gain (also known as “mutual information”), chi-square, and odds ratio, are based on binary information indicating the presence/absence of the feature (or “term”) in each training document. As such, these methods do not exploit a rich source of information, namely, the information concerning how frequently the feature occurs in the training document (term frequency). In order to overcome this drawback, when doing feature selection we logically break down each training document of length k into k training “micro-documents”, each consisting of a single word occurrence and endowed with the same class information of the original training document. This move has the double effect of (a) allowing all the original feature selection methods based on binary information to be still straightforwardly applicable, and (b) making them sensitive to term frequency information. We study the impact of this strategy in the case of ordinal text classification, a type of text classification dealing with classes lying on an ordinal scale, and recently made popular by applications in customer relationship management, market research, and Web 2.0 mining. We run experiments using four recently introduced feature selection functions, two learning methods of the support vector machines family, and two large datasets of product reviews. The experiments show that the use of this strategy substantially improves the accuracy of ordinal text classification. 相似文献
7.
Multimedia Tools and Applications - The problem of text detection and localization in scene images has always been challenging for the researchers over the years due to diversities present in these... 相似文献
8.
A good feature selection method should take into account both category information and high‐frequency information to select useful features that can effectively display the information of a target. Because basic mutual information (BMI) prefers low‐frequency features and ignores high‐frequency features, clustering mutual information is proposed, which is based on clustering and makes effective high‐frequency features become unique, better integrating category information and useful high‐frequency information. Time is an important factor in topic detection and tracking (TDT). In order to improve the performance of TDT, time difference is integrated into clustering mutual information to dynamically adjust the mutual information, and then another algorithm called the dynamic clustering mutual information (DCMI) is given. In order to obtain the optimal subsets to display topics information, an objective function is proposed, which is based on the idea that a good feature subset should have the smallest distance within‐class and the largest distance across‐class. Experiments on TDT4 corpora using this objective function are performed; then, comparing the performances of BMI, DCMI, and the only existed topic feature selection algorithm Incremental Term Frequency‐Inverted Document Frequency (ITF‐IDF), these performance information will be displayed by four figures. Computation time of DCMI is previously lower than BMI and ITF‐IDF. The optimal normalized‐detection performance (Cdet)norm of DCMI is decreased by 0.3044 and 0.0970 compared with those of BMI and ITF‐IDF, respectively. 相似文献
9.
This paper proposes a framework for selecting the Laplacian eigenvalues of 3D shapes that are more relevant for shape characterization
and classification. We demonstrate the redundancy of the information coded by the shape spectrum and discuss the shape characterization
capability of the selected eigenvalues. The feature selection methods used to demonstrate our claim are the AdaBoost algorithm
and Support Vector Machine. The efficacy of the selection is shown by comparing the results of the selected eigenvalues on
shape characterization and classification with those related to the first k eigenvalues, by varying k over the cardinality of the spectrum. Our experiments, which have been performed on 3D objects represented either as triangle
meshes or point clouds, show that working directly with point clouds provides classification results that are comparable with
respect to those related to surface-based representations. Finally, we discuss the stability of the computation of the Laplacian
spectrum to matrix perturbations. 相似文献
10.
《微型机与应用》2019,(5):48-52
近年来以大数据为中心的人工智能技术得到蓬勃发展,自然语言处理成为了人工智能时代最突出的前沿研究领域之一。然而,在自然语言处理领域的短文本分类中,不同的特征提取方法与机器学习算法集成时,处理效果差异明显。针对短文本分类精度较低的问题,基于组合的方式和预设的评价指标,通过将不同特征提取方法与不同机器学习算法进行组合,探究其在超短文本分类中的效果以寻求最优组合模型进而获得最佳分类效果。实验结果表明,在所选取的四种最优组合方法中,以词频-逆文件频率为特征提取方法、以逻辑回归为算法的组合模型在公开数据集中取得最好的实验效果,精度为92. 13%,查全率为90. 12%,适合应用于超短文本的分类应用场景。 相似文献
11.
Hierarchical feature selection is a new research area in machine learning/data mining, which consists of performing feature selection by exploiting dependency relationships among hierarchically structured features. This paper evaluates four hierarchical feature selection methods, i.e., HIP, MR, SHSEL and GTD, used together with four types of lazy learning-based classifiers, i.e., Naïve Bayes, Tree Augmented Naïve Bayes, Bayesian Network Augmented Naïve Bayes and k-Nearest Neighbors classifiers. These four hierarchical feature selection methods are compared with each other and with a well-known “flat” feature selection method, i.e., Correlation-based Feature Selection. The adopted bioinformatics datasets consist of aging-related genes used as instances and Gene Ontology terms used as hierarchical features. The experimental results reveal that the HIP (Select Hierarchical Information Preserving Features) method performs best overall, in terms of predictive accuracy and robustness when coping with data where the instances’ classes have a substantially imbalanced distribution. This paper also reports a list of the Gene Ontology terms that were most often selected by the HIP method. 相似文献
12.
Huan Liu Lei Yu 《Knowledge and Data Engineering, IEEE Transactions on》2005,17(4):491-502
This paper introduces concepts and algorithms of feature selection, surveys existing feature selection algorithms for classification and clustering, groups and compares different algorithms with a categorizing framework based on search strategies, evaluation criteria, and data mining tasks, reveals unattempted combinations, and provides guidelines in selecting feature selection algorithms. With the categorizing framework, we continue our efforts toward-building an integrated system for intelligent feature selection. A unifying platform is proposed as an intermediate step. An illustrative example is presented to show how existing feature selection algorithms can be integrated into a meta algorithm that can take advantage of individual algorithms. An added advantage of doing so is to help a user employ a suitable algorithm without knowing details of each algorithm. Some real-world applications are included to demonstrate the use of feature selection in data mining. We conclude this work by identifying trends and challenges of feature selection research and development. 相似文献
13.
With the advent of Big Data, data is being collected at an unprecedented fast pace, and it needs to be processed in a short time. To deal with data streams that flow continuously, classical batch learning algorithms cannot be applied and it is necessary to employ online approaches. Online learning consists of continuously revising and refining a model by incorporating new data as they arrive, and it allows important problems such as concept drift or management of extremely high-dimensional datasets to be solved. In this paper, we present a unified pipeline for online learning which covers online discretization, feature selection and classification. Three classical methods—the k-means discretizer, the χ2 filter and a one-layer artificial neural network—have been reimplemented to be able to tackle online data, showing promising results on both synthetic and real datasets. 相似文献
14.
This paper is concerned with a two phase procedure to select salient features (variables) for classification committees. Both filter and wrapper approaches to feature selection are combined in this work. In the first phase, definitely redundant features are eliminated based on the paired t-test. The test compares the saliency of the candidate and the noise features. In the second phase, the genetic search is employed. The search integrates the steps of training, aggregation of committee members, selection of hyper-parameters, and selection of salient features into the same learning process. A small number of genetic iterations needed to find a solution is the characteristic feature of the genetic search procedure developed. The experimental tests performed on five real-world problems have shown that significant improvements in classification accuracy can be obtained in a small number of iterations if compared to the case of using all the features available. 相似文献
15.
Marios Kyperountas Author Vitae Anastasios Tefas Author Vitae 《Pattern recognition》2010,43(3):972-986
A novel facial expression classification (FEC) method is presented and evaluated. The classification process is decomposed into multiple two-class classification problems, a choice that is analytically justified, and unique sets of features are extracted for each classification problem. Specifically, for each two-class problem, an iterative feature selection process that utilizes a class separability measure is employed to create salient feature vectors (SFVs), where each SFV is composed of a selected feature subset. Subsequently, two-class discriminant analysis is applied on the SFVs to produce salient discriminant hyper-planes (SDHs), which are used to train the corresponding two-class classifiers. To properly integrate the two-class classification results and produce the FEC decision, a computationally efficient and fast classification scheme is developed. During each step of this scheme, the most reliable classifier is identified and utilized, thus, a more accurate final classification decision is produced. The JAFFE and the MMI databases are used to evaluate the performance of the proposed salient-feature-and-reliable-classifier selection (SFRCS) methodology. Classification rates of 96.71% and 93.61% are achieved under the leave-one-sample-out evaluation strategy, and 85.92% under the leave-one-subject-out evaluation strategy. 相似文献
16.
Qiang HeAuthor VitaeZongxia XieAuthor Vitae Qinghua HuAuthor Vitae Congxin WuAuthor Vitae 《Neurocomputing》2011,74(10):1585-1594
Support vector machines (SVMs) are a class of popular classification algorithms for their high generalization ability. However, it is time-consuming to train SVMs with a large set of learning samples. Improving learning efficiency is one of most important research tasks on SVMs. It is known that although there are many candidate training samples in some learning tasks, only the samples near decision boundary which are called support vectors have impact on the optimal classification hyper-planes. Finding these samples and training SVMs with them will greatly decrease training time and space complexity. Based on the observation, we introduce neighborhood based rough set model to search boundary samples. Using the model, we firstly divide sample spaces into three subsets: positive region, boundary and noise. Furthermore, we partition the input features into four subsets: strongly relevant features, weakly relevant and indispensable features, weakly relevant and superfluous features, and irrelevant features. Then we train SVMs only with the boundary samples in the relevant and indispensable feature subspaces, thus feature and sample selection is simultaneously conducted with the proposed model. A set of experimental results show the model can select very few features and samples for training; in the mean time the classification performances are preserved or even improved. 相似文献
17.
This paper presents a novel ensemble classifier framework for improved classification of mammographic lesions in Computer-aided Detection (CADe) and Diagnosis (CADx) systems. Compared to previously developed classification techniques in mammography, the main novelty of proposed method is twofold: (1) the “combined use” of different feature representations (of the same instance) and data resampling to generate more diverse and accurate base classifiers as ensemble members and (2) the incorporation of a novel “ensemble selection” mechanism to further maximize the overall classification performance. In addition, as opposed to conventional ensemble learning, our proposed ensemble framework has the advantage of working well with both weak and strong classifiers, extensively used in mammography CADe and/or CADx systems. Extensive experiments have been performed using benchmark mammogram dataset to test the proposed method on two classification applications: (1) false-positive (FP) reduction using classification between masses and normal tissues, and (2) diagnosis using classification between malignant and benign masses. Results showed promising results that the proposed method (area under the ROC curve (AUC) of 0.932 and 0.878, each obtained for the aforementioned two classification applications, respectively) impressively outperforms (by an order of magnitude) the most commonly used single neural network (AUC = 0.819 and AUC =0.754) and support vector machine (AUC = 0.849 and AUC = 0.773) based classification approaches. In addition, the feasibility of our method has been successfully demonstrated by comparing other state-of-the-art ensemble classification techniques such as Gentle AdaBoost and Random Forest learning algorithms. 相似文献
18.
In this paper a quite general formulation of sequential pattern recognition processes is presented. Within the framework of this formulation, a procedure is obtained for the simultaneous optimization of the stopping rule and the stage-by-stage ordering of features as the process proceeds. This optimization procedure is based on dynamic programming and uses as an index of performance the expected cost of the process, including both the cost of feature measurement and the cost of classification errors. A simple example illustrates the important computational aspects of the procedure and indicates the form of the solution. 相似文献
19.
A neuro-fuzzy scheme for simultaneous feature selection and fuzzy rule-based classification 总被引:4,自引:0,他引:4
Most methods of classification either ignore feature analysis or do it in a separate phase, offline prior to the main classification task. This paper proposes a neuro-fuzzy scheme for designing a classifier along with feature selection. It is a four-layered feed-forward network for realizing a fuzzy rule-based classifier. The network is trained by error backpropagation in three phases. In the first phase, the network learns the important features and the classification rules. In the subsequent phases, the network is pruned to an "optimal" architecture that represents an "optimal" set of rules. Pruning is found to drastically reduce the size of the network without degrading the performance. The pruned network is further tuned to improve performance. The rules learned by the network can be easily read from the network. The system is tested on both synthetic and real data sets and found to perform quite well. 相似文献
20.
In this study, a hierarchical electroencephalogram (EEG) classification system for epileptic seizure detection is proposed. The system includes the following three stages: (i) original EEG signals representation by wavelet packet coefficients and feature extraction using the best basis-based wavelet packet entropy method, (ii) cross-validation (CV) method together with k-Nearest Neighbor (k-NN) classifier used in the training stage to hierarchical knowledge base (HKB) construction, and (iii) in the testing stage, computing classification accuracy and rejection rate using the top-ranked discriminative rules from the HKB. The data set is taken from a publicly available EEG database which aims to differentiate healthy subjects and subjects suffering from epilepsy diseases. Experimental results show the efficiency of our proposed system. The best classification accuracy is about 100% via 2-, 5-, and 10-fold cross-validation, which indicates the proposed method has potential in designing a new intelligent EEG-based assistance diagnosis system for early detection of the electroencephalographic changes. 相似文献