共查询到20条相似文献,搜索用时 0 毫秒
1.
The aim of this paper is to propose a new hybrid data mining model based on combination of various feature selection and ensemble learning classification algorithms, in order to support decision making process. The model is built through several stages. In the first stage, initial dataset is preprocessed and apart of applying different preprocessing techniques, we paid a great attention to the feature selection. Five different feature selection algorithms were applied and their results, based on ROC and accuracy measures of logistic regression algorithm, were combined based on different voting types. We also proposed a new voting method, called if_any, that outperformed all other voting methods, as well as a single feature selection algorithm's results. In the next stage, a four different classification algorithms, including generalized linear model, support vector machine, naive Bayes and decision tree, were performed based on dataset obtained in the feature selection process. These classifiers were combined in eight different ensemble models using soft voting method. Using the real dataset, the experimental results show that hybrid model that is based on features selected by if_any voting method and ensemble GLM + DT model performs the highest performance and outperforms all other ensemble and single classifier models. 相似文献
2.
Rough set based approaches to feature selection for Case-Based Reasoning classifiers 总被引:3,自引:0,他引:3
This paper investigates feature selection based on rough sets for dimensionality reduction in Case-Based Reasoning classifiers. In order to be useful, Case-Based Reasoning systems should be able to manage imprecise, uncertain and redundant data to retrieve the most relevant information in a potentially overwhelming quantity of data. Rough Set Theory has been shown to be an effective tool for data mining and for uncertainty management. This paper has two central contributions: (1) it develops three strategies for feature selection, and (2) it proposes several measures for estimating attribute relevance based on Rough Set Theory. Although we concentrate on Case-Based Reasoning classifiers, the proposals are general enough to be applicable to a wide range of learning algorithms. We applied these proposals on twenty data sets from the UCI repository and examined the impact of feature selection over classification performance. Our evaluation shows that all three proposals benefit the basic Case-Based Reasoning system. They also present robustness in comparison to well-known feature selection strategies. 相似文献
3.
数据库通常包含很多冗余特征,找出重要特征叫做特征提取。本文提出一种基于属性重要度的启发式特征选取算法。该算法以属性重要度为迭代准则得到属性集合的最小约简。 相似文献
4.
Feature selection plays a vital role in many areas of pattern recognition and data mining. The effective computation of feature selection is important for improving the classification performance. In rough set theory, many feature selection algorithms have been proposed to process static incomplete data. However, feature values in an incomplete data set may vary dynamically in real-world applications. For such dynamic incomplete data, a classic (non-incremental) approach of feature selection is usually computationally time-consuming. To overcome this disadvantage, we propose an incremental approach for feature selection, which can accelerate the feature selection process in dynamic incomplete data. We firstly employ an incremental manner to compute the new positive region when feature values with respect to an object set vary dynamically. Based on the calculated positive region, two efficient incremental feature selection algorithms are developed respectively for single object and multiple objects with varying feature values. Then we conduct a series of experiments with 12 UCI real data sets to evaluate the efficiency and effectiveness of our proposed algorithms. The experimental results show that the proposed algorithms compare favorably with that of applying the existing non-incremental methods. 相似文献
5.
Qiang HeAuthor VitaeZongxia XieAuthor Vitae Qinghua HuAuthor Vitae Congxin WuAuthor Vitae 《Neurocomputing》2011,74(10):1585-1594
Support vector machines (SVMs) are a class of popular classification algorithms for their high generalization ability. However, it is time-consuming to train SVMs with a large set of learning samples. Improving learning efficiency is one of most important research tasks on SVMs. It is known that although there are many candidate training samples in some learning tasks, only the samples near decision boundary which are called support vectors have impact on the optimal classification hyper-planes. Finding these samples and training SVMs with them will greatly decrease training time and space complexity. Based on the observation, we introduce neighborhood based rough set model to search boundary samples. Using the model, we firstly divide sample spaces into three subsets: positive region, boundary and noise. Furthermore, we partition the input features into four subsets: strongly relevant features, weakly relevant and indispensable features, weakly relevant and superfluous features, and irrelevant features. Then we train SVMs only with the boundary samples in the relevant and indispensable feature subspaces, thus feature and sample selection is simultaneously conducted with the proposed model. A set of experimental results show the model can select very few features and samples for training; in the mean time the classification performances are preserved or even improved. 相似文献
6.
Feature subset selection is viewed as an important preprocessing step for pattern recognition, machine learning and data mining. Most of researches are focused on dealing with homogeneous feature selection, namely, numerical or categorical features. In this paper, we introduce a neighborhood rough set model to deal with the problem of heterogeneous feature subset selection. As the classical rough set model can just be used to evaluate categorical features, we generalize this model with neighborhood relations and introduce a neighborhood rough set model. The proposed model will degrade to the classical one if we specify the size of neighborhood zero. The neighborhood model is used to reduce numerical and categorical features by assigning different thresholds for different kinds of attributes. In this model the sizes of the neighborhood lower and upper approximations of decisions reflect the discriminating capability of feature subsets. The size of lower approximation is computed as the dependency between decision and condition attributes. We use the neighborhood dependency to evaluate the significance of a subset of heterogeneous features and construct forward feature subset selection algorithms. The proposed algorithms are compared with some classical techniques. Experimental results show that the neighborhood model based method is more flexible to deal with heterogeneous data. 相似文献
7.
Time series forecasting with a non-linear model and the scatter search meta-heuristic 总被引:1,自引:0,他引:1
Carlos Gomes da Silva 《Information Sciences》2008,178(16):3288-3299
Forecasting the behavior of variables (e.g., economic, financial, physical) is of strategic value for organizations, which helps to sustain practical interest in the development of alternative models and resolution procedures. This paper presents a non-linear model that combines radial basis functions and the ARMA(p, q) structure. The optimal set of parameters for such a model is difficult to find. In this paper, a scatter search meta-heuristic is used to find this optimal set. Five time series are analyzed to assess and illustrate the pertinence of the proposed meta-heuristic method. 相似文献
8.
Ligang Zhou Kin Keung Lai Lean Yu 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2009,13(2):149-155
Support vector machines (SVM) is an effective tool for building good credit scoring models. However, the performance of the
model depends on its parameters’ setting. In this study, we use direct search method to optimize the SVM-based credit scoring
model and compare it with other three parameters optimization methods, such as grid search, method based on design of experiment
(DOE) and genetic algorithm (GA). Two real-world credit datasets are selected to demonstrate the effectiveness and feasibility
of the method. The results show that the direct search method can find the effective model with high classification accuracy
and good robustness and keep less dependency on the initial search space or point setting. 相似文献
9.
Wei-Chou Chen Shian-Shyong Tseng Tzung-Pei Hong 《Expert systems with applications》2008,34(4):2858-2869
Feature selection is about finding useful (relevant) features to describe an application domain. Selecting relevant and enough features to effectively represent and index the given dataset is an important task to solve the classification and clustering problems intelligently. This task is, however, quite difficult to carry out since it usually needs a very time-consuming search to get the features desired. This paper proposes a bit-based feature selection method to find the smallest feature set to represent the indexes of a given dataset. The proposed approach originates from the bitmap indexing and rough set techniques. It consists of two-phases. In the first phase, the given dataset is transformed into a bitmap indexing matrix with some additional data information. In the second phase, a set of relevant and enough features are selected and used to represent the classification indexes of the given dataset. After the relevant and enough features are selected, they can be judged by the domain expertise and the final feature set of the given dataset is thus proposed. Finally, the experimental results on different data sets also show the efficiency and accuracy of the proposed approach. 相似文献
10.
Md. Monirul KabirAuthor Vitae 《Neurocomputing》2011,74(17):2914-2928
This paper presents a new hybrid genetic algorithm (HGA) for feature selection (FS), called as HGAFS. The vital aspect of this algorithm is the selection of salient feature subset within a reduced size. HGAFS incorporates a new local search operation that is devised and embedded in HGA to fine-tune the search in FS process. The local search technique works on basis of the distinct and informative nature of input features that is computed by their correlation information. The aim is to guide the search process so that the newly generated offsprings can be adjusted by the less correlated (distinct) features consisting of general and special characteristics of a given dataset. Thus, the proposed HGAFS receives the reduced redundancy of information among the selected features. On the other hand, HGAFS emphasizes on selecting a subset of salient features with reduced number using a subset size determination scheme. We have tested our HGAFS on 11 real-world classification datasets having dimensions varying from 8 to 7129. The performances of HGAFS have been compared with the results of other existing ten well-known FS algorithms. It is found that, HGAFS produces consistently better performances on selecting the subsets of salient features with resulting better classification accuracies. 相似文献
11.
Songyot Nakariyakul Author Vitae David P. Casasent Author Vitae 《Pattern recognition》2009,42(9):1932-1940
A new improved forward floating selection (IFFS) algorithm for selecting a subset of features is presented. Our proposed algorithm improves the state-of-the-art sequential forward floating selection algorithm. The improvement is to add an additional search step called “replacing the weak feature” to check whether removing any feature in the currently selected feature subset and adding a new one at each sequential step can improve the current feature subset. Our method provides the optimal or quasi-optimal (close to optimal) solutions for many selected subsets and requires significantly less computational load than optimal feature selection algorithms. Our experimental results for four different databases demonstrate that our algorithm consistently selects better subsets than other suboptimal feature selection algorithms do, especially when the original number of features of the database is large. 相似文献
12.
Credit scoring analysis is an important activity, especially nowadays after a huge number of defaults has been one of the main causes of the financial crisis. Among the many different tools used to model credit risk, the recent development of rough set models has proved effective. The original development of rough set theory has been widely generalized and combined with other approaches to uncertain reasoning, especially probability and fuzzy set theories. Since coherent conditional probability assessments cope well with the problem of unifying these different approaches, a merging of fuzzy rough set theory with this subjectivist approach is proposed. Specifically, expert partial probabilistic evaluations are encompassed inside a gradual decision rule structure, with coherence of the conclusion as a guideline. In line with Bayesian rough set models, credibility degrees of multiple premises are introduced through conditional probability assessments. Nonetheless, discernibility with this method remains too fine. Therefore, the basic partition is coarsened by equivalence classes based on the arity of positively, negatively and neutrally related criteria. A membership function, which grades the likelihood of default, is introduced by a peculiar choice of t-norms and t-conorms. To build and test the model, real data related to a sample of firms are used. 相似文献
13.
The main objective of feature selection is to improve learning performance by selecting concise and informative feature subsets, which presents a challenging task for machine learning or pattern recognition applications due to the large and complex search space involved. This paper provides an in-depth examination of nature-inspired metaheuristic methods for the feature selection problem, with a focus on representation and search algorithms, as they have drawn significant interest from the feature selection community due to their potential for global search and simplicity. An analysis of various advanced approach types, along with their advantages and disadvantages, is presented in this study, with the goal of highlighting important issues and unanswered questions in the literature. The article provides advice for conducting future research more effectively to benefit this field of study, including guidance on identifying appropriate approaches to use in different scenarios. 相似文献
14.
Diwakar Tripathi Damodar Reddy Edla Ramalingaswamy Cheruku Venkatanareshbabu Kuppili 《Computational Intelligence》2019,35(2):371-394
Credit scoring focuses on the development of empirical models to support the financial decision‐making processes of financial institutions and credit industries. It makes use of applicants' historical data and statistical or machine learning techniques to assess the risk associated with an applicant. However, the historical data may consist of redundant and noisy features that affect the performance of credit scoring models. The main focus of this paper is to develop a hybrid model, combining feature selection and a multilayer ensemble classifier framework, to improve the predictive performance of credit scoring. The proposed hybrid credit scoring model is modeled in three phases. The initial phase constitutes preprocessing and assigns ranks and weights to classifiers. In the next phase, the ensemble feature selection approach is applied to the preprocessed dataset. Finally, in the last phase, the dataset with the selected features is used in a multilayer ensemble classifier framework. In addition, a classifier placement algorithm based on the Choquet integral value is designed, as the classifier placement affects the predictive performance of the ensemble framework. The proposed hybrid credit scoring model is validated on real‐world credit scoring datasets, namely, Australian, Japanese, German‐categorical, and German‐numerical datasets. 相似文献
15.
Feature selection has become an increasingly important field of research. It aims at finding optimal feature subsets that can achieve better generalization on unseen data. However, this can be a very challenging task, especially when dealing with large feature sets. Hence, a search strategy is needed to explore a relatively small portion of the search space in order to find “semi-optimal” subsets. Many search strategies have been proposed in the literature, however most of them do not take into consideration relationships between features. Due to the fact that features usually have different degrees of dependency among each other, we propose in this paper a new search strategy that utilizes dependency between feature pairs to guide the search in the feature space. When compared to other well-known search strategies, the proposed method prevailed. 相似文献
16.
17.
A novel local search algorithm with configuration checking and scoring mechanism for the set k‐covering problem 下载免费PDF全文
Yiyuan Wang Minghao Yin Dantong Ouyang Liming Zhang 《International Transactions in Operational Research》2017,24(6):1463-1485
The set k‐covering problem, an extension of the classical set covering problem, is an important NP‐hard combinatorial optimization problem with extensive applications, including computational biology and wireless network. The aim of this paper is to design a new local search algorithm to solve this problem. First, to overcome the cycling problem in local search, the set k‐covering configuration checking (SKCC) strategy is proposed. Second, we use the cost scheme of elements to define the scoring mechanism so that our algorithm can find different possible good‐quality solutions. Having combined the SKCC strategy with the scoring mechanism, a subset selection strategy is designed to decide which subset should be selected as a candidate solution component. After that, a novel local search framework, as we call DLLccsm (diversion local search based on configuration checking and scoring mechanism), is proposed. DLLccsm is evaluated against two state‐of‐the‐art algorithms. The experimental results show that DLLccsm performs better than its competitors in terms of solution quality in most classical instances. 相似文献
18.
19.
20.
基于模糊粗糙集信息熵的蚁群特征选择方法 总被引:1,自引:0,他引:1
目前针对高维数据特征选择提出的启发式算法多数容易陷入局部最优,无法对整个特征空间进行有效搜索。为了提高对特征域的并行搜索能力,基于模糊粗糙集的信息熵原理,对蚁群模型的搜索策略、信息素更新和状态转移规则等进行了改进,提出蚁群特征选择方法。经UCI数据实验验证,该算法比传统的特征选择算法具有更好的选择效果,是有效的。 相似文献