首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
The aim of this paper is to propose a new hybrid data mining model based on combination of various feature selection and ensemble learning classification algorithms, in order to support decision making process. The model is built through several stages. In the first stage, initial dataset is preprocessed and apart of applying different preprocessing techniques, we paid a great attention to the feature selection. Five different feature selection algorithms were applied and their results, based on ROC and accuracy measures of logistic regression algorithm, were combined based on different voting types. We also proposed a new voting method, called if_any, that outperformed all other voting methods, as well as a single feature selection algorithm's results. In the next stage, a four different classification algorithms, including generalized linear model, support vector machine, naive Bayes and decision tree, were performed based on dataset obtained in the feature selection process. These classifiers were combined in eight different ensemble models using soft voting method. Using the real dataset, the experimental results show that hybrid model that is based on features selected by if_any voting method and ensemble GLM + DT model performs the highest performance and outperforms all other ensemble and single classifier models.  相似文献   

2.
This paper investigates feature selection based on rough sets for dimensionality reduction in Case-Based Reasoning classifiers. In order to be useful, Case-Based Reasoning systems should be able to manage imprecise, uncertain and redundant data to retrieve the most relevant information in a potentially overwhelming quantity of data. Rough Set Theory has been shown to be an effective tool for data mining and for uncertainty management. This paper has two central contributions: (1) it develops three strategies for feature selection, and (2) it proposes several measures for estimating attribute relevance based on Rough Set Theory. Although we concentrate on Case-Based Reasoning classifiers, the proposals are general enough to be applicable to a wide range of learning algorithms. We applied these proposals on twenty data sets from the UCI repository and examined the impact of feature selection over classification performance. Our evaluation shows that all three proposals benefit the basic Case-Based Reasoning system. They also present robustness in comparison to well-known feature selection strategies.  相似文献   

3.
数据库通常包含很多冗余特征,找出重要特征叫做特征提取。本文提出一种基于属性重要度的启发式特征选取算法。该算法以属性重要度为迭代准则得到属性集合的最小约简。  相似文献   

4.
Feature selection plays a vital role in many areas of pattern recognition and data mining. The effective computation of feature selection is important for improving the classification performance. In rough set theory, many feature selection algorithms have been proposed to process static incomplete data. However, feature values in an incomplete data set may vary dynamically in real-world applications. For such dynamic incomplete data, a classic (non-incremental) approach of feature selection is usually computationally time-consuming. To overcome this disadvantage, we propose an incremental approach for feature selection, which can accelerate the feature selection process in dynamic incomplete data. We firstly employ an incremental manner to compute the new positive region when feature values with respect to an object set vary dynamically. Based on the calculated positive region, two efficient incremental feature selection algorithms are developed respectively for single object and multiple objects with varying feature values. Then we conduct a series of experiments with 12 UCI real data sets to evaluate the efficiency and effectiveness of our proposed algorithms. The experimental results show that the proposed algorithms compare favorably with that of applying the existing non-incremental methods.  相似文献   

5.
Support vector machines (SVMs) are a class of popular classification algorithms for their high generalization ability. However, it is time-consuming to train SVMs with a large set of learning samples. Improving learning efficiency is one of most important research tasks on SVMs. It is known that although there are many candidate training samples in some learning tasks, only the samples near decision boundary which are called support vectors have impact on the optimal classification hyper-planes. Finding these samples and training SVMs with them will greatly decrease training time and space complexity. Based on the observation, we introduce neighborhood based rough set model to search boundary samples. Using the model, we firstly divide sample spaces into three subsets: positive region, boundary and noise. Furthermore, we partition the input features into four subsets: strongly relevant features, weakly relevant and indispensable features, weakly relevant and superfluous features, and irrelevant features. Then we train SVMs only with the boundary samples in the relevant and indispensable feature subspaces, thus feature and sample selection is simultaneously conducted with the proposed model. A set of experimental results show the model can select very few features and samples for training; in the mean time the classification performances are preserved or even improved.  相似文献   

6.
Neighborhood rough set based heterogeneous feature subset selection   总被引:6,自引:0,他引:6  
Feature subset selection is viewed as an important preprocessing step for pattern recognition, machine learning and data mining. Most of researches are focused on dealing with homogeneous feature selection, namely, numerical or categorical features. In this paper, we introduce a neighborhood rough set model to deal with the problem of heterogeneous feature subset selection. As the classical rough set model can just be used to evaluate categorical features, we generalize this model with neighborhood relations and introduce a neighborhood rough set model. The proposed model will degrade to the classical one if we specify the size of neighborhood zero. The neighborhood model is used to reduce numerical and categorical features by assigning different thresholds for different kinds of attributes. In this model the sizes of the neighborhood lower and upper approximations of decisions reflect the discriminating capability of feature subsets. The size of lower approximation is computed as the dependency between decision and condition attributes. We use the neighborhood dependency to evaluate the significance of a subset of heterogeneous features and construct forward feature subset selection algorithms. The proposed algorithms are compared with some classical techniques. Experimental results show that the neighborhood model based method is more flexible to deal with heterogeneous data.  相似文献   

7.
Forecasting the behavior of variables (e.g., economic, financial, physical) is of strategic value for organizations, which helps to sustain practical interest in the development of alternative models and resolution procedures. This paper presents a non-linear model that combines radial basis functions and the ARMA(pq) structure. The optimal set of parameters for such a model is difficult to find. In this paper, a scatter search meta-heuristic is used to find this optimal set. Five time series are analyzed to assess and illustrate the pertinence of the proposed meta-heuristic method.  相似文献   

8.
Support vector machines (SVM) is an effective tool for building good credit scoring models. However, the performance of the model depends on its parameters’ setting. In this study, we use direct search method to optimize the SVM-based credit scoring model and compare it with other three parameters optimization methods, such as grid search, method based on design of experiment (DOE) and genetic algorithm (GA). Two real-world credit datasets are selected to demonstrate the effectiveness and feasibility of the method. The results show that the direct search method can find the effective model with high classification accuracy and good robustness and keep less dependency on the initial search space or point setting.  相似文献   

9.
Feature selection is about finding useful (relevant) features to describe an application domain. Selecting relevant and enough features to effectively represent and index the given dataset is an important task to solve the classification and clustering problems intelligently. This task is, however, quite difficult to carry out since it usually needs a very time-consuming search to get the features desired. This paper proposes a bit-based feature selection method to find the smallest feature set to represent the indexes of a given dataset. The proposed approach originates from the bitmap indexing and rough set techniques. It consists of two-phases. In the first phase, the given dataset is transformed into a bitmap indexing matrix with some additional data information. In the second phase, a set of relevant and enough features are selected and used to represent the classification indexes of the given dataset. After the relevant and enough features are selected, they can be judged by the domain expertise and the final feature set of the given dataset is thus proposed. Finally, the experimental results on different data sets also show the efficiency and accuracy of the proposed approach.  相似文献   

10.
A new local search based hybrid genetic algorithm for feature selection   总被引:2,自引:0,他引:2  
This paper presents a new hybrid genetic algorithm (HGA) for feature selection (FS), called as HGAFS. The vital aspect of this algorithm is the selection of salient feature subset within a reduced size. HGAFS incorporates a new local search operation that is devised and embedded in HGA to fine-tune the search in FS process. The local search technique works on basis of the distinct and informative nature of input features that is computed by their correlation information. The aim is to guide the search process so that the newly generated offsprings can be adjusted by the less correlated (distinct) features consisting of general and special characteristics of a given dataset. Thus, the proposed HGAFS receives the reduced redundancy of information among the selected features. On the other hand, HGAFS emphasizes on selecting a subset of salient features with reduced number using a subset size determination scheme. We have tested our HGAFS on 11 real-world classification datasets having dimensions varying from 8 to 7129. The performances of HGAFS have been compared with the results of other existing ten well-known FS algorithms. It is found that, HGAFS produces consistently better performances on selecting the subsets of salient features with resulting better classification accuracies.  相似文献   

11.
A new improved forward floating selection (IFFS) algorithm for selecting a subset of features is presented. Our proposed algorithm improves the state-of-the-art sequential forward floating selection algorithm. The improvement is to add an additional search step called “replacing the weak feature” to check whether removing any feature in the currently selected feature subset and adding a new one at each sequential step can improve the current feature subset. Our method provides the optimal or quasi-optimal (close to optimal) solutions for many selected subsets and requires significantly less computational load than optimal feature selection algorithms. Our experimental results for four different databases demonstrate that our algorithm consistently selects better subsets than other suboptimal feature selection algorithms do, especially when the original number of features of the database is large.  相似文献   

12.
Credit scoring analysis using a fuzzy probabilistic rough set model   总被引:1,自引:0,他引:1  
Credit scoring analysis is an important activity, especially nowadays after a huge number of defaults has been one of the main causes of the financial crisis. Among the many different tools used to model credit risk, the recent development of rough set models has proved effective. The original development of rough set theory has been widely generalized and combined with other approaches to uncertain reasoning, especially probability and fuzzy set theories. Since coherent conditional probability assessments cope well with the problem of unifying these different approaches, a merging of fuzzy rough set theory with this subjectivist approach is proposed. Specifically, expert partial probabilistic evaluations are encompassed inside a gradual decision rule structure, with coherence of the conclusion as a guideline. In line with Bayesian rough set models, credibility degrees of multiple premises are introduced through conditional probability assessments. Nonetheless, discernibility with this method remains too fine. Therefore, the basic partition is coarsened by equivalence classes based on the arity of positively, negatively and neutrally related criteria. A membership function, which grades the likelihood of default, is introduced by a peculiar choice of t-norms and t-conorms. To build and test the model, real data related to a sample of firms are used.  相似文献   

13.
The main objective of feature selection is to improve learning performance by selecting concise and informative feature subsets, which presents a challenging task for machine learning or pattern recognition applications due to the large and complex search space involved. This paper provides an in-depth examination of nature-inspired metaheuristic methods for the feature selection problem, with a focus on representation and search algorithms, as they have drawn significant interest from the feature selection community due to their potential for global search and simplicity. An analysis of various advanced approach types, along with their advantages and disadvantages, is presented in this study, with the goal of highlighting important issues and unanswered questions in the literature. The article provides advice for conducting future research more effectively to benefit this field of study, including guidance on identifying appropriate approaches to use in different scenarios.  相似文献   

14.
Credit scoring focuses on the development of empirical models to support the financial decision‐making processes of financial institutions and credit industries. It makes use of applicants' historical data and statistical or machine learning techniques to assess the risk associated with an applicant. However, the historical data may consist of redundant and noisy features that affect the performance of credit scoring models. The main focus of this paper is to develop a hybrid model, combining feature selection and a multilayer ensemble classifier framework, to improve the predictive performance of credit scoring. The proposed hybrid credit scoring model is modeled in three phases. The initial phase constitutes preprocessing and assigns ranks and weights to classifiers. In the next phase, the ensemble feature selection approach is applied to the preprocessed dataset. Finally, in the last phase, the dataset with the selected features is used in a multilayer ensemble classifier framework. In addition, a classifier placement algorithm based on the Choquet integral value is designed, as the classifier placement affects the predictive performance of the ensemble framework. The proposed hybrid credit scoring model is validated on real‐world credit scoring datasets, namely, Australian, Japanese, German‐categorical, and German‐numerical datasets.  相似文献   

15.
Feature selection has become an increasingly important field of research. It aims at finding optimal feature subsets that can achieve better generalization on unseen data. However, this can be a very challenging task, especially when dealing with large feature sets. Hence, a search strategy is needed to explore a relatively small portion of the search space in order to find “semi-optimal” subsets. Many search strategies have been proposed in the literature, however most of them do not take into consideration relationships between features. Due to the fact that features usually have different degrees of dependency among each other, we propose in this paper a new search strategy that utilizes dependency between feature pairs to guide the search in the feature space. When compared to other well-known search strategies, the proposed method prevailed.  相似文献   

16.
基于粗糙集与蚁群优化算法的特征选择方法研究*   总被引:1,自引:0,他引:1  
已有的基于蚁群优化算法的特征选择方法是从随机点出发,寻找最优的特征组合。讨论和分析了粗糙集理论中的特征核思想,结合蚁群优化算法的全局寻优特点,以特征重要度作为启发式搜索信息,提出从特征核出发基于粗糙集理论与蚁群优化的特征选择算法,简化蚁群完全图搜索的规模。在标准UCI数据集上进行测试,实验验证了新算法对于特征选择的有效性。  相似文献   

17.
The set k‐covering problem, an extension of the classical set covering problem, is an important NP‐hard combinatorial optimization problem with extensive applications, including computational biology and wireless network. The aim of this paper is to design a new local search algorithm to solve this problem. First, to overcome the cycling problem in local search, the set k‐covering configuration checking (SKCC) strategy is proposed. Second, we use the cost scheme of elements to define the scoring mechanism so that our algorithm can find different possible good‐quality solutions. Having combined the SKCC strategy with the scoring mechanism, a subset selection strategy is designed to decide which subset should be selected as a candidate solution component. After that, a novel local search framework, as we call DLLccsm (diversion local search based on configuration checking and scoring mechanism), is proposed. DLLccsm is evaluated against two state‐of‐the‐art algorithms. The experimental results show that DLLccsm performs better than its competitors in terms of solution quality in most classical instances.  相似文献   

18.
杨胜  施鹏飞  顾钧 《控制与决策》2004,19(11):1208-1212
从属性集互信息的角度分析了粗糙集理论的属性约筒问题.首先在互信息的基础上定义了一个新的属性子集的冗余性和协同能力度量——属性子集的冗余协同系数;然后将它作为属性约筒度量,提出了基于Beam搜索的粗糙集属性约筒算法.实验表明属性约简算法具有良好的运行效果.  相似文献   

19.
20.
基于模糊粗糙集信息熵的蚁群特征选择方法   总被引:1,自引:0,他引:1  
赵军阳  张志利 《计算机应用》2009,29(1):109-111,
目前针对高维数据特征选择提出的启发式算法多数容易陷入局部最优,无法对整个特征空间进行有效搜索。为了提高对特征域的并行搜索能力,基于模糊粗糙集的信息熵原理,对蚁群模型的搜索策略、信息素更新和状态转移规则等进行了改进,提出蚁群特征选择方法。经UCI数据实验验证,该算法比传统的特征选择算法具有更好的选择效果,是有效的。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号