首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We propose an extension of an entropy-based heuristic for constructing a decision tree from a large database with many numeric attributes. When it comes to handling numeric attributes, conventional methods are inefficient if any numeric attributes are strongly correlated. Our approach offers one solution to this problem. For each pair of numeric attributes with strong correlation, we compute a two-dimensional association rule with respect to these attributes and the objective attribute of the decision tree. In particular, we consider a family R of grid-regions in the plane associated with the pairof attributes. For R R, the data canbe split into two classes: data inside R and dataoutside R. We compute the region Ropt R that minimizes the entropy of the splitting,and add the splitting associated with Ropt (foreach pair of strongly correlated attributes) to the set of candidatetests in an entropy-based heuristic. We give efficient algorithmsfor cases in which R is (1) x-monotone connected regions, (2) based-monotone regions, (3) rectangles, and (4) rectilinear convex regions. The algorithm has been implemented as a subsystem of SONAR (System for Optimized Numeric Association Rules) developed by the authors. We have confirmed that we can compute the optimal region efficiently. And diverse experiments show that our approach can create compact trees whose accuracy is comparable with or better than that of conventional trees. More importantly, we can grasp non-linear correlation among numeric attributes which could not be found without our region splitting.  相似文献   

2.
Different multi-attribute decision-making (MADM) methods often produce different outcomes for selecting or ranking a set of decision alternatives involving multiple attributes. This paper presents a new approach to the selection of compensatory MADM methods for a specific cardinal ranking problem via sensitivity analysis of attribute weights. In line with the context-dependent concept of informational importance, the approach examines the consistency degree between the relative degree of sensitivity of individual attributes using an MADM method and the relative degree of influence of the corresponding attributes indicated by Shannon's entropy concept. The approach favors the method that has the highest consistency degree as it best reflects the decision information embedded in the problem data set. An empirical study of a scholarship student selection problem is used to illustrate how the approach can validate the ranking outcome produced by different MADM methods. The empirical study shows that different problem data sets may result in a different method being selected. This approach is particularly applicable to large-scale cardinal ranking problems where the ranking outcome of different methods differs significantly.  相似文献   

3.
Mining optimized gain rules for numeric attributes   总被引:7,自引:0,他引:7  
Association rules are useful for determining correlations between attributes of a relation and have applications in the marketing, financial, and retail sectors. Furthermore, optimized association rules are an effective way to focus on the most interesting characteristics involving certain attributes. Optimized association rules are permitted to contain uninstantiated attributes and the problem is to determine instantiations such that either the support, confidence, or gain of the rule is maximized. In this paper, we generalize the optimized gain association rule problem by permitting rules to contain disjunctions over uninstantiated numeric attributes. Our generalized association rules enable us to extract more useful information about seasonal and local patterns involving the uninstantiated attribute. For rules containing a single numeric attribute, we present an algorithm with linear complexity for computing optimized gain rules. Furthermore, we propose a bucketing technique that can result in a significant reduction in input size by coalescing contiguous values without sacrificing optimality. We also present an approximation algorithm based on dynamic programming for two numeric attributes. Using recent results on binary space partitioning trees, we show that the approximations are within a constant factor of the optimal optimized gain rules. Our experimental results with synthetic data sets for a single numeric attribute demonstrate that our algorithm scales up linearly with the attribute's domain size as well as the number of disjunctions. In addition, we show that applying our optimized rule framework to a population survey real-life data set enables us to discover interesting underlying correlations among the attributes.  相似文献   

4.
For high dimensional data, if no preprocessing is carried out before inputting patterns to classifiers, the computation required may be too heavy. For example, the number of hidden units of a radial basis function (RBF) neural network can be too large. This is not suitable for some practical applications due to speed and memory constraints. In many cases, some attributes are not relevant to concepts in the data at all. In this paper, we propose a novel separability-correlation measure (SCM) to rank the importance of attributes. According to the attribute ranking results, different attribute subsets are used as inputs to a classifier, such as an RBF neural network. Those attributes that increase the validation error are deemed irrelevant and are deleted. The complexity of the classifier can thus be reduced and its classification performance improved. Computer simulations show that our method for attribute importance ranking leads to smaller attribute subsets with higher accuracies compared with the existing SUD and Relief-F methods. We also propose a modified method for efficient construction of an RBF classifier. In this method we allow for large overlaps between clusters corresponding to the same class label. Our approach significantly reduces the structural complexity of the RBF network and improves the classification performance.  相似文献   

5.
Business operation performance is related to corporation profitability and directly affects the choices of investment in the stock market. This paper proposes a hybrid method, which combines the ordered weighted averaging (OWA) operator and rough set theory after an attribute selection procedure to deal with multi-attribute forecasting problems with respect to revenue growth rate of the electronic industry. In the attribute selection step, four most-important attributes within 12 attributes collected from related literature are determined via five attribute selection methods as the input of the following procedure of the proposed method. The OWA operator can adjust the weight of an attribute based on the situation of a decision-maker and aggregate different attribute values into a single aggregated value of each instance, and then the single aggregated values are utilized to generate classification rules by rough set for forecasting operation performance.To verify the proposed method, this research collects the financial data of 629 electronic firms for public companies listed in the TSE (Taiwan Stock Exchange) and OTC (Over-the-Counter) market in 2004 and 2005 to forecast the revenue growth rate. The results show that the proposed method outperforms the listing methods.  相似文献   

6.
A fuzzy support vector machine (FSVM) is an improvement in SVMs for dealing with data sets with outliers. In FSVM, a key step is to compute the membership for every training sample. Existing approaches of computing the membership of a sample are motivated by the existence of outliers in data sets and do not take account of the inconsistency between conditional attributes and decision classes. However, this kind of inconsistency can affect membership for every sample and has been considered in fuzzy rough set theory. In this paper, we develop a new method to compute membership for FSVMs by using a Gaussian kernel-based fuzzy rough set. Furthermore, we employ a technique of attribute reduction using Gaussian kernel-based fuzzy rough sets to perform feature selection for FSVMs. Based on these discussions we combine the FSVMs and fuzzy rough sets methods together. The experimental results show that the proposed approaches are feasible and effective.  相似文献   

7.
This paper deals with the problem of supervised wrapper-based feature subset selection in datasets with a very large number of attributes. Recently the literature has contained numerous references to the use of hybrid selection algorithms: based on a filter ranking, they perform an incremental wrapper selection over that ranking. Though working fine, these methods still have their problems: (1) depending on the complexity of the wrapper search method, the number of wrapper evaluations can still be too large; and (2) they rely on a univariate ranking that does not take into account interaction between the variables already included in the selected subset and the remaining ones.Here we propose a new approach whose main goal is to drastically reduce the number of wrapper evaluations while maintaining good performance (e.g. accuracy and size of the obtained subset). To do this we propose an algorithm that iteratively alternates between filter ranking construction and wrapper feature subset selection (FSS). Thus, the FSS only uses the first block of ranked attributes and the ranking method uses the current selected subset in order to build a new ranking where this knowledge is considered. The algorithm terminates when no new attribute is selected in the last call to the FSS algorithm. The main advantage of this approach is that only a few blocks of variables are analyzed, and so the number of wrapper evaluations decreases drastically.The proposed method is tested over eleven high-dimensional datasets (2400-46,000 variables) using different classifiers. The results show an impressive reduction in the number of wrapper evaluations without degrading the quality of the obtained subset.  相似文献   

8.
Benchmarking attribute selection techniques for discrete class data mining   总被引:9,自引:0,他引:9  
Data engineering is generally considered to be a central issue in the development of data mining applications. The success of many learning schemes, in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant, and noisy attributes in the model building process phase can result in poor predictive performance and increased computation. Attribute selection generally involves a combination of search and attribute utility estimation plus evaluation with respect to specific learning schemes. This leads to a large number of possible permutations and has led to a situation where very few benchmark studies have been conducted. This paper presents a benchmark comparison of several attribute selection methods for supervised classification. All the methods produce an attribute ranking, a useful devise for isolating the individual merit of an attribute. Attribute selection is achieved by cross-validating the attribute rankings with respect to a classification learner to find the best attributes. Results are reported for a selection of standard data sets and two diverse learning schemes C4.5 and naive Bayes.  相似文献   

9.
The problem of identifying meaningful patterns in a database lies at the very heart of data mining. A core objective of data mining processes is the recognition of inter-attribute correlations. Not only are correlations necessary for predictions and classifications – since rules would fail in the absence of pattern – but also the identification of groups of mutually correlated attributes expedites the selection of a representative subset of attributes, from which existing mappings allow others to be derived. In this paper, we describe a scalable, effective algorithm to identify groups of correlated attributes. This algorithm can handle non-linear correlations between attributes, and is not restricted to a specific family of mapping functions, such as the set of polynomials. We show the results of our evaluation of the algorithm applied to synthetic and real world datasets, and demonstrate that it is able to spot the correlated attributes. Moreover, the execution time of the proposed technique is linear on the number of elements and of correlations in the dataset.  相似文献   

10.
Social choice deals with aggregating the preferences of a number of voters into a collective preference. We will use this idea for software project effort estimation, substituting the voters by project attributes. Therefore, instead of supplying numeric values for various project attributes that are then used in regression or similar methods, a new project only needs to be placed into one ranking per attribute, necessitating only ordinal values. Using the resulting aggregate ranking the new project is again placed between other projects whose actual expended effort can be used to derive an estimation. In this paper we will present this method and extensions using weightings derived from genetic algorithms. We detail a validation based on several well-known data sets and show that estimation accuracy similar to classic methods can be achieved with considerably lower demands on input data.  相似文献   

11.
基于可分性判据排序的RBF神经网络属性选择方法   总被引:2,自引:0,他引:2  
文专  王正欧 《计算机工程》2004,30(23):40-42
提出一种基于数据属性重要性排序的神经网络属性选择方法,该方法只需对部分属性进行洲练,即可进行降维,它克服了现有的神经网络降维方法必须对全部属性进行训练的弊端,大大提高了属性选择的效率。该方法先用本文提出的一种简单的可分性判据方法对数据属性进行重要性排序,然后按重要次序用RBF神经网络进行属性选择。仿真实例表明,该方法具有良好的效果。  相似文献   

12.
属性约简是粗糙集理论中的重要研究内容之一.但属性约简是一个NP难题,需要通过启发式知识实四。文中提出利用分辨矩阵求不同的条件属性组合相对于决策属性的正域的方法,并给出新的求核属性的方法。在此基础上,提出了一种利用分辨矩阵实现属性约简的新算法,该算法能快速求最少属性且实现简单,并实现了属性约简与规则提取的同步.最后通过实例证明了其正确性。  相似文献   

13.
Cost-constrained data acquisition for intelligent data preparation   总被引:1,自引:0,他引:1  
Real-world data is noisy and can often suffer from corruptions or incomplete values that may impact the models created from the data. To build accurate predictive models, data acquisition is usually adopted to prepare the data and complete missing values. However, due to the significant cost of doing so and the inherent correlations in the data set, acquiring correct information for all instances is prohibitive and unnecessary. An interesting and important problem that arises here is to select what kinds of instances to complete so the model built from the processed data can receive the "maximum" performance improvement. This problem is complicated by the reality that the costs associated with the attributes are different, and fixing the missing values of some attributes is inherently more expensive than others. Therefore, the problem becomes that given a fixed budget, what kinds of instances should be selected for preparation, so that the learner built from the processed data set can maximize its performance? In this paper, we propose a solution for this problem, and the essential idea is to combine attribute costs and the relevance of each attribute to the target concept, so that the data acquisition can pay more attention to those attributes that are cheap in price but informative for classification. To this end, we will first introduce a unique economical factor (EF) that seamlessly integrates the cost and the importance (in terms of classification) of each attribute. Then, we will propose a cost-constrained data acquisition model, where active learning, missing value prediction, and impact-sensitive instance ranking are combined for effective data acquisition. Experimental results and comparative studies from real-world data sets demonstrate the effectiveness of our method.  相似文献   

14.
In this paper, we propose a new objective weighting method that employs intuitionistic fuzzy (IF) entropy measures to solve multiple-attribute decision-making problems in the context of intuitionistic fuzzy sets. Instead of traditional fuzzy entropy, which uses the probabilistic discrimination of attributes to obtain attribute weights, we utilize the IF entropy to assess objective weights based on the credibility of the input data. We examine various measures for IF entropy with respect to hesitation degree, probability, non-probability, and geometry to calculate the attribute weights. A comparative analysis of different measures to generate attribute rankings is illustrated with both computational experiments as well as analyses of Pearson correlations, Spearman rank correlations, contradiction rates, inversion rates, and consistency rates. The experimental results indicate that ranking the outcomes of attributes not only depends on the type of IF entropy measures but is also affected by the number of attributes and the number of alternatives.  相似文献   

15.
16.
Set-valued information systems   总被引:2,自引:0,他引:2  
Set-valued information systems are generalized models of single-valued information systems. Incomplete information systems can be viewed as disjunctively interpreted set-valued information systems. Since some objects in set-valued information systems may have more than one value for an attribute, so we define tolerance relation and use the maximal tolerance classes to classify the universe of discourse. In order to derive optimal decision rules from set-valued decision information systems, we propose the concept of relative reduct of maximal tolerance classes, and define a kind of discernibility function to compute the relative reduct by Boolean reasoning techniques. Finally, we define three kinds of relative reducts for set-valued information systems and used them to evaluate the significance of attributes.  相似文献   

17.
决策表分析的统计依据   总被引:2,自引:0,他引:2  
给出了决策表的条件属性约简的非参数统计检验方法。首先,给出与决策表相应的列联表,进行条件属性与决策属性间相关性的显著性检验,在一定的显著性水平上,依据相关性显著与否,来判别该属性相对于决策行为是否冗余,从而获得属性约简;进而,来用Lanmbda系数对与决策属性显著相关的属性进行相关性度量,说明用条件属性对决策属性进行预测将消减误差的比例。并在列联表的基础上,获得决策表的一级规则。病例决策表的实验表明,该方法简单,有效。  相似文献   

18.
发掘多值属性的关联规则   总被引:45,自引:1,他引:45  
张朝晖  陆玉昌  张钹 《软件学报》1998,9(11):801-805
属性值可以取布尔量或多值量.从以布尔量描述的数据中发掘关联规则已经有比较成熟的系统和方法,而对于多值量则不然.将多值量的数据转化为布尔型的数据是一条方便、有效的途径.提出一种算法,根据数据本身的情况决定多值量的划分,进而将划分后的区段映射为布尔量,在此基础上可发掘容易理解且具有概括性的、有效的关联规则.  相似文献   

19.
针对主成分分析-贝叶斯判别法(PCA-BDA)仅支持安全评价但不能发现危险因素的问题,引入属性重要度的概念,提出一种改进的PCA-BDA算法,并将其应用于石油钻井安全评价。首先,使用原始PCA-BDA方法评估出各条记录的安全等级;然后,利用主成分分析(PCA)过程中的特征向量矩阵,贝叶斯判别(BDA)过程中的判别函数矩阵,以及各安全等级的权重计算得出属性重要度;最后,通过参考属性重要度来调控属性。安全评价准确率的对比实验中,改进PCA-BDA方法准确率达到96.7%,明显高于层次分析法(AHP)和模糊综合评价法(FCE)。调控属性的仿真实验中,调控重要度最高的3个属性70%以上的钻井安全等级得到改善;相对地,调控重要度最低的3个属性钻井安全等级几乎没有变化。实验结果表明,改进PCA-BDA方法不仅能够准确地实现安全评价,同时能够找出关键属性使石油钻井安全管理更有针对性。  相似文献   

20.
High dimensional data contain many redundant or irrelevant attributes, which will be difficult for data mining and a variety of pattern recognition. When implementing data mining or a variety of pattern recognition on high dimensional space, it is necessary to reduce the dimension of high dimensional space. In this paper, a new attribute importance measure and selection methods based on attribute ranking was proposed. In proposed attribute selection method, input output correlation (IOC) is applied for calculating attribute’ importance, and then sorts them according to descending order. The hybrid of Back Propagation Neural Network (BPNN) and Particle Swarm Optimization (PSO) algorithms is also proposed. PSO is used to optimize weights and thresholds of BPNN for overcoming the inherent shortcoming of BPNN. The experiment results show the proposed attribute selection method is an effective preproceesing technology.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号