共查询到20条相似文献,搜索用时 31 毫秒
1.
Yasuhiko Morimoto Takeshi Fukuda Shinichi Morishita Takeshi Tokuyama 《Constraints》1997,2(3-4):401-427
We propose an extension of an entropy-based heuristic for constructing a decision tree from a large database with many numeric attributes. When it comes to handling numeric attributes, conventional methods are inefficient if any numeric attributes are strongly correlated. Our approach offers one solution to this problem. For each pair of numeric attributes with strong correlation, we compute a two-dimensional association rule with respect to these attributes and the objective attribute of the decision tree. In particular, we consider a family R of grid-regions in the plane associated with the pairof attributes. For R R, the data canbe split into two classes: data inside R and dataoutside R. We compute the region Ropt R that minimizes the entropy of the splitting,and add the splitting associated with Ropt (foreach pair of strongly correlated attributes) to the set of candidatetests in an entropy-based heuristic. We give efficient algorithmsfor cases in which R is (1) x-monotone connected regions, (2) based-monotone regions, (3) rectangles, and (4) rectilinear convex regions. The algorithm has been implemented as a subsystem of SONAR (System for Optimized Numeric Association Rules) developed by the authors. We have confirmed that we can compute the optimal region efficiently. And diverse experiments show that our approach can create compact trees whose accuracy is comparable with or better than that of conventional trees. More importantly, we can grasp non-linear correlation among numeric attributes which could not be found without our region splitting. 相似文献
2.
Chung-Hsing Yeh 《International Transactions in Operational Research》2002,9(2):169-181
Different multi-attribute decision-making (MADM) methods often produce different outcomes for selecting or ranking a set of decision alternatives involving multiple attributes. This paper presents a new approach to the selection of compensatory MADM methods for a specific cardinal ranking problem via sensitivity analysis of attribute weights. In line with the context-dependent concept of informational importance, the approach examines the consistency degree between the relative degree of sensitivity of individual attributes using an MADM method and the relative degree of influence of the corresponding attributes indicated by Shannon's entropy concept. The approach favors the method that has the highest consistency degree as it best reflects the decision information embedded in the problem data set. An empirical study of a scholarship student selection problem is used to illustrate how the approach can validate the ranking outcome produced by different MADM methods. The empirical study shows that different problem data sets may result in a different method being selected. This approach is particularly applicable to large-scale cardinal ranking problems where the ranking outcome of different methods differs significantly. 相似文献
3.
Mining optimized gain rules for numeric attributes 总被引:7,自引:0,他引:7
Brin S. Rastogi R. Kyuseok Shim 《Knowledge and Data Engineering, IEEE Transactions on》2003,15(2):324-338
Association rules are useful for determining correlations between attributes of a relation and have applications in the marketing, financial, and retail sectors. Furthermore, optimized association rules are an effective way to focus on the most interesting characteristics involving certain attributes. Optimized association rules are permitted to contain uninstantiated attributes and the problem is to determine instantiations such that either the support, confidence, or gain of the rule is maximized. In this paper, we generalize the optimized gain association rule problem by permitting rules to contain disjunctions over uninstantiated numeric attributes. Our generalized association rules enable us to extract more useful information about seasonal and local patterns involving the uninstantiated attribute. For rules containing a single numeric attribute, we present an algorithm with linear complexity for computing optimized gain rules. Furthermore, we propose a bucketing technique that can result in a significant reduction in input size by coalescing contiguous values without sacrificing optimality. We also present an approximation algorithm based on dynamic programming for two numeric attributes. Using recent results on binary space partitioning trees, we show that the approximations are within a constant factor of the optimal optimized gain rules. Our experimental results with synthetic data sets for a single numeric attribute demonstrate that our algorithm scales up linearly with the attribute's domain size as well as the number of disjunctions. In addition, we show that applying our optimized rule framework to a population survey real-life data set enables us to discover interesting underlying correlations among the attributes. 相似文献
4.
Data dimensionality reduction with application to simplifying RBF network structure and improving classification performance 总被引:7,自引:0,他引:7
Xiuju Fu Lipo Wang 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2003,33(3):399-409
For high dimensional data, if no preprocessing is carried out before inputting patterns to classifiers, the computation required may be too heavy. For example, the number of hidden units of a radial basis function (RBF) neural network can be too large. This is not suitable for some practical applications due to speed and memory constraints. In many cases, some attributes are not relevant to concepts in the data at all. In this paper, we propose a novel separability-correlation measure (SCM) to rank the importance of attributes. According to the attribute ranking results, different attribute subsets are used as inputs to a classifier, such as an RBF neural network. Those attributes that increase the validation error are deemed irrelevant and are deleted. The complexity of the classifier can thus be reduced and its classification performance improved. Computer simulations show that our method for attribute importance ranking leads to smaller attribute subsets with higher accuracies compared with the existing SUD and Relief-F methods. We also propose a modified method for efficient construction of an RBF classifier. In this method we allow for large overlaps between clusters corresponding to the same class label. Our approach significantly reduces the structural complexity of the RBF network and improves the classification performance. 相似文献
5.
Jing-Wei Liu Ching-Hsue Cheng Yao-Hsien Chen Tai-Liang Chen 《Expert systems with applications》2010,37(1):610-617
Business operation performance is related to corporation profitability and directly affects the choices of investment in the stock market. This paper proposes a hybrid method, which combines the ordered weighted averaging (OWA) operator and rough set theory after an attribute selection procedure to deal with multi-attribute forecasting problems with respect to revenue growth rate of the electronic industry. In the attribute selection step, four most-important attributes within 12 attributes collected from related literature are determined via five attribute selection methods as the input of the following procedure of the proposed method. The OWA operator can adjust the weight of an attribute based on the situation of a decision-maker and aggregate different attribute values into a single aggregated value of each instance, and then the single aggregated values are utilized to generate classification rules by rough set for forecasting operation performance.To verify the proposed method, this research collects the financial data of 629 electronic firms for public companies listed in the TSE (Taiwan Stock Exchange) and OTC (Over-the-Counter) market in 2004 and 2005 to forecast the revenue growth rate. The results show that the proposed method outperforms the listing methods. 相似文献
6.
Membership evaluation and feature selection for fuzzy support vector machine based on fuzzy rough sets 总被引:1,自引:1,他引:0
Qiang He Congxin Wu 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2011,15(6):1105-1114
A fuzzy support vector machine (FSVM) is an improvement in SVMs for dealing with data sets with outliers. In FSVM, a key step
is to compute the membership for every training sample. Existing approaches of computing the membership of a sample are motivated
by the existence of outliers in data sets and do not take account of the inconsistency between conditional attributes and
decision classes. However, this kind of inconsistency can affect membership for every sample and has been considered in fuzzy
rough set theory. In this paper, we develop a new method to compute membership for FSVMs by using a Gaussian kernel-based
fuzzy rough set. Furthermore, we employ a technique of attribute reduction using Gaussian kernel-based fuzzy rough sets to
perform feature selection for FSVMs. Based on these discussions we combine the FSVMs and fuzzy rough sets methods together.
The experimental results show that the proposed approaches are feasible and effective. 相似文献
7.
This paper deals with the problem of supervised wrapper-based feature subset selection in datasets with a very large number of attributes. Recently the literature has contained numerous references to the use of hybrid selection algorithms: based on a filter ranking, they perform an incremental wrapper selection over that ranking. Though working fine, these methods still have their problems: (1) depending on the complexity of the wrapper search method, the number of wrapper evaluations can still be too large; and (2) they rely on a univariate ranking that does not take into account interaction between the variables already included in the selected subset and the remaining ones.Here we propose a new approach whose main goal is to drastically reduce the number of wrapper evaluations while maintaining good performance (e.g. accuracy and size of the obtained subset). To do this we propose an algorithm that iteratively alternates between filter ranking construction and wrapper feature subset selection (FSS). Thus, the FSS only uses the first block of ranked attributes and the ranking method uses the current selected subset in order to build a new ranking where this knowledge is considered. The algorithm terminates when no new attribute is selected in the last call to the FSS algorithm. The main advantage of this approach is that only a few blocks of variables are analyzed, and so the number of wrapper evaluations decreases drastically.The proposed method is tested over eleven high-dimensional datasets (2400-46,000 variables) using different classifiers. The results show an impressive reduction in the number of wrapper evaluations without degrading the quality of the obtained subset. 相似文献
8.
Data engineering is generally considered to be a central issue in the development of data mining applications. The success of many learning schemes, in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant, and noisy attributes in the model building process phase can result in poor predictive performance and increased computation. Attribute selection generally involves a combination of search and attribute utility estimation plus evaluation with respect to specific learning schemes. This leads to a large number of possible permutations and has led to a situation where very few benchmark studies have been conducted. This paper presents a benchmark comparison of several attribute selection methods for supervised classification. All the methods produce an attribute ranking, a useful devise for isolating the individual merit of an attribute. Attribute selection is achieved by cross-validating the attribute rankings with respect to a classification learner to find the best attributes. Results are reported for a selection of standard data sets and two diverse learning schemes C4.5 and naive Bayes. 相似文献
9.
Elaine P. M. de Sousa Caetano Traina Jr. Agma J. M. Traina Leejay Wu Christos Faloutsos 《Data mining and knowledge discovery》2007,14(3):367-407
The problem of identifying meaningful patterns in a database lies at the very heart of data mining. A core objective of data
mining processes is the recognition of inter-attribute correlations. Not only are correlations necessary for predictions and
classifications – since rules would fail in the absence of pattern – but also the identification of groups of mutually correlated
attributes expedites the selection of a representative subset of attributes, from which existing mappings allow others to
be derived. In this paper, we describe a scalable, effective algorithm to identify groups of correlated attributes. This algorithm
can handle non-linear correlations between attributes, and is not restricted to a specific family of mapping functions, such
as the set of polynomials. We show the results of our evaluation of the algorithm applied to synthetic and real world datasets,
and demonstrate that it is able to spot the correlated attributes. Moreover, the execution time of the proposed technique
is linear on the number of elements and of correlations in the dataset. 相似文献
10.
Social choice deals with aggregating the preferences of a number of voters into a collective preference. We will use this idea for software project effort estimation, substituting the voters by project attributes. Therefore, instead of supplying numeric values for various project attributes that are then used in regression or similar methods, a new project only needs to be placed into one ranking per attribute, necessitating only ordinal values. Using the resulting aggregate ranking the new project is again placed between other projects whose actual expended effort can be used to derive an estimation. In this paper we will present this method and extensions using weightings derived from genetic algorithms. We detail a validation based on several well-known data sets and show that estimation accuracy similar to classic methods can be achieved with considerably lower demands on input data. 相似文献
11.
基于可分性判据排序的RBF神经网络属性选择方法 总被引:2,自引:0,他引:2
提出一种基于数据属性重要性排序的神经网络属性选择方法,该方法只需对部分属性进行洲练,即可进行降维,它克服了现有的神经网络降维方法必须对全部属性进行训练的弊端,大大提高了属性选择的效率。该方法先用本文提出的一种简单的可分性判据方法对数据属性进行重要性排序,然后按重要次序用RBF神经网络进行属性选择。仿真实例表明,该方法具有良好的效果。 相似文献
12.
属性约简是粗糙集理论中的重要研究内容之一.但属性约简是一个NP难题,需要通过启发式知识实四。文中提出利用分辨矩阵求不同的条件属性组合相对于决策属性的正域的方法,并给出新的求核属性的方法。在此基础上,提出了一种利用分辨矩阵实现属性约简的新算法,该算法能快速求最少属性且实现简单,并实现了属性约简与规则提取的同步.最后通过实例证明了其正确性。 相似文献
13.
Real-world data is noisy and can often suffer from corruptions or incomplete values that may impact the models created from the data. To build accurate predictive models, data acquisition is usually adopted to prepare the data and complete missing values. However, due to the significant cost of doing so and the inherent correlations in the data set, acquiring correct information for all instances is prohibitive and unnecessary. An interesting and important problem that arises here is to select what kinds of instances to complete so the model built from the processed data can receive the "maximum" performance improvement. This problem is complicated by the reality that the costs associated with the attributes are different, and fixing the missing values of some attributes is inherently more expensive than others. Therefore, the problem becomes that given a fixed budget, what kinds of instances should be selected for preparation, so that the learner built from the processed data set can maximize its performance? In this paper, we propose a solution for this problem, and the essential idea is to combine attribute costs and the relevance of each attribute to the target concept, so that the data acquisition can pay more attention to those attributes that are cheap in price but informative for classification. To this end, we will first introduce a unique economical factor (EF) that seamlessly integrates the cost and the importance (in terms of classification) of each attribute. Then, we will propose a cost-constrained data acquisition model, where active learning, missing value prediction, and impact-sensitive instance ranking are combined for effective data acquisition. Experimental results and comparative studies from real-world data sets demonstrate the effectiveness of our method. 相似文献
14.
In this paper, we propose a new objective weighting method that employs intuitionistic fuzzy (IF) entropy measures to solve multiple-attribute decision-making problems in the context of intuitionistic fuzzy sets. Instead of traditional fuzzy entropy, which uses the probabilistic discrimination of attributes to obtain attribute weights, we utilize the IF entropy to assess objective weights based on the credibility of the input data. We examine various measures for IF entropy with respect to hesitation degree, probability, non-probability, and geometry to calculate the attribute weights. A comparative analysis of different measures to generate attribute rankings is illustrated with both computational experiments as well as analyses of Pearson correlations, Spearman rank correlations, contradiction rates, inversion rates, and consistency rates. The experimental results indicate that ranking the outcomes of attributes not only depends on the type of IF entropy measures but is also affected by the number of attributes and the number of alternatives. 相似文献
15.
16.
Set-valued information systems 总被引:2,自引:0,他引:2
Set-valued information systems are generalized models of single-valued information systems. Incomplete information systems can be viewed as disjunctively interpreted set-valued information systems. Since some objects in set-valued information systems may have more than one value for an attribute, so we define tolerance relation and use the maximal tolerance classes to classify the universe of discourse. In order to derive optimal decision rules from set-valued decision information systems, we propose the concept of relative reduct of maximal tolerance classes, and define a kind of discernibility function to compute the relative reduct by Boolean reasoning techniques. Finally, we define three kinds of relative reducts for set-valued information systems and used them to evaluate the significance of attributes. 相似文献
17.
决策表分析的统计依据 总被引:2,自引:0,他引:2
给出了决策表的条件属性约简的非参数统计检验方法。首先,给出与决策表相应的列联表,进行条件属性与决策属性间相关性的显著性检验,在一定的显著性水平上,依据相关性显著与否,来判别该属性相对于决策行为是否冗余,从而获得属性约简;进而,来用Lanmbda系数对与决策属性显著相关的属性进行相关性度量,说明用条件属性对决策属性进行预测将消减误差的比例。并在列联表的基础上,获得决策表的一级规则。病例决策表的实验表明,该方法简单,有效。 相似文献
18.
19.
针对主成分分析-贝叶斯判别法(PCA-BDA)仅支持安全评价但不能发现危险因素的问题,引入属性重要度的概念,提出一种改进的PCA-BDA算法,并将其应用于石油钻井安全评价。首先,使用原始PCA-BDA方法评估出各条记录的安全等级;然后,利用主成分分析(PCA)过程中的特征向量矩阵,贝叶斯判别(BDA)过程中的判别函数矩阵,以及各安全等级的权重计算得出属性重要度;最后,通过参考属性重要度来调控属性。安全评价准确率的对比实验中,改进PCA-BDA方法准确率达到96.7%,明显高于层次分析法(AHP)和模糊综合评价法(FCE)。调控属性的仿真实验中,调控重要度最高的3个属性70%以上的钻井安全等级得到改善;相对地,调控重要度最低的3个属性钻井安全等级几乎没有变化。实验结果表明,改进PCA-BDA方法不仅能够准确地实现安全评价,同时能够找出关键属性使石油钻井安全管理更有针对性。 相似文献
20.
High dimensional data contain many redundant or irrelevant attributes, which will be difficult for data mining and a variety of pattern recognition. When implementing data mining or a variety of pattern recognition on high dimensional space, it is necessary to reduce the dimension of high dimensional space. In this paper, a new attribute importance measure and selection methods based on attribute ranking was proposed. In proposed attribute selection method, input output correlation (IOC) is applied for calculating attribute’ importance, and then sorts them according to descending order. The hybrid of Back Propagation Neural Network (BPNN) and Particle Swarm Optimization (PSO) algorithms is also proposed. PSO is used to optimize weights and thresholds of BPNN for overcoming the inherent shortcoming of BPNN. The experiment results show the proposed attribute selection method is an effective preproceesing technology. 相似文献