首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
The discretization of values plays a critical role in data mining and knowledge discovery. The representation of information through intervals is more concise and easier to understand at certain levels of knowledge than the representation by mean continuous values. In this paper, we propose a method for discretizing continuous attributes by means of fuzzy sets, which constitute a fuzzy partition of the domains of these attributes. This method carries out a fuzzy discretization of continuous attributes in two stages. A fuzzy decision tree is used in the first stage to propose an initial set of crisp intervals, while a genetic algorithm is used in the second stage to define the membership functions and the cardinality of the partitions. After defining the fuzzy partitions, we evaluate and compare them with previously existing ones in the literature.  相似文献   

2.
基于粗糙集的两种离散化算法的研究   总被引:9,自引:0,他引:9  
随着知识发现和数据挖掘的迅速发展,出现了很多的方法,这些方法很多都依赖于离散的数据。但是,大部分现实中应用的数据都带有连续变量的属性。为了使得数据挖掘的技术能够用在这些数据上面,必须进行离散化。文章探讨了基于粗糙集的离散化方法。论文做实验来比较局部和全局离散化算法,实验结果表明,这两种算法对于数据集有敏感性。  相似文献   

3.

The effective extraction of continuous features in ocean optical remote sensing image is the key to achieve the automatic detection and identification for marine vessel targets. Since many of the existing data mining algorithms can only deal with discrete attributes, it is necessary to transform the continuous features into discrete ones for adapting to these intelligent algorithms. However, most of the current discretization methods do not consider the mutual exclusion within the attribute set when selecting breakpoints, and cannot guarantee that the indiscernible relationship of information system is not destroyed. Obviously, they are not suitable for processing ocean optical remote sensing data with multiple features. Aiming at this problem, a multivariable optical remote sensing image feature discretization method applied to marine vessel targets recognition is presented in this paper. Firstly, the information equivalent model of remote sensing image is established based on the theories of information entropy and rough set. Secondly, the change extent of indiscernible relationship in the model before and after discretization is evaluated. Thirdly, multiple scans are executed for each band until the termination condition is satisfied for generating the optimal number of intervals. Finally, we carry out the simulation analysis of the high-resolution remote sensing image data collected near the coast of South China Sea. In addition, we also compare the proposed method with the current mainstream discretization algorithms. Experiments validate that the proposed method has better comprehensive performance in terms of interval number, data consistency, running time, prediction accuracy and recognition rate.

  相似文献   

4.
解亚萍 《计算机应用》2011,31(5):1409-1412
很多数据挖掘方法只能处理离散值的属性,因此,连续属性必须进行离散化。提出一种统计相关系数的数据离散化方法,基于统计相关理论有效地捕获了类-属性间的相互依赖,选取最佳断点。此外,将变精度粗糙集(VPRS)模型纳入离散化中,有效地控制数据的信息丢失。将所提方法在乳腺癌症诊断以及其他领域数据上进行了应用,实验结果表明,该方法显著地提高了See5决策树的分类学习精度。  相似文献   

5.
《Knowledge》2007,20(4):419-425
Many classification algorithms require that training examples contain only discrete values. In order to use these algorithms when some attributes have continuous numeric values, the numeric attributes must be converted into discrete ones. This paper describes a new way of discretizing numeric values using information theory. Our method is context-sensitive in the sense that it takes into account the value of the target attribute. The amount of information each interval gives to the target attribute is measured using Hellinger divergence, and the interval boundaries are decided so that each interval contains as equal amount of information as possible. In order to compare our discretization method with some current discretization methods, several popular classification data sets are selected for discretization. We use naive Bayesian classifier and C4.5 as classification tools to compare the accuracy of our discretization method with that of other methods.  相似文献   

6.
Discretization techniques have played an important role in machine learning and data mining as most methods in such areas require that the training data set contains only discrete attributes. Data discretization unification (DDU), one of the state-of-the-art discretization techniques, trades off classification errors and the number of discretized intervals, and unifies existing discretization criteria. However, it suffers from two deficiencies. First, the efficiency of DDU is very low as it conducts a large number of parameters to search good results, which does not still guarantee to obtain an optimal solution. Second, DDU does not take into account the number of inconsistent records produced by discretization, which leads to unnecessary information loss. To overcome the above deficiencies, this paper presents a Uni versal Dis cretization technique, namely UniDis. We first develop a non-parametric normalized discretization criteria which avoids the effect of relatively large difference between classification errors and the number of discretized intervals on discretization results. In addition, we define a new entropy-based measure of inconsistency for multi-dimensional variables to effectively control information loss while producing a concise summarization of continuous variables. Finally, we propose a heuristic algorithm to guarantee better discretization based on the non-parametric normalized discretization criteria and the entropy-based inconsistency. Besides theoretical analysis, experimental results demonstrate that our approach is statistically comparable to DDU evaluated by a popular statistical test and it yields a better discretization scheme which significantly improves the accuracy of classification than previously other known discretization methods except for DDU by running J4.8 decision tree and Naive Bayes classifier.  相似文献   

7.
In data mining many datasets are described with both discrete and numeric attributes. Most Ant Colony Optimization based classifiers can only deal with discrete attributes and need a pre-processing discretization step in case of numeric attributes. We propose an adaptation of AntMiner+ for rule mining which intrinsically handles numeric attributes. We describe the new approach and compare it to the existing algorithms. The proposed method achieves comparable results with existing methods on UCI datasets, but has advantages on datasets with strong interactions between numeric attributes. We analyse the effect of parameters on the classification accuracy and propose sensible defaults. We describe application of the new method on a real world medical domain which achieves comparable results with the existing method.  相似文献   

8.
《Computers & Fluids》1999,28(4-5):573-602
A new method for the acceleration of linear and nonlinear time-dependent calculations is presented. It is based on the large discretization step (LDS, in short) approximation, defined in this work, which employs an extended system of low accuracy schemes to approximate a high accuracy discrete approximation to a time-dependent differential operator.These approximations are efficiently implemented in the LDS methods for linear and nonlinear hyperbolic equations, presented here. In these algorithms the high and low accuracy schemes are interpreted as the same discretization of a time-dependent operator on fine and coarse grids, respectively. Thus, a system of correction terms and corresponding equations are derived and solved on the coarse grid to yield the fine grid accuracy. These terms are initialized by visiting the fine grid once in many coarse grid time steps. The resulting methods are very general, simple to implement and may be used to accelerate many existing time marching schemes.The efficiency of the LDS algorithms is defined as the cost of computing the fine grid solution relative to the cost of obtaining the same accuracy with the LDS methods. The LDS method’s typical efficiency is 16 for two-dimensional problems and 28 for three-dimensional problems for both linear and nonlinear equations. For a particularly good discretization of a linear equation, an efficiency of 25 in two-dimensional and 66 in three-dimensional was obtained.  相似文献   

9.
IDD: A Supervised Interval Distance-Based Method for Discretization   总被引:1,自引:0,他引:1  
This article introduces a new method for supervised discretization based on interval distances by using a novel concept of neighbourhood in the target's space. The method proposed takes into consideration the order of the class attribute, when this exists, so that it can be used with ordinal discrete classes as well as continuous classes, in the case of regression problems. The method has proved to be very efficient in terms of accuracy and faster than the most commonly supervised discretization methods used in the literature. It is illustrated through several examples and a comparison with other standard discretization methods is performed for three public data sets by using two different learning tasks: a decision tree algorithm and SVM for regression.  相似文献   

10.
11.
Relief is a measure of attribute quality which is often used for feature subset selection. Its use in induction of classification trees and rules, discretization, and other methods has however been hindered by its inability to suggest subsets of values of discrete attributes and thresholds for splitting continuous attributes into intervals. We present efficient algorithms for both tasks.  相似文献   

12.
A discretization algorithm based on Class-Attribute Contingency Coefficient   总被引:1,自引:0,他引:1  
Discretization algorithms have played an important role in data mining and knowledge discovery. They not only produce a concise summarization of continuous attributes to help the experts understand the data more easily, but also make learning more accurate and faster. In this paper, we propose a static, global, incremental, supervised and top-down discretization algorithm based on Class-Attribute Contingency Coefficient. Empirical evaluation of seven discretization algorithms on 13 real datasets and four artificial datasets showed that the proposed algorithm could generate a better discretization scheme that improved the accuracy of classification. As to the execution time of discretization, the number of generated rules, and the training time of C5.0, our approach also achieved promising results.  相似文献   

13.
MODL: A Bayes optimal discretization method for continuous attributes   总被引:1,自引:0,他引:1  
While real data often comes in mixed format, discrete and continuous, many supervised induction algorithms require discrete data. Efficient discretization of continuous attributes is an important problem that has effects on speed, accuracy and understandability of the induction models. In this paper, we propose a new discretization method MODL1, founded on a Bayesian approach. We introduce a space of discretization models and a prior distribution defined on this model space. This results in the definition of a Bayes optimal evaluation criterion of discretizations. We then propose a new super-linear optimization algorithm that manages to find near-optimal discretizations. Extensive comparative experiments both on real and synthetic data demonstrate the high inductive performances obtained by the new discretization method. Editor: Tom Fawcett 1French patent No. 04 00179.  相似文献   

14.
The problem of the discretization of continuous linear systems is considered. (A particular case is the problem of digital filter design for a given analog prototype.) Common and distinctive features of discrete systems, which are given by a number of different methods of discretization for one and the same continuous system, are analysed. The sampling period is assumed to be sufficiently small.  相似文献   

15.
A discretization algorithm based on a heterogeneity criterion   总被引:5,自引:0,他引:5  
Discretization, as a preprocessing step for data mining, is a process of converting the continuous attributes of a data set into discrete ones so that they can be treated as the nominal features by machine learning algorithms. Those various discretization methods, that use entropy-based criteria, form a large class of algorithm. However, as a measure of class homogeneity, entropy cannot always accurately reflect the degree of class homogeneity of an interval. Therefore, in this paper, we propose a new measure of class heterogeneity of intervals from the viewpoint of class probability itself. Based on the definition of heterogeneity, we present a new criterion to evaluate a discretization scheme and analyze its property theoretically. Also, a heuristic method is proposed to find the approximate optimal discretization scheme. Finally, our method is compared, in terms of predictive error rate and tree size, with Ent-MDLC, a representative entropy-based discretization method well-known for its good performance. Our method is shown to produce better results than those of Ent-MDLC, although the improvement is not significant. It can be a good alternative to entropy-based discretization methods.  相似文献   

16.
徐盈盈  钟才明 《计算机应用》2014,34(8):2184-2187
模式识别与机器学习的一些算法只能处理离散属性值,而在现实生活中的很多数据具有连续的属性值,针对数据离散化的问题提出了一种无监督的方法。首先,使用K-means方法将数据集进行划分得到类别信息;然后,应用有监督的离散化方法对划分后的数据离散化,重复上述过程以得到多个离散化的结果,再将这些结果进行集成;最后,将集成得到的最小子区间进行合并,这里根据数据间的邻居关系选择优先合并的维度及相邻区间。其中,通过数据间的近邻关系自动寻求子区间数目,尽可能保持其内在结构关系不变。将离散后的数据应用于聚类算法,如谱聚类算法,并对聚类后的效果进行评价。实验结果表明,该算法聚类精确度比其他4种方法平均提高约33%,表明了该算法的可行性和有效性。通过该算法得到的离散化数据可应用于一些数据挖掘算法,如ID3决策树算法。  相似文献   

17.
CAIM discretization algorithm   总被引:8,自引:0,他引:8  
The task of extracting knowledge from databases is quite often performed by machine learning algorithms. The majority of these algorithms can be applied only to data described by discrete numerical or nominal attributes (features). In the case of continuous attributes, there is a need for a discretization algorithm that transforms continuous attributes into discrete ones. We describe such an algorithm, called CAIM (class-attribute interdependence maximization), which is designed to work with supervised data. The goal of the CAIM algorithm is to maximize the class-attribute interdependence and to generate a (possibly) minimal number of discrete intervals. The algorithm does not require the user to predefine the number of intervals, as opposed to some other discretization algorithms. The tests performed using CAIM and six other state-of-the-art discretization algorithms show that discrete attributes generated by the CAIM algorithm almost always have the lowest number of intervals and the highest class-attribute interdependency. Two machine learning algorithms, the CLIP4 rule algorithm and the decision tree algorithm, are used to generate classification rules from data discretized by CAIM. For both the CLIP4 and decision tree algorithms, the accuracy of the generated rules is higher and the number of the rules is lower for data discretized using the CAIM algorithm when compared to data discretized using six other discretization algorithms. The highest classification accuracy was achieved for data sets discretized with the CAIM algorithm, as compared with the other six algorithms.  相似文献   

18.
We investigate discretization of continuous variables for classification problems in a high‐ dimensional framework. As the goal of classification is to correctly predict a class membership of an observation, we suggest a discretization method that optimizes the discretization procedure using the misclassification probability as a measure of the classification accuracy. Our method is compared to several other discretization methods as well as result for continuous data. To compare performance we consider three supervised classification methods, and to capture the effect of high dimensionality we investigate a number of feature variables for a fixed number of observations. Since discretization is a data transformation procedure, we also investigate how the dependence structure is affected by this. Our method performs well, and lower misclassification can be obtained in a high‐dimensional framework for both simulated and real data if the continuous feature variables are first discretized. The dependence structure is well maintained for some discretization methods. © 2012 Wiley Periodicals, Inc.  相似文献   

19.
A very simple model for train stopping is used as a vehicle for investigating how the development of a control system, initially designed in the continuous domain and subsequently discretized, can be captured within a formal development process compatible with standard model based refinement methodologies. Starting with a formalized requirements analysis using KAOS, an abstract model of the continuous system is created in the ASM formalism. This requires extensions of the KAOS and ASM formalisms, capable of dealing with quantities evolving continuously over real time, which are developed. After considering how the continuous system, described as a continuous control system in the state space framework, can be discretized, a discrete control system is created in the state space framework. This is re-expressed in the ASM formalism. The rigorous results on the relationship between continuous and discrete control system models that are needed to establish provable properties of the discretization, then become the ingredients of a retrenchment between continuous and discrete ASM models, and are thus fully integrated into the formal development. The discrete ASM model can then be further refined towards implementation.  相似文献   

20.
特征选择在机器学习和数据挖掘中起到了至关重要的作用.Relief作为一种高效的过滤式特征选择算法,能处理多种类型的数据,且对噪声的容忍力较强,因此被广泛应用.然而,经典的Relief算法对离散特征的评价较为简单,在实际进行特征选择时并未充分挖掘特征与类标签之间的潜在关系,具有很大的改进空间.针对经典的Relief算法对...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号