首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Class-attribute interdependence maximization (CAIM) is one of the state-of-the-art algorithms for discretizing data for which classes are known. However, it may take a long time when run on high-dimensional large-scale data, with large number of attributes and/or instances. This paper presents a solution to this problem by introducing a graphic processing unit (GPU)-based implementation of the CAIM algorithm that significantly speeds up the discretization process on big complex data sets. The GPU-based implementation is scalable to multiple GPU devices and enables the use of concurrent kernels execution capabilities of modern GPUs. The CAIM GPU-based model is evaluated and compared with the original CAIM using single and multi-threaded parallel configurations on 40 data sets with different characteristics. The results show great speedup, up to 139 times faster using four GPUs, which makes discretization of big data efficient and manageable. For example, discretization time of one big data set is reduced from 2 h to \(<\) 2 min.  相似文献   

2.
3.
Nguyen S.H离散化算法中定义的初始断点集由于可能包含了部分对决策系统的分辨关系并无贡献的断点而影响到算法的效率.通过定义分界点来对该算法中定义的初始断点以对决策系统的分辨关系是否有贡献来进行区分,并仅取分界点集作为初始断点集,使得初始断点数目较大幅度地降低,提出了一种改进的启发式离散化算法.此算法较大程度地减小了算法空间复杂性和时间复杂性,对比实验结果表明了改进算法的正确性和有效性.  相似文献   

4.
In this paper we propose a new discretization technique generating enclosures of the solution of a weakly nonlinear 2-point boundary value problem. By means of available extimations lower and upper solutions are generated. The monotonicity of the method is guaranteed by principles of monotone iterations. The convergence properties of the proposed algorithm are analyzed.  相似文献   

5.
离散格的一种启发式搜索算法   总被引:1,自引:0,他引:1  
通过定义离散化方案之间的偏序关系以及交、并运算,将各种离散化方案组织成离散格。提出一种搜索离散格的启发式算法,实验表明该算法得到的一致决策表的断点数比已有解更少。  相似文献   

6.
A modified Chi2 algorithm for discretization   总被引:8,自引:0,他引:8  
Since the ChiMerge algorithm was first proposed by Kerber (1992), it has become a widely used and discussed discretization method. The Chi2 algorithm is a modification to the ChiMerge method. It automates the discretization process by introducing an inconsistency rate as the stopping criterion and it automatically selects the significance value. In addition, it adds a finer phase aimed at feature selection to broaden the applications of the ChiMerge algorithm. However, the Chi2 algorithm does not consider the inaccuracy inherent in ChiMerge's merging criterion. The user-defined inconsistency rate also brings about inaccuracy to the discretization process. These two drawbacks are first discussed in this paper and modifications to overcome them are then proposed. By comparison, results with the original Chi2 algorithm using C4.5, the modified Chi2 algorithm, performs better than the original Chi2 algorithm. It becomes a completely automatic discretization method  相似文献   

7.
A discretization algorithm based on a heterogeneity criterion   总被引:5,自引:0,他引:5  
Discretization, as a preprocessing step for data mining, is a process of converting the continuous attributes of a data set into discrete ones so that they can be treated as the nominal features by machine learning algorithms. Those various discretization methods, that use entropy-based criteria, form a large class of algorithm. However, as a measure of class homogeneity, entropy cannot always accurately reflect the degree of class homogeneity of an interval. Therefore, in this paper, we propose a new measure of class heterogeneity of intervals from the viewpoint of class probability itself. Based on the definition of heterogeneity, we present a new criterion to evaluate a discretization scheme and analyze its property theoretically. Also, a heuristic method is proposed to find the approximate optimal discretization scheme. Finally, our method is compared, in terms of predictive error rate and tree size, with Ent-MDLC, a representative entropy-based discretization method well-known for its good performance. Our method is shown to produce better results than those of Ent-MDLC, although the improvement is not significant. It can be a good alternative to entropy-based discretization methods.  相似文献   

8.
A discretization algorithm based on Class-Attribute Contingency Coefficient   总被引:1,自引:0,他引:1  
Discretization algorithms have played an important role in data mining and knowledge discovery. They not only produce a concise summarization of continuous attributes to help the experts understand the data more easily, but also make learning more accurate and faster. In this paper, we propose a static, global, incremental, supervised and top-down discretization algorithm based on Class-Attribute Contingency Coefficient. Empirical evaluation of seven discretization algorithms on 13 real datasets and four artificial datasets showed that the proposed algorithm could generate a better discretization scheme that improved the accuracy of classification. As to the execution time of discretization, the number of generated rules, and the training time of C5.0, our approach also achieved promising results.  相似文献   

9.
关于决策表离散化贪心算法的进一步改进   总被引:2,自引:2,他引:0  
对决策表属性离散化的贪心算法做了进一步改进。以分开实例时的多少确定断点的重要性,当出现同样重要的断点时,用下一个断点的重要性来决定选用哪一个断点,在一定程度上消除了不确定性,使算法更加完善。最后给出的实例表明,该算法能够减少断点数和未知区域。  相似文献   

10.
基于改进遗传算法的连续属性离散化方法   总被引:1,自引:0,他引:1  
粗糙集中的离散化要求在保持原有决策系统的不可分辩关系情况下,用尽量少的断点进行离散化,而求取连续属性值的最优断点集合是一个NP难题.把连续属性值离散化问题作为一种约束优化问题,采用一种改进的遗传算法来获得最优解,并针对离散化问题设计了相应的编码方式和交叉方法.实验结果表明,采用改进的遗传算法求解连续属性值最优断点集合是可行的.  相似文献   

11.
An extended Chi2 algorithm for discretization of real value attributes   总被引:11,自引:0,他引:11  
The variable precision rough sets (VPRS) model is a powerful tool for data mining, as it has been widely applied to acquire knowledge. Despite its diverse applications in many domains, the VPRS model unfortunately cannot be applied to real-world classification tasks involving continuous attributes. This requires a discretization method to preprocess the data. Discretization is an effective technique to deal with continuous attributes for data mining, especially for the classification problem. The modified Chi2 algorithm is one of the modifications to the Chi2 algorithm, replacing the inconsistency check in the Chi2 algorithm by using the quality of approximation, coined from the rough sets theory (RST), in which it takes into account the effect of degrees of freedom. However, the classification with a controlled degree of uncertainty, or a misclassification error, is outside the realm of RST. This algorithm also ignores the effect of variance in the two merged intervals. In this study, we propose a new algorithm, named the extended Chi2 algorithm, to overcome these two drawbacks. By running the software of See5, our proposed algorithm possesses a better performance than the original and modified Chi2 algorithms.  相似文献   

12.
In the domain of association rules mining (ARM) discovering the rules for numerical attributes is still a challenging issue. Most of the popular approaches for numerical ARM require a priori data discretization to handle the numerical attributes. Moreover, in the process of discovering relations among data, often more than one objective (quality measure) is required, and in most cases, such objectives include conflicting measures. In such a situation, it is recommended to obtain the optimal trade-off between objectives. This paper deals with the numerical ARM problem using a multi-objective perspective by proposing a multi-objective particle swarm optimization algorithm (i.e., MOPAR) for numerical ARM that discovers numerical association rules (ARs) in only one single step. To identify more efficient ARs, several objectives are defined in the proposed multi-objective optimization approach, including confidence, comprehensibility, and interestingness. Finally, by using the Pareto optimality the best ARs are extracted. To deal with numerical attributes, we use rough values containing lower and upper bounds to show the intervals of attributes. In the experimental section of the paper, we analyze the effect of operators used in this study, compare our method to the most popular evolutionary-based proposals for ARM and present an analysis of the mined ARs. The results show that MOPAR extracts reliable (with confidence values close to 95%), comprehensible, and interesting numerical ARs when attaining the optimal trade-off between confidence, comprehensibility and interestingness.  相似文献   

13.
Data discretization unification   总被引:1,自引:1,他引:1  
  相似文献   

14.
Crisp discretization is one of the most widely used methods for handling continuous attributes. In crisp discretization, each attribute is split into several intervals and handled as discrete numbers. Although crisp discretization is a convenient tool, it is not appropriate in some situations (e.g., when there is no clear boundary and we cannot set a clear threshold). To address such a problem, several discretizations with fuzzy sets have been proposed. In this paper we examine the effect of fuzzy discretization derived from crisp discretization. The fuzziness of fuzzy discretization is controlled by a fuzzification grade F. We examine two procedures for the setting of F. In one procedure, we set F beforehand and do not change it through training rule-based classifiers. In the other procedure, first we set F and then change it after training. Through computational experiments, we show that the accuracy of rule-based classifiers is improved by an appropriate setting of the grade of fuzzification. Moreover, we show that increasing the grade of fuzzification after training classifiers can often improve generalization ability. This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008  相似文献   

15.
16.
In this paper we propose a new static, global, supervised, incremental and bottom-up discretization algorithm based on coefficient of dispersion and skewness of data range. It automates the discretization process by introducing the number of intervals and stopping criterion. The results obtained using this discretization algorithm show that the discretization scheme generated by the algorithm almost has minimum number of intervals and requires smallest discretization time. The feedforward neural network with conjugate gradient training algorithm is used to compute the accuracy of classification from the data discretized by this algorithm. The efficiency of the proposed algorithm is shown in terms of better discretization scheme and better accuracy of classification by implementing it on six different real data sets.  相似文献   

17.
Toward unsupervised correlation preserving discretization   总被引:2,自引:0,他引:2  
Discretization is a crucial preprocessing technique used for a variety of data warehousing and mining tasks. In this paper, we present a novel PCA-based unsupervised algorithm for the discretization of continuous attributes in multivariate data sets. The algorithm leverages the underlying correlation structure in the data set to obtain the discrete intervals and ensures that the inherent correlations are preserved. Previous efforts on this problem are largely supervised and consider only piecewise correlation among attributes. We consider the correlation among continuous attributes and, at the same time, also take into account the interactions between continuous and categorical attributes. Our approach also extends easily to data sets containing missing values. We demonstrate the efficacy of the approach on real data sets and as a preprocessing step for both classification and frequent itemset mining tasks. We show that the intervals are meaningful and can uncover hidden patterns in data. We also show that large compression factors can be obtained on the discretized data sets. The approach is task independent, i.e., the same discretized data set can be used for different data mining tasks. Thus, the data sets can be discretized, compressed, and stored once and can be used again and again.  相似文献   

18.
19.
Semi-supervised classification methods aim to exploit labeled and unlabeled examples to train a predictive model. Most of these approaches make assumptions on the distribution of classes. This article first proposes a new semi-supervised discretization method, which adopts very low informative prior on data. This method discretizes the numerical domain of a continuous input variable, while keeping the information relative to the prediction of classes. Then, an in-depth comparison of this semi-supervised method with the original supervised MODL approach is presented. We demonstrate that the semi-supervised approach is asymptotically equivalent to the supervised approach, improved with a post-optimization of the intervals bounds location.  相似文献   

20.
An algorithm was obtained for restoration of a continuous process from a finite number of its equispaced discrete samples. The necessary and sufficient sampling conditions were formulated.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号