首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Our objective is a comparison of two data mining approaches to dealing with imbalanced data sets. The first approach is based on saving the original rule set, induced by the LEM2 (Learning from Example Module) algorithm, and changing the rule strength for all rules for the smaller class (concept) during classification. In the second approach, rule induction is split: the rule set for the larger class is induced by LEM2, while the rule set for the smaller class is induced by EXPLORE, another data mining algorithm. Results of our experiments show that both approaches increase the sensitivity compared to the original LEM2. However, the difference in performance of both approaches is statistically insignificant. Thus the appropriate approach for dealing with imbalanced data sets should be selected individually for a specific data set.  相似文献   

2.
基于非对称相似粗糙集的规则获取算法   总被引:1,自引:0,他引:1  
针对目前应用粗糙集相似关系理论与LEM2算法进行规则推理时获取规则较少以及规则简化程度不高的问题,提出了粗糙集非对称相似关系与近似集的计算方法,并对现有LEM2算法获取规则的过程进行了改进与补充,形成了一种新的基于非对称相似粗糙集的规则获取算法,以便从不完整信息中获取更多潜在规则.最后以实际算例对两种算法分别进行了测试并给出了结果对比分析,仿真结果表明新的规则获取算法在不改变原有信息集结构与内容的基础上具有更好的优化性能,能获得更好的优化结果.  相似文献   

3.
A probabilistic approximation is a generalization of the standard idea of lower and upper approximations, defined for equivalence relations. Recently probabilistic approximations were additionally generalized to an arbitrary binary relation so that probabilistic approximations may be applied for incomplete data. We discuss two ways to induce rules from incomplete data using probabilistic approximations, by applying true MLEM2 algorithm and an emulated MLEM2 algorithm. In this paper we report novel research on a comparison of both approaches: new results of experiments on incomplete data with three interpretations of missing attribute values. Our results show that both approaches do not differ much.  相似文献   

4.
THE USEFULNESS OF A MACHINE LEARNING APPROACH TO KNOWLEDGE ACQUISITION   总被引:5,自引:0,他引:5  
This paper presents results of experiments showing how machine learning methods arc useful for rule induction in the process of knowledge acquisition for expert systems. Four machine learning methods were used: ID3, ID3 with dropping conditions, and two options of the system LERS (Learning from Examples based on Rough Sets): LEM1 and LEM2. Two knowledge acquisition options of LERS were used as well. All six methods were used for rule induction from six real-life data sets. The main objective was to lest how an expert system, supplied with these rule sets, performs without information on a few attributes. Thus an expert system attempts to classify examples with all missing values of some attributes. As a result of experiments, it is clear that all machine learning methods performed much worse than knowledge acquisition options of LERS. Thus, machine learning methods used for knowledge acquisition should be replaced by other methods of rule induction that will generate complete sets of rules. Knowledge acquisition options of LERS are examples of such appropriate ways of inducing rules for building knowledge bases.  相似文献   

5.
以规则库为切入点,提出一个决策规则的批量增量更新算法。为所有新增对象建立一个等价类表,将原有规则库与等价类表进行高效匹配,根据新对象的不同匹配类型分别进行规则更新。该算法既适用于完备数据也适用于不完备数据,且只需访问2遍规则库就可以实现规则更新。理论分析和UCI数据上的比较实验结果都表明该方法优于传统方法。  相似文献   

6.
A new relational learning system using novel rule selection strategies   总被引:1,自引:0,他引:1  
Mahmut Uludag  Mehmet R. Tolun   《Knowledge》2006,19(8):765-771
This paper describes a new rule induction system, rila, which can extract frequent patterns from multiple connected relations. The system supports two different rule selection strategies, namely the select early and select late strategies. Pruning heuristics are used to control the number of hypotheses generated during the learning process. Experimental results are provided on the mutagenesis and the segmentation data sets. The present rule induction algorithm is also compared to the similar relational learning algorithms. Results show that the algorithm is comparable to similar algorithms.  相似文献   

7.
决策树归纳的两个重要阶段是数据表示空间的简化和决策树的生成。在将训练集的不一致率控制在某一阈值的前提下,减少实例的属性个数和各个属性的取值个数保证了决策树方法的可行性和有效性。本文在Chi2算法的基础上运用它的一种变形进行属性取值离散化和属性筛选,然后运用算术运算符合并取值个数为2或3的相邻属性。在此基础上生
成的决策树具有良好的准确性。实验数据采用的是一个保险公司捐献的数据集。  相似文献   

8.
词义归纳是解决词义知识获取的重要研究课题,利用聚类算法对词义进行归纳分析是目前最广泛采用的方法。通过比较K-Means聚类算法和EM聚类算法在 各自 词义归纳模型上的优势,提出一种新的融合距离度量和高斯混合模型的聚类算法,以期利用两种聚类算法分别在距离度量和数据分布计算上的优势,挖掘数据的几何特性和正态分布信息在词义聚类分析中的作用,从而提高词义归纳模型的性能。实验结果表明,所提混合聚类算法对于改进词义归纳模型的性能是十分有效的。  相似文献   

9.
一种新的无监督连续属性离散化方法   总被引:1,自引:1,他引:0       下载免费PDF全文
提出了一种基于聚类方法的无监督连续属性离散化算法,称为CAMNA(Clustering and Merging on Numerical Attribute)算法。CAMNA算法通过聚类过程将数值值域划分为多个离散区间,根据类分布的指导信息优化合并相邻区间,实现理想的离散方案。通过实验证明该算法在保持执行效率较高的前提下,离散结果更加合理,生成的决策树结构简单,获得较少的分类规则,分类准确率也有提高。  相似文献   

10.
The cAnt-Miner algorithm is an Ant Colony Optimization (ACO) based technique for classification rule discovery in problem domains which include continuous attributes. In this paper, we propose several extensions to cAnt-Miner. The main extension is based on the use of multiple pheromone types, one for each class value to be predicted. In the proposed μcAnt-Miner algorithm, an ant first selects a class value to be the consequent of a rule and the terms in the antecedent are selected based on the pheromone levels of the selected class value; pheromone update occurs on the corresponding pheromone type of the class value. The pre-selection of a class value also allows the use of more precise measures for the heuristic function and the dynamic discretization of continuous attributes, and further allows for the use of a rule quality measure that directly takes into account the confidence of the rule. Experimental results on 20 benchmark datasets show that our proposed extension improves classification accuracy to a statistically significant extent compared to cAnt-Miner, and has classification accuracy similar to the well-known Ripper and PART rule induction algorithms.  相似文献   

11.
在旋转机械故障诊断领域中,通常需要对连续特征量进行离散化预处理,以便后续诊断分析。为此,该文在分析了ChiMerge离散方法及其两点不足的基础上,提出了一种新的基于冲突水平的多特征离散方法。该方法可以自动实现多特征的离散化操作,并且收敛到预设的冲突水平上。算例分析证明了该方法的有效性。  相似文献   

12.
In this paper, we study the potential of adaptive sparse grids for multivariate numerical quadrature in the moderate or high dimensional case, i. e. for a number of dimensions beyond three and up to several hundreds. There, conventional methods typically suffer from the curse of dimension or are unsatisfactory with respect to accuracy. Our sparse grid approach, based upon a direct higher order discretization on the sparse grid, overcomes this dilemma to some extent, and introduces additional flexibility with respect to both the order of the 1 D quadrature rule applied (in the sense of Smolyak's tensor product decomposition) and the placement of grid points. The presented algorithm is applied to some test problems and compared with other existing methods.  相似文献   

13.
提出一种基于矩阵加权关联规则的区间模糊C均值聚类算法。根据支持度和可信度对矩阵构造关联规则,在关联规则的基础上进行区间模糊C均值聚类。由样本数量的大小来调整区间的影响因子a以达到最优聚类。该算法在解决小型文本时精度优于传统算法(如k-means),在解决多维数据时效率较理想。理论和实验表明,该算法可以在一定程度上提高聚类结果的质量和算法效率。  相似文献   

14.
曹峰  唐超  张婧 《计算机科学》2017,44(9):222-226
离散化是一个重要的数据预处理过程,在规则提取、知识发现、分类等研究领域都有广泛的应用。提出一种结合二元蚁群和粗糙集的连续属性离散化算法。该算法在多维连续属性候选断点集空间上构建二元蚁群网络,通过粗糙集近似分类精度建立蚁群算法适宜度评价函数,寻找全局最优离散化断点集。通过UCI数据集验证算法的有效性,实验结果表明,该算法具有较好的离散化性能。  相似文献   

15.
Cardiovascular diseases are associated with high mortality rates in the globe. The development of new drugs, new medical equipment and non-invasive techniques for the heart demand multidisciplinary efforts towards the characterization of cardiac anatomy and function from the molecular to the organ level. Computational modeling has demonstrated to be a useful tool for the investigation and comprehension of the complex biophysical processes that underlie cardiac function. The set of Bidomain equations is currently one of the most complete mathematical models for simulating the electrical activity in cardiac tissue. Unfortunately, large scale simulations, such as those resulting from the discretization of an entire heart, remain a computational challenge. In order to reduce simulation execution times, parallel implementations have traditionally exploited data parallelism via numerical schemes based on domain-decomposition. However, it has been verified that the parallel efficiency of these implementations severely degrades as the number of processors increases. In this work we propose and implement a new parallel algorithm for the solution of cardiac models. By relaxing the coherence of the execution, a new level of parallelism could be identified and exploited: pipelining. A synchronous parallel algorithm that uses both pipelining and data decomposition techniques was implemented and used the MPI library for communication. Numerical tests were performed in two different cluster configurations. Our preliminary results indicated that the proposed algorithm is able to increase the parallel efficiency up to 20% on an 8-core cluster. On a 32-core cluster the multi-level algorithm was 1.7 times faster than the traditional domain decomposition algorithm. In addition, the numerical precision was kept under control (relative errors under 6%) when the relaxed coherence execution was adopted.  相似文献   

16.
17.
A theorem-proving system has been programmed for automating mildly complex proofs by structural induction. One can see the formal system as a generalization of number theory: the formal language is typed and the induction rule is valid for all types. Proofs are generated by working backward from the goal. The induction strategy splits into two parts:(1)the selection of induction variables, which is claimed to be linked to the useful generalization of terms to variables, and(2)the generation of induction subgoals, in particular, the selection and specialization of relevant hypotheses. Other strategies include a fast simplification algorithm. The prover can cope with situations as complex as the defination and correctness proof of a simple compiling algorithm for expressions.  相似文献   

18.
We inferred business rules for business/ICT alignment by applying a novel rule induction algorithm on a data set containing rich alignment information polled from 641 organisations in 7 European countries. The alignment rule set was created using AntMiner+, a rule induction technique with a reputation of inducing accurate, comprehensible, and intuitive predictive models from data. Our data set consisted of 18 alignment practices distilled from an analysis of relevant publications and validated by a Delphi panel of experts. The goal of our study was to describe practical guidelines for managers in obtaining better alignment of ICT investments with business requirements. Our obtained rule set showed the multi-disciplinary nature of B/ICT alignment. We discuss implication of the alignment rules for practitioners.  相似文献   

19.
Multivariate Discretization for Set Mining   总被引:2,自引:0,他引:2  
Many algorithms in data mining can be formulated as a set-mining problem where the goal is to find conjunctions (or disjunctions) of terms that meet user-specified constraints. Set-mining techniques have been largely designed for categorical or discrete data where variables can only take on a fixed number of values. However, many datasets also contain continuous variables and a common method of dealing with these is to discretize them by breaking them into ranges. Most discretization methods are univariate and consider only a single feature at a time (sometimes in conjunction with a class variable). We argue that this is a suboptimal approach for knowledge discovery as univariate discretization can destroy hidden patterns in data. Discretization should consider the effects on all variables in the analysis and that two regions X and Y should only be in the same interval after discretization if the instances in those regions have similar multivariate distributions (F x F y ) across all variables and combinations of variables. We present a bottom-up merging algorithm to discretize continuous variables based on this rule. Our experiments indicate that the approach is feasible, that it will not destroy hidden patterns and that it will generate meaningful intervals. Received 14 November 2000 / Revised 1 February 2001 / Accepted in revised form 1 May 2001  相似文献   

20.
Knowledge-based systems such as expert systems are of particular interest in medical applications as extracted if-then rules can provide interpretable results. Various rule induction algorithms have been proposed to effectively extract knowledge from data, and they can be combined with classification methods to form rule-based classifiers. However, most of the rule-based classifiers can not directly handle numerical data such as blood pressure. A data preprocessing step called discretization is required to convert such numerical data into a categorical format. Existing discretization algorithms do not take into account the multimodal class densities of numerical variables in datasets, which may degrade the performance of rule-based classifiers. In this paper, a new Gaussian Mixture Model based Discretization Algorithm (GMBD) is proposed that preserve the most frequent patterns of the original dataset by taking into account the multimodal distribution of the numerical variables. The effectiveness of GMBD algorithm was verified using six publicly available medical datasets. According to the experimental results, the GMBD algorithm outperformed five other static discretization methods in terms of the number of generated rules and classification accuracy in the associative classification algorithm. Consequently, our proposed approach has a potential to enhance the performance of rule-based classifiers used in clinical expert systems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号