首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In knowledge discovery and data mining many measures of interestingness have been proposed in order to measure the relevance and utility of the discovered patterns. Among these measures, an important role is played by Bayesian confirmation measures, which express in what degree a premise confirms a conclusion. In this paper, we are considering knowledge patterns in a form of “if…, then…” rules with a fixed conclusion. We investigate a monotone link between Bayesian confirmation measures, and classic dimensions being rule support and confidence. In particular, we formulate and prove conditions for monotone dependence of two confirmation measures enjoying some desirable properties on rule support and confidence. As the confidence measure is unable to identify and eliminate non-interesting rules, for which a premise does not confirm a conclusion, we propose to substitute the confidence for one of the considered confirmation measures in mining the Pareto-optimal rules. We also provide general conclusions for the monotone link between any confirmation measure enjoying the desirable properties and rule support and confidence. Finally, we propose to mine rules maximizing rule support and minimizing rule anti-support, which is the number of examples, which satisfy the premise of the rule but not its conclusion (called counter-examples of the considered rule). We prove that in this way we are able to mine all the rules maximizing any confirmation measure enjoying the desirable properties. We also prove that this Pareto-optimal set includes all the rules from the previously considered Pareto-optimal borders.  相似文献   

2.
Association rule mining and classification are important tasks in data mining. Using association rules has proved to be a good approach for classification. In this paper, we propose an accurate classifier based on class association rules (CARs), called CAR-IC, which introduces a new pruning strategy for mining CARs, which allows building specific rules with high confidence. Moreover, we propose and prove three propositions that support the use of a confidence threshold for computing rules that avoids ambiguity at the classification stage. This paper also presents a new way for ordering the set of CARs based on rule size and confidence. Finally, we define a new coverage strategy, which reduces the number of non-covered unseen-transactions during the classification stage. Results over several datasets show that CAR-IC beats the best classifiers based on CARs reported in the literature.  相似文献   

3.
Association rules form one of the most widely used techniques to discover correlations among attribute in a database. So far, some efficient methods have been proposed to obtain these rules with respect to an optimal goal, such as: to maximize the number of large itemsets and interesting rules or the values of support and confidence for the discovered rules. This paper first introduces optimized fuzzy association rule mining in terms of three important criteria; strongness, interestingness and comprehensibility. Then, it proposes multi-objective Genetic Algorithm (GA) based approaches for discovering these optimized rules. Optimization technique according to given criterion may be one of two different forms; The first tries to determine the appropriate fuzzy sets of quantitative attributes in a prespecified rule, which is also called as certain rule. The second deals with finding both uncertain rules and their appropriate fuzzy sets. Experimental results conducted on a real data set show the effectiveness and applicability of the proposed approach.  相似文献   

4.
Since the suggestion of a computing procedure of multiple Pareto-optimal solutions in multi-objective optimization problems in the early Nineties, researchers have been on the look out for a procedure which is computationally fast and simultaneously capable of finding a well-converged and well-distributed set of solutions. Most multi-objective evolutionary algorithms (MOEAs) developed in the past decade are either good for achieving a well-distributed solutions at the expense of a large computational effort or computationally fast at the expense of achieving a not-so-good distribution of solutions. For example, although the Strength Pareto Evolutionary Algorithm or SPEA (Zitzler and Thiele, 1999) produces a much better distribution compared to the elitist non-dominated sorting GA or NSGA-II (Deb et al., 2002a), the computational time needed to run SPEA is much greater. In this paper, we evaluate a recently-proposed steady-state MOEA (Deb et al., 2003) which was developed based on the epsilon-dominance concept introduced earlier(Laumanns et al., 2002) and using efficient parent and archive update strategies for achieving a well-distributed and well-converged set of solutions quickly. Based on an extensive comparative study with four other state-of-the-art MOEAs on a number of two, three, and four objective test problems, it is observed that the steady-state MOEA is a good compromise in terms of convergence near to the Pareto-optimal front, diversity of solutions, and computational time. Moreover, the epsilon-MOEA is a step closer towards making MOEAs pragmatic, particularly allowing a decision-maker to control the achievable accuracy in the obtained Pareto-optimal solutions.  相似文献   

5.
《Knowledge》2006,19(6):438-444
One major goal for data mining is to understand data. Rule based methods are better than other methods in making mining results comprehensible. However, current rule based classifiers make use of a small number of rules and a default prediction to build a concise predictive model. This reduces the explanatory ability of the rule based classifier. In this paper, we propose to use multiple and negative target rules to improve explanatory ability of rule based classifiers. We show experimentally that this understandability is not at the cost of accuracy of rule based classifiers.  相似文献   

6.
In this paper we consider induction of rule-based classifiers from imbalanced data, where one class (a minority class) is under-represented in comparison to the remaining majority classes. The minority class is usually of primary interest. However, most rule-based classifiers are biased towards the majority classes and they have difficulties with correct recognition of the minority class. In this paper we discuss sources of these difficulties related to data characteristics or to an algorithm itself. Among the problems related to the data distribution we focus on the role of small disjuncts, overlapping of classes and presence of noisy examples. Then, we show that standard techniques for induction of rule-based classifiers, such as sequential covering, top-down induction of rules or classification strategies, were created with the assumption of balanced data distribution, and we explain why they are biased towards the majority classes. Some modifications of rule-based classifiers have been already introduced, but they usually concentrate on individual problems. Therefore, we propose a novel algorithm, BRACID, which more comprehensively addresses the issues associated with imbalanced data. Its main characteristics includes a hybrid representation of rules and single examples, bottom-up learning of rules and a local classification strategy using nearest rules. The usefulness of BRACID has been evaluated in experiments on several imbalanced datasets. The results show that BRACID significantly outperforms the well known rule-based classifiers C4.5rules, RIPPER, PART, CN2, MODLEM as well as other related classifiers as RISE or K-NN. Moreover, it is comparable or better than the studied approaches specialized for imbalanced data such as generalizations of rule algorithms or combinations of SMOTE + ENN preprocessing with PART. Finally, it improves the support of minority class rules, leading to better recognition of the minority class examples.  相似文献   

7.
The focus of this paper is to develop a solution framework to study equilibrium transportation network design problems with multiple objectives that are mutually commensurate. Objective parameterization, or scalarization, forms the core idea of this solution approach, by which a multi-objective problem can be equivalently addressed by tackling a series of single-objective problems. In particular, we develop a parameterization-based heuristic that resembles an iterative divide-and-conquer strategy to locate a Pareto-optimal solution in each divided range of commensurate parameters. Unlike its previous counterparts, the heuristic is capable of asymptotically exhausting the complete Pareto-optimal solution set and identifying parameter ranges that exclude any Pareto-optimal solution. Its algorithmic effectiveness and solution characteristics are justified by a set of numerical examples, from which we also gain additional insights about its solution generation behavior and the tradeoff between the computation cost and solution quality.  相似文献   

8.
Greedy approaches suffer from a restricted search space which could lead to suboptimal classifiers in terms of performance and classifier size. This study discusses exhaustive search as an alternative to greedy search for learning short and accurate decision rules. The Exhaustive Procedure for LOgic-Rule Extraction (EXPLORE) algorithm is presented, to induce decision rules in disjunctive normal form (DNF) in a systematic and efficient manner. We propose a method based on subsumption to reduce the number of values considered for instantiation in the literals, by taking into account the relational operator without loss of performance. Furthermore, we describe a branch-and-bound approach that makes optimal use of user-defined performance constraints. To improve the generalizability we use a validation set to determine the optimal length of the DNF rule. The performance and size of the DNF rules induced by EXPLORE are compared to those of eight well-known rule learners. Our results show that an exhaustive approach to rule learning in DNF results in significantly smaller classifiers than those of the other rule learners, while securing comparable or even better performance. Clearly, exhaustive search is computer-intensive and may not always be feasible. Nevertheless, based on this study, we believe that exhaustive search should be considered an alternative for greedy search in many problems.  相似文献   

9.
AR-Markov模型在动态关联规则挖掘中的应用   总被引:2,自引:1,他引:1       下载免费PDF全文
针对规则随着时间变化的特点,为规则建立元规则对其支持度和置信度变化趋势的分析和预测模型。通过增加支持度向量和置信度向量这两种规则评价指标,给出了动态关联规则元规则的形式化定义。利用自回归Markov模型对动态关联规则的元规则进行了挖掘,并通过实例证明了该方法的有效性。  相似文献   

10.
Association rule mining can provide genuine insight into the data being analysed; however, rule sets can be extremely large, and therefore difficult and time-consuming for the user to interpret. We propose reducing the size of Apriori rule sets by removing overlapping rules, and compare this approach with two standard methods for reducing rule set size: increasing the minimum confidence parameter, and increasing the minimum antecedent support parameter. We evaluate the rule sets in terms of confidence and coverage, as well as two rule interestingness measures that favour rules with antecedent conditions that are poor individual predictors of the target class, as we assume that these represent potentially interesting rules. We also examine the distribution of the rules graphically, to assess whether particular classes of rules are eliminated. We show that removing overlapping rules substantially reduces rule set size in most cases, and alters the character of a rule set less than if the standard parameters are used to constrain the rule set to the same size. Based on our results, we aim to extend the Apriori algorithm to incorporate the suppression of overlapping rules.  相似文献   

11.
陈柳  冯山 《计算机应用》2018,38(5):1315-1319
针对传统正负关联规则置信度阈值设置方法难以控制低可信度规则数量和易遗漏有趣规则的问题,提出了一个结合项集相关性的两级置信度阈值设置方法(PNMC-TWO)。首先,基于规则的无矛盾性、有效性和有趣性考虑,以相关度-支持度-置信度为框架,从规则置信度与项集支持度的计算关系出发,系统地分析了正负关联规则置信度取值随规则的项集支持度大小变化的规律;然后,与实际挖掘中用户对高可信度且有趣的规则需求相结合,提出了一个新的设置模型,避免了传统方法设置阈值时的盲目性和随意性;最后,从规则数量和规则质量两方面对所提方法与原双阈值法进行了实验对比。实验结果表明,所提方法不仅可以更好地确保提取出的关联规则有效和有趣,还可以显著地降低可信度低的关联规则数量。  相似文献   

12.
This paper proposes a precise candidate selection method for large character set recognition by confidence evaluation of distance-based classifiers. The proposed method is applicable to a wide variety of distance metrics and experiments on Euclidean distance and city block distance have achieved promising results. By confidence evaluation, the distribution of distances is analyzed to derive the probabilities of classes in two steps: output probability evaluation and input probability inference. Using the input probabilities as confidences, several selection rules have been tested and the rule that selects the classes with high confidence ratio to the first rank class produced best results. The experiments were implemented on the ETL9B database and the results show that the proposed method selects about one-fourth as many candidates with accuracy preserved compared to the conventional method that selects a fixed number of candidates  相似文献   

13.
基于项目属性的相联规则提取   总被引:2,自引:0,他引:2  
相联规则是数据库知识发现领域的重要方法之一,用于发现满足用户指定最小支持度和最小信任度阈值的规则,其中,最小支持度阈值确定了研究数据集的规模,最小信任度阈值用来衡量一个规则可靠性,在通常的支持度/信任度框架下,用户只能给出一对最小支持度和最小信任度阈值,因此,对于有数据项均采用统一标准处理,但是,实际数据库中的数据项目具有自的特点,该文旨在根据项目的属性特征,通过模糊安全评判,决定项目合理的最小支持度阈值,进而确定各个项目的支持度区间,达到在一次数据挖掘中同时发现频繁规则和稀有规则的,由于基于最小信任度的规则提取具有冗余性,文中提出规则前件和后件的重要程度对比的思想,借助主观判断去除冗余规则,从而挖掘出尽可能接近自然的完全规则。  相似文献   

14.
In this paper, a new approach based on Differential Evolution (DE) for the automatic classification of items in medical databases is proposed. Based on it, a tool called DEREx is presented, which automatically extracts explicit knowledge from the database under the form of IF-THEN rules containing AND-connected clauses on the database variables. Each DE individual codes for a set of rules. For each class more than one rule can be contained in the individual, and these rules can be seen as logically connected in OR. Furthermore, all the classifying rules for all the classes are found all at once in one step. DEREx is thought as a useful support to decision making whenever explanations on why an item is assigned to a given class should be provided, as it is the case for diagnosis in the medical domain. The major contribution of this paper is that DEREx is the first classification tool in literature that is based on DE and automatically extracts sets of IF-THEN rules without the intervention of any other mechanism. In fact, all other classification tools based on DE existing in literature either simply find centroids for the classes rather than extracting rules, or are hybrid systems in which DE simply optimizes some parameters whereas the classification capabilities are provided by other mechanisms. For the experiments eight databases from the medical domain have been considered. First, among ten classical DE variants, the most effective of them in terms of highest classification accuracy in a ten-fold cross-validation has been found. Secondly, the tool has been compared over the same eight databases against a set of fifteen classifiers widely used in literature. The results have proven the effectiveness of the proposed approach, since DEREx turns out to be the best performing tool in terms of highest classification accuracy. Also statistical analysis has confirmed that DEREx is the best classifier. When compared to the other rule-based classification tools here used, DEREx needs the lowest average number of rules to face a problem, and the average number of clauses per rule is not very high. In conclusion, the tool here presented is preferable to the other classifiers because it shows good classification accuracy, automatically extracts knowledge, and provides users with it under an easily comprehensible form.  相似文献   

15.
基于支持度和置信度模型的关联规则剪枝算法会挖掘出很多无趣规则。针对该问题,提出一种正相关性指导下的关联规则剪枝算法。利用全置信度和提升度构造一个正相关性评价函数,以此对频繁项集进行剪枝。实验结果表明,该算法能减少无趣关联规则数量,提升挖掘结果质量,缩短挖掘时间。  相似文献   

16.
Credit-risk evaluation is a very challenging and important problem in the domain of financial analysis. Many classification methods have been proposed in the literature to tackle this problem. Statistical and neural network based approaches are among the most popular paradigms. However, most of these methods produce so-called “hard” classifiers, those generate decisions without any accompanying confidence measure. In contrast, “soft” classifiers, such as those designed using fuzzy set theoretic approach; produce a measure of support for the decision (and also alternative decisions) that provides the analyst with greater insight. In this paper, we propose a method of building credit-scoring models using fuzzy rule based classifiers. First, the rule base is learned from the training data using a SOM based method. Then the fuzzy k-nn rule is incorporated with it to design a contextual classifier that integrates the context information from the training set for more robust and qualitatively better classification. Further, a method of seamlessly integrating business constraints into the model is also demonstrated.  相似文献   

17.
《Applied Soft Computing》2008,8(1):646-656
In this paper, a Pareto-based multi-objective differential evolution (DE) algorithm is proposed as a search strategy for mining accurate and comprehensible numeric association rules (ARs) which are optimal in the wider sense that no other rules are superior to them when all objectives are simultaneously considered. The proposed DE guided the search of ARs toward the global Pareto-optimal set while maintaining adequate population diversity to capture as many high-quality ARs as possible. ARs mining problem is formulated as a four-objective optimization problem. Support, confidence value and the comprehensibility of the rule are maximization objectives while the amplitude of the intervals which conforms the itemset and rule is minimization objective. It has been designed to simultaneously search for intervals of numeric attributes and the discovery of ARs which these intervals conform in only single run of DE. Contrary to the methods used as usual, ARs are directly mined without generating frequent itemsets. The proposed DE performs a database-independent approach which does not rely upon the minimum support and the minimum confidence thresholds which are hard to determine for each database. The efficiency of the proposed DE is validated upon synthetic and real databases.  相似文献   

18.
Knowledge-based systems such as expert systems are of particular interest in medical applications as extracted if-then rules can provide interpretable results. Various rule induction algorithms have been proposed to effectively extract knowledge from data, and they can be combined with classification methods to form rule-based classifiers. However, most of the rule-based classifiers can not directly handle numerical data such as blood pressure. A data preprocessing step called discretization is required to convert such numerical data into a categorical format. Existing discretization algorithms do not take into account the multimodal class densities of numerical variables in datasets, which may degrade the performance of rule-based classifiers. In this paper, a new Gaussian Mixture Model based Discretization Algorithm (GMBD) is proposed that preserve the most frequent patterns of the original dataset by taking into account the multimodal distribution of the numerical variables. The effectiveness of GMBD algorithm was verified using six publicly available medical datasets. According to the experimental results, the GMBD algorithm outperformed five other static discretization methods in terms of the number of generated rules and classification accuracy in the associative classification algorithm. Consequently, our proposed approach has a potential to enhance the performance of rule-based classifiers used in clinical expert systems.  相似文献   

19.
In data mining applications, it is important to develop evaluation methods for selecting quality and profitable rules. This paper utilizes a non-parametric approach, Data Envelopment Analysis (DEA), to estimate and rank the efficiency of association rules with multiple criteria. The interestingness of association rules is conventionally measured based on support and confidence. For specific applications, domain knowledge can be further designed as measures to evaluate the discovered rules. For example, in market basket analysis, the product value and cross-selling profit associated with the association rule can serve as essential measures to rule interestingness. In this paper, these domain measures are also included in the rule ranking procedure for selecting valuable rules for implementation. An example of market basket analysis is applied to illustrate the DEA based methodology for measuring the efficiency of association rules with multiple criteria.  相似文献   

20.
Sequential rule mining is an important data mining task used in a wide range of applications. However, current algorithms for discovering sequential rules common to several sequences use very restrictive definitions of sequential rules, which make them unable to recognize that similar rules can describe a same phenomenon. This can have many undesirable effects such as (1) similar rules that are rated differently, (2) rules that are not found because they are considered uninteresting when taken individually, (3) and rules that are too specific, which makes them less likely to be used for making predictions. In this paper, we address these problems by proposing a more general form of sequential rules such that items in the antecedent and in the consequent of each rule are unordered. We propose an algorithm named CMRules for mining this form of rules. The algorithm proceeds by first finding association rules to prune the search space for items that occur jointly in many sequences. Then it eliminates association rules that do not meet the minimum confidence and support thresholds according to the sequential ordering. We evaluate the performance of CMRules in three different ways. First, we provide an analysis of its time complexity. Second, we compare its performance (in terms of execution time, memory usage and scalability) with an adaptation of an algorithm from the literature that we name CMDeo. For this comparison, we use three real-life public datasets, which have different characteristics and represent three kinds of data. In many cases, results show that CMRules is faster and has a better scalability for low support thresholds than CMDeo. Lastly, we report a successful application of the algorithm in a tutoring agent.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号