共查询到20条相似文献,搜索用时 0 毫秒
1.
Association rules are one of the most frequently used tools for finding relationships between different attributes in a database. There are various techniques for obtaining these rules, the most common of which are those which give categorical association rules. However, when we need to relate attributes which are numeric and discrete, we turn to methods which generate quantitative association rules, a far less studied method than the above. In addition, when the database is extremely large, many of these tools cannot be used. In this paper, we present an evolutionary tool for finding association rules in databases (both small and large) comprising quantitative and categorical attributes without the need for an a priori discretization of the domain of the numeric attributes. Finally, we evaluate the tool using both real and synthetic databases. 相似文献
2.
The process of automatically extracting novel, useful and ultimately comprehensible information from large databases, known as data mining, has become of great importance due to the ever-increasing amounts of data collected by large organizations. In particular, the emphasis is devoted to heuristic search methods able to discover patterns that are hard or impossible to detect using standard query mechanisms and classical statistical techniques. In this paper an evolutionary system capable of extracting explicit classification rules is presented. Special interest is dedicated to find easily interpretable rules that may be used to make crucial decisions. A comparison with the findings achieved by other methods on a real problem, the breast cancer diagnosis, is performed. 相似文献
3.
In the domain of association rules mining (ARM) discovering the rules for numerical attributes is still a challenging issue. Most of the popular approaches for numerical ARM require a priori data discretization to handle the numerical attributes. Moreover, in the process of discovering relations among data, often more than one objective (quality measure) is required, and in most cases, such objectives include conflicting measures. In such a situation, it is recommended to obtain the optimal trade-off between objectives. This paper deals with the numerical ARM problem using a multi-objective perspective by proposing a multi-objective particle swarm optimization algorithm (i.e., MOPAR) for numerical ARM that discovers numerical association rules (ARs) in only one single step. To identify more efficient ARs, several objectives are defined in the proposed multi-objective optimization approach, including confidence, comprehensibility, and interestingness. Finally, by using the Pareto optimality the best ARs are extracted. To deal with numerical attributes, we use rough values containing lower and upper bounds to show the intervals of attributes. In the experimental section of the paper, we analyze the effect of operators used in this study, compare our method to the most popular evolutionary-based proposals for ARM and present an analysis of the mined ARs. The results show that MOPAR extracts reliable (with confidence values close to 95%), comprehensible, and interesting numerical ARs when attaining the optimal trade-off between confidence, comprehensibility and interestingness. 相似文献
4.
在知识发现流程中,分类规则是主要的挖掘任务之一。针对传统的基于统计分析的挖掘算法在保证知识的有趣性方面的缺陷,提出了利用演化计算这种智能计算模型的全局搜索特性和完全适应值导向特性来进行分类知识的自动挖掘和处理,不需要先验知识,以确保知识的有趣性。提出了用IF-THEN这种高层次的知识表示形式来提高知识的可理解性。并给出了个体表示,遗传操作和适应值评估等几个在演化算法中起重要作用的成分的设计原则和方法。 相似文献
5.
Web挖掘中,根据内容对Web文档进行分类是至关重要的一步.在Web文档分类中一种通常的方法是层次型分类方法,这种方法采用自顶向下的方式把文档分类到一个分类树的相应类别.然而,层次型分类方法在对文档进行分类时经常产生待分类的文档在分类树的上层分类器被错误地拒绝的现象(阻塞).针对这种现象,采用了以分类器为中心的阻塞因子去衡量阻塞的程度,并介绍了两种新的层次型分类方法,即基于降低阈值的方法和基于限制投票的方法,去改善Web文档分类中文档被错误阻塞的情况. 相似文献
6.
Prototype selection problem consists of reducing the size of databases by removing samples that are considered noisy or not influential on nearest neighbour classification tasks. Evolutionary algorithms have been used recently for prototype selection showing good results. However, due to the complexity of this problem when the size of the databases increases, the behaviour of evolutionary algorithms could deteriorate considerably because of a lack of convergence. This additional problem is known as the scaling up problem. Memetic algorithms are approaches for heuristic searches in optimization problems that combine a population-based algorithm with a local search. In this paper, we propose a model of memetic algorithm that incorporates an ad hoc local search specifically designed for optimizing the properties of prototype selection problem with the aim of tackling the scaling up problem. In order to check its performance, we have carried out an empirical study including a comparison between our proposal and previous evolutionary and non-evolutionary approaches studied in the literature. The results have been contrasted with the use of non-parametric statistical procedures and show that our approach outperforms previously studied methods, especially when the database scales up. 相似文献
7.
分析了针对连续属性样本进行数据挖掘的缺陷,提出一种直接对连续属性样本进行分类规则挖掘的算法.它基于样本属性值分割点对实例样本进行分类,把分割点对实例样本的分类能力作为分割点选择的依据,将所有相容样本划分为分类属性值相同的子集作为停机条件,实现连续属性样本分类规则挖掘的完全自动化.它考虑到数据挖掘的目标和要求,充分利用属性与类间的依赖性、属性间的互补性,达到样本分割点数少、分类规则简单和属性约减的目的.最后通过实例进行了验证,并与C4.5算法进行了比较. 相似文献
8.
数据挖掘是从大量原始数据中抽取隐藏知识的过程。大部分数据挖掘工具采用规则发现和决策树分类技术来发现数据模式和规则,其核心是归纳算法。与传统统计方法相比,基于机器学习技术得到的分类结果具有较好的可解释性。在针对特定的数据集进行数据挖掘时,如果缺乏相应的领域知识,用户或决策者就很难确定选择何种归纳算法。因此,需要尝试各种算法。借助MLC++,决策者能够轻而易举地比较不同分类算法对特定数据集的有效性,从而选择合适的分类算法。同时,系统开发人员也可以利用MLC++设计各种混合算法。 相似文献
9.
There exist several methods for binary classification of gene expression data sets. However, in the majority of published
methods, little effort has been made to minimize classifier complexity. In view of the small number of samples available in
most gene expression data sets, there is a strong motivation for minimizing the number of free parameters that must be fitted
to the data. In this paper, a method is introduced for evolving (using an evolutionary algorithm) simple classifiers involving
a minimal subset of the available genes. The classifiers obtained by this method perform well, reaching 97% correct classification
of clinical outcome on training samples from the breast cancer data set published by van't Veer, and up to 89% correct classification
on validation samples from the same data set, easily outperforming previously published results. 相似文献
10.
Multi-objective optimization has played a major role in solving problems where two or more conflicting objectives need to
be simultaneously optimized. This paper presents a Multi-Objective grammar-based genetic programming (MOGGP) system that automatically
evolves complete rule induction algorithms, which in turn produce both accurate and compact rule models. The system was compared
with a single objective GGP and three other rule induction algorithms. In total, 20 UCI data sets were used to generate and
test generic rule induction algorithms, which can be now applied to any classification data set. Experiments showed that,
in general, the proposed MOGGP finds rule induction algorithms with competitive predictive accuracies and more compact models
than the algorithms it was compared with.
相似文献
11.
In this article, a new evolutionary algorithm, Forest Optimization Algorithm (FOA), suitable for continuous nonlinear optimization problems has been proposed. It is inspired by few trees in the forests which can survive for several decades, while other trees could live for a limited period. In FOA, seeding procedure of the trees is simulated so that, some seeds fall just under the trees, while others are distributed in wide areas by natural procedures and the animals that feed on the seeds or fruits. Application of the proposed algorithm on some benchmark functions demonstrated its good capability in comparison with Genetic Algorithm (GA) and Particle Swarm Optimization (PSO). Also we tested the performance of FOA on feature weighting as a real optimization problem and the results of the experiments showed the good performance of FOA in some data sets from the UCI repository. 相似文献
12.
One of the most accurate types of prototype selection algorithms, preprocessing techniques that select a subset of instances from the data before applying nearest neighbor classification to it, are evolutionary approaches. These algorithms result in very high accuracy and reduction rates, but unfortunately come at a substantial computational cost. In this paper, we introduce a framework that allows to efficiently use the intermediary results of the prototype selection algorithms to further increase their accuracy performance. Instead of only using the fittest prototype subset generated by the evolutionary algorithm, we use multiple prototype subsets in an ensemble setting. Secondly, in order to classify a test instance, we only use prototype subsets that accurately classify training instances in the neighborhood of that test instance. In an experimental evaluation, we apply our new framework to four state-of-the-art prototype selection algorithms and show that, by using our framework, more accurate results are obtained after less evaluations of the prototype selection method. We also present a case study with a prototype generation algorithm, showing that our framework is easily extended to other preprocessing paradigms as well. 相似文献
13.
随着数据挖掘技术的日趋成熟,其在生活中的作用也越来越重要。本文首先介绍了数据挖掘,聚类分析和分类分析的相关知识,然后将层次聚类应用到分类规则挖掘中。 相似文献
14.
本文首先提出了一种挖掘频集的高效算法PP。它采用了一种基于树的模式支持集表示,避免了反复扫描数据库和递归建造个数与频繁模式数相同的模式支持集,其效率比Apriori和FPGrowth高1—3个数量级。PP被进一步扩展成发现分类规则的有效算法CRM-PP。CRM-PP将多支持率剪裁集成到频集发现阶段,将二阶段挖掘法改进为单阶段挖掘法。CRM-PP的效率也比基于Apriori和FPGrowth的二阶段算法高1—3个数量级。 相似文献
15.
如何保护私有信息或敏感知识在挖掘过程中不被泄露,同时能得到较为准确的挖掘结果,目前已经成为数据挖掘研究中的一个很有意义的研究课题。本文通过对当前隐私保护数据挖掘中具有代表性的算法按照数据分布对其中的数据更改方法、数据挖掘算法、数据或规则隐藏等进行了详细阐述,并对各自的优缺点进行了分析和比较,总结出了各种算法的特性。此外,通过对比提出了隐私保护数据挖掘算法的评价标准,即保密性、规则效能、算法复杂性、扩展性,以便在今后的研究中提出新的有效算法。 相似文献
16.
Data mining is a powerful method to extract knowledge from data. Raw data faces various challenges that make traditional method improper for knowledge extraction. Data mining is supposed to be able to handle various data types in all formats. Relevance of this paper is emphasized by the fact that data mining is an object of research in different areas. In this paper, we review previous works in the context of knowledge extraction from medical data. The main idea in this paper is to describe key papers and provide some guidelines to help medical practitioners. Medical data mining is a multidisciplinary field with contribution of medicine and data mining. Due to this fact, previous works should be classified to cover all users’ requirements from various fields. Because of this, we have studied papers with the aim of extracting knowledge from structural medical data published between 1999 and 2013. We clarify medical data mining and its main goals. Therefore, each paper is studied based on the six medical tasks: screening, diagnosis, treatment, prognosis, monitoring and management. In each task, five data mining approaches are considered: classification, regression, clustering, association and hybrid. At the end of each task, a brief summarization and discussion are stated. A standard framework according to CRISP-DM is additionally adapted to manage all activities. As a discussion, current issue and future trend are mentioned. The amount of the works published in this scope is substantial and it is impossible to discuss all of them on a single work. We hope this paper will make it possible to explore previous works and identify interesting areas for future research. 相似文献
17.
Data mining is an important real-life application for businesses. It is critical to find efficient ways of mining large data sets. In order to benefit from the experience with relational databases, a set-oriented approach to mining data is needed. In such an approach, the data mining operations are expressed in terms of relational or set-oriented operations. Query optimization technology can then be used for efficient processing. In this paper, we describe set-oriented algorithms for mining association rules. Such algorithms imply performing multiple joins and thus may appear to be inherently less efficient than special-purpose algorithms. We develop new algorithms that can be expressed as SQL queries, and discuss optimization of these algorithms. After analytical evaluation, an algorithm named SETM emerges as the algorithm of choice. Algorithm SETM uses only simple database primitives, viz., sorting and merge-scan join. Algorithm SETM is simple, fast, and stable over the range of parameter values. It is easily parallelized and we suggest several additional optimizations. The set-oriented nature of Algorithm SETM makes it possible to develop extensions easily and its performance makes it feasible to build interactive data mining tools for large databases. 相似文献
18.
随着数据挖掘技术的发展,聚类算法也越来越多.数据挖掘对聚类算法有某些典型要求,如何验证聚类算法是否满足这些要求已成为一个需要解决的问题.由于实际样本集很难获得,且很多无法用来进行聚类算法的测试,因此设计并实现了一个工具,讨论用构造的样本集对加载的聚类算法进行评测,并对聚类结果进行展示. 相似文献
19.
传统的Apriori关联法则算法必须经过大量反复的数据库扫描才能产生候选项集,效率较低.提出一个改进的CBA(Classification Based Apriori)算法.此算法仅需扫描数据库一次,将数据库经过预处理后,再将事务数据库进行分类并保存分类结果,比较时可以不与所有事务记录进行比较,从而减少扫描数据库的次数与比较时间,且又能确保挖掘结果的完整性与正确性. 相似文献
20.
基于一种新的自动程序设计方法基因表达式程序设计(GEP),通过设计适应函数、初始化群体的优化、增加新的遗传算子以及采用演化策略中的(λ+μ)淘汰策略等对原始GEP算法进行有效的改进,设计出一种新的数据挖掘算法。采用UCI机器学习知识库中的数据集对该算法进行了实验,并通过与C4.5及文献[3]的比较,检验了该算法的准确性。 相似文献
|