共查询到20条相似文献,搜索用时 15 毫秒
1.
In the domain of association rules mining (ARM) discovering the rules for numerical attributes is still a challenging issue. Most of the popular approaches for numerical ARM require a priori data discretization to handle the numerical attributes. Moreover, in the process of discovering relations among data, often more than one objective (quality measure) is required, and in most cases, such objectives include conflicting measures. In such a situation, it is recommended to obtain the optimal trade-off between objectives. This paper deals with the numerical ARM problem using a multi-objective perspective by proposing a multi-objective particle swarm optimization algorithm (i.e., MOPAR) for numerical ARM that discovers numerical association rules (ARs) in only one single step. To identify more efficient ARs, several objectives are defined in the proposed multi-objective optimization approach, including confidence, comprehensibility, and interestingness. Finally, by using the Pareto optimality the best ARs are extracted. To deal with numerical attributes, we use rough values containing lower and upper bounds to show the intervals of attributes. In the experimental section of the paper, we analyze the effect of operators used in this study, compare our method to the most popular evolutionary-based proposals for ARM and present an analysis of the mined ARs. The results show that MOPAR extracts reliable (with confidence values close to 95%), comprehensible, and interesting numerical ARs when attaining the optimal trade-off between confidence, comprehensibility and interestingness. 相似文献
2.
Association rules are one of the most frequently used tools for finding relationships between different attributes in a database. There are various techniques for obtaining these rules, the most common of which are those which give categorical association rules. However, when we need to relate attributes which are numeric and discrete, we turn to methods which generate quantitative association rules, a far less studied method than the above. In addition, when the database is extremely large, many of these tools cannot be used. In this paper, we present an evolutionary tool for finding association rules in databases (both small and large) comprising quantitative and categorical attributes without the need for an a priori discretization of the domain of the numeric attributes. Finally, we evaluate the tool using both real and synthetic databases. 相似文献
3.
Data mining involves nontrivial process of extracting knowledge or patterns from large databases. Genetic Algorithms are efficient and robust searching and optimization methods that are used in data mining. In this paper we propose a Self-Adaptive Migration Model GA (SAMGA), where parameters of population size, the number of points of crossover and mutation rate for each population are adaptively fixed. Further, the migration of individuals between populations is decided dynamically. This paper gives a mathematical schema analysis of the method stating and showing that the algorithm exploits previously discovered knowledge for a more focused and concentrated search of heuristically high yielding regions while simultaneously performing a highly explorative search on the other regions of the search space. The effective performance of the algorithm is then shown using standard testbed functions and a set of actual classification datamining problems. Michigan style of classifier was used to build the classifier and the system was tested with machine learning databases of Pima Indian Diabetes database, Wisconsin Breast Cancer database and few others. The performance of our algorithm is better than others. 相似文献
4.
We present a new distributed association rule mining (D-ARM) algorithm that demonstrates superlinear speed-up with the number of computing nodes. The algorithm is the first D-ARM algorithm to perform a single scan over the database. As such, its performance is unmatched by any previous algorithm. Scale-up experiments over standard synthetic benchmarks demonstrate stable run time regardless of the number of computers. Theoretical analysis reveals a tighter bound on error probability than the one shown in the corresponding sequential algorithm. As a result of this tighter bound and by utilizing the combined memory of several computers, the algorithm generates far fewer candidates than comparable sequential algorithms—the same order of magnitude as the optimum. 相似文献
5.
Data mining techniques can be used to find potentially useful patterns from data and to ease the knowledge acquisition bottleneck in building prototype rule-based systems. Based on the partition methods presented in simple-fuzzy-partition-based method (SFPBM) proposed by Hu et al. (Comput. Ind. Eng. 43(4) (2002) 735), the aim of this paper is to propose a new fuzzy data mining technique consisting of two phases to find fuzzy if–then rules for classification problems: one to find frequent fuzzy grids by using a pre-specified simple fuzzy partition method to divide each quantitative attribute, and the other to generate fuzzy classification rules from frequent fuzzy grids. To improve the classification performance of the proposed method, we specially incorporate adaptive rules proposed by Nozaki et al. (IEEE Trans. Fuzzy Syst. 4(3) (1996) 238) into our methods to adjust the confidence of each classification rule. For classification generalization ability, the simulation results from the iris data demonstrate that the proposed method may effectively derive fuzzy classification rules from training samples. 相似文献
6.
A crucial issue related to data mining on time-series is that of training period duration. The training horizon used impacts the nature of rules obtained and their predictability over time. Longer training horizons are generally sought, in order to discern sustained patterns with robust training data performance that extends well into the predictive period. However, in dynamic environments patterns that persist over time may be unavailable, and shorter-term patterns may hold higher predictive ability, albeit with shorter predictive periods. Such potentially useful shorter-term patterns may be lost when the training duration covers much longer periods. Too short a training duration can, of course, be susceptible to over-fitting to noise. We conduct experiments using different training horizons with daily-data for the S&P500 index and report the sensitivity of the performance of the obtained rules with respect to the training durations. We show that while the performance of the rules in the training period is important for inducing the “best” rules, it is not indicative of their performance in the test-period and propose alternative measures that can be used to help identify the appropriate training durations. 相似文献
7.
Data mining is an important real-life application for businesses. It is critical to find efficient ways of mining large data sets. In order to benefit from the experience with relational databases, a set-oriented approach to mining data is needed. In such an approach, the data mining operations are expressed in terms of relational or set-oriented operations. Query optimization technology can then be used for efficient processing. In this paper, we describe set-oriented algorithms for mining association rules. Such algorithms imply performing multiple joins and thus may appear to be inherently less efficient than special-purpose algorithms. We develop new algorithms that can be expressed as SQL queries, and discuss optimization of these algorithms. After analytical evaluation, an algorithm named SETM emerges as the algorithm of choice. Algorithm SETM uses only simple database primitives, viz., sorting and merge-scan join. Algorithm SETM is simple, fast, and stable over the range of parameter values. It is easily parallelized and we suggest several additional optimizations. The set-oriented nature of Algorithm SETM makes it possible to develop extensions easily and its performance makes it feasible to build interactive data mining tools for large databases. 相似文献
8.
The process of automatically extracting novel, useful and ultimately comprehensible information from large databases, known as data mining, has become of great importance due to the ever-increasing amounts of data collected by large organizations. In particular, the emphasis is devoted to heuristic search methods able to discover patterns that are hard or impossible to detect using standard query mechanisms and classical statistical techniques. In this paper an evolutionary system capable of extracting explicit classification rules is presented. Special interest is dedicated to find easily interpretable rules that may be used to make crucial decisions. A comparison with the findings achieved by other methods on a real problem, the breast cancer diagnosis, is performed. 相似文献
9.
Biclustering consists in simultaneous partitioning of the set of samples and the set of their attributes (features) into subsets (classes). Samples and features classified together are supposed to have a high relevance to each other. In this paper we review the most widely used and successful biclustering techniques and their related applications. This survey is written from a theoretical viewpoint emphasizing mathematical concepts that can be met in existing biclustering techniques. 相似文献
10.
Rule learning is one of the most common tasks in knowledge discovery. In this paper, we investigate the induction of fuzzy classification rules for data mining purposes, and propose a hybrid genetic algorithm for learning approximate fuzzy rules. A novel niching method is employed to promote coevolution within the population, which enables the algorithm to discover multiple rules by means of a coevolutionary scheme in a single run. In order to improve the quality of the learned rules, a local search method was devised to perform fine-tuning on the offspring generated by genetic operators in each generation. After the GA terminates, a fuzzy classifier is built by extracting a rule set from the final population. The proposed algorithm was tested on datasets from the UCI repository, and the experimental results verify its validity in learning rule sets and comparative advantage over conventional methods. 相似文献
11.
This paper presents an evolutionary algorithm for Discriminative Pattern (DP) mining that focuses on high dimensional data sets. DPs aims to identify the sets of characteristics that better differentiate a target group from the others (e.g. successful vs. unsuccessful medical treatments). It becomes more natural to extract information from high dimensionality data sets with the increase in the volume of data stored in the world (30 GB/s only in the Internet). There are several evolutionary approaches for DP mining, but none focusing on high-dimensional data. We propose an evolutionary approach attributing features that reduce the cost of memory and processing in the context of high-dimensional data. The new algorithm thus seeks the best (top- k) patterns and hides from the user many common parameters in other evolutionary heuristics such as population size, mutation and crossover rates, and the number of evaluations. We carried out experiments with real-world high-dimensional and traditional low dimensional data. The results showed that the proposed algorithm was superior to other approaches of the literature in high-dimensional data sets and competitive in the traditional data sets. 相似文献
12.
Prediction of liquefaction is an important subject in geotechnical engineering. Prediction of liquefaction is also a complex problem as it depends on many different physical factors, and the relations between these factors are highly non-linear and complex. Several approaches have been proposed in the literature for modeling and prediction of liquefaction. Most of these approaches are based on classical statistical approaches and neural networks. In this paper a new approach which is based on classification data mining is proposed first time in the literature for liquefaction prediction. The proposed approach is based on extracting accurate classification rules from neural networks via ant colony optimization. The extracted classification rules are in the form of IF–THEN rules which can be easily understood by human. The proposed algorithm is also compared with several other data mining algorithms. It is shown that the proposed algorithm is very effective and accurate in prediction of liquefaction. 相似文献
13.
Prototype selection problem consists of reducing the size of databases by removing samples that are considered noisy or not influential on nearest neighbour classification tasks. Evolutionary algorithms have been used recently for prototype selection showing good results. However, due to the complexity of this problem when the size of the databases increases, the behaviour of evolutionary algorithms could deteriorate considerably because of a lack of convergence. This additional problem is known as the scaling up problem. Memetic algorithms are approaches for heuristic searches in optimization problems that combine a population-based algorithm with a local search. In this paper, we propose a model of memetic algorithm that incorporates an ad hoc local search specifically designed for optimizing the properties of prototype selection problem with the aim of tackling the scaling up problem. In order to check its performance, we have carried out an empirical study including a comparison between our proposal and previous evolutionary and non-evolutionary approaches studied in the literature. The results have been contrasted with the use of non-parametric statistical procedures and show that our approach outperforms previously studied methods, especially when the database scales up. 相似文献
14.
In this paper we introduce a method called CL.E.D.M. (CLassification through ELECTRE and Data Mining), that employs aspects of the methodological framework of the ELECTRE I outranking method, and aims at increasing the accuracy of existing data mining classification algorithms. In particular, the method chooses the best decision rules extracted from the training process of the data mining classification algorithms, and then it assigns the classes that correspond to these rules, to the objects that must be classified. Three well known data mining classification algorithms are tested in five different widely used databases to verify the robustness of the proposed method. 相似文献
15.
Multi-relational data mining has become popular due to the limitations of propositional problem definition in structured domains and the tendency of storing data in relational databases. Several relational knowledge discovery systems have been developed employing various search strategies, heuristics, language pattern limitations and hypothesis evaluation criteria, in order to cope with intractably large search space and to be able to generate high-quality patterns. In this work, an ILP-based concept discovery method, namely Confidence-based Concept Discovery (C 2D), is described in which strong declarative biases and user-defined specifications are relaxed. Moreover, this new method directly works on relational databases. In addition to this, a new confidence-based pruning is used in this technique. We also describe how to define and use aggregate predicates as background knowledge in the proposed method. In order to use aggregate predicates, we show how to handle numerical attributes by using comparison operators on them. Finally, we analyze the effect of incorporating unrelated facts for generating transitive rules on the proposed method. A set of experiments are conducted on real-world problems to test the performance of the proposed method. 相似文献
16.
Many classification tasks can be viewed as ordinal. Use of numeric information usually provides possibilities for more powerful analysis than ordinal data. On the other hand, ordinal data allows more powerful analysis when compared to nominal data. It is therefore important not to overlook knowledge about ordinal dependencies in data sets used in data mining. This paper investigates data mining support available from ordinal data. The effect of considering ordinal dependencies in the data set on the overall results of constructing decision trees and induction rules is illustrated. The degree of improved prediction of ordinal over nominal data is demonstrated. When data was very representative and consistent, use of ordinal information reduced the number of final rules with a lower error rate. Data treatment alternatives are presented to deal with data sets having greater imperfections. 相似文献
17.
针对现有的决策点规则挖掘研究在挖掘时都只考虑了业务对象等过程外部对象的属性,而忽视了业务过程的内部属性——各决策点间的结构关系——对决策点的分支选择决策的影响,在深入研究过程内部属性提取方法的基础上,提出了一种基于过程挖掘的决策规则发现算法。该算法在挖掘决策规则时综合考虑过程外部对象属性和业务过程内部属性,从而可以更加全面、准确地挖掘决策点决策规则。实验结果证明,该算法能够有效挖掘业务过程决策规则,从而帮助用户更好地分析和理解实际业务过程。 相似文献
18.
In this paper,the problem of discovering association rules between items in a large database of sales transactions is discussed.and a novel algorithm,BitMatrix,is proposed.The proposed algorithm is fundamentally different from the known algorithms Apriori and AprioriTid.Empirical evaluation shows that the algorithm outperforms the known ones for large databases.Scale-up experiments show that the algorithm scales linearly with the number of transactions. 相似文献
19.
In data mining applications, it is important to develop evaluation methods for selecting quality and profitable rules. This paper utilizes a non-parametric approach, Data Envelopment Analysis (DEA), to estimate and rank the efficiency of association rules with multiple criteria. The interestingness of association rules is conventionally measured based on support and confidence. For specific applications, domain knowledge can be further designed as measures to evaluate the discovered rules. For example, in market basket analysis, the product value and cross-selling profit associated with the association rule can serve as essential measures to rule interestingness. In this paper, these domain measures are also included in the rule ranking procedure for selecting valuable rules for implementation. An example of market basket analysis is applied to illustrate the DEA based methodology for measuring the efficiency of association rules with multiple criteria. 相似文献
20.
Ant colony optimization (ACO) algorithms have been successfully applied in data classification, which aim at discovering a list of classification rules. However, due to the essentially random search in ACO algorithms, the lists of classification rules constructed by ACO-based classification algorithms are not fixed and may be distinctly different even using the same training set. Those differences are generally ignored and some beneficial information cannot be dug from the different data sets, which may lower the predictive accuracy. To overcome this shortcoming, this paper proposes a novel classification rule discovery algorithm based on ACO, named AntMiner mbc, in which a new model of multiple rule sets is presented to produce multiple lists of rules. Multiple base classifiers are built in AntMiner mbc, and each base classifier is expected to remedy the weakness of other base classifiers, which can improve the predictive accuracy by exploiting the useful information from various base classifiers. A new heuristic function for ACO is also designed in our algorithm, which considers both of the correlation and coverage for the purpose to avoid deceptive high accuracy. The performance of our algorithm is studied experimentally on 19 publicly available data sets and further compared to several state-of-the-art classification approaches. The experimental results show that the predictive accuracy obtained by our algorithm is statistically higher than that of the compared targets. 相似文献
|