共查询到20条相似文献,搜索用时 15 毫秒
1.
《Expert systems with applications》2014,41(5):2259-2268
Business rules are an effective way to control data quality. Business experts can directly enter the rules into appropriate software without error prone communication with programmers. However, not all business situations and possible data quality problems can be considered in advance. In situations where business rules have not been defined yet, patterns of data handling may arise in practice. We employ data mining to accounting transactions in order to discover such patterns. The discovered patterns are represented in form of association rules. Then, deviations from discovered patterns can be marked as potential data quality violations that need to be examined by humans. Data quality breaches can be expensive but manual examination of many transactions is also expensive. Therefore, the goal is to find a balance between marking too many and too few transactions as being potentially erroneous. We apply appropriate procedures to evaluate the classification accuracy of developed association rules and support the decision on the number of deviations to be manually examined based on economic principles. 相似文献
2.
The amount of ontologies and semantic annotations available on the Web is constantly growing. This new type of complex and heterogeneous graph-structured data raises new challenges for the data mining community. In this paper, we present a novel method for mining association rules from semantic instance data repositories expressed in RDF/(S) and OWL. We take advantage of the schema-level (i.e. Tbox) knowledge encoded in the ontology to derive appropriate transactions which will later feed traditional association rules algorithms. This process is guided by the analyst requirements, expressed in the form of query patterns. Initial experiments performed on semantic data of a biomedical application show the usefulness and efficiency of the approach. 相似文献
3.
Alfonso Iodice D’Enza Michael Greenacre 《Computational statistics & data analysis》2008,52(6):3269-3281
Association rules (AR) represent one of the most powerful and largely used approaches to detect the presence of regularities and paths in large databases. Rules express the relations (in terms of co-occurrence) between pairs of items and are defined in two measures: support and confidence. Most techniques for finding AR scan the whole data set, evaluate all possible rules and retain only rules that have support and confidence greater than thresholds, which should be fixed in order to avoid both that only trivial rules are retained and also that interesting rules are not discarded. A multistep approach aims to the identification of potentially interesting items exploiting well-known techniques of multidimensional data analysis. In particular, interesting pairs of items have a well-defined degree of association: an item pair is well defined if its degree of co-occurrence is very high with respect to one or more subsets of the considered set of transactions. 相似文献
4.
5.
It is well-recognized that the main factor that hinders the applications of Association Rules (ARs) is the huge number of ARs returned by the mining process. In this paper, we propose an effective solution that presents concise mining results by eliminating the redundancy in the set of ARs. We adopt the concept of δ tolerance to define the set of δ-Tolerance ARs (δ-TARs), which is a concise representation for the set of ARs. The notion of δ-tolerance is a relaxation on the closure defined on the support of frequent itemsets, thus allowing us to effectively prune the redundant ARs. We devise a set of inference rules, with which we prove that the set of δ-TARs is a non-redundant representation of ARs. In addition, we prove that the set of ARs that is derived from the δ-TARs by the inference rules is sound and complete. We also develop a compact tree structure called the δ-TAR tree, which facilitates the efficient generation of the δ-TARs and derivation of other ARs. Experimental results verify the efficiency of using the δ-TAR tree to generate the δ-TARs and to query the ARs. The set of δ-TARs is shown to be significantly smaller than the state-of-the-art concise representations of ARs. In addition, the approximation on the support and confidence of the ARs derived from the δ-TARs are highly accurate. 相似文献
6.
Mining association rules are widely studied in data mining society. In this paper, we analyze the measure method of support–confidence framework for mining association rules, from which we find it tends to mine many redundant or unrelated rules besides the interesting ones. In order to ameliorate the criterion, we propose a new method of match as the substitution of confidence. We analyze in detail the property of the proposed measurement. Experimental results show that the generated rules by the improved method reveal high correlation between the antecedent and the consequent when the rules were compared with that produced by the support–confidence framework. Furthermore, the improved method decreases the generation of redundant rules. 相似文献
7.
Association rules (AR) represent a consolidated tool in data mining applications as they are able to discover regularities in large data sets. The information mined by the rules is very often difficult to exploit because of the presence of too many associations where to detect the really relevant logical implications. In this framework, by combining methodological and graphical pruning techniques, AR post-analysis tools are proposed. The methodological techniques will ensure the statistical significance of the AR which were not pruned, while the graphical ones will provide interactive and powerful visualization tools. 相似文献
8.
Emerging applications introduce the requirement for novel association-rule mining algorithms that will be scalable not only with respect to the number of records (number of rows) but also with respect to the domain's size (number of columns). In this paper, we focus on the cases where the items of a large domain correlate with each other in a way that small worlds are formed, that is, the domain is clustered into groups with a large number of intra-group and a small number of inter-group correlations. This property appears in several real-world cases, e.g., in bioinformatics, e-commerce applications, and bibliographic analysis, and can help to significantly prune the search space so as to perform efficient association-rule mining. We develop an algorithm that partitions the domain of items according to their correlations and we describe a mining algorithm that carefully combines partitions to improve the efficiency. Our experiments show the superiority of the proposed method against existing algorithms, and that it overcomes the problems (e.g., increase in CPU cost and possible I/O thrashing) caused by existing algorithms due to the combination of a large domain and a large number of records. 相似文献
9.
Chun-Hao Chen Guo-Cheng Lan Tzung-Pei Hong Yui-Kai Lin 《Expert systems with applications》2013,40(16):6531-6537
Data mining has been studied for a long time. Its goal is to help market managers find relationships among items from large databases and thus increase sales volume. Association-rule mining is one of the well known and commonly used techniques for this purpose. The Apriori algorithm is an important method for such a task. Based on the Apriori algorithm, lots of mining approaches have been proposed for diverse applications. Many of these data mining approaches focus on positive association rules such as “if milk is bought, then cookies are bought”. Such rules may, however, be misleading since there may be customers that buy milk and not buy cookies. This paper thus takes the properties of propositional logic into consideration and proposes an algorithm for mining highly coherent rules. The derived association rules are expected to be more meanful and reliable for business. Experiments on two datasets are also made to show the performance of the proposed approach. 相似文献
10.
The basic goal of scene understanding is to organize the video into sets of events and to find the associated temporal dependencies. Such systems aim to automatically interpret activities in the scene, as well as detect unusual events that could be of particular interest, such as traffic violations and unauthorized entry. The objective of this work, therefore, is to learn behaviors of multi-agent actions and interactions in a semi-supervised manner. Using tracked object trajectories, we organize similar motion trajectories into clusters using the spectral clustering technique. This set of clusters depicts the different paths/routes, i.e., the distinct events taking place at various locations in the scene. A temporal mining algorithm is used to mine interval-based frequent temporal patterns occurring in the scene. A temporal pattern indicates a set of events that are linked based on their relationship with other events in the set, and we use Allen's interval-based temporal logic to describe these relations. The resulting frequent patterns are used to generate temporal association rules, which convey the semantic information contained in the scene. Our overall aim is to generate rules that govern the dynamics of the scene and perform anomaly detection. We apply the proposed approach on two publicly available complex traffic datasets and demonstrate considerable improvements over the existing techniques. 相似文献
11.
12.
Mining fuzzy association rules for classification problems 总被引:3,自引:0,他引:3
The effective development of data mining techniques for the discovery of knowledge from training samples for classification problems in industrial engineering is necessary in applications, such as group technology. This paper proposes a learning algorithm, which can be viewed as a knowledge acquisition tool, to effectively discover fuzzy association rules for classification problems. The consequence part of each rule is one class label. The proposed learning algorithm consists of two phases: one to generate large fuzzy grids from training samples by fuzzy partitioning in each attribute, and the other to generate fuzzy association rules for classification problems by large fuzzy grids. The proposed learning algorithm is implemented by scanning training samples stored in a database only once and applying a sequence of Boolean operations to generate fuzzy grids and fuzzy rules; therefore, it can be easily extended to discover other types of fuzzy association rules. The simulation results from the iris data demonstrate that the proposed learning algorithm can effectively derive fuzzy association rules for classification problems. 相似文献
13.
约束关联挖掘是在把项或项集限制在用户给定的某一条件或多个条件下的关联挖掘,是一种重要的关联挖掘类型,在现实中有着不少的应用。但由于大多数算法处理的约束条件类型单一,提出一种多约束关联挖掘算法。该算法以FP-growth为基础,创建项集的条件数据库。利用非单调性和单调性约束的性质,采用多种剪枝策略,快速寻找约束点。实验证明,该算法能有效地挖掘多约束条件下的关联规则,且可扩展性能很好。 相似文献
14.
Yue XuAuthor Vitae Yuefeng Li Author VitaeGavin Shaw Author Vitae 《Data & Knowledge Engineering》2011,70(6):555-575
Association rule mining has contributed to many advances in the area of knowledge discovery. However, the quality of the discovered association rules is a big concern and has drawn more and more attention recently. One problem with the quality of the discovered association rules is the huge size of the extracted rule set. Often for a dataset, a huge number of rules can be extracted, but many of them can be redundant to other rules and thus useless in practice. Mining non-redundant rules is a promising approach to solve this problem. In this paper, we first propose a definition for redundancy, then propose a concise representation, called a Reliable basis, for representing non-redundant association rules. The Reliable basis contains a set of non-redundant rules which are derived using frequent closed itemsets and their generators instead of using frequent itemsets that are usually used by traditional association rule mining approaches. An important contribution of this paper is that we propose to use the certainty factor as the criterion to measure the strength of the discovered association rules. Using this criterion, we can ensure the elimination of as many redundant rules as possible without reducing the inference capacity of the remaining extracted non-redundant rules. We prove that the redundancy elimination, based on the proposed Reliable basis, does not reduce the strength of belief in the extracted rules. We also prove that all association rules, their supports and confidences, can be retrieved from the Reliable basis without accessing the dataset. Therefore the Reliable basis is a lossless representation of association rules. Experimental results show that the proposed Reliable basis can significantly reduce the number of extracted rules. We also conduct experiments on the application of association rules to the area of product recommendation. The experimental results show that the non-redundant association rules extracted using the proposed method retain the same inference capacity as the entire rule set. This result indicates that using non-redundant rules only is sufficient to solve real problems needless using the entire rule set. 相似文献
15.
Daniel Sánchez José María Serrano Ignacio Blanco Maria Jose Martín-Bautista María-Amparo Vila 《Data mining and knowledge discovery》2008,16(3):313-348
In this paper we deal with the problem of mining for approximate dependencies (AD) in relational databases. We introduce a
definition of AD based on the concept of association rule, by means of suitable definitions of the concepts of item and transaction.
This definition allow us to measure both the accuracy and support of an AD. We provide an interpretation of the new measures
based on the complexity of the theory (set of rules) that describes the dependence, and we employ this interpretation to compare
the new measures with existing ones. A methodology to adapt existing association rule mining algorithms to the task of discovering
ADs is introduced. The adapted algorithms obtain the set of ADs that hold in a relation with accuracy and support greater
than user-defined thresholds. The experiments we have performed show that our approach performs reasonably well over large
databases with real-world data. 相似文献
16.
Mining association rules using inverted hashing and pruning 总被引:2,自引:0,他引:2
John D. HoltSoon M. Chung 《Information Processing Letters》2002,83(4):211-220
In this paper, we propose a new algorithm named Inverted Hashing and Pruning (IHP) for mining association rules between items in transaction databases. The performance of the IHP algorithm was evaluated for various cases and compared with those of two well-known mining algorithms, Apriori algorithm [Proc. 20th VLDB Conf., 1994, pp. 487-499] and Direct Hashing and Pruning algorithm [IEEE Trans. on Knowledge Data Engrg. 9 (5) (1997) 813-825]. It has been shown that the IHP algorithm has better performance for databases with long transactions. 相似文献
17.
The integration of data mining techniques with data warehousing is gaining popularity due to the fact that both disciplines complement each other in extracting knowledge from large datasets. However, the majority of approaches focus on applying data mining as a front end technology to mine data warehouses. Surprisingly, little progress has been made in incorporating mining techniques in the design of data warehouses. While methods such as data clustering applied on multidimensional data have been shown to enhance the knowledge discovery process, a number of fundamental issues remain unresolved with respect to the design of multidimensional schema. These relate to automated support for the selection of informative dimension and fact variables in high dimensional and data intensive environments, an activity which may challenge the capabilities of human designers on account of the sheer scale of data volume and variables involved. In this research, we propose a methodology that selects a subset of informative dimension and fact variables from an initial set of candidates. Our experimental results conducted on three real world datasets taken from the UCI machine learning repository show that the knowledge discovered from the schema that we generated was more diverse and informative than the standard approach of mining the original data without the use of our multidimensional structure imposed on it. 相似文献
18.
An increasing number of data applications such as monitoring weather data, data streaming, data web logs, and cloud data, are going online and are playing vital in our every-day life. The underlying data of such applications change very frequently, especially in the cloud environment. Many interesting events can be detected by discovering such data from different distributed sources and analyzing it for specific purposes (e.g., car accident detection or market analysis). However, several isolated events could be erroneous due to the fact that important data sets are either discarded or improperly analyzed as they contain missing data. Such events therefore need to be monitored globally and be detected jointly in order to understand their patterns and correlated relationships. In the context of current cloud computing infrastructure, no solutions exist for enabling the correlations between multi-source events in the presence of missing data. This paper addresses the problem of capturing the underlying latent structure of the data with missing entries based on association rules. This necessitate to factorize the data set with missing data.The paper proposes a novel model to handle high amount of data in cloud environment. It is a model of aggregated data that are confidences of association rules. We first propose a method to discover the association rules locally on each node of a cloud in the presence of missing rules. Afterward, we provide a tensor based model to perform a global correlation between all the local models of each node of the network.The proposed approach based on tensor decomposition, deals with a multi modal network where missing association rules are detected and their confidences are approximated. The approach is scalable in terms of factorizing multi-way arrays (i.e. tensor) in the presence of missing association rules. It is validated through experimental results which show its significance and viability in terms of detecting missing rules. 相似文献
19.
In order to allow for the analysis of data sets including numerical attributes, several generalizations of association rule
mining based on fuzzy sets have been proposed in the literature. While the formal specification of fuzzy associations is more
or less straightforward, the assessment of such rules by means of appropriate quality measures is less obvious. Particularly,
it assumes an understanding of the semantic meaning of a fuzzy rule. This aspect has been ignored by most existing proposals,
which must therefore be considered as ad-hoc to some extent. In this paper, we develop a systematic approach to the assessment
of fuzzy association rules. To this end, we proceed from the idea of partitioning the data stored in a database into examples
of a given rule, counterexamples, and irrelevant data. Evaluation measures are then derived from the cardinalities of the
corresponding subsets. The problem of finding a proper partition has a rather obvious solution for standard association rules
but becomes less trivial in the fuzzy case. Our results not only provide a sound justification for commonly used measures
but also suggest a means for constructing meaningful alternatives.
相似文献
Henri PradeEmail: |
20.
Elicitation of classification rules by fuzzy data mining 总被引:1,自引:0,他引:1
Yi-Chung Hu Gwo-Hshiung Tzeng 《Engineering Applications of Artificial Intelligence》2003,16(7-8):709-716
Data mining techniques can be used to find potentially useful patterns from data and to ease the knowledge acquisition bottleneck in building prototype rule-based systems. Based on the partition methods presented in simple-fuzzy-partition-based method (SFPBM) proposed by Hu et al. (Comput. Ind. Eng. 43(4) (2002) 735), the aim of this paper is to propose a new fuzzy data mining technique consisting of two phases to find fuzzy if–then rules for classification problems: one to find frequent fuzzy grids by using a pre-specified simple fuzzy partition method to divide each quantitative attribute, and the other to generate fuzzy classification rules from frequent fuzzy grids. To improve the classification performance of the proposed method, we specially incorporate adaptive rules proposed by Nozaki et al. (IEEE Trans. Fuzzy Syst. 4(3) (1996) 238) into our methods to adjust the confidence of each classification rule. For classification generalization ability, the simulation results from the iris data demonstrate that the proposed method may effectively derive fuzzy classification rules from training samples. 相似文献