首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
Finding interesting patterns using user expectations   总被引:6,自引:0,他引:6  
One of the major problems in the field of knowledge discovery (or data mining) is the interestingness problem. Past research and applications have found that, in practice, it is all too easy to discover a huge number of patterns in a database. Most of these patterns are actually useless or uninteresting to the user. But due to the huge number of patterns, it is difficult for the user to comprehend them and to identify those interesting to him/her. To prevent the user from being overwhelmed by the large number of patterns, techniques are needed to rank them according to their interestingness. In this paper, we propose such a technique, called the user-expectation method. In this technique, the user is first asked to provide his/her expected patterns according to his/her past knowledge or intuitive feelings. Given these expectations, the system uses a fuzzy matching technique to match the discovered patterns against the user's expectations, and then rank the discovered patterns according to the matching results. A variety of rankings can be performed for different purposes, such as to confirm the user's knowledge and to identify unexpected patterns, which are by definition interesting. The proposed technique is general and interactive  相似文献   

3.
Discovering knowledge from data means finding useful patterns in data, this process has increased the opportunity and challenge for businesses in the big data era. Meanwhile, improving the quality of the discovered knowledge is important for making correct decisions in an unpredictable environment. Various models have been developed in the past; however, few used both data quality and prior knowledge to control the quality of the discovery processes and results. In this paper, a multi-objective model of knowledge discovery in databases is developed, which aids the discovery process by utilizing prior process knowledge and different measures of data quality. To illustrate the model, association rule mining is considered and formulated as a multi-objective problem that takes into account data quality measures and prior process knowledge instead of a single objective problem. Measures such as confidence, support, comprehensibility and interestingness are used. A Pareto-based integrated multi-objective Artificial Bee Colony (IMOABC) algorithm is developed to solve the problem. Using well-known and publicly available databases, experiments are carried out to compare the performance of IMOABC with NSGA-II, MOPSO and Apriori algorithms, respectively. The computational results show that IMOABC outperforms NSGA-II, MOPSO and Apriori on different measures and it could be easily customized or tailored to be in line with user requirements and still generates high-quality association rules.  相似文献   

4.
Numerous interestingness measures have been proposed in statistics and data mining to assess object relationships. This is especially important in recent studies of association or correlation pattern mining. However, it is still not clear whether there is any intrinsic relationship among many proposed measures, and which one is truly effective at gauging object relationships in large data sets. Recent studies have identified a critical property, null-(transaction) invariance, for measuring associations among events in large data sets, but many measures do not have this property. In this study, we re-examine a set of null-invariant interestingness measures and find that they can be expressed as the generalized mathematical mean, leading to a total ordering of them. Such a unified framework provides insights into the underlying philosophy of the measures and helps us understand and select the proper measure for different applications. Moreover, we propose a new measure called Imbalance Ratio to gauge the degree of skewness of a data set. We also discuss the efficient computation of interesting patterns of different null-invariant interestingness measures by proposing an algorithm, GAMiner, which complements previous studies. Experimental evaluation verifies the effectiveness of the unified framework and shows that GAMiner speeds up the state-of-the-art algorithm by an order of magnitude.  相似文献   

5.
Market basket analysis is one of the typical applications in mining association rules. The valuable information discovered from data mining can be used to support decision making. Generally, support and confidence (objective) measures are used to evaluate the interestingness of association rules. However, in some cases, by using these two measures, the discovered rules may be not profitable and not actionable (not interesting) to enterprises. Therefore, how to discover the patterns by considering both objective measures (e.g. probability) and subjective measures (e.g. profit) is a challenge in data mining, particularly in marketing applications. This paper focuses on pattern evaluation in the process of knowledge discovery by using the concept of profit mining. Data Envelopment Analysis is utilized to calculate the efficiency of discovered association rules with multiple objective and subjective measures. After evaluating the efficiency of association rules, they are categorized into two classes, relatively efficient (interesting) and relatively inefficient (uninteresting). To classify these two classes, Decision Tree (DT)‐based classifier is built by using the attributes of association rules. The DT classifier can be used to find out the characteristics of interesting association rules, and to classify the unknown (new) association rules.  相似文献   

6.
The analysis of research data plays a key role in data‐driven areas of science. Varieties of mixed research data sets exist and scientists aim to derive or validate hypotheses to find undiscovered knowledge. Many analysis techniques identify relations of an entire dataset only. This may level the characteristic behavior of different subgroups in the data. Like automatic subspace clustering, we aim at identifying interesting subgroups and attribute sets. We present a visual‐interactive system that supports scientists to explore interesting relations between aggregated bins of multivariate attributes in mixed data sets. The abstraction of data to bins enables the application of statistical dependency tests as the measure of interestingness. An overview matrix view shows all attributes, ranked with respect to the interestingness of bins. Complementary, a node‐link view reveals multivariate bin relations by positioning dependent bins close to each other. The system supports information drill‐down based on both expert knowledge and algorithmic support. Finally, visual‐interactive subset clustering assigns multivariate bin relations to groups. A list‐based cluster result representation enables the scientist to communicate multivariate findings at a glance. We demonstrate the applicability of the system with two case studies from the earth observation domain and the prostate cancer research domain. In both cases, the system enabled us to identify the most interesting multivariate bin relations, to validate already published results, and, moreover, to discover unexpected relations.  相似文献   

7.
□ We present a new technique for interactively mining patterns and generating explanations by harnessing the expertise of domain experts. Key to the approach is the distinction between what is unexpected from the perspective of the computational data mining process and what is surprising to the domain experts and interesting relative to their needs. We demonstrate the potential of the approach for discovering patterns and generating rich explanations in a clinical domain. Discovering interesting facts in clinical data is a grand challenge, because medical practitioners and clinicians generally have exceptional knowledge in the problem domain in which they work, however, this knowledge is typically difficult to isolate computationally. To identify the desired surprising patterns, we formally record user knowledge and use that knowledge to filter and constrain the output from an objective data mining technique, with the user making the final judgement about whether a rule is surprising. Specifically, we introduce an unexpectedness algorithm based on association rule mining and Bayesian Networks and a ?-explanations technique for explanation generation to identify unexpected patterns. An implemented prototype is successfully demonstrated using a large clinical database recording incidence, prevalance, and outcome of dialysis and kidney transplant patients.  相似文献   

8.
9.
What makes patterns interesting in knowledge discovery systems   总被引:6,自引:0,他引:6  
One of the central problems in the field of knowledge discovery is the development of good measures of interestingness of discovered patterns. Such measures of interestingness are divided into objective measures-those that depend only on the structure of a pattern and the underlying data used in the discovery process, and the subjective measures-those that also depend on the class of users who examine the pattern. The focus of the paper is on studying subjective measures of interestingness. These measures are classified into actionable and unexpected, and the relationship between them is examined. The unexpected measure of interestingness is defined in terms of the belief system that the user has. Interestingness of a pattern is expressed in terms of how it affects the belief system. The paper also discusses how this unexpected measure of interestingness can be used in the discovery process  相似文献   

10.
Polygons provide natural representations for many types of geospatial objects, such as countries, buildings, and pollution hotspots. Thus, polygon-based data mining techniques are particularly useful for mining geospatial datasets. In this paper, we propose a polygon-based clustering and analysis framework for mining multiple geospatial datasets that have inherently hidden relations. In this framework, polygons are first generated from multiple geospatial point datasets by using a density-based contouring algorithm called DCONTOUR. Next, a density-based clustering algorithm called Poly-SNN with novel dissimilarity functions is employed to cluster polygons to create meta-clusters of polygons. Finally, post-processing analysis techniques are proposed to extract interesting patterns and user-guided summarized knowledge from meta-clusters. These techniques employ plug-in reward functions that capture a domain expert’s notion of interestingness to guide the extraction of knowledge from meta-clusters. The effectiveness of our framework is tested in a real-world case study involving ozone pollution events in Texas. The experimental results show that our framework can reveal interesting relationships between different ozone hotspots represented by polygons; it can also identify interesting hidden relations between ozone hotspots and several meteorological variables, such as outdoor temperature, solar radiation, and wind speed.  相似文献   

11.
A number of studies, theoretical, empirical, or both, have been conducted to provide insight into the properties and behavior of interestingness measures for association rule mining. While each has value in its own right, most are either limited in scope or, more importantly, ignore the purpose for which interestingness measures are intended, namely the ultimate ranking of discovered association rules. This paper, therefore, focuses on an analysis of the rule-ranking behavior of 61 well-known interestingness measures tested on the rules generated from 110 different datasets. By clustering based on ranking behavior, we highlight, and formally prove, previously unreported equivalences among interestingness measures. We also show that there appear to be distinct clusters of interestingness measures, but that there remain differences among clusters, confirming that domain knowledge is essential to the selection of an appropriate interestingness measure for a particular task and business objective.  相似文献   

12.
Recent research has shown that association rules are useful in gene expression data analysis. Interestingness measure plays an important role in the association rule mining on small sample size, high dimensionality, and noisy gene expression data. This work introduces two interestingness measures by exploring prior knowledge contained in open biological databases. They are Max-Pathway-Distance (MaxPD), which explores the gene’s relativity in Kyoto encyclopedia of genes and genomes Pathway, and Max-Chromosomal-Distance (MaxCD), which makes use of the distance among genes in the chromosome. The properties of our proposed interestingness measures are also explored to mine the interesting rules efficiently. Experimental results on four real-life gene expression datasets show the effectiveness of MaxPD and MaxCD in both classification accuracy and biological interpretability.  相似文献   

13.
数据挖掘是在数据中发现隐藏的结构和模式。但发现的许多模式对用卢来说可能是已知的,从而使这些模式毫无意义,毫无兴趣性。文献中多强调分类规则的准确性和可理解性,但发现兴趣规则在数据挖掘算法中依然是一个令人生畏的挑战。本文采用一种遗传数据挖掘方法,在分类规则产生的同时对其兴趣性进行度量,直接产生兴趣规则。实验表明该方法是可行的、高效的。  相似文献   

14.
Recent research has highlighted the practical benefits of subjective interestingness measures, which quantify the novelty or unexpectedness of a pattern when contrasted with any prior information of the data miner (Silberschatz and Tuzhilin, Proceedings of the 1st ACM SIGKDD international conference on Knowledge discovery and data mining (KDD95), 1995; Geng and Hamilton, ACM Comput Surv 38(3):9, 2006). A key challenge here is the formalization of this prior information in a way that lends itself to the definition of a subjective interestingness measure that is both meaningful and practical. In this paper, we outline a general strategy of how this could be achieved, before working out the details for a use case that is important in its own right. Our general strategy is based on considering prior information as constraints on a probabilistic model representing the uncertainty about the data. More specifically, we represent the prior information by the maximum entropy (MaxEnt) distribution subject to these constraints. We briefly outline various measures that could subsequently be used to contrast patterns with this MaxEnt model, thus quantifying their subjective interestingness. We demonstrate this strategy for rectangular databases with knowledge of the row and column sums. This situation has been considered before using computation intensive approaches based on swap randomizations, allowing for the computation of empirical p-values as interestingness measures (Gionis et al., ACM Trans Knowl Discov Data 1(3):14, 2007). We show how the MaxEnt model can be computed remarkably efficiently in this situation, and how it can be used for the same purpose as swap randomizations but computationally more efficiently. More importantly, being an explicitly represented distribution, the MaxEnt model can additionally be used to define analytically computable interestingness measures, as we demonstrate for tiles (Geerts et al., Proceedings of the 7th international conference on Discovery science (DS04), 2004) in binary databases.  相似文献   

15.
On account of the enormous amounts of rules that can be produced by data mining algorithms, knowledge post-processing is a difficult stage in an association rule discovery process. In order to find relevant knowledge for decision making, the user (a decision maker specialized in the data studied) needs to rummage through the rules. To assist him/her in this task, we here propose the rule-focusing methodology, an interactive methodology for the visual post-processing of association rules. It allows the user to explore large sets of rules freely by focusing his/her attention on limited subsets. This new approach relies on rule interestingness measures, on a visual representation, and on interactive navigation among the rules. We have implemented the rule-focusing methodology in a prototype system called ARVis. It exploits the user's focus to guide the generation of the rules by means of a specific constraint-based rule-mining algorithm. Julien Blanchard earned the Ph.D. in 2005 from Nantes University (France) and is currently an assistant professor at the Polytechnic School of Nantes University. He is the author of a book chapter and seven journal and international conference papers in the field of visualization and interestingness measures for data mining. Fabrice Guillet is currently a member of the LINA laboratory (CNRS 2729) at the Polytechnic Graduate School of Nantes University (France). He receive the Ph.D. degree in computer science in 1995 from the Ecole Nationale Supěrieure des Télécommunications de Bretagne. He is author of 35 international publications in data mining and knowledge management. He is a founder and a permanent member of the Steering Committee of the annual EGC French-speaking conference. Henri Briand received the Ph.D. degree in 1983 from Paul Sabatier University located in Toulouse (France) and has published works in over 100 publications in database systems and database mining. He was the head of the Computer Engineering Department at the Polytechnic School of Nantes University. He was in charge of a research team in the data mining domain. He is responsible for the organization of the Data Mining Master in Nantes University.  相似文献   

16.
同一关联挖掘算法算法在不同性质的数据上会表现出不同的性能。针对该问题,提出一种有趣关联模式挖掘方法。介绍模式的兴趣度度量,引入兴趣度预处理过程,并将数据分为2种类型,分别采用不同的算法对这2类数据集进行挖掘。实例表明,该方法能有效提高输出模式的质量。  相似文献   

17.
1 引言数据挖掘是一种新的商业信息处理技术,其主要特点是对商业数据库中的大量业务数据进行抽取、转换、分析和其他模型化处理,从中提取辅助商业决策的关键性数据。通常,经过某些数据挖掘工具的挖掘后,例如,文[1]所给出的快速算法,我们会得到大量的关联规则。对用户来说,从这些大量的规则中找出自己感兴趣的规则十分困难,而且,也  相似文献   

18.
Knowledge discovery in databases is used to discover useful and understandable knowledge from large databases. A process of knowledge discovery consists of two steps, the data mining step and the evaluation step. In this paper, evaluating and ranking the interestingness of summaries generated from databases, which is a part of the second step, is studied using diversity measures. Sixteen previously analyzed diversity measures of interestingness are used along with three not previously considered ones, brought from different well-known areas. The latter three measures are evaluated theoretically according to five principles that a measure must satisfy to be qualified acceptable for ranking summaries. A theoretical correlation study between the eight measures that satisfy all five principles is presented based on mathematical proofs. An empirical evaluation is conducted using three real databases. Then, a classification of the eight measures is deduced. The resulting classification is used to reduce the number of measures to only two, which are the best over all criteria, and that produce non-similar results. This helps the user interpret the most important discovered knowledge in his decision making process.  相似文献   

19.
挖掘所关注规则的多策略方法研究   总被引:20,自引:1,他引:19  
通过数据挖掘,从大型数据库中发现了大量规则,如何选取所关注的规则,是知识发现的重要研究内容。该文研究了利用领域知识对规则的主观关注程度进行度量的方法,给出了一个能够度量规则的简洁性和新奇性的客观关注程度的计算函数,提出了选取用户关注的规则的多策略方法。  相似文献   

20.
The discovery of interesting patterns in relational databases is an important data mining task. This paper is concerned with the development of a search algorithm for first-order hypothesis spaces adopting an important pruning technique (termed subset pruning here) from association rule mining in a first-order setting. The basic search algorithm is extended by so-called requires and excludes constraints allowing to declare prior knowledge about the data, such as mutual exclusion or generalization relationships among attributes, so that it can be exploited for further structuring and restricting the search space. Furthermore, it is illustrated how to process taxonomies and numerical attributes in the search algorithm.Several task settings using different interestingness criteria and search modes with corresponding pruning criteria are described. Three settings serve as test beds for evaluation of the proposed approach. The experimental evaluation shows that the impact of subset pruning is significant, since it reduces the number of hypothesis evaluations in many cases by about 50%. The impact of generalization relationships is shown to be less effective in our experimental set-up.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号