首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 671 毫秒
1.
吴军  欧阳艾嘉  张琳 《计算机应用》2022,42(9):2713-2721
针对传统序列模式挖掘算法中支持度不能如实体现序列模式兴趣度以及未对报告的序列模式进行质量评估的问题,提出一个基于影响度的统计显著序列模式挖掘算法ISSPM。首先,递归地挖掘出所有满足兴趣度约束的序列模式;然后,使用项集置换方法构建这些序列模式的置换检验零分布;最后,通过该零分布计算出被评估的序列模式的统计度量值,并从上述序列模式中找到所有统计显著序列模式。真实序列记录集合上的实验结果表明,ISSPM算法相较于PSPM、SPDL和PSDSP算法挖掘到的序列模式数量更少但兴趣度更强;仿真序列记录集合上的实验结果表明,ISSPM算法报告的结果中假阳性序列模式数量平均占比为3.39%,且该算法的嵌入模式的发现率均不低于66.7%,明显优于上述3个对比算法。可见,ISSPM算法报告的统计显著序列模式能够体现序列记录集合中更有价值的信息,同时根据这些信息做出的进一步分析和决策也更加可靠。  相似文献   

2.
数据挖掘是在数据中发现隐藏的结构和模式。但发现的许多模式对用卢来说可能是已知的,从而使这些模式毫无意义,毫无兴趣性。文献中多强调分类规则的准确性和可理解性,但发现兴趣规则在数据挖掘算法中依然是一个令人生畏的挑战。本文采用一种遗传数据挖掘方法,在分类规则产生的同时对其兴趣性进行度量,直接产生兴趣规则。实验表明该方法是可行的、高效的。  相似文献   

3.
基于统计相关性的兴趣关联规则的挖掘   总被引:8,自引:0,他引:8  
本文首先对关联规则的支持—置信框架存在的不足进行了分析,然后引入了规则的兴趣度概念,利用兴趣度来约束冗余关联规则的产生,以提高挖掘知识的有用性,并给出了算法描述。  相似文献   

4.
Mining multiple-level association rules in large databases   总被引:2,自引:0,他引:2  
A top-down progressive deepening method is developed for efficient mining of multiple-level association rules from large transaction databases based on the a priori principle. A group of variant algorithms is proposed based on the ways of sharing intermediate results, with the relative performance tested and analyzed. The enforcement of different interestingness measurements to find more interesting rules, and the relaxation of rule conditions for finding “level-crossing” association rules, are also investigated. The study shows that efficient algorithms can be developed from large databases for the discovery of interesting and strong multiple-level association rules  相似文献   

5.
6.
Major challenges of clustering geo-referenced data include identifying arbitrarily shaped clusters, properly utilizing spatial information, coping with diverse extrinsic characteristics of clusters and supporting region discovery tasks. The goal of region discovery is to identify interesting regions in geo-referenced datasets based on a domain expert’s notion of interestingness. Almost all agglomerative clustering algorithms only focus on the first challenge. The goal of the proposed work is to develop agglomerative clustering frameworks that deal with all four challenges. In particular, we propose a generic agglomerative clustering framework for geo-referenced datasets (GAC-GEO) generalizing agglomerative clustering by allowing for three plug-in components. GAC-GEO agglomerates neighboring clusters maximizing a plug-in fitness function that capture the notion of interestingness of clusters. It enhances typical agglomerative clustering algorithms in two ways: fitness functions support task-specific clustering, whereas generic neighboring relationships increase the number of merging candidates. We also demonstrate that existing agglomerative clustering algorithms can be considered as specific cases of GAC-GEO. We evaluate the proposed framework on an artificial dataset and two real-world applications involving region discovery. The experimental results show that GAC-GEO is capable of identifying arbitrarily shaped hotspots for different data mining tasks.  相似文献   

7.
8.
Many studies have shown the limits of the support/confidence framework used in Apriori ‐like algorithms to mine association rules. There are a lot of efficient implementations based on the antimonotony property of the support, but candidate set generation (e.g., frequent item set mining) is still costly. In addition, many rules are uninteresting or redundant and one can miss interesting rules like nuggets. We are thus facing a complexity issue and a quality issue. One solution is to not use frequent itemset mining and to focus as soon as possible on interesting rules using additional interestingness measures. We present here a formal framework that allows us to make a link between analytic and algorithmic properties of interestingness measures. We introduce the notion of optimonotony in relation with the optimal rule discovery framework. We then demonstrate a necessary and sufficient condition for the existence of optimonotony. This result can thus be applied to classify the measures. We study the case of 39 classical measures and show that 31 of them are optimonotone. These optimonotone measures can thus be used with an underlying pruning strategy. Empirical evaluations show that the pruning strategy is efficient and leads to the discovery of nuggets using an optimonotone measure and without the support constraint.  相似文献   

9.
Because clinical research is carried out in complex environments, prior domain knowledge, constraints, and expert knowledge can enhance the capabilities and performance of data mining. In this paper we propose an unexpected pattern mining model that uses decision trees to compare recovery rates of two different treatments, and to find patterns that contrast with the prior knowledge of domain users. In the proposed model we define interestingness measures to determine whether the patterns found are interesting to the domain. By applying the concept of domain-driven data mining, we repeatedly utilize decision trees and interestingness measures in a closed-loop, in-depth mining process to find unexpected and interesting patterns. We use retrospective data from transvaginal ultrasound-guided aspirations to show that the proposed model can successfully compare different treatments using a decision tree, which is a new usage of that tool. We believe that unexpected, interesting patterns may provide clinical researchers with different perspectives for future research.  相似文献   

10.
可信关联规则及其基于极大团的挖掘算法   总被引:4,自引:1,他引:3  
肖波  徐前方  蔺志青  郭军  李春光 《软件学报》2008,19(10):2597-2610
目前的关联规则挖掘算法主要依靠基于支持度的剪切策略来减小组合搜索空间.如果挖掘潜在的令人感兴趣的低支持度模式,这种策略并非有效.为此,提出一种新的关联模式—可信关联规则(credible association rule,简称CAR),规则中每个项目的支持度处于同一数量级,规则的置信度直接反映其可信程度,从而可以不必再考虑传统的支持度.同时,提出MaxcliqueMining算法,该算法采用邻接矩阵产生2-项可信集,进而利用极大团思想产生所有可信关联规则提出并证明了几个相关命题以说明这种规则的特点及算法的可行性和有效性.在告警数据集及Pumsb数据集上的实验表明,该算法挖掘CAR具有较高的效率和准确性.  相似文献   

11.
This paper presents a framework for exact discovery of the top-k sequential patterns under Leverage. It combines (1) a novel definition of the expected support for a sequential pattern—a concept on which most interestingness measures directly rely—with (2) Skopus: a new branch-and-bound algorithm for the exact discovery of top-k sequential patterns under a given measure of interest. Our interestingness measure employs the partition approach. A pattern is interesting to the extent that it is more frequent than can be explained by assuming independence between any of the pairs of patterns from which it can be composed. The larger the support compared to the expectation under independence, the more interesting is the pattern. We build on these two elements to exactly extract the k sequential patterns with highest leverage, consistent with our definition of expected support. We conduct experiments on both synthetic data with known patterns and real-world datasets; both experiments confirm the consistency and relevance of our approach with regard to the state of the art.  相似文献   

12.
Most incremental mining and online mining algorithms concentrate on finding association rules or patterns consistent with entire current sets of data. Users cannot easily obtain results from only interesting portion of data. This may prevent the usage of mining from online decision support for multidimensional data. To provide ad-hoc, query-driven, and online mining support, we first propose a relation called the multidimensional pattern relation to structurally and systematically store context and mining information for later analysis. Each tuple in the relation comes from an inserted dataset in the database. We then develop an online mining approach called three-phase online association rule mining (TOARM) based on this proposed multidimensional pattern relation to support online generation of association rules under multidimensional considerations. The TOARM approach consists of three phases during which final sets of patterns satisfying various mining requests are found. It first selects and integrates related mining information in the multidimensional pattern relation, and then if necessary, re-processes itemsets without sufficient information against the underlying datasets. Some implementation considerations for the algorithm are also stated in detail. Experiments on homogeneous and heterogeneous datasets were made and the results show the effectiveness of the proposed approach.  相似文献   

13.
概念指导的关联规则的挖掘   总被引:4,自引:0,他引:4  
关联规则是数据依赖关系泊有效描述方法,是知识发现研究的重要内容,传统的关联规则挖掘算法缺少挖掘的针对性,挖掘速度慢,挖掘效果难于理解,挖掘析数量巨大,需要进行大量的筛选以便抽取出有用规则,文中提出了将概念融入挖掘过程中,提高挖掘的效率和挖掘的针对性的方法,给出了概念指导的关联规则挖掘算法CGARM和大数据库中概念的交互式生成方法。算法CGARM是对基于分类的挖掘算法的拓展。实验结果表明,算法CGA  相似文献   

14.
Discovering injective episodes with general partial orders   总被引:1,自引:1,他引:0  
Frequent episode discovery is a popular framework for temporal pattern discovery in event streams. An episode is a partially ordered set of nodes with each node associated with an event type. Currently algorithms exist for episode discovery only when the associated partial order is total order (serial episode) or trivial (parallel episode). In this paper, we propose efficient algorithms for discovering frequent episodes with unrestricted partial orders when the associated event-types are unique. These algorithms can be easily specialized to discover only serial or parallel episodes. Also, the algorithms are flexible enough to be specialized for mining in the space of certain interesting subclasses of partial orders. We point out that frequency alone is not a sufficient measure of interestingness in the context of partial order mining. We propose a new interestingness measure for episodes with unrestricted partial orders which, when used along with frequency, results in an efficient scheme of data mining. Simulations are presented to demonstrate the effectiveness of our algorithms.  相似文献   

15.
为了减少偏好度量过程中的人为干预,同时提高偏好度量算法的效率和准确性,提出一种基于信任系统的偏好协同度量框架。首先,提出了规则间的距离和规则集的内部距离等概念来具体化规则之间的关系。在此基础上,提出了基于规则集平均内部距离的规则集聚合算法PRA,旨在保证损失最少信息的情况下筛选出最具代表性的全体用户的共同偏好,即共识偏好。之后,提出Common belief的概念和一种改进的信任系统,使用共识偏好作为信任系统的证据,在考虑用户一致性的同时还允许用户保留个性化信息。在信任系统下,提出了基于信任系统的有趣度度量标准,并量化了偏好的信任度和偏离度,用于描述用户偏好和信任系统的一致或相悖程度,并将用户偏好分为泛化偏好或个性化偏好,最终依据信任度和偏离度得出有趣度,从而找出最有趣的规则。在计算有趣度的过程中,提出了一个可以使用不同信任度公式来计算有趣度的可扩展的计算框架。为了进一步验证度量框架的准确性和有效性,以加权的余弦相似度公式和相关系数公式为例,提出了IMCos算法和IMCov算法。实验结果表明,信任度和偏离度有效地反映了偏好的不同特征,并且与两种最新的算法CONTENUM和TKO相比,度量框架发现的Top-K规则在召回率、准确率和F1-Measure等指标上均更优。  相似文献   

16.
Knowledge discovery in databases is used to discover useful and understandable knowledge from large databases. A process of knowledge discovery consists of two steps, the data mining step and the evaluation step. In this paper, evaluating and ranking the interestingness of summaries generated from databases, which is a part of the second step, is studied using diversity measures. Sixteen previously analyzed diversity measures of interestingness are used along with three not previously considered ones, brought from different well-known areas. The latter three measures are evaluated theoretically according to five principles that a measure must satisfy to be qualified acceptable for ranking summaries. A theoretical correlation study between the eight measures that satisfy all five principles is presented based on mathematical proofs. An empirical evaluation is conducted using three real databases. Then, a classification of the eight measures is deduced. The resulting classification is used to reduce the number of measures to only two, which are the best over all criteria, and that produce non-similar results. This helps the user interpret the most important discovered knowledge in his decision making process.  相似文献   

17.
Market basket analysis is one of the typical applications in mining association rules. The valuable information discovered from data mining can be used to support decision making. Generally, support and confidence (objective) measures are used to evaluate the interestingness of association rules. However, in some cases, by using these two measures, the discovered rules may be not profitable and not actionable (not interesting) to enterprises. Therefore, how to discover the patterns by considering both objective measures (e.g. probability) and subjective measures (e.g. profit) is a challenge in data mining, particularly in marketing applications. This paper focuses on pattern evaluation in the process of knowledge discovery by using the concept of profit mining. Data Envelopment Analysis is utilized to calculate the efficiency of discovered association rules with multiple objective and subjective measures. After evaluating the efficiency of association rules, they are categorized into two classes, relatively efficient (interesting) and relatively inefficient (uninteresting). To classify these two classes, Decision Tree (DT)‐based classifier is built by using the attributes of association rules. The DT classifier can be used to find out the characteristics of interesting association rules, and to classify the unknown (new) association rules.  相似文献   

18.
The aim of this paper is to propose a new hybrid data mining model based on combination of various feature selection and ensemble learning classification algorithms, in order to support decision making process. The model is built through several stages. In the first stage, initial dataset is preprocessed and apart of applying different preprocessing techniques, we paid a great attention to the feature selection. Five different feature selection algorithms were applied and their results, based on ROC and accuracy measures of logistic regression algorithm, were combined based on different voting types. We also proposed a new voting method, called if_any, that outperformed all other voting methods, as well as a single feature selection algorithm's results. In the next stage, a four different classification algorithms, including generalized linear model, support vector machine, naive Bayes and decision tree, were performed based on dataset obtained in the feature selection process. These classifiers were combined in eight different ensemble models using soft voting method. Using the real dataset, the experimental results show that hybrid model that is based on features selected by if_any voting method and ensemble GLM + DT model performs the highest performance and outperforms all other ensemble and single classifier models.  相似文献   

19.
A number of studies, theoretical, empirical, or both, have been conducted to provide insight into the properties and behavior of interestingness measures for association rule mining. While each has value in its own right, most are either limited in scope or, more importantly, ignore the purpose for which interestingness measures are intended, namely the ultimate ranking of discovered association rules. This paper, therefore, focuses on an analysis of the rule-ranking behavior of 61 well-known interestingness measures tested on the rules generated from 110 different datasets. By clustering based on ranking behavior, we highlight, and formally prove, previously unreported equivalences among interestingness measures. We also show that there appear to be distinct clusters of interestingness measures, but that there remain differences among clusters, confirming that domain knowledge is essential to the selection of an appropriate interestingness measure for a particular task and business objective.  相似文献   

20.
关联分类具有较高的分类精度和较强的适应性,然而由于分类器是由一组高置信度的规则构成,有时会存在过度拟合问题。提出了基于规则兴趣度的关联分类(ACIR)。它扩展了TD-FP-growth算法,使之有效地挖掘训练集,产生满足最小支持度和最小置信度的有趣的规则。通过剪枝选择一个小规则集构造分类器。在规则剪枝过程中,采用规则兴趣度来评价规则的质量,综合考虑规则的预测精度和规则中项的兴趣度。实验结果表明该方法在分类精度上优于See5、CBA和CMAR,并且具有较好的可理解性和扩展性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号