首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
Mining very large databases   总被引:1,自引:0,他引:1  
Ganti  V. Gehrke  J. Ramakrishnan  R. 《Computer》1999,32(8):38-45
Established companies have had decades to accumulate masses of data about their customers, suppliers, products and services, and employees. Data mining, also known as knowledge discovery in databases, gives organizations the tools to sift through these vast data stores to find the trends, patterns, and correlations that can guide strategic decision making. Traditionally, algorithms for data analysis assume that the input data contains relatively few records. Current databases however, are much too large to be held in main memory. To be efficient, the data mining techniques applied to very large databases must be highly scalable. An algorithm is said to be scalable if (given a fixed amount of main memory), its runtime increases linearly with the number of records in the input database. Recent work has focused on scaling data mining algorithms to very large data sets. The authors describe a broad range of algorithms that address three classical data mining problems: market basket analysis, clustering, and classification  相似文献   

Mining multiple-level association rules in large databases   总被引:2,自引:0,他引:2  
A top-down progressive deepening method is developed for efficient mining of multiple-level association rules from large transaction databases based on the a priori principle. A group of variant algorithms is proposed based on the ways of sharing intermediate results, with the relative performance tested and analyzed. The enforcement of different interestingness measurements to find more interesting rules, and the relaxation of rule conditions for finding “level-crossing” association rules, are also investigated. The study shows that efficient algorithms can be developed from large databases for the discovery of interesting and strong multiple-level association rules  相似文献   

Although many methods have been proposed to enhance the efficiencies of data mining, little research has been devoted to the issue of scalability – that is, the problem of mining frequent itemsets when the size of the database is very large. This study proposes a methodology, hierarchical partitioning, for mining frequent itemsets in large databases, based on a novel data structure called the Frequent Pattern List (FPL). One of the major features of the FPL is its ability to partition the database, and thus transform the database into a set of sub-databases of manageable sizes. As a result, a divide-and-conquer approach can be developed to perform the desired data-mining tasks. Experimental results show that hierarchical partitioning is capable of mining frequent itemsets and frequent closed itemsets in very large databases.  相似文献   

Incorporating constraints into frequent itemset mining not only improves data mining efficiency, but also leads to concise and meaningful results. In this paper, a framework for closed constrained gradient itemset mining in retail databases is proposed by introducing the concept of gradient constraint into closed itemset mining. A tailored version of CLOSET+, LCLOSET, is first briefly introduced, which is designed for efficient closed itemset mining from sparse databases. Then, a newly proposed weaker but antimonotone measure, top-X average measure, is proposed and can be adopted to prune search space effectively. Experiments show that a combination of LCLOSET and the top-X average pruning provides an efficient approach to mining frequent closed gradient itemsets.  相似文献   


A causal rule between two variables, X M Y, captures the relationship that the presence of X causes the appearance of Y. Because of its usefulness (compared to association rules), techniques for mining causal rules are beginning to be developed. However, the effectiveness of existing methods (such as the LCD and CU-path algorithms) are limited to mining causal rules among simple variables, and are inadequate to discover and represent causal rules among multi-value variables. In this paper, we propose that the causality between variables X and Y be represented in the form X M Y with conditional probability matrix M Y|X . We also propose a new approach to discover causality in large databases based on partitioning. The approach partitions the items into item variables by decomposing "bad" item variables and composing "not-good" item variables. In particular, we establish a method to optimize causal rules that merges the "useless" information in conditional probability matrices of extracted causal rules.  相似文献   

数据库中动态关联规则的挖掘   总被引:7,自引:0,他引:7  
关联规则能挖掘变量间的相互依赖关系,但是不能反映规则本身的变化规律.为此本文提出了动态关联规则.首先将整个待挖掘数据集按时间划分成若干子集,每个子集挖掘得到的每条规则分别生成一个支持度和一个置信度,这样每条规则在全集上就对应了一个支持度向量和一个置信度向量.通过分析支持度向量和置信度向量,不仅可以发现规则随时间变化的情况,也能够预测规则的发展趋势.本文还提出了两个挖掘动态关联规则的算法,且对他们做了比较.并给出了柱状图和时间序列两种方法分析这两个向量.最后给出了一个挖掘动态关联规则的应用实例。  相似文献   

Query-by-example and query-by-keyword both suffer from the problem of “aliasing,” meaning that example-images and keywords potentially have variable interpretations or multiple semantics. For discerning which semantic is appropriate for a given query, we have established that combining active learning with kernel methods is a very effective approach. In this work, we first examine active-learning strategies, and then focus on addressing the challenges of two scalability issues: scalability in concept complexity and in dataset size. We present remedies, explain limitations, and discuss future directions that research might take.  相似文献   

In this paper, we examine a new data mining issue of mining association rules from customer databases and transaction databases. The problem is decomposed into two subproblems: identifying all the large itemsets from the transaction database and mining association rules from the customer database and the large itemsets identified. For the first subproblem, we propose an efficient algorithm to discover all the large itemsets from the transaction database. Experimental results show that by our approach, the total execution time can be reduced significantly. For the second subproblem, a relationship graph is constructed according to the identified large itemsets from the transaction database and the priorities of condition attributes from the customer database. Based on the relationship graph, we present an efficient graph-based algorithm to discover interesting association rules embedded in the transaction database and the customer database.  相似文献   

In this paper, we proposed an efficient algorithm, called PCP-Miner (Pointset Closed Pattern Miner), for mining frequent closed patterns from a pointset database, where a pointset contains a set of points. Our proposed algorithm consists of two phases. First, we find all frequent patterns of length two in the database. Second, for each pattern found in the first phase, we recursively generate frequent closed patterns by a frequent pattern tree in a depth-first search manner. Since the PCP-Miner does not generate unnecessary candidates, it is more efficient and scalable than the modified Apriori, SASMiner and MaxGeo. The experimental results show that the PCP-Miner algorithm outperforms the comparing algorithms by more than one order of magnitude.  相似文献   

In this paper, we propose an efficient algorithm, called CMP-Miner, to mine closed patterns in a time-series database where each record in the database, also called a transaction, contains multiple time-series sequences. Our proposed algorithm consists of three phases. First, we transform each time-series sequence in a transaction into a symbolic sequence. Second, we scan the transformed database to find frequent patterns of length one. Third, for each frequent pattern found in the second phase, we recursively enumerate frequent patterns by a frequent pattern tree in a depth-first search manner. During the process of enumeration, we apply several efficient pruning strategies to remove frequent but non-closed patterns. Thus, the CMP-Miner algorithm can efficiently mine the closed patterns from a time-series database. The experimental results show that our proposed algorithm outperforms the modified Apriori and BIDE algorithms.  相似文献   

传统的关联规则挖掘研究事务中所包含的项与项之间的关联性,而负关联规则挖掘不仅要考虑事务中包含的项,还要考虑事务中不包含的项。给出了完全负关联规则的定义,提出一种基于树的算法Free-PNP,通过此算法挖掘数据库中的负频繁模式,继而得到所要挖掘的完全负关联规则。通过实验验证了算法的有效性。  相似文献   

Mining frequent trajectory patterns in spatial-temporal databases   总被引:1,自引:0,他引:1  
In this paper, we propose an efficient graph-based mining (GBM) algorithm for mining the frequent trajectory patterns in a spatial-temporal database. The proposed method comprises two phases. First, we scan the database once to generate a mapping graph and trajectory information lists (TI-lists). Then, we traverse the mapping graph in a depth-first search manner to mine all frequent trajectory patterns in the database. By using the mapping graph and TI-lists, the GBM algorithm can localize support counting and pattern extension in a small number of TI-lists. Moreover, it utilizes the adjacency property to reduce the search space. Therefore, our proposed method can efficiently mine the frequent trajectory patterns in the database. The experimental results show that it outperforms the Apriori-based and PrefixSpan-based methods by more than one order of magnitude.  相似文献   

Mining spatial association rules in image databases   总被引:2,自引:0,他引:2  
In this paper, we propose a novel spatial mining algorithm, called 9DLT-Miner, to mine the spatial association rules from an image database, where every image is represented by the 9DLT representation. The proposed method consists of two phases. First, we find all frequent patterns of length one. Next, we use frequent k-patterns (k ? 1) to generate all candidate (k + 1)-patterns. For each candidate pattern generated, we scan the database to count the pattern’s support and check if it is frequent. The steps in the second phase are repeated until no more frequent patterns can be found. Since our proposed algorithm prunes most of impossible candidates, it is more efficient than the Apriori algorithm. The experiment results show that 9DLT-Miner runs 2-5 times faster than the Apriori algorithm.  相似文献   

Spatiotemporal co-occurrence patterns (STCOPs) represent the subsets of feature types whose instances are frequently co-occurring both in space and time. Spatiotemporal co-occurrences reflect the spatiotemporal overlap relationships among two or more spatiotemporal instances both in spatial and temporal dimensions. STCOPs can be potentially used to predict and understand the generation and evolution of different types of interacting phenomena in various scientific fields such as astronomy, meteorology, biology, geosciences. Meaningful and statistically significant data analysis for these scientific fields requires processing sufficiently large datasets. Due to the computationally expensive nature of spatiotemporal operations required for mining spatiotemporal co-occurrences, it is increasingly difficult to identify spatiotemporal co-occurrences and discover STCOPs in centralized system settings. As a solution, we developed a cloud-based distributed mining system for discovering STCOPs. Our system uses Accumulo, a column-oriented non-relational database management system as its backbone. In order to efficiently mine the STCOPs, we propose three data models for managing trajectory-based spatiotemporal data in Accumulo. We introduce an in-memory join-index structure and a join algorithm for effectively performing spatiotemporal join operations on spatiotemporal trajectories in non-relational databases. Lastly, with the experiments with artificial and real life datasets, we evaluate the performance of the proposed models for STCOP mining.  相似文献   

With the increasing use of wireless communication devices and the ability to track people and objects cheaply and easily, the amount of spatio-temporal data is growing substantially. Many of these applications cannot easily locate the exact position of objects, but they can determine the region in which each object is contained. Furthermore, the regions are fixed and may vary greatly in size. Examples include mobile/cell phone networks, RFID tag readers and satellite tracking. This demands techniques to mine such data. These techniques must also correct for the bias produced by different sized regions. We provide a comprehensive definition of Spatio-Temporal Association Rules (STARs) that describe how objects move between regions over time. We also present other patterns that are useful for mobility data; stationary regions and high traffic regions. The latter consists of sources, sinks and thoroughfares. These patterns describe important temporal characteristics of regions and we show that they can be considered as special STARs. We define spatial support to effectively deal with the problem of different sized regions. We provide an efficient algorithm—STAR-Miner—to find these patterns by exploiting several pruning properties. Responsible editors: Charles Perng and Tao Li.  相似文献   

The rationale behind mining frequent itemsets is that only itemsets with high frequency are of interest to users. However, the practical usefulness of frequent itemsets is limited by the significance of the discovered itemsets. A frequent itemset only reflects the statistical correlation between items, and it does not reflect the semantic significance of the items. In this paper, we propose a utility based itemset mining approach to overcome this limitation. The proposed approach permits users to quantify their preferences concerning the usefulness of itemsets using utility values. The usefulness of an itemset is characterized as a utility constraint. That is, an itemset is interesting to the user only if it satisfies a given utility constraint. We show that the pruning strategies used in previous itemset mining approaches cannot be applied to utility constraints. In response, we identify several mathematical properties of utility constraints. Then, two novel pruning strategies are designed. Two algorithms for utility based itemset mining are developed by incorporating these pruning strategies. The algorithms are evaluated by applying them to synthetic and real world databases. Experimental results show that the proposed algorithms are effective on the databases tested.  相似文献   

Mining itemset utilities from transaction databases   总被引:4,自引:0,他引:4  
The rationale behind mining frequent itemsets is that only itemsets with high frequency are of interest to users. However, the practical usefulness of frequent itemsets is limited by the significance of the discovered itemsets. A frequent itemset only reflects the statistical correlation between items, and it does not reflect the semantic significance of the items. In this paper, we propose a utility based itemset mining approach to overcome this limitation. The proposed approach permits users to quantify their preferences concerning the usefulness of itemsets using utility values. The usefulness of an itemset is characterized as a utility constraint. That is, an itemset is interesting to the user only if it satisfies a given utility constraint. We show that the pruning strategies used in previous itemset mining approaches cannot be applied to utility constraints. In response, we identify several mathematical properties of utility constraints. Then, two novel pruning strategies are designed. Two algorithms for utility based itemset mining are developed by incorporating these pruning strategies. The algorithms are evaluated by applying them to synthetic and real world databases. Experimental results show that the proposed algorithms are effective on the databases tested.  相似文献   

Knowledge and Information Systems - This paper considers the problem of sequential pattern mining (SPM) in probabilistic databases. Specifically, we consider SPM in situations where there is...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号