首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The problem of mining collections of trees to identify common patterns, called frequent subtrees (FSTs), arises often when trying to interpret the results of phylogenetic analysis. FST mining generalizes the well-known maximum agreement subtree problem. Here we present EvoMiner, a new algorithm for mining frequent subtrees in collections of phylogenetic trees. EvoMiner is an Apriori-like levelwise method, which uses a novel phylogeny-specific constant-time candidate generation scheme, an efficient fingerprinting-based technique for downward closure, and a lowest-common-ancestor-based support counting step that requires neither costly subtree operations nor database traversal. Our algorithm achieves speedups of up to 100 times or more over Phylominer, the current state-of-the-art algorithm for mining phylogenetic trees. EvoMiner can also work in depth-first enumeration mode to use less memory at the expense of speed. We demonstrate the utility of FST mining as a way to extract meaningful phylogenetic information from collections of trees when compared to maximum agreement subtrees and majority-rule trees—two commonly used approaches in phylogenetic analysis for extracting consensus information from a collection of trees over a common leaf set.  相似文献   

2.
Toward multidatabase mining: identifying relevant databases   总被引:2,自引:0,他引:2  
Various tools and systems for knowledge discovery and data mining have been developed and are available for applications. However, when there are many databases, an immediate question is where one should start mining. It is not true that data mining is better the more databases there are. It is only true when the databases involved are relevant to the task at hand. By breaking away from the conventional data mining assumption that many databases should be joined into one, we argue that the first step for multidatabase mining is to identify databases that are most relevant to an application; without doing so, the mining process can be lengthy, aimless, and ineffective. A measure of relevance is thus proposed for mining tasks with the objective of finding patterns or regularities of certain attributes. An efficient algorithm for identifying relevant databases is described. Experiments are conducted to verify the measure's performance and to exemplify its application  相似文献   

3.
Set-oriented data mining in relational databases   总被引:2,自引:0,他引:2  
Data mining is an important real-life application for businesses. It is critical to find efficient ways of mining large data sets. In order to benefit from the experience with relational databases, a set-oriented approach to mining data is needed. In such an approach, the data mining operations are expressed in terms of relational or set-oriented operations. Query optimization technology can then be used for efficient processing.

In this paper, we describe set-oriented algorithms for mining association rules. Such algorithms imply performing multiple joins and thus may appear to be inherently less efficient than special-purpose algorithms. We develop new algorithms that can be expressed as SQL queries, and discuss optimization of these algorithms. After analytical evaluation, an algorithm named SETM emerges as the algorithm of choice. Algorithm SETM uses only simple database primitives, viz., sorting and merge-scan join. Algorithm SETM is simple, fast, and stable over the range of parameter values. It is easily parallelized and we suggest several additional optimizations. The set-oriented nature of Algorithm SETM makes it possible to develop extensions easily and its performance makes it feasible to build interactive data mining tools for large databases.  相似文献   


4.
Based on the approach implementing a deductive object-oriented database system through the underlying relational database,this paper presents and object reasoning language O-Datalog,which is the extension of Datalog in form and can deal with object-oriented data.For any O-Datalog program,an dquivalent Datalog program can be built to help evaluate the original program.This paper focuses on the syntax,semantics and evaluation of O-Datalog.  相似文献   

5.
Very little research in knowledge discovery has studied how to incorporate statistical methods to automate linear correlation discovery (LCD). We present an automatic LCD methodology that adopts statistical measurement functions to discover correlations from databases’ attributes. Our methodology automatically pairs attribute groups having potential linear correlations, measures the linear correlation of each pair of attribute groups, and confirms the discovered correlation. The methodology is evaluated in two sets of experiments. The results demonstrate the methodology’s ability to facilitate linear correlation discovery for databases with a large amount of data.  相似文献   

6.
Incorporating constraints into frequent itemset mining not only improves data mining efficiency, but also leads to concise and meaningful results. In this paper, a framework for closed constrained gradient itemset mining in retail databases is proposed by introducing the concept of gradient constraint into closed itemset mining. A tailored version of CLOSET+, LCLOSET, is first briefly introduced, which is designed for efficient closed itemset mining from sparse databases. Then, a newly proposed weaker but antimonotone measure, top-X average measure, is proposed and can be adopted to prune search space effectively. Experiments show that a combination of LCLOSET and the top-X average pruning provides an efficient approach to mining frequent closed gradient itemsets.  相似文献   

7.
Visualization techniques for mining large databases: a comparison   总被引:9,自引:0,他引:9  
Visual data mining techniques have proven to be of high value in exploratory data analysis, and they also have a high potential for mining large databases. In this article, we describe and evaluate a new visualization-based approach to mining large databases. The basic idea of our visual data mining techniques is to represent as many data items as possible on the screen at the same time by mapping each data value to a pixel of the screen and arranging the pixels adequately. The major goal of this article is to evaluate our visual data mining techniques and to compare them to other well-known visualization techniques for multidimensional data: the parallel coordinate and stick-figure visualization techniques. For the evaluation of visual data mining techniques, the perception of data properties counts most, while the CPU time and the number of secondary storage accesses are only of secondary importance. In addition to testing the visualization techniques using real data, we developed a testing environment for database visualizations similar to the benchmark approach used for comparing the performance of database systems. The testing environment allows the generation of test data sets with predefined data characteristics which are important for comparing the perceptual abilities of visual data mining techniques  相似文献   

8.
Temporally uncertain data widely exist in many real-world applications. Temporal uncertainty can be caused by various reasons such as conflicting or missing event timestamps, network latency, granularity mismatch, synchronization problems, device precision limitations, data aggregation. In this paper, we propose an efficient algorithm to mine sequential patterns from data with temporal uncertainty. We propose an uncertain model in which timestamps are modeled by random variables and then design a new approach to manage temporal uncertainty. We integrate it into the pattern-growth sequential pattern mining algorithm to discover probabilistic frequent sequential patterns. Extensive experiments on both synthetic and real datasets prove that the proposed algorithm is both efficient and scalable.  相似文献   

9.
The goal of analyzing a time series database is to find whether and how frequent a periodic pattern is repeated within the series. Periodic pattern mining is the problem that regards temporal regularity. However, most of the existing algorithms have a major limitation in mining interesting patterns of users interest, that is, they can mine patterns of specific length with all the events sequentially one after another in exact positions within this pattern. Though there are certain scenarios where a pattern can be flexible, that is, it may be interesting and can be mined by neglecting any number of unimportant events in between important events with variable length of the pattern. Moreover, existing algorithms can detect only specific type of periodicity in various time series databases and require the interaction from user to determine periodicity. In this paper, we have proposed an algorithm for the periodic pattern mining in time series databases which does not rely on the user for the period value or period type of the pattern and can detect all types of periodic patterns at the same time, indeed these flexibilities are missing in existing algorithms. The proposed algorithm facilitates the user to generate different kinds of patterns by skipping intermediate events in a time series database and find out the periodicity of the patterns within the database. It is an improvement over the generating pattern using suffix tree, because suffix tree based algorithms have weakness in this particular area of pattern generation. Comparing with the existing algorithms, the proposed algorithm improves generating different kinds of interesting patterns and detects whether the generated pattern is periodic or not. We have tested the performance of our algorithm on both synthetic and real life data from different domains and found a large number of interesting event sequences which were missing in existing algorithms and the proposed algorithm was efficient enough in generating and detecting periodicity of flexible patterns on both types of data.  相似文献   

10.
首先比较了现有的两种挖掘方法,提出了一种改进技术.综合考虑例外的局部和全局兴趣度,剔除非真正有趣的局部例外;增加两种客观度量并按模式重要度排序.实验表明该方法不仅可以有效挖掘多数据库中例外模式,而且还大大减少了用户负担.  相似文献   

11.
针对PFIM算法中频繁概率计算方法的局限性,且挖掘时需要多次扫描数据库和生成大量候选集的不足,提出EPFIM(efficient probabilistic frequent itemset mining)算法。新提出的频繁概率计算方法能适应数据流等项集的概率发生变化时的情况;通过不确定数据库存储在概率矩阵中,以及利用项集的有序性和逐步删除无用事物来提高挖掘效率。理论分析和实验结果证明了EPFIM算法的性能更优。  相似文献   

12.
Pattern Analysis and Applications - Rare association rule mining is an imperative field of data mining that attempts to identify rare correlations among the items in a database. Although numerous...  相似文献   

13.
Frequent itemset mining (FIM) is a fundamental research topic, which consists of discovering useful and meaningful relationships between items in transaction databases. However, FIM suffers from two important limitations. First, it assumes that all items have the same importance. Second, it ignores the fact that data collected in a real-life environment is often inaccurate, imprecise, or incomplete. To address these issues and mine more useful and meaningful knowledge, the problems of weighted and uncertain itemset mining have been respectively proposed, where a user may respectively assign weights to items to specify their relative importance, and specify existential probabilities to represent uncertainty in transactions. However, no work has addressed both of these issues at the same time. In this paper, we address this important research problem by designing a new type of patterns named high expected weighted itemset (HEWI) and the HEWI-Uapriori algorithm to efficiently discover HEWIs. The HEWI-Uapriori finds HEWIs using an Apriori-like two-phase approach. The algorithm introduces a property named high upper-bound expected weighted downward closure (HUBEWDC) to early prune the search space and unpromising itemsets. Substantial experiments on real-life and synthetic datasets are conducted to evaluate the performance of the proposed algorithm in terms of runtime, memory consumption, and number of patterns found. Results show that the proposed algorithm has excellent performance and scalability compared with traditional methods for weighted-itemset mining and uncertain itemset mining.  相似文献   

14.
A computational approach is shown for unsupervised, reactive, database mining. This approach is dependent on soft computing techniques. Database mining seeks to discover noteworthy, unrecognized associations between database items. A novel approach is suggested for unsupervised search controlled by dissonance reduction. Both crisp and noncrisp data are subject to discovery. Another aspect of uncertainty is the metric that controls discovery. Issues involve: coherence measures, granularization, user intelligible results, unsupervised recognition of interesting results, and concept equivalent formation. © 1997 John Wiley & Sons, Inc.  相似文献   

15.
Data mining for case-based reasoning in high-dimensional biological domains   总被引:1,自引:0,他引:1  
Case-based reasoning (CBR) is a suitable paradigm for class discovery in molecular biology, where the rules that define the domain knowledge are difficult to obtain and the number and the complexity of the rules affecting the problem are too large for formal knowledge representation. To extend the capabilities of CBR, we propose the mixture of experts for case-based reasoning (MOE4CBR), a method that combines an ensemble of CBR classifiers with spectral clustering and logistic regression. Our approach not only achieves higher prediction accuracy, but also leads to the selection of a subset of features that have meaningful relationships with their class labels. We evaluate MOE4CBR by applying the method to a CBR system called TA3 - a computational framework for CBR systems. For two ovarian mass spectrometry data sets, the prediction accuracy improves from 80 percent to 93 percent and from 90 percent to 98.4 percent, respectively. We also apply the method to leukemia and lung microarray data sets with prediction accuracy improving from 65 percent to 74 percent and from 60 percent to 70 percent, respectively. Finally, we compare our list of discovered biomarkers with the lists of selected biomarkers from other studies for the mass spectrometry data sets.  相似文献   

16.
Mining periodic patterns in time series databases is an important data mining problem with many applications. Previous studies have considered synchronous periodic patterns where misaligned occurrences are not allowed. However, asynchronous periodic pattern mining has received less attention and only been discussed for a sequence of symbols where each time point contains one event. In this paper, we propose a more general model of asynchronous periodic patterns from a sequence of symbol sets where a time slot can contain multiple events. Three parameters min/spl I.bar/rep, max/spl I.bar/dis, and global/spl I.bar/rep are employed to specify the minimum number of repetitions required for a valid segment of nondisrupted pattern occurrences, the maximum allowed disturbance between two successive valid segments, and the total repetitions required for a valid sequence. A 4-phase algorithm is devised to discover periodic patterns from a time series database presented in vertical format. The experiments demonstrate good performance and scalability with large frequent patterns.  相似文献   

17.
A configuration management database (CMDB) can be considered to be a large graph representing the IT infrastructure entities and their interrelationships. Mining such graphs is challenging because they are large, complex, and multi-attributed and have many repeated labels. These characteristics pose challenges for graph mining algorithms, due to the increased cost of subgraph isomorphism (for support counting) and graph isomorphism (for eliminating duplicate patterns). The notion of pattern frequency or support is also more challenging in a single graph, since it has to be defined in terms of the number of its (potentially, exponentially many) embeddings. We present CMDB-Miner, a novel two-step method for mining infrastructure patterns from CMDB graphs. It first samples the set of maximal frequent patterns and then clusters them to extract the representative infrastructure patterns. We demonstrate the effectiveness of CMDB-Miner on real-world CMDB graphs, as well as synthetic graphs.  相似文献   

18.
19.
Metric databases are databases where a metric distance function is defined for pairs of database objects. In such databases, similarity queries in the form of range queries or k-nearest-neighbor queries are the most important query types. In traditional query processing, single queries are issued independently by different users. In many data mining applications, however, the database is typically explored by iteratively asking similarity queries for answers of previous similarity queries. We introduce a generic scheme for such data mining algorithms and we investigate two orthogonal approaches, reducing I/O cost as well as CPU cost, to speed-up the processing of multiple similarity queries. The proposed techniques apply to any type of similarity query and to an implementation based on an index or using a sequential scan. Parallelization yields an additional impressive speed-up. An extensive performance evaluation confirms the efficiency of our approach  相似文献   

20.
The task-oriented nature of data mining (DM) has already been dealt successfully with the employment of intelligent agent systems that distribute tasks, collaborate and synchronize in order to reach their ultimate goal, the extraction of knowledge. A number of sophisticated multi-agent systems (MAS) that perform DM have been developed, proving that agent technology can indeed be used in order to solve DM problems. Looking into the opposite direction though, knowledge extracted through DM has not yet been exploited on MASs. The inductive nature of DM imposes logic limitations and hinders the application of the extracted knowledge on such kind of deductive systems. This problem can be overcome, however, when certain conditions are satisfied a priori. In this paper, we present an approach that takes the relevant limitations and considerations into account and provides a gateway on the way DM techniques can be employed in order to augment agent intelligence. This work demonstrates how the extracted knowledge can be used for the formulation initially, and the improvement, in the long run, of agent reasoning.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号