共查询到20条相似文献,搜索用时 0 毫秒
1.
Floriana Esposito Donato Malerba Vincenza Ripa Giovanni Semeraro 《Applied Artificial Intelligence》2013,27(1):71-84
This article explores the combined application of inductive learning algorithms and causal inference techniques to the problem of discovering causal rules among the attributes of a relational database. Given some relational data each field can be considered as a random variable and a hybrid graph can be built by detecting conditional independencies among variables. The induced graph represents genuine and potential causal relations as well as spurious associations. When the variables are discrete or have been discretized to test condi tional independencies supervised induction algorithms can be used to learn causal rules that is conditional statements in which causes appear as antecedents and effects as consequences. The approach is illustrated by means of some experiments conducted on different data sets. 相似文献
2.
The discovery of dependencies between attributes in databases is an important problem in data mining, and can be applied to facilitate future decision-making. In the present paper some properties of the branching dependencies are examined. We define a minimal branching dependency and we propose an algorithm for finding all minimal branching dependencies between a given set of attributes and a given attribute in a relation of a database. Our examination of the branching dependencies is motivated by their application in a database storing realized sales of products. For example, finding out that arbitrary p products have totally attracted at most q new users can prove to be crucial in supporting the decision making.In addition, we also consider the fractional and the fractional branching dependencies. Some properties of these dependencies are examined. An algorithm for finding all fractional dependencies between a given set of attributes and a given attribute in a database relation is proposed. We examine the general case of an arbitrary relation, as well as a particular case where the problem of discovering the fractional dependencies is considerably simplified. 相似文献
3.
Sequential pattern mining is one of the most important data mining techniques. Previous research on mining sequential patterns discovered patterns from point-based event data, interval-based event data, and hybrid event data. In many real life applications, however, an event may involve many statuses; it might not occur only at one certain point in time or over a period of time. In this work, we propose a generalized representation of temporal events. We treat events as multi-label events with many statuses, and introduce an algorithm called MLTPM to discover multi-label temporal patterns from temporal databases. The experimental results show that the efficiency and scalability of the MLTPM algorithm are satisfactory. We also discuss interesting multi-label temporal patterns discovered when MLTPM was applied to historical Nasdaq data. 相似文献
4.
Applied Intelligence - Since periodic events are very common everywhere, periodic pattern mining is increasingly more important in today’s data mining domain. However, there is currently no... 相似文献
5.
Yen-Liang Chen Tony Cheng-Kui Huang 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2005,35(5):959-972
Given a sequence database and minimum support threshold, the task of sequential pattern mining is to discover the complete set of sequential patterns in databases. From the discovered sequential patterns, we can know what items are frequently brought together and in what order they appear. However, they cannot tell us the time gaps between successive items in patterns. Accordingly, Chen et al. have proposed a generalization of sequential patterns, called time-interval sequential patterns, which reveals not only the order of items, but also the time intervals between successive items. An example of time-interval sequential pattern has a form like (A, I2, B, I1, C), meaning that we buy A first, then after an interval of I2 we buy B, and finally after an interval of I1 we buy C, where I2 and I1 are predetermined time ranges. Although this new type of pattern can alleviate the above concern, it causes the sharp boundary problem. That is, when a time interval is near the boundary of two predetermined time ranges, we either ignore or overemphasize it. Therefore, this paper uses the concept of fuzzy sets to extend the original research so that fuzzy time-interval sequential patterns are discovered from databases. Two efficient algorithms, the fuzzy time interval (FTI)-Apriori algorithm and the FTI-PrefixSpan algorithm, are developed for mining fuzzy time-interval sequential patterns. In our simulation results, we find that the second algorithm outperforms the first one, not only in computing time but also in scalability with respect to various parameters. 相似文献
6.
Nguyen Xuan Vinh Jeffrey Chan Simone Romano James Bailey Christopher Leckie Kotagiri Ramamohanarao Jian Pei 《Data mining and knowledge discovery》2016,30(6):1520-1555
We address the problem of outlying aspects mining: given a query object and a reference multidimensional data set, how can we discover what aspects (i.e., subsets of features or subspaces) make the query object most outlying? Outlying aspects mining can be used to explain any data point of interest, which itself might be an inlier or outlier. In this paper, we investigate several open challenges faced by existing outlying aspects mining techniques and propose novel solutions, including (a) how to design effective scoring functions that are unbiased with respect to dimensionality and yet being computationally efficient, and (b) how to efficiently search through the exponentially large search space of all possible subspaces. We formalize the concept of dimensionality unbiasedness, a desirable property of outlyingness measures. We then characterize existing scoring measures as well as our novel proposed ones in terms of efficiency, dimensionality unbiasedness and interpretability. Finally, we evaluate the effectiveness of different methods for outlying aspects discovery and demonstrate the utility of our proposed approach on both large real and synthetic data sets. 相似文献
7.
Mining very large databases 总被引:1,自引:0,他引:1
Established companies have had decades to accumulate masses of data about their customers, suppliers, products and services, and employees. Data mining, also known as knowledge discovery in databases, gives organizations the tools to sift through these vast data stores to find the trends, patterns, and correlations that can guide strategic decision making. Traditionally, algorithms for data analysis assume that the input data contains relatively few records. Current databases however, are much too large to be held in main memory. To be efficient, the data mining techniques applied to very large databases must be highly scalable. An algorithm is said to be scalable if (given a fixed amount of main memory), its runtime increases linearly with the number of records in the input database. Recent work has focused on scaling data mining algorithms to very large data sets. The authors describe a broad range of algorithms that address three classical data mining problems: market basket analysis, clustering, and classification 相似文献
8.
Query-by-example and query-by-keyword both suffer from the problem of “aliasing,” meaning that example-images and keywords potentially have variable interpretations or multiple semantics. For discerning which semantic is appropriate for a given query, we have established that combining active learning with kernel methods is a very effective approach. In this work, we first examine active-learning strategies, and then focus on addressing the challenges of two scalability issues: scalability in concept complexity and in dataset size. We present remedies, explain limitations, and discuss future directions that research might take. 相似文献
9.
Mining constrained gradients in large databases 总被引:1,自引:0,他引:1
Dong G. Han J. Lam J.W.M. Pei J. Wangm K Zou W. 《Knowledge and Data Engineering, IEEE Transactions on》2004,16(8):922-938
Many data analysis tasks can be viewed as search or mining in a multidimensional space (MDS). In such MDSs, dimensions capture potentially important factors for given applications, and cells represent combinations of values for the factors. To systematically analyze data in MDS, an interesting notion, called "cubegrade" was recently introduced by Imielinski et al. [2002], which focuses on the notable changes in measures in MDS by comparing a cell (which we refer to as probe cell) with its gradient cells, namely, its ancestors, descendants, and siblings. We call such queries gradient analysis queries (GQs). Since an MDS can contain billions of cells, it is important to answer GQs efficiently. We focus on developing efficient methods for mining GQs constrained by certain (weakly) antimonotone constraints. Instead of conducting an independent gradient-cell search once per probe cell, which is inefficient due to much repeated work, we propose an efficient algorithm, LiveSet-Driven. This algorithm finds all good gradient-probe cell pairs in one search pass. It utilizes measure-value analysis and dimension-match analysis in a set-oriented manner, to achieve bidirectional pruning between the sets of hopeful probe cells and of hopeful gradient cells. Moreover, it adopts a hypertree structure and an H-cubing method to compress data and to maximize sharing of computation. Our performance study shows that this algorithm is efficient and scalable. In addition to data cubes, we extend our study to another important scenario: mining constrained gradients in transactional databases where each item is associated with some measures such as price. Such transactional databases can be viewed as sparse MDSs where items represent dimensions, although they have significantly different characteristics than data cubes. We outline efficient mining methods for this problem. 相似文献
10.
Peng Yan 《Information Sciences》2005,173(4):319-336
This paper extends the work on discovering fuzzy association rules with degrees of support and implication (ARsi). The effort is twofold: one is to discover ARsi with hierarchy so as to express more semantics due to the fact that hierarchical relationships usually exist among fuzzy sets associated with the attribute concerned; the other is to generate a “core” set of rules, namely the rule cover set, that are of more interest in a sense that all other rules could be derived by the cover set. Corresponding algorithms for ARsi with hierarchy and the cover set are proposed along with pruning strategies incorporated to improve the computational efficiency. Some data experiments are conducted as well to show the effectiveness of the approach. 相似文献
11.
Mining multiple-level association rules in large databases 总被引:2,自引:0,他引:2
Jiawei Han Yongjian Fu 《Knowledge and Data Engineering, IEEE Transactions on》1999,11(5):798-805
A top-down progressive deepening method is developed for efficient mining of multiple-level association rules from large transaction databases based on the a priori principle. A group of variant algorithms is proposed based on the ways of sharing intermediate results, with the relative performance tested and analyzed. The enforcement of different interestingness measurements to find more interesting rules, and the relaxation of rule conditions for finding “level-crossing” association rules, are also investigated. The study shows that efficient algorithms can be developed from large databases for the discovery of interesting and strong multiple-level association rules 相似文献
12.
Structural equation models with latent variables are used widely in psychometrics, econometrics, and sociology to explore the causal relations among latent variables. Since such models often involve dozens of variables, the number of theoretically feasible alternatives can be astronomical. Without computational aids with which to search such a space, researchers can only explore a handful of alternative models. We describe a procedure that can find information about the causal structure among latent, or unmeasured variables. the procedure is asymptotically reliable, feasible on data sets with as many as a hundred variables, and has already proved useful in modeling an empirical data set collected by the U.S. Navy. © 1992 John Wiley & Sons, Inc. 相似文献
13.
In this paper, we present an innovative system, coined as DISTROD (a.k.a DISTRibuted Outlier Detector), for detecting outliers, namely abnormal instances or observations, from multiple large distributed databases. DISTROD is able to effectively detect the so-called global outliers from distributed databases that are consistent with those produced by the centralized detection paradigm. DISTROD is equipped with a number of optimization/boosting strategies which empower it to significantly enhance its speed performance and reduce its communication overhead. Experimental evaluation demonstrates the good performance of DISTROD in terms of speed and communication overhead. 相似文献
14.
International Journal of Parallel Programming - A time decomposition technique is suggested for large-database (DB) models. The problem of network aggregation is studied and the results used to... 相似文献
15.
《Pattern recognition》2014,47(2):588-602
Fingerprint matching has emerged as an effective tool for human recognition due to the uniqueness, universality and invariability of fingerprints. Many different approaches have been proposed in the literature to determine faithfully if two fingerprint images belong to the same person. Among them, minutiae-based matchers highlight as the most relevant techniques because of their discriminative capabilities, providing precise results. However, performing a fingerprint identification over a large database can be an inefficient task due to the lack of scalability and high computing times of fingerprint matching algorithms.In this paper, we propose a distributed framework for fingerprint matching to tackle large databases in a reasonable time. It provides a general scheme for any kind of matcher, so that its precision is preserved and its time of response can be reduced.To test the proposed system, we conduct an extensive study that involves both synthetic and captured fingerprint databases, which have different characteristics, analyzing the performance of three well-known minutiae-based matchers within the designed framework. With the available hardware resources, our distributed model is able to address up to 400 000 fingerprints in approximately half a second. Additional details are provided at http://sci2s.ugr.es/ParallelMatching. 相似文献
16.
Nowadays, most fingerprint sensors capture partial fingerprint images. Incomplete, fragmentary, or partial fingerprint identification in large databases is an attractive research topic and is remained as an important and challenging problem. Accordingly, conventional fingerprint identification systems are not capable of providing convincing results. To overcome this problem, we need a fast and accurate identification strategy. In this context, fingerprint indexing is commonly used to speed up the identification process. This paper proposes a robust and fast identification system that combines two indexing algorithms. One of the indexing algorithms uses minutiae triplets, and the other uses orientation field (OF) to index and retrieve fingerprints. Furthermore, the proposal uses some partial fingerprint matching methods on final candidate list obtained from the indexing stage. The proposal is evaluated over two national institutes of standards and technology (NIST) datasets and four fingerprint verification competition (FVC) datasets leading to low identification times with no accuracy loss. 相似文献
17.
The RAPID-1 (relational access processor for intelligent data), an associative accelerator that recognizes tuples and logical formulas, is presented. It evaluates logical formulas instantiated by the current tuple, or record, and operates on whole relations or on hashing buckets. RAPID- 1 uses a reduced instruction set and hardwired control and executes all comparisons in a bit-parallel mode. It speeds up the database by a significant factor and will adapt to future generations of microprocessors. The principal design issues, data structures, instruction set, architecture, environments and performance are discussed 相似文献
18.
The world is increasingly full of data. Organisations, governments and individuals are creating increasingly large data sources, and in many cases making them publicly available. This offers massive potential for interaction and mutual collaboration. But using this data often creates problems. Those creating the data will use their own terminology, structure and formats for the data, meaning that data from one source will be incompatible with data from another source. When presented with a large, unknown data source, it is very difficult to ascribe meaning to the terms of that data source, and to understand what is being conveyed. Much effort has been invested in data interpretation prior to run-time, with large data sources being matched against each other off-line. But data is often used dynamically, and so to maximise the value of the data it is necessary to extract meaning from it dynamically. We therefore postulate that an essential competent of utilising the world of data in which we increasingly live is the development of the ability to discover meaning on the go in large, heterogenous data.This paper provides an overview of the current state-of-the-art, reviewing the aims and achievements in different fields which can be applied to this problem. We take a brief look at cutting edge research in this field, summarising four papers published in the special issue of the AI Review on Discovering Meaning on the go in Large Heterogenous Data, and conclude with our thoughts about where research in this field is going, and what our priorities must be to enable us to move closer to achieving this goal. 相似文献
19.
Databases for data mining often have missing values. Missing data are often mistreated in data mining and valuable knowledge related to missing data is often overlooked. This study discusses patterns of missing data in survey databases. It proposes a framework of rough set rule induction method that enables the data miner to obtain association rules of patterns of missing data in a survey database. Through an experiment on a real-world data set, we demonstrate the approach to discovering knowledge about missing data. 相似文献
20.
Similarity matching in video databases is of growing importance in many new applications such as video clustering and digital video libraries. In order to provide efficient access to relevant data in large databases, there have been many research efforts in video indexing with diverse spatial and temporal features. However, most of the previous works relied on sequential matching methods or memory-based inverted file techniques, thus making them unsuitable for a large volume of video databases. In order to resolve this problem, this paper proposes an effective and scalable indexing technique using a trie, originally proposed for string matching, as an index structure. For building an index, we convert each frame into a symbol sequence using a window order heuristic and build a disk-resident trie from a set of symbol sequences. For query processing, we perform a depth-first traversal on the trie and execute a temporal segmentation. To verify the superiority of our approach, we perform several experiments with real and synthetic data sets. The results reveal that our approach consistently outperforms the sequential scan method, and the performance gain is maintained even with a large volume of video databases. 相似文献