首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到10条相似文献,搜索用时 156 毫秒
1.
A Taxonomy of Dirty Data   总被引:3,自引:0,他引:3  
Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical applications, data mining, and customer relationship management systems. A major problem that is only beginning to be recognized is that the data in data sources are often dirty. Broadly, dirty data include missing data, wrong data, and non-standard representations of the same data. The results of analyzing a database/data warehouse of dirty data can be damaging and at best be unreliable. In this paper, a comprehensive classification of dirty data is developed for use as a framework for understanding how dirty data arise, manifest themselves, and may be cleansed to ensure proper construction of data warehouses and accurate data analysis. The impact of dirty data on data mining is also explored.  相似文献   

2.
Exploratory data mining and analysis requires a computing environment which provides facilities for the user-friendly expression and rapid execution of scientific queries. In this paper, we address research issues in the parallelization of scientific queries containing complex user-defined operations. In a parallel query execution environment, parallelizing a query execution plan involves determining how input data streams to evaluators implementing logical operations can be divided to be processed by clones of the same evaluator in parallel. We introduced the concept of relevance window that characterizes data lineage and data partitioning opportunities available for an user-defined evaluator. In addition, we developed a query parallelization framework by extending relational parallel query optimization algorithms to allow the parallelization characteristics of user-defined evaluators to guide the process of query parallelization in an extensible query processing environment. We demonstrated the utility of our system by performing experiments mining cyclonic activity, blocking events, and the upward wave-energy propagation features from several observational and model simulation datasets.  相似文献   

3.
Bursty and Hierarchical Structure in Streams   总被引:10,自引:1,他引:9  
A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise—that the appearance of a topic in a document stream is signaled by a burst of activity, with certain features rising sharply in frequency as the topic emerges.The goal of the present work is to develop a formal approach for modeling such bursts, in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them.  相似文献   

4.
Concept Decompositions for Large Sparse Text Data Using Clustering   总被引:27,自引:0,他引:27  
Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors–a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain fractal-like and self-similar behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned by all the concept vectors. We empirically establish that the approximation errors of the concept decompositions are close to the best possible, namely, to truncated singular value decompositions. As our third contribution, we show that the concept vectors are localized in the word space, are sparse, and tend towards orthonormality. In contrast, the singular vectors are global in the word space and are dense. Nonetheless, we observe the surprising fact that the linear subspaces spanned by the concept vectors and the leading singular vectors are quite close in the sense of small principal angles between them. In conclusion, the concept vectors produced by the spherical k-means algorithm constitute a powerful sparse and localized basis for text data sets.  相似文献   

5.
The proliferation of large masses of data has created many new opportunities for those working in science, engineering and business. The field of data mining (DM) and knowledge discovery from databases (KDD) has emerged as a new discipline in engineering and computer science. In the modern sense of DM and KDD the focus tends to be on extracting information characterized as knowledge from data that can be very complex and in large quantities. Industrial engineering, with the diverse areas it comprises, presents unique opportunities for the application of DM and KDD, and for the development of new concepts and techniques in this field. Many industrial processes are now automated and computerized in order to ensure the quality of production and to minimize production costs. A computerized process records large masses of data during its functioning. This real-time data which is recorded to ensure the ability to trace production steps can also be used to optimize the process itself. A French truck manufacturer decided to exploit the data sets of measures recorded during the test of diesel engines manufactured on their production lines. The goal was to discover knowledge in the data of the test engine process in order to significantly reduce (by about 25%) the processing time. This paper presents the study of knowledge discovery utilizing the KDD method. All the steps of the method have been used and two additional steps have been needed. The study allowed us to develop two systems: the discovery application is implemented giving a real-time prediction model (with a real reduction of 28%) and the discovery support environment now allows those who are not experts in statistics to extract their own knowledge for other processes.  相似文献   

6.
7.
Beyond Market Baskets: Generalizing Association Rules to Dependence Rules   总被引:11,自引:2,他引:9  
One of the more well-studied problems in data mining is the search for association rules in market basket data. Association rules are intended to identify patterns of the type: A customer purchasing item A often also purchases item B. Motivated partly by the goal of generalizing beyond market basket data and partly by the goal of ironing out some problems in the definition of association rules, we develop the notion of dependence rules that identify statistical dependence in both the presence and absence of items in itemsets. We propose measuring significance of dependence via the chi-squared test for independence from classical statistics. This leads to a measure that is upward-closed in the itemset lattice, enabling us to reduce the mining problem to the search for a border between dependent and independent itemsets in the lattice. We develop pruning strategies based on the closure property and thereby devise an efficient algorithm for discovering dependence rules. We demonstrate our algorithm's effectiveness by testing it on census data, text data (wherein we seek term dependence), and synthetic data.  相似文献   

8.
Incremental Scheduling of Mixed Workloads in Multimedia Information Servers   总被引:2,自引:0,他引:2  
In contrast to pure video servers, advanced multimedia applications such as digital libraries or teleteaching exhibit a mixed workload with massive access to conventional, discrete data such as text documents, images and indexes as well as requests for continuous data, like video and audio data. In addition to the service quality guarantees for continuous data requests, quality-conscious applications require that the response time of the discrete data requests stay below some user-tolerance threshold. In this paper, we study the impact of different disk scheduling policies on the service quality for both continuous and discrete data. We provide a framework for describing various policies in terms of few parameters, and we develop a novel policy that is experimentally shown to outperform all other policies.  相似文献   

9.
This paper proposes the use of accessible information (data/knowledge) to infer inaccessible data in a distributed database system. Inference rules are extracted from databases by means of knowledge discovery techniques. These rules can derive inaccessible data due to a site failure or network partition in a distributed system. Such query answering requires combining incomplete and partial information from multiple sources. The derived answer may be exact or approximate. Our inference process involves two phases to reason with reconstructed information. One phase involves using local rules to infer inaccessible data. A second phase involves merging information from different sites. We shall call such reasoning processes cooperative data inference. Since the derived answer may be incomplete, new algebraic tools are developed for supporting operations on incomplete information. A weak criterion called toleration is introduced for evaluating the inferred results. The conditions that assure the correctness of combining partial results, known as sound inference paths, are developed. A solution is presented for terminating an iterative reasoning process on derived data from multiple knowledge sources. The proposed approach has been implemented on a cooperative distributed database testbed, CoBase, at UCLA. The experimental results validate the feasibility of this proposed concept and can significantly improve the availability of distributed knowledge base/database systems.List of notation Mapping - --< Logical implication - = Symbolic equality - ==< Inference path - Satisfaction - Toleration - Undefined (does not exist) - Variable-null (may or may not exist) - * Subtuple relationship - * s-membership - s-containment - Open subtuple - Open s-membership - Open s-containment - P Open base - P Program - I Interpretation - DIP Data inference program - t Tuples - R Relations - Ø Empty interpretation - Open s-union - Open s-interpretation - Set of mapping from the set of objects to the set of closed objects - W Set of attributes - W Set of sound inference paths on the set of attributes W - Set of relational schemas in a DB that satisfy MVD - + Range closure of W wrt   相似文献   

10.
Regions-of-Interest and Spatial Layout for Content-Based Image Retrieval   总被引:1,自引:0,他引:1  
To date most content-based image retrieval (CBIR) techniques rely on global attributes such as color or texture histograms which tend to ignore the spatial composition of the image. In this paper, we present an alternative image retrieval system based on the principle that it is the user who is most qualified to specify the query content and not the computer. With our system, the user can select multiple regions-of-interest and can specify the relevance of their spatial layout in the retrieval process. We also derive similarity bounds on histogram distances for pruning the database search. This experimental system was found to be superior to global indexing techniques as measured by statistical sampling of multiple users' satisfaction ratings.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号