首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
A Taxonomy of Dirty Data   总被引:3,自引:0,他引:3  
Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical applications, data mining, and customer relationship management systems. A major problem that is only beginning to be recognized is that the data in data sources are often dirty. Broadly, dirty data include missing data, wrong data, and non-standard representations of the same data. The results of analyzing a database/data warehouse of dirty data can be damaging and at best be unreliable. In this paper, a comprehensive classification of dirty data is developed for use as a framework for understanding how dirty data arise, manifest themselves, and may be cleansed to ensure proper construction of data warehouses and accurate data analysis. The impact of dirty data on data mining is also explored.  相似文献   

2.
This paper proposes the use of accessible information (data/knowledge) to infer inaccessible data in a distributed database system. Inference rules are extracted from databases by means of knowledge discovery techniques. These rules can derive inaccessible data due to a site failure or network partition in a distributed system. Such query answering requires combining incomplete and partial information from multiple sources. The derived answer may be exact or approximate. Our inference process involves two phases to reason with reconstructed information. One phase involves using local rules to infer inaccessible data. A second phase involves merging information from different sites. We shall call such reasoning processes cooperative data inference. Since the derived answer may be incomplete, new algebraic tools are developed for supporting operations on incomplete information. A weak criterion called toleration is introduced for evaluating the inferred results. The conditions that assure the correctness of combining partial results, known as sound inference paths, are developed. A solution is presented for terminating an iterative reasoning process on derived data from multiple knowledge sources. The proposed approach has been implemented on a cooperative distributed database testbed, CoBase, at UCLA. The experimental results validate the feasibility of this proposed concept and can significantly improve the availability of distributed knowledge base/database systems.List of notation Mapping - --< Logical implication - = Symbolic equality - ==< Inference path - Satisfaction - Toleration - Undefined (does not exist) - Variable-null (may or may not exist) - * Subtuple relationship - * s-membership - s-containment - Open subtuple - Open s-membership - Open s-containment - P Open base - P Program - I Interpretation - DIP Data inference program - t Tuples - R Relations - Ø Empty interpretation - Open s-union - Open s-interpretation - Set of mapping from the set of objects to the set of closed objects - W Set of attributes - W Set of sound inference paths on the set of attributes W - Set of relational schemas in a DB that satisfy MVD - + Range closure of W wrt   相似文献   

3.
The proliferation of large masses of data has created many new opportunities for those working in science, engineering and business. The field of data mining (DM) and knowledge discovery from databases (KDD) has emerged as a new discipline in engineering and computer science. In the modern sense of DM and KDD the focus tends to be on extracting information characterized as knowledge from data that can be very complex and in large quantities. Industrial engineering, with the diverse areas it comprises, presents unique opportunities for the application of DM and KDD, and for the development of new concepts and techniques in this field. Many industrial processes are now automated and computerized in order to ensure the quality of production and to minimize production costs. A computerized process records large masses of data during its functioning. This real-time data which is recorded to ensure the ability to trace production steps can also be used to optimize the process itself. A French truck manufacturer decided to exploit the data sets of measures recorded during the test of diesel engines manufactured on their production lines. The goal was to discover knowledge in the data of the test engine process in order to significantly reduce (by about 25%) the processing time. This paper presents the study of knowledge discovery utilizing the KDD method. All the steps of the method have been used and two additional steps have been needed. The study allowed us to develop two systems: the discovery application is implemented giving a real-time prediction model (with a real reduction of 28%) and the discovery support environment now allows those who are not experts in statistics to extract their own knowledge for other processes.  相似文献   

4.
Pushing Convertible Constraints in Frequent Itemset Mining   总被引:1,自引:0,他引:1  
Recent work has highlighted the importance of the constraint-based mining paradigm in the context of frequent itemsets, associations, correlations, sequential patterns, and many other interesting patterns in large databases. Constraint pushing techniques have been developed for mining frequent patterns and associations with antimonotonic, monotonic, and succinct constraints. In this paper, we study constraints which cannot be handled with existing theory and techniques in frequent pattern mining. For example, avg(S)v, median(S)v, sum(S)v (S can contain items of arbitrary values, {<, <, , } and v is a real number.) are customarily regarded as tough constraints in that they cannot be pushed inside an algorithm such as Apriori. We develop a notion of convertible constraints and systematically analyze, classify, and characterize this class. We also develop techniques which enable them to be readily pushed deep inside the recently developed FP-growth algorithm for frequent itemset mining. Results from our detailed experiments show the effectiveness of the techniques developed.  相似文献   

5.
Learning to Recognize Volcanoes on Venus   总被引:1,自引:0,他引:1  
Burl  Michael C.  Asker  Lars  Smyth  Padhraic  Fayyad  Usama  Perona  Pietro  Crumpler  Larry  Aubele  Jayne 《Machine Learning》1998,30(2-3):165-194
Dramatic improvements in sensor and image acquisition technology have created a demand for automated tools that can aid in the analysis of large image databases. We describe the development of JARtool, a trainable software system that learns to recognize volcanoes in a large data set of Venusian imagery. A machine learning approach is used because it is much easier for geologists to identify examples of volcanoes in the imagery than it is to specify domain knowledge as a set of pixel-level constraints. This approach can also provide portability to other domains without the need for explicit reprogramming; the user simply supplies the system with a new set of training examples. We show how the development of such a system requires a completely different set of skills than are required for applying machine learning to toy world domains. This paper discusses important aspects of the application process not commonly encountered in the toy world, including obtaining labeled training data, the difficulties of working with pixel data, and the automatic extraction of higher-level features.  相似文献   

6.
The design of the database is crucial to the process of designing almost any Information System (IS) and involves two clearly identifiable key concepts: schema and data model, the latter allowing us to define the former. Nevertheless, the term model is commonly applied indistinctly to both, the confusion arising from the fact that in Software Engineering (SE), unlike in formal or empirical sciences, the notion of model has a double meaning of which we are not always aware. If we take our idea of model directly from empirical sciences, then the schema of a database would actually be a model, whereas the data model would be a set of tools allowing us to define such a schema.The present paper discusses the meaning of model in the area of Software Engineering from a philosophical point of view, an important topic for the confusion arising directly affects other debates where model is a key concept. We would also suggest that the need for a philosophical discussion on the concept of data model is a further argument in favour of institutionalizing a new area of knowledge, which could be called: Philosophy of Engineering.  相似文献   

7.
Beyond Market Baskets: Generalizing Association Rules to Dependence Rules   总被引:11,自引:2,他引:9  
One of the more well-studied problems in data mining is the search for association rules in market basket data. Association rules are intended to identify patterns of the type: A customer purchasing item A often also purchases item B. Motivated partly by the goal of generalizing beyond market basket data and partly by the goal of ironing out some problems in the definition of association rules, we develop the notion of dependence rules that identify statistical dependence in both the presence and absence of items in itemsets. We propose measuring significance of dependence via the chi-squared test for independence from classical statistics. This leads to a measure that is upward-closed in the itemset lattice, enabling us to reduce the mining problem to the search for a border between dependent and independent itemsets in the lattice. We develop pruning strategies based on the closure property and thereby devise an efficient algorithm for discovering dependence rules. We demonstrate our algorithm's effectiveness by testing it on census data, text data (wherein we seek term dependence), and synthetic data.  相似文献   

8.
Statistical Models for Data Mining   总被引:1,自引:0,他引:1  
We review the background to the papers presented in this special issue and give a short introduction to each. We also briefly describe the workshop on Statistical models for data mining, held in Pavia (Italy), in October 2000, where the papers were presented.  相似文献   

9.
In this note we introduce purification for a pair (, ), where is a quantum state and is a channel, which allows in particular a natural extension of the properties of related information quantities (mutual and coherent informations) to the channels with arbitrary input and output spaces. PACS: 03.67.Hk  相似文献   

10.
Recently, constraint-based mining of itemsets for questions like find all frequent itemsets whose total price is at least $50 has attracted much attention. Two classes of constraints, monotone and antimonotone, have been very useful in this area. There exist algorithms that efficiently take advantage of either one of these two classes, but no previous algorithms can efficiently handle both types of constraints simultaneously. In this paper, we present DualMiner, the first algorithm that efficiently prunes its search space using both monotone and antimonotone constraints. We complement a theoretical analysis and proof of correctness of DualMiner with an experimental study that shows the efficacy of DualMiner compared to previous work.  相似文献   

11.
Recently, researchers have mainly been interested only in the search for data content that are globally similar to the query and not in the search for inside data items. This paper presents an algorithm, called a generalized virtual node (GVN) algorithm, to search for data items where parts (subdatatype) are similar to the incoming query. We call this subdatatype-based multimedia retrieval. Each multimedia datatype, such as image and audio is represented in this paper as a k-dimensional signal in the spatio-temporal domain. A k-dimensional signal is transformed into characteristic features and these features are stored in a hierarchical multidimensional structure, called the k-tree. Each node on the k-tree contains partial content corresponding to the spatial and/or temporal positions in the data. The k-tree structure allows us to build a unified retrieval model for any types of multimedia data. It also eliminates unnecessary comparisons of cross-media querying. The experimental results of the use of the new GVN algorithm for subaudio and subimage retrievals show that it takes much less retrieval times than other earlier algorithms such as brute-force and the partial-matching algorithm, while the accuracy is acceptable.  相似文献   

12.
In the last decade there has been an explosion of interest in mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of this work has very little utility because the contribution made (speed in the case of indexing, accuracy in the case of classification and clustering, model accuracy in the case of segmentation) offer an amount of improvement that would have been completely dwarfed by the variance that would have been observed by testing on many real world datasets, or the variance that would have been observed by changing minor (unstated) implementation details.To illustrate our point, we have undertaken the most exhaustive set of time series experiments ever attempted, re-implementing the contribution of more than two dozen papers, and testing them on 50 real world, highly diverse datasets. Our empirical results strongly support our assertion, and suggest the need for a set of time series benchmarks and more careful empirical evaluation in the data mining community.  相似文献   

13.
HDM: A Client/Server/Engine Architecture for Real-Time Web Usage Mining   总被引:1,自引:1,他引:0  
The behavior of the users of a website may change so quickly that it becomes a real challenge to attempt to make predictions according to the frequent patterns coming from the analysis of an access log file. In order to reduce the obsolescence of behavioral patterns as much as possible, the ideal method would provide frequent patterns in real time, making the result immediately available. In this paper, we propose a method for finding frequent behavioral patterns in real time, whatever the number of connected users. Considering how fast frequent behavior patterns may have changed since the time the access log file was analyzed, this result thus provides completely appropriate navigation schemata for predicting user behavior. Based on a distributed heuristic, our method also tackles and provides answers to several problems within the framework of data mining: the discovery of interesting zones (a large number of frequent patterns concentrated over a period of time, or super-frequent patterns), discovering very long sequential patterns and interactive data mining (on-the-fly modification of the minimum support).  相似文献   

14.
3-D interpretation of optical flow by renormalization   总被引:5,自引:2,他引:3  
This article studies 3-D interpretation of optical flow induced by a general camera motion relative to a surface of general shape. First, we describe, using the image sphere representation, an analytical procedure that yields an exact solution when the data are exact: we solve theepipolar equation written in terms of theessential parameters and thetwisted optical flow. Introducing a simple model of noise, we then show that the solution is statistically biased. In order to remove the statistical bias, we propose an algorithm calledrenormalization, which automatically adjusts to unknown image noise. A brief discussion is also given to thecritical surface that yields ambiguous 3-D interpretations and the use of theimage plane representation.  相似文献   

15.
Incremental Scheduling of Mixed Workloads in Multimedia Information Servers   总被引:2,自引:0,他引:2  
In contrast to pure video servers, advanced multimedia applications such as digital libraries or teleteaching exhibit a mixed workload with massive access to conventional, discrete data such as text documents, images and indexes as well as requests for continuous data, like video and audio data. In addition to the service quality guarantees for continuous data requests, quality-conscious applications require that the response time of the discrete data requests stay below some user-tolerance threshold. In this paper, we study the impact of different disk scheduling policies on the service quality for both continuous and discrete data. We provide a framework for describing various policies in terms of few parameters, and we develop a novel policy that is experimentally shown to outperform all other policies.  相似文献   

16.
We present a deep X-ray mask with integrated bent-beam electrothermal actuator for the fabrication of 3D microstructures with curved surface. The mask absorber is electroplated on the shuttle mass, which is supported by a pair of 20-m-thick single crystal silicon bent-beam electrothermal actuators and oscillated in a rectilinear direction due to the thermal expansion of the bent-beams. The width of each bent-beam is 10 m or 20 m and the length and bending angle are 1 mm and 0.1 rad, respectively, and the shuttle mass size is 1 mm × 1 mm. For 10-m-wide bent-beams, the shuttle mass displacement is around 15 m at 180 mW (3.6 V) dc input power. For 20-m-wide bent-beams, the shuttle mass displacement is around 19 m at 336 mW (4.2 V) dc input power. Sinusoidal cross-sectional PMMA microstructures with a pitch of 40 m and a height of 20 m are fabricated by 0.5 Hz, 20-m-amplitude sinusoidal shuttle mass oscillation.This research, under the contract project code MS-02-338-01, has been supported by the Intelligent Microsystem Center, which carries out one of the 21st centurys Frontier R & D Projects sponsored by the Korea Ministry of Science & Technology. Experiments at PLS were supported in part by MOST and POSCO.  相似文献   

17.
When verifying concurrent systems described by transition systems, state explosion is one of the most serious problems. If quantitative temporal information (expressed by clock ticks) is considered, state explosion is even more serious. We present a notion of abstraction of transition systems, where the abstraction is driven by the formulae of a quantitative temporal logic, called qu-mu-calculus, defined in the paper. The abstraction is based on a notion of bisimulation equivalence, called , n-equivalence, where is a set of actions and n is a natural number. It is proved that two transition systems are , n-equivalent iff they give the same truth value to all qu-mu-calculus formulae such that the actions occurring in the modal operators are contained in , and with time constraints whose values are less than or equal to n. We present a non-standard (abstract) semantics for a timed process algebra able to produce reduced transition systems for checking formulae. The abstract semantics, parametric with respect to a set of actions and a natural number n, produces a reduced transition system , n-equivalent to the standard one. A transformational method is also defined, by means of which it is possible to syntactically transform a program into a smaller one, still preserving , n-equivalence.  相似文献   

18.
Summary Distributed Mutual Exclusion algorithms have been mainly compared using the number of messages exchanged per critical section execution. In such algorithms, no attention has been paid to the serialization order of the requests. Indeed, they adopt FCFS discipline. Conversely, the insertion of priority serialization disciplines, such as Short-Job-First, Head-Of-Line, Shortest-Remaining-Job-First etc., can be useful in many applications to optimize some performance indices. However, such priority disciplines are prone to starvation. The goal of this paper is to investigate and evaluate the impact of the insertion of a priority discipline in Maekawa-type algorithms. Priority serialization disciplines will be inserted by means of agated batch mechanism which avoids starvation. In a distributed algorithm, such a mechanism needs synchronizations among the processes. In order to highlight the usefulness of the priority based serialization discipline, we show how it can be used to improve theaverage response time compared to the FCFS discipline. The gated batch approach exhibits other advantages: algorithms are inherently deadlock-free and messages do not need to piggyback timestamps. We also show that, under heavy demand, algorithms using gated batch exchange less messages than Maekawa-type algorithms per critical section excution. Roberto Baldoni was born in Rome on February 1, 1965. He received the Laurea degree in electronic engineering in 1990 from the University of Rome La Sapienza and the Ph.D. degree in Computer Science from the University of Rome La Sapienza in 1994. Currently, he is a researcher in computer science at IRISA, Rennes (France). His research interests include operating systems, distributed algorithms, network protocols and real-time multimedia applications. Bruno Ciciani received the Laurea degree in electronic engineering in 1980 from the University of Rome La Sapienza. From 1983 to 1991 he has been a researcher at the University of Rome Tor Vergata. He is currently full professor in Computer Science at the University of Rome La Sapienza. His research activities include distributed computer systems, fault-tolerant computing, languages for parallel processing, and computer system performance and reliability evaluation. He has published in IEEE Trans. on Computers, IEEE Trans. on Knowledge and Data Engineering, IEEE Trans. on Software Engineering and IEEE Trans. on Reliability. He is the author of a book titled Manufactoring Yield Evaluation of VLSI/WSI Systems to be published by IEEE Computer Society Press.This research was supported in part by the Consiglio Nazionale delle Ricerche under grant 93.02294.CT12This author is also supported by a grant of the Human Capital and Mobility project of the European Community under contract No. 3702 CABERNET  相似文献   

19.
This paper discusses some of the key drivers that will enable businesses to operate effectively on-line, and looks at how the notion of website will become one of an on-line presence which will support the main activities of an organisation. This is placed in the context of the development of the information society which will allow individuals-as consumers or employees-quick, inexpensive and on-demand access to vast quantities of entertainment, services and information. The paper draws on an example of these developments in Australasia.  相似文献   

20.
In this paper, we consider the linear interval tolerance problem, which consists of finding the largest interval vector included in ([A], [b]) = {x R n | A [A], b [b], Ax = b}. We describe two different polyhedrons that represent subsets of all possible interval vectors in ([A], [b]), and we provide a new definition of the optimality of an interval vector included in ([A], [b]). Finally, we show how the Simplex algorithm can be applied to find an optimal interval vector in ([A], [b]).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号