首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 11 毫秒
1.
Bit transposition for very large scientific and statistical databases   总被引:2,自引:0,他引:2  
Conventional access methods cannot be effectively used in large Scientific/Statistical Database (SSDB) applications. A file structure (called bit transposed file (BTF)) is proposed which offers several attractive features that are better suited for the special characteristics that SSDBs exhibit. This file structure is an extreme version of the (attribute) transposed file. The data are stored by vertical bit partitions. The bit patterns of attributes are assigned using one of several data encoding methods. Each of these encoding methods is appropriate for different query types. The bit partitions can also be compressed using a version of the run length encoding scheme. Efficient operators on compressed bit vectors have been developed and form the basis of a query language. Because of the simplicity of the file structure and query language, optimization problems for database design, query evaluation, and common subexpression removal can be formalized and efficient exact solution or near optimal solution can be achieved. In addition to selective power with low overheads for SSDBs, the BTF is also amenable to special parallel hardware. Results from experiments with the file structure suggest that this approach may be a reasonable alternative file structure for large SSDBs.Supported by the Office of Energy Research, U.S. DOE under Contract No. DE-AC03-76SF00098.On leave from the Department of Computer Science, Heilongjiang University, China.  相似文献   

2.
Mining very large databases   总被引:1,自引:0,他引:1  
Ganti  V. Gehrke  J. Ramakrishnan  R. 《Computer》1999,32(8):38-45
Established companies have had decades to accumulate masses of data about their customers, suppliers, products and services, and employees. Data mining, also known as knowledge discovery in databases, gives organizations the tools to sift through these vast data stores to find the trends, patterns, and correlations that can guide strategic decision making. Traditionally, algorithms for data analysis assume that the input data contains relatively few records. Current databases however, are much too large to be held in main memory. To be efficient, the data mining techniques applied to very large databases must be highly scalable. An algorithm is said to be scalable if (given a fixed amount of main memory), its runtime increases linearly with the number of records in the input database. Recent work has focused on scaling data mining algorithms to very large data sets. The authors describe a broad range of algorithms that address three classical data mining problems: market basket analysis, clustering, and classification  相似文献   

3.
Query-by-example and query-by-keyword both suffer from the problem of “aliasing,” meaning that example-images and keywords potentially have variable interpretations or multiple semantics. For discerning which semantic is appropriate for a given query, we have established that combining active learning with kernel methods is a very effective approach. In this work, we first examine active-learning strategies, and then focus on addressing the challenges of two scalability issues: scalability in concept complexity and in dataset size. We present remedies, explain limitations, and discuss future directions that research might take.  相似文献   

4.
In the Big Data Era, the management of energy consumption by servers and data centers has become a challenging issue for companies, institutions, and countries. In data-centric applications, Database Management Systems are one of the major energy consumers when executing complex queries involving very large databases. Several initiatives have been proposed to deal with this issue, covering both the hardware and software dimensions. They can be classified in two main approaches assuming that either (a) the database is already deployed on a given platform, or (b) it is not yet deployed. In this study, we focus on the first set of initiatives with a particular interest in physical design, where optimization structures (e.g., indexes, materialized views) are selected to satisfy a given set of non-functional requirements such as query performance for a given workload. In this paper, we first propose an initiative, called Eco-Physic, which integrates the energy dimension into the physical design when selecting materialized views, one of the redundant optimization structures. Secondly, we provide a multi-objective formalization of the materialized view selection problem, considering two non-functional requirements: query performance and energy consumption while executing a given workload. Thirdly, an evolutionary algorithm is developed to solve the problem. This algorithm differs from the existing ones by being interactive, so that database administrators can adjust some energy sensitive parameters at the final stage of the algorithm execution according to their specifications. Finally, intensive experiments are conducted using our mathematical cost model and a real device for energy measurements. Results underscore the value of our approach as an effective way to save energy while optimizing queries through materialized views structures.  相似文献   

5.
6.
Clustering in very large databases based on distance and density   总被引:8,自引:0,他引:8       下载免费PDF全文
Clustering in vergy large databases or data warehouses,with many applications in areas such as spatial computation,web information coollection,pattern recognition and econmic analysis,is a huge task that challenges data mining researches.Current clustering methods always have the problems:1)scanning the whole databased leads to high I/O cost and expensive maintenance(e.g.,R^*-tree);2)pre-specifying the uncertain parameter k,with which clustering can only be refined by trial and test many times;3) lacking high efficiency in treating arbitrary shape under very large data set environment.In this paper,we first present a new hybrid-clustering algorithm to solve these problesm,This new algorithm,which combines both distance and density strategies,can handle any arbitrary shape clusters effectively.It makes full use of statistics information in mining to reduce the time complexity greatly while keeping good clustering quality.Furthermore,this algorithm can easily eliminate noises and inentify outliers.An experimental evaluation is performed on a spatial database with this method and other popular clustering algorithms(CURE and DBSCAN).The results show that our algorithm outperforms them in terms of efficiency and cost,and even gets much more speedup as the data size scales up much larger.  相似文献   

7.
8.
Estimating the disclosure risk of a Statistical Disclosure Control (SDC) protection method by means of (distance-based) record linkage techniques is a very popular approach to analyze the privacy level offered by such a method. When databases are very large, some particular record linkage techniques such as blocking or partitioning are usually applied to make this process reasonably efficient. However, in this case the record linkage process is not exact, which means that the disclosure risk of a SDC protection method may be underestimated.In this paper we propose the use of kd-trees techniques to apply exact yet very efficient record linkage when (protected) datasets are very large. We describe some experiments showing that this approach achieves better results, in terms of both accuracy and running time, than more classical approaches such as record linkage based on a sliding window.We also discuss and experiment on the use of these techniques not to link a whole protected record with its original one, but just to guess the value of some confidential attribute(s) of the record(s). This fact leads to concepts such as k-neighbor l-diversity or k-neighbor p-sensitivity, a generalization (to any SDC protection method) of l-diversity or p-sensitivity, which have been defined for SDC protection methods ensuring k-anonymity, such as microaggregation.  相似文献   

9.
The development and investigation of efficient methods of parallel processing of very large databases using the columnar data representation designed for computer cluster is discussed. An approach that combines the advantages of relational and column-oriented DBMSs is proposed. A new type of distributed column indexes fragmented based on the domain-interval principle is introduced. The column indexes are auxiliary structures that are constantly stored in the distributed main memory of a computer cluster. To match the elements of a column index to the tuples of the original relation, surrogate keys are used. Resource hungry relational operations are performed on the corresponding column indexes rather than on the original relations of the database. As a result, a precomputation table is obtained. Using this table, the DBMS reconstructs the resulting relation. For basic relational operations on column indexes, methods for their parallel decomposition that do not require massive data exchanges between the processor nodes are proposed. This approach improves the class OLAP query performance by hundreds of times.  相似文献   

10.
Many applications require the management of spatial data in a multidimensional feature space. Clustering large spatial databases is an important problem, which tries to find the densely populated regions in the feature space to be used in data mining, knowledge discovery, or efficient information retrieval. A good clustering approach should be efficient and detect clusters of arbitrary shape. It must be insensitive to the noise (outliers) and the order of input data. We propose WaveCluster, a novel clustering approach based on wavelet transforms, which satisfies all the above requirements. Using the multiresolution property of wavelet transforms, we can effectively identify arbitrarily shaped clusters at different degrees of detail. We also demonstrate that WaveCluster is highly efficient in terms of time complexity. Experimental results on very large datasets are presented, which show the efficiency and effectiveness of the proposed approach compared to the other recent clustering methods. Received June 9, 1998 / Accepted July 8, 1999  相似文献   

11.
Most scientific databases consist of datasets (or sources) which in turn include samples (or files) with an identical structure (or schema). In many cases, samples are associated with rich metadata, describing the process that leads to building them (e.g.: the experimental conditions used during sample generation). Metadata are typically used in scientific computations just for the initial data selection; at most, metadata about query results is recovered after executing the query, and associated with its results by post-processing. In this way, a large body of information that could be relevant for interpreting query results goes unused during query processing.In this paper, we present ScQL, a new algebraic relational language, whose operations apply to objects consisting of data–metadatapairs, by preserving such one-to-one correspondence throughout the computation. We formally define each operation and we describe an optimization, called meta-first, that may significantly reduce the query processing overhead by anticipating the use of metadata for selectively loading into the execution environment only those input samples that contribute to the result samples.In ScQL, metadata have the same relevance as data, and contribute to building query results; in this way, the resulting samples are systematically associated with metadata about either the specific input samples involved or about query processing, thereby yielding a new form of metadata provenance. We present many examples of use of ScQL, relative to several application domains, and we demonstrate the effectiveness of the meta-first optimization.  相似文献   

12.
Statistical databases have traditionally been stored as flat files approximating relations. We propose that by storing statistical data in an object oriented type database, enhanced with knowledge of statistical theory, a more natural and powerful interface to statistical data can be created.A formalism is proposed for dealing with and combining data that have random components by makingstatistics first class citizens in the database world. Entities in the databases are classified according to whether they areobservations, orstatistics.Estimates are a special type ofstatistics which aremoored toobservation entities.Statistics entities are classified by their statistical properties. A hierarchical structure of random features is provided, with distributions at its leaves. This structure is a DAG (Directed Acyclic Graph), which may be extended or redefined for different applications and contains information used to compare and manipulatestatistics.  相似文献   

13.
Set controls are inference controls for statistical databases that couple their reaction to the composition of the query set. They may work by suppressing or by perturbing answers. We show that set controls are often insecure.  相似文献   

14.
International Journal of Parallel Programming - A time decomposition technique is suggested for large-database (DB) models. The problem of network aggregation is studied and the results used to...  相似文献   

15.
The evaluability of queries on a statistical database containing joinable tables connected by an intersection hypergraph is considered. A characterization of evaluable queries is given, which allows one to define polynomial-time procedures both for testing evaluability and for evaluating queries. These results are useful in designing an `informed query system' for statistical databases which promotes an integrated use of stored information. Such a query system allows the user to formulate a query involving attributes from several joinable tables as if they were all contained in a single universal table  相似文献   

16.
《Pattern recognition》2014,47(2):588-602
Fingerprint matching has emerged as an effective tool for human recognition due to the uniqueness, universality and invariability of fingerprints. Many different approaches have been proposed in the literature to determine faithfully if two fingerprint images belong to the same person. Among them, minutiae-based matchers highlight as the most relevant techniques because of their discriminative capabilities, providing precise results. However, performing a fingerprint identification over a large database can be an inefficient task due to the lack of scalability and high computing times of fingerprint matching algorithms.In this paper, we propose a distributed framework for fingerprint matching to tackle large databases in a reasonable time. It provides a general scheme for any kind of matcher, so that its precision is preserved and its time of response can be reduced.To test the proposed system, we conduct an extensive study that involves both synthetic and captured fingerprint databases, which have different characteristics, analyzing the performance of three well-known minutiae-based matchers within the designed framework. With the available hardware resources, our distributed model is able to address up to 400 000 fingerprints in approximately half a second. Additional details are provided at http://sci2s.ugr.es/ParallelMatching.  相似文献   

17.
Nowadays, most fingerprint sensors capture partial fingerprint images. Incomplete, fragmentary, or partial fingerprint identification in large databases is an attractive research topic and is remained as an important and challenging problem. Accordingly, conventional fingerprint identification systems are not capable of providing convincing results. To overcome this problem, we need a fast and accurate identification strategy. In this context, fingerprint indexing is commonly used to speed up the identification process. This paper proposes a robust and fast identification system that combines two indexing algorithms. One of the indexing algorithms uses minutiae triplets, and the other uses orientation field (OF) to index and retrieve fingerprints. Furthermore, the proposal uses some partial fingerprint matching methods on final candidate list obtained from the indexing stage. The proposal is evaluated over two national institutes of standards and technology (NIST) datasets and four fingerprint verification competition (FVC) datasets leading to low identification times with no accuracy loss.  相似文献   

18.
Faudemay  P. Mhiri  M. 《Micro, IEEE》1991,11(6):22-34
The RAPID-1 (relational access processor for intelligent data), an associative accelerator that recognizes tuples and logical formulas, is presented. It evaluates logical formulas instantiated by the current tuple, or record, and operates on whole relations or on hashing buckets. RAPID- 1 uses a reduced instruction set and hardwired control and executes all comparisons in a bit-parallel mode. It speeds up the database by a significant factor and will adapt to future generations of microprocessors. The principal design issues, data structures, instruction set, architecture, environments and performance are discussed  相似文献   

19.
Gerardo  Bice   《Data & Knowledge Engineering》2009,68(11):1187-1205
The paper proposes a novel approach for on-line max and min query auditing, in which a Bayesian network addresses disclosures based on probabilistic inferences that can be drawn from released data. In the literature, on-line max and min auditing has been addressed with some restrictive assumptions, primarily that sensitive values must be all distinct and the sensitive field has a uniform distribution. We remove these limitations and propose a model able to: provide a graphical representation of user knowledge; deal with the implicit delivery of information that derives from denying the answer to a query; and capture user background knowledge. Finally, we discuss the results of experiments aimed at assessing the scalability of the approach, in terms of response time and size of the conditional probability table, and the usefulness of the auditor system, in terms of probability to deny.  相似文献   

20.
We propose an active image information system based upon the concept of smart images. A smart image is an image with an associated knowledge structure, consisting of protocols, hot spots, active indexes and attributes. This active image information system enables us to accomplish the objectives of timely delivery and easy accessibility by handling long-duration operations and supporting unitary image information usage. The experimental prototype of the smart image system is described.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号