首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In statistical databases and data warehousing applications it is commonly the case that aggregate views are maintained as an underlying mechanism for summarising information. Where the databases or applications are distributed, or arise from independent data collections or system developments, there may be incompatibility, heterogeneity, and data inconsistency. These challenges need to be overcome if federations of aggregated databases are to be successfully incorporated into systems for database management, querying, retrieval, and knowledge discovery. In this paper we address the issue of integrating aggregate views that have semantically heterogeneous classification schemes. In previous work we have developed a methodology that is efficient but that cannot easily handle data inconsistencies. Our previous approach is therefore not particularly well-suited to very large databases or federations of large numbers of databases. We now address these scalability issues by introducing a methodology for heterogeneous aggregate view integration that constructs a dynamic shared ontology to which each of the aggregate views can be explicitly related. A maximum likelihood technique, implemented using the EM (Expectation-Maximisation) algorithm, is used to inherently handle data inconsistencies in the computation of integrated aggregates that are described in terms of the dynamic shared ontology.  相似文献   

2.
Aggregate views are commonly used for summarizing information held in very large databases such as those encountered in data warehousing, large scale transaction management, and statistical databases. Such applications often involve distributed databases that have developed independently and therefore may exhibit incompatibility, heterogeneity, and data inconsistency. We are here concerned with the integration of aggregates that have heterogeneous classification schemes where local ontologies, in the form of such classification schemes, may be mapped onto a common ontology. In previous work, we have developed a method for the integration of such aggregates; the method previously developed is efficient, but cannot handle innate data inconsistencies that are likely to arise when a large number of databases are being integrated. In this paper, we develop an approach that can handle data inconsistencies and is thus inherently much more scalable. In our new approach, we first construct a dynamic shared ontology by analyzing the correspondence graph that relates the heterogeneous classification schemes; the aggregates are then derived by minimization of the Kullback-Leibler information divergence using the EM (Expectation-Maximization) algorithm. Thus, we may assess whether global queries on such aggregates are answerable, partially answerable, or unanswerable in advance of computing the aggregates themselves.  相似文献   

3.
A global schema is a single, connected view of heterogeneous databases. Past research into the problem of global schema design has demonstrated the use of generalization for connecting disjoint schemas. Before generalization can be applied, the common attributes of the local schemas must be identified. We use the term attribute equivalence for the identification of such common attributes. This paper defines four types of attribute equivalences. The distinction between local and global equivalence is explained, and special attention is given to key attributes. We also discuss the placement of locally equivalent attributes in a global schema.  相似文献   

4.
This paper presents a query processing algorithm, formulated and developed in support of the prototype architecture of the Distributed Access View Integrated Database (DAVID) which is a heterogeneous distributed database management system. The objective of the proposed query processing algorithm is to produce an inexpensive strategy for a given query. The inexpensive query strategy is obtained primarily by computing the most profitable semi-joins and by determining the best sequence of join operations per processing site. The latter is obtained by applying a zero-one integer linear program that uses a non-parametric statistical estimation technique to compute the sizes of the temporary clusters. A cluster is a subset of the cartesian product of a list of atomic and non-atomic domains and is the structure that can represent in a uniform way data stored in relational, hierarchical and network databases.Following some background information on the development of the DAVID prototype, this paper introduces the schema architecture. The schema architecture describes the mechanism by which the component heterogeneous database schemata are mapped into the uniform global schema. This is followed by the formulation of the query processing algorithm, its implementation and an illustration of its use in the context of NASA's Astrophysics Data System.Recommended by: Y. Breitbart  相似文献   

5.
Building Finder uses semantic Web technologies to integrate different data types from various online data sources. The application's use of the RDF and RDF data query language makes it usable by computer agents as well as human users. An agent would send a query, expressed in terms of its preferred ontology (schema), to a system that would then find and integrate the relevant data from multiple sources and return it using the agent's ontology. We discuss about retrieving and semantically integrating heterogeneous data from the Web.  相似文献   

6.
Range aggregate processing in spatial databases   总被引:3,自引:0,他引:3  
A range aggregate query returns summarized information about the points falling in a hyper-rectangle (e.g., the total number of these points instead of their concrete ids). This paper studies spatial indexes that solve such queries efficiently and proposes the aggregate Point-tree (aP-tree), which achieves logarithmic cost to the data set cardinality (independently of the query size) for two-dimensional data. The aP-tree requires only small modifications to the popular multiversion structural framework and, thus, can be implemented and applied easily in practice. We also present models that accurately predict the space consumption and query cost of the aP-tree and are therefore suitable for query optimization. Extensive experiments confirm that the proposed methods are efficient and practical.  相似文献   

7.
8.
The problem of finding optimal distribution of a database over a computer network to facilitate parallel searching for a set of database queries is analysed in this paper. The parallel searching of multiple segments required by the queries lowers the response time considerably. Procedures for finding the optimal distributions in a network to maximally exploit the parallel search capability with or without redundancy of segment types are proposed.  相似文献   

9.
Allocating fragments in distributed databases   总被引:2,自引:0,他引:2  
For a distributed database system to function efficiently, the fragments of the database need to be located, judiciously at various sites across the relevant communications network. The problem of allocating these fragments to the most appropriate sites is a difficult one to solve, however, with most approaches available relying on heuristic techniques. Optimal approaches are usually based on mathematical programming, and formulations available for this problem are based on the linearization of nonlinear binary integer programs and have been observed to be ineffective except on very small problems. This paper presents new integer programming formulations for the nonredundant version of the fragment allocation problem. This formulation is extended to address problems which have both storage and processing capacity constraints; the approach is observed to be particularly effective in the presence of capacity restrictions. Extensive computational tests conducted over a variety of parameter values indicate that the reformulations are very effective even on relatively large problems, thereby reducing the need for heuristic approaches.  相似文献   

10.
A DAG (direct acyclic graph) is an important data structure which requires efficient support in CAD (computer-aided design) databases. It typically arise from the design hierarchy, which describes complex designs in terms of subdesigns. A study is made of the properties of the three types of clustered sequences of nodes for hierarchies and DAGs, and algorithms are developed for generating the clustered sequences, retrieving the descendants of a given node, and inserting new nodes into existing clustered sequences of nodes which preserve their clustering properties. The performance of the clustering sequences is compared  相似文献   

11.
A reduced cover set of the set of full reducer semijoin programs for an acyclic query graph for a distributed database system is given. An algorithm is presented that determines the minimum cost full reducer program. The computational complexity of finding the optimal full reducer for a single relation is of the same order as that of finding the optimal full reducer for all relations. The optimization algorithm is able to handle query graphs where more than one attribute is common between the relations. A method for determining the optimum profitable semijoin program is presented. A low-cost algorithm which determines a near-optimal profitable semijoin program is outlined. This is done by converting a semijoin program into a partial order graph. This graph also allows one to maximize the concurrent processing of the semijoins. It is shown that the minimum response time is given by the largest cost path of the partial order graph. This reducibility is used as a post optimizer for the SSD-1 query optimization algorithm. It is shown that the least upper bound on the length of any profitable semijoin program is N(N-1) for a query graph of N nodes  相似文献   

12.
In this paper, we present an innovative system, coined as DISTROD (a.k.a DISTRibuted Outlier Detector), for detecting outliers, namely abnormal instances or observations, from multiple large distributed databases. DISTROD is able to effectively detect the so-called global outliers from distributed databases that are consistent with those produced by the centralized detection paradigm. DISTROD is equipped with a number of optimization/boosting strategies which empower it to significantly enhance its speed performance and reduce its communication overhead. Experimental evaluation demonstrates the good performance of DISTROD in terms of speed and communication overhead.  相似文献   

13.
The skyline-join operator, as an important variant of skylines, plays an important role in multi-criteria decision making problems. However, as the data scale increases, previous methods of skyline-join queries cannot be applied to new applications. Therefore, in this paper, it is the first attempt to propose a scalable method to process skyline-join queries in distributed databases. First, a tailored distributed framework is presented to facilitate the computation of skyline-join queries. Second, the distributed skyline-join query algorithm (DSJQ) is designed to process skyline-join queries. DSJQ contains two phases. In the first phase, two filtering strategies are used to filter out unpromising tuples from the original tables. The remaining tuples are transmitted to the corresponding data nodes according a partition function, which can guarantee that the tuples with the same join value are transferred to the same node. In the second phase, we design a scheduling plan based on rotations to calculate the final skyline-join result. The scheduling plan can ensure that calculations are equally assigned to all the data nodes, and the calculations on each data node can be processed in parallel without creating a bottleneck node. Finally, the effectiveness of DSJQ is evaluated through a series of experiments.  相似文献   

14.
An approach is presented for managing distributed database systems in the face of communication failures and network partitions. The approach is based on the idea of dividing the database into fragments and assigning each fragment a controlling entity called an agent. The goals achieved by this approach include high data availability and the ability to operate without promptly and correctly detecting partitions. A correctness criterion for transaction execution, called fragmentwise serializability, is introduced. It is less strict than the conventional serializability, but provides a valuable alternative for some applications  相似文献   

15.
In many distributed databases locality of reference is crucial to achieve acceptable performance. However, the purpose of data distribution is to spread the data among several remote sites. One way to solve this contradiction is to use partitioned data techniques. Instead of accessing the entire data, a site works on a fraction that is made locally available, thereby increasing the site's autonomy. We present a theory of partitioned data that formalizes the concept and establishes the basis to develop a correctness criterion and a concurrency control protocol for partitioned databases. Set-serializability is proposed as a correctness criterion and we suggest an implementation that integrates partitioned and non-partitioned data. To complete this study, the policies required in a real implementation are also analyzed. Recommended by: Hector Garcia-Molina  相似文献   

16.
The state of the art of searching for non-text data (e.g., images) is to use extracted metadata annotations or text, which might be available as a related information. However, supporting real content-based audiovisual search, based on similarity search on features, is significantly more expensive than searching for text. Moreover, such search exhibits linear scalability with respect to the dataset size, so parallel query execution is needed.In this paper, we present a Distributed Incremental Nearest Neighbor algorithm (DINN) for finding closest objects in an incremental fashion over data distributed among computer nodes, each able to perform its local Incremental Nearest Neighbor (local-INN) algorithm. We prove that our algorithm is optimum with respect to both the number of involved nodes and the number of local-INN invocations. An implementation of our DINN algorithm, on a real P2P system called MCAN, was used for conducting an extensive experimental evaluation on a real-life dataset.The proposed algorithm is being used in two running projects: SAPIR and NeP4B.  相似文献   

17.
The execution of logic queries in a distributed database environment is studied. Conventional optimization strategies, such as the early evaluation of selection conditions and the clustering of processing to manipulate and exchange large sets of tuples, are redefined in view of the additional difficulties due to logic queries, in particular to recursive rules. In order to allow efficient processing of these logic queries, several program transformation techniques that attempt to minimize distribution costs based on the idea of semijoins and generalized semijoins in conventional databases are presented. Although local computation of semijoins is not possible for the general case, classes of programs are indicated for which these transformations succeed in producing set-oriented computation. Processes evaluating the recursive program in a distributed network are described, and an efficient method for testing the termination of the computation is developed. The approach is compared with sequential as well as dataflow-oriented evaluation  相似文献   

18.
We show that some relational queries, which we call quantified queries are not well supported in distributed environments. We give a formal definition of quantified queries, propose a language in which to express said queries and provide a procedure to compute answers in this new language in the context of distributed databases. The proposed language is made up of high-level, declarative operators (called generalised quantifiers), and therefore it can be used in combination with several distributed frameworks. Our approach is designed to be as general as possible; it assumes horizontally partitioned relations, but nothing else, so no data placement or replication is used. We present an implementation and algorithms for the new language, propose some basic optimisations and give experimental results which show that the new approach is indeed quite efficient and scales well.  相似文献   

19.
Spectral databases constitute one of the components of a complete observing system, storing in situ spectroscopic measurements plus associated metadata and providing data for the validation, calibration, and simulation of imaging spectrometer products. Such databases may be employed by physically or organisationally separate entities. Consequently, methods for data exchange between distributed spectral databases are required, allowing the transfer of defined subsets of spectral data including their full metadata context from a source to a target system. The data exchange comprises generic approaches to the sequential steps of ordered table row export, relational storage in XML files, and nonconflicting import into the target database. The SPECCHIO spectral database system was used as a test bed for the data exchange between databases of identical schemata and according import/export functionality has been added to the SPECCHIO application. Import and export speeds were assessed using test cases of different metadata space densities, a score for the density with which associated metadata are detailed, and the potential utility as a quantitative rating for quality. Future spectral databases should allow the exchange between heterogeneous systems, ideally implementing a common subset of metadata parameters and thus supporting the long-term usability and data sharing between research partners.  相似文献   

20.
M. J. R. Shave 《Software》1980,10(2):135-147
A relational model of a small distributed database is used to illustrate problems of consistency and integrity with specific reference to distributed systems, and to discuss methods for their solution. It is shown that some limitation may be necessary on the freedom to make modifications on a purely local basis, and that questions of time consistency due to the network have a partial but limiting solution.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号