首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. In this article we propose a new dynamic paged and balanced access method for similarity search in metric data sets, named CM-tree (Clustered Metric tree). It fully supports dynamic capabilities of insertions and deletions both of single objects and in bulk. Distinctive from other methods, it is especially designed to achieve a structure of tight and low overlapping clusters via its primary construction algorithms (instead of post-processing), yielding significantly improved performance. Several new methods are introduced to achieve this: a strategy for selecting representative objects of nodes, clustering based node split algorithm and criteria for triggering a node split, and an improved sub-tree pruning method used during search. To facilitate these methods the pairwise distances between the objects of a node are maintained within each node. Results from an extensive experimental study show that the CM-tree outperforms the M-tree and the Slim-tree, improving search performance by up to 312% for I/O costs and 303% for CPU costs.  相似文献   

2.
This work focus on fast nearest neighbor (NN) search algorithms that can work in any metric space (not just the Euclidean distance) and where the distance computation is very time consuming. One of the most well known methods in this field is the AESA algorithm, used as baseline for performance measurement for over twenty years. The AESA works in two steps that repeats: first it searches a promising candidate to NN and computes its distance (approximation step), next it eliminates all the unsuitable NN candidates in view of the new information acquired in the previous calculation (elimination step).This work introduces the PiAESA algorithm. This algorithm improves the performance of the AESA algorithm by splitting the approximation criterion: on the first iterations, when there is not enough information to find good NN candidates, it uses a list of pivots (objects in the database) to obtain a cheap approximation of the distance function. Once a good approximation is obtained it switches to the AESA usual behavior. As the pivot list is built in preprocessing time, the run time of PiAESA is almost the same than the AESA one.In this work, we report experiments comparing with some competing methods. Our empirical results show that this new approach obtains a significant reduction of distance computations with no execution time penalty.  相似文献   

3.
Little work has been reported in the literature to support k-nearest neighbor (k-NN) searches/queries in hybrid data spaces (HDS). An HDS is composed of a combination of continuous and non-ordered discrete dimensions. This combination presents new challenges in data organization and search ordering. In this paper, we present an algorithm for k-NN searches using a multidimensional index structure in hybrid data spaces. We examine the concept of search stages and use the properties of an HDS to derive a new search heuristic that greatly reduces the number of disk accesses in the initial stage of searching. Further, we present a performance model for our algorithm that estimates the cost of performing such searches. Our experimental results demonstrate the effectiveness of our algorithm and the accuracy of our performance estimation model.  相似文献   

4.
Using a distributed quadtree index in peer-to-peer networks   总被引:6,自引:0,他引:6  
Peer-to-peer (P2P) networks have become a powerful means for online data exchange. Currently, users are primarily utilizing these networks to perform exact-match queries and retrieve complete files. However, future more data intensive applications, such as P2P auction networks, P2P job-search networks, P2P multiplayer games, will require the capability to respond to more complex queries such as range queries involving numerous data types including those that have a spatial component. In this paper, a distributed quadtree index that adapts the MX-CIF quadtree is described that enables more powerful accesses to data in P2P networks. This index has been implemented for various prototype P2P applications and results of experiments are presented. Our index is easy to use, scalable, and exhibits good load-balancing properties. Similar indices can be constructed for various multidimensional data types with both spatial and non-spatial components.  相似文献   

5.
Similarity searching in metric spaces has a vast number of applications in several fields like multimedia databases, text retrieval, computational biology, and pattern recognition. In this context, one of the most important similarity queries is the k nearest neighbor (k-NN) search. The standard best-first k-NN algorithm uses a lower bound on the distance to prune objects during the search. Although optimal in several aspects, the disadvantage of this method is that its space requirements for the priority queue that stores unprocessed clusters can be linear in the database size. Most of the optimizations used in spatial access methods (for example, pruning using MinMaxDist) cannot be applied in metric spaces, due to the lack of geometric properties. We propose a new k-NN algorithm that uses distance estimators, aiming to reduce the storage requirements of the search algorithm. The method stays optimal, yet it can significantly prune the priority queue without altering the output of the query. Experimental results with synthetic and real datasets confirm the reduction in storage space of our proposed algorithm, showing savings of up to 80% of the original space requirement.
Gonzalo NavarroEmail:

Benjamin Bustos   is an assistant professor in the Department of Computer Science at the University of Chile. He is also a researcher at the Millennium Nucleus Center for Web Research. His research interests are similarity searching and multimedia information retrieval. He has a doctoral degree in natural sciences from the University of Konstanz, Germany. Contact him at bebustos@dcc.uchile.cl. Gonzalo Navarro   earned his PhD in Computer Science at the University of Chile in 1998, where he is now Full Professor. His research interests include similarity searching, text databases, compression, and algorithms and data structures in general. He has coauthored a book on string matching and around 200 international papers. He has (co)chaired international conferences SPIRE 2001, SCCC 2004, SPIRE 2005, SIGIR Posters 2005, IFIP TCS 2006, and ENC 2007 Scalable Pattern Recognition track; and belongs to the Editorial Board of Information Retrieval Journal. He is currently Head of the Department of Computer Science at University of Chile, and Head of the Millenium Nucleus Center for Web Research, the largest Chilean project in Computer Science research.   相似文献   

6.
This paper presents a rigorous analytic study of gossip-based message dissemination schemes that can be employed for content/service dissemination or discovery in unstructured and distributed networks. When using random gossiping, communication with multiple peers in one gossiping round is allowed. The algorithms studied in this paper are considered under different network conditions, depending on the knowledge of the state of the neighboring nodes in the network. Different node behaviors, with respect to their degree of cooperation and compliance with the gossiping process, are also incorporated. From the exact analysis, several important performance metrics and design parameters are analytically determined. Based on the proposed metrics and parameters, the performance of the gossip-based dissemination or search schemes, as well as the impact of the design parameters, are evaluated.  相似文献   

7.
Distributed database is an exciting concept since it combines the functional advantages of an integrated database with the economic advantages of a distributed implementation. However, a potential implementor may well ask how much expensive special purpose software must be produced locally or otherwise added to realise a distributed database or, in other words, how much support for the distributed database concept is currently available from manufacturers or software vendors. This paper outlines the requirements of distributed database systems and attempts to survey the present level of support.  相似文献   

8.
File replication is a widely used technique for high performance in peer-to-peer content delivery networks. A file replication technique should be efficient and at the same time facilitates efficient file consistency maintenance. However, most traditional methods do not consider nodes’ available capacity and physical location in file replication, leading to high overhead for both file replication and consistency maintenance. This paper presents a proactive low-overhead file replication scheme, namely Plover. By making file replicas among physically close nodes based on nodes’ available capacities, Plover not only achieves high efficiency in file replication but also supports low-cost and timely consistency maintenance. It also includes an efficient file query redirection algorithm for load balancing between replica nodes. Theoretical analysis and simulation results demonstrate the effectiveness of Plover in comparison with other file replication schemes. It dramatically reduces the overhead of both file replication and consistency maintenance compared to other schemes. In addition, it yields significant improvements in reduction of overloaded nodes.  相似文献   

9.
10.
Proximity searches become very difficult on “high dimensional” metric spaces, that is, those whose histogram of distances has a large mean and/or a small variance. This so-called “curse of dimensionality”, well known in vector spaces, is also observed in metric spaces. The search complexity grows sharply with the dimension and with the search radius. We present a general probabilistic framework applicable to any search algorithm and whose net effect is to reduce the search radius. The higher the dimension, the more effective the technique. We illustrate empirically its practical performance on a particular class of algorithms, where large improvements in the search time are obtained at the cost of a very small error probability.  相似文献   

11.
Data availability is an important requirement of distributed databases. Replication is a technique that has been proposed to meet this need. In the absence of failures, traditional replica control algorithms provide complete availability in the sense that any transaction can be executed. The worst case of data availability occurs when the system is totally partitioned (each operational site is isolated from every other site). In this paper, we present techniques to achieve high availability under combinations of site failures and partitions. Users are required to specify the database access requirements in the totally-partitioned environment. This information is represented by means of a Read Access Graph (RAG). When failures occur, the set of items that may be accessed by a transaction depends on the connectivity of the network and the RAG. The techniques ensure that as failures occur the loss of availability is gradual and graceful. Data availability improves with the level of normalcy in the system. Unless there is a complete failure, at least some predefined set of transactions can be executed. It is shown that these algorithms preserve the integrity of the database by ensuring that all executions are one-copy serializable. The algorithms compare favorably with other replica management schemes in terms of availability. K. Brahmadathan obtained a Bachelor's degree in Electronics and Communications Engineering from University of Kerala, Trivandrum, India; a Master's degree in Computer Science from Indian Institute of Technology, Madras, India; and the M.S. and Ph.D. degrees in Computer Science from University of Pittsburgh. Since 1989, he has been an Assistant Professor of Computer Science at the University of Wyoming. His research interests are in the areas of database systems and distributed systems. K.V.S. Ramarao obtained his M.Sc. in Applied Mathematics from Andhra University, Waltair, India; M.Tech. in Computer Science from IIT Kanpur, India; and the Ph.D. in Computing Science from University of Alberta, Edmonton, Canada. He is currently a Senior Technologist for Southwestern Bell Technology Resources, Inc. Prior to that, he was an Assistant Professor at the University of Pittsburgh. His current research interests include distributed systems and distributed databases.  相似文献   

12.
This paper studies the problem of answering aggregation queries, satisfying the interval validity semantics, in a distributed system prone to continuous arrival and departure of participants. The interval validity semantics states that the query answer must be calculated considering contributions of at least all processes that remained in the distributed system for the whole query duration. Satisfying this semantics in systems experiencing unbounded churn is impossible due to the lack of connectivity and path stability between processes. This paper presents a novel architecture, namely Virtual Tree, for building and maintaining a structured overlay network with guaranteed connectivity and path stability in settings characterized by bounded churn rate. The architecture includes a simple query answering algorithm that provides interval valid answers. The overlay network generated by the Virtual Tree architecture is a tree-shaped topology with virtual nodes constituted by clusters of processes and virtual links constituted by multiple communication links connecting processes located in adjacent virtual nodes. We formally prove a bound on the churn rate for interval valid queries in a distributed system where communication latencies are bounded by a constant unknown by processes. Finally, we carry out an extensive experimental evaluation that shows the degree of robustness of the overlay network generated by the virtual tree architecture under different churn rates.  相似文献   

13.
Today’s peer-to-peer networks are designed based on the assumption that the participating nodes are cooperative, which does not hold in reality. Incentive mechanisms that promote cooperation must be introduced. However, the existing incentive schemes (using either reputation or virtual currency) suffer from various attacks based on false reports. Even worse, a colluding group of malicious nodes in a peer-to-peer network can manipulate the history information of its own members, and the damaging power increases dramatically with the group size. Such malicious nodes/collusions are difficult to detect, especially in a large network without a centralized authority. In this paper, we propose a new distributed incentive scheme, in which the amount that a node can benefit from the network is proportional to its contribution, malicious nodes can only attack others at the cost of their own interests, and a colluding group cannot gain advantage by cooperation regardless of its size. Consequently, the damaging power of colluding groups is strictly limited. The proposed scheme includes three major components: a distributed authority infrastructure, a key sharing protocol, and a contract verification protocol.  相似文献   

14.
Most of the routing algorithms devised for sensor networks considered either energy constraints or bandwidth constraints to maximize the network lifetime. In the real scenario, both energy and bandwidth are the scarcest resource for sensor networks. The energy constraints affect only sensor routing, whereas the link bandwidth affects both routing topology and data rate on each link. Therefore, a heuristic technique that combines both energy and bandwidth constraints for better routing in the wireless sensor networks is proposed. The link bandwidth is allocated based on the remaining energy making the routing solution feasible under bandwidth constraints. This scheme uses an energy efficient algorithm called nearest neighbor tree (NNT) for routing. The data gathered from the neighboring nodes are also aggregated based on averaging technique in order to reduce the number of data transmissions. Experimental results show that this technique yields good solutions to increase the sensor network lifetime. The proposed work is also tested for wildfire application.  相似文献   

15.
16.
As databases increasingly integrate different types of information such as time-series, multimedia and scientific data, it becomes necessary to support efficient retrieval of multi-dimensional data. Both the dimensionality and the amount of data that needs to be processed are increasing rapidly. As a result of the scale and high dimensional nature, the traditional techniques have proven inadequate. In this paper, we propose search techniques that are effective especially for large high dimensional data sets. We first propose VA+VA+-file technique which is based on scalar quantization of the data. VA+VA+-file is especially useful for searching exact nearest neighbors (NN) in non-uniform high dimensional data sets. We then discuss how to improve the search and make it progressive by allowing some approximations in the query result. We develop a general framework for approximate NN queries, discuss various approaches for progressive processing of similarity queries, and develop a metric for evaluation of such techniques. Finally, a new technique based on clustering is proposed, which merges the benefits of various approaches for progressive similarity searching. Extensive experimental evaluation is performed on several real-life data sets. The evaluation establishes the superiority of the proposed techniques over the existing techniques for high dimensional similarity searching. The techniques proposed in this paper are effective for real-life data sets, which are typically non-uniform, and they are scalable with respect to both dimensionality and size of the data set.  相似文献   

17.
B.  T.   《Microprocessors and Microsystems》2002,26(9-10):399-406
This paper presents an efficient parallel architecture for Kohonen's Self Organizing Map Neural Networks (SOM) and analyzes the area–time complexities. The proposed SIMD architecture for the SOM facilitates its use in real-time applications like video processing. The operations of norm computation and weight update are done by the individual neurons and that of winner determination is carried out by a global, serial or parallel logic. Two methods for winner determination are presented and their time, area and networking complexities are studied. Optimal techniques for the retrieval of the winner's index are also proposed for the two methods and their complexities are investigated.  相似文献   

18.
Internet-based distributed systems enable globally-scattered resources to be collectively pooled and used in a cooperative manner to achieve unprecedented petascale supercomputing capabilities. Numerous resource discovery approaches have been proposed to help achieve this goal. To report or discover a multi-attribute resource, most approaches use multiple messages, with one message for each attribute, leading to high overhead of memory consumption, node communication, and subsequent merging operation. Another approach can report and discover a multi-attribute resource using one query by reducing multi-attribute to a single index, but it is not practically effective in an environment with a large number of different resource attributes. Furthermore, few approaches are able to locate resources geographically close to the requesters, which is critical to system performance. This paper presents a P2P-based intelligent resource discovery (PIRD) mechanism that weaves all attributes into a set of indices using locality sensitive hashing, and then maps the indices to a structured P2P overlay. PIRD can discover resources geographically close to requesters by relying on a hierarchical P2P structure. It significantly reduces overhead and improves search efficiency and effectiveness in resource discovery. It further incorporates the Lempel–Ziv–Welch algorithm to compress attribute information for higher efficiency. Theoretical analysis and simulation results demonstrate the efficiency of PIRD in comparison with other approaches. It dramatically reduces overhead and yields significant improvements on the efficiency of resource discovery.  相似文献   

19.
One of the challenges in the design of a distributed multimedia system is devising suitable specification models for various schemas in different levels of the system. Another important research issue is the integration and synchronization of heterogeneous multimedia objects. In this paper, we present our models for multimedia schemas and transformation algorithms. They transform high-level multimedia objects into schemas that can be used to support the presentation and communication of the multimedia objects. A key module in the system is the Object Exchange Manager (OEM). In this paper, we present the design and implementation of the OEM module, and discuss in detail the interaction between the OEM and other modules in a distributed multimedia system.  相似文献   

20.
The file allocation problem for distributed databases has been extensively studied in the literature and the objective is to minimize total costs consisting of storage, query and update communication costs. Current modeling of update communication costs is simplistic and does not capture the working of most of the protocols that have been proposed. This paper shows that more accurate modeling of update costs can be achieved fairly easily without an undue increase in the complexity of the formulation. In particular, formulations for two classes of update protocols are shown. Existing heuristics can be used on these formulations to obtain good solutions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号