首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. In this article we propose a new dynamic paged and balanced access method for similarity search in metric data sets, named CM-tree (Clustered Metric tree). It fully supports dynamic capabilities of insertions and deletions both of single objects and in bulk. Distinctive from other methods, it is especially designed to achieve a structure of tight and low overlapping clusters via its primary construction algorithms (instead of post-processing), yielding significantly improved performance. Several new methods are introduced to achieve this: a strategy for selecting representative objects of nodes, clustering based node split algorithm and criteria for triggering a node split, and an improved sub-tree pruning method used during search. To facilitate these methods the pairwise distances between the objects of a node are maintained within each node. Results from an extensive experimental study show that the CM-tree outperforms the M-tree and the Slim-tree, improving search performance by up to 312% for I/O costs and 303% for CPU costs.  相似文献   

2.
We propose a new dynamic index structure called the GC-tree (or the grid cell tree) for efficient similarity search in image databases. The GC-tree is based on a special subspace partitioning strategy which is optimized for a clustered high-dimensional image dataset. The basic ideas are threefold: 1) we adaptively partition the data space based on a density function that identifies dense and sparse regions in a data space; 2) we concentrate the partition on the dense regions, and the objects in the sparse regions of a certain partition level are treated as if they lie within a single region; and 3) we dynamically construct an index structure that corresponds to the space partition hierarchy. The resultant index structure adapts well to the strongly clustered distribution of high-dimensional image datasets. To demonstrate the practical effectiveness of the GC-tree, we experimentally compared the GC-tree with the IQ-tree, LPC-file, VA-file, and linear scan. The result of our experiments shows that the GC-tree outperforms all other methods.  相似文献   

3.
Similarity search (e.g., k-nearest neighbor search) in high-dimensional metric space is the key operation in many applications, such as multimedia databases, image retrieval and object recognition, among others. The high dimensionality and the huge size of the data set require an index structure to facilitate the search. State-of-the-art index structures are built by partitioning the data set based on distances to certain reference point(s). Using the index, search is confined to a small number of partitions. However, these methods either ignore the property of the data distribution (e.g., VP-tree and its variants) or produce non-disjoint partitions (e.g., M-tree and its variants, DBM-tree); these greatly affect the search efficiency. In this paper, we study the effectiveness of a new index structure, called Nested-Approximate-eQuivalence-class tree (NAQ-tree), which overcomes the above disadvantages. NAQ-tree is constructed by recursively dividing the data set into nested approximate equivalence classes. The conducted analysis and the reported comparative test results demonstrate the effectiveness of NAQ-tree in significantly improving the search efficiency.  相似文献   

4.
Shape similarity searching is a crucial task in image databases, particularly in the presence of errors induced by segmentation or scanning images. The resulting slight displacements or rotations have not been considered so far in the literature. We present a new similarity model that flexibly addresses this problem. By specifying neighborhood influence weights, the user may adapt the similarity distance functions to his or her own requirements or preferences. Technically, the new similarity model is based on quadratic forms for which we present a multi-step query processing architecture, particularly for high dimensions as they occur in image databases. Our algorithm to reduce the dimensionality of quadratic form-based similarity queries results in a lower-bounding distance function that is proven to provide an optimal filter selectivity. Experiments on our test database of 10,000 images demonstrate the applicability and the performance of our approach, even in dimensions as high as 1,024  相似文献   

5.
Typically searching image collections is based on features of the images. In most cases the features are based on the color histogram of the images. Similarity search based on color histograms is very efficient, but the quality of the search results is often rather poor. One of the reasons is that histogram-based systems only support a specific form of global similarity using the whole histogram as one vector. But there is more information in a histogram than the distribution of colors. This paper has two contributions: (1) a new generalized similarity search method based on a wavelet transformation of the color histograms and (2) a new effectiveness measure for image similarity search. Our generalized similarity search method has been developed to allow the user to search for images with similarities on arbitrary detail levels of the color histogram. We show that our new approach is more general and more effective than previous approaches while retaining a competitive performance.  相似文献   

6.
Bulk construction of dynamic clustered metric trees   总被引:2,自引:2,他引:0  
Repositories of complex data types, such as images, audio, video and free text, are becoming increasingly frequent in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. An important class of access methods for similarity search in metric data is that of dynamic clustered metric trees, where the index is structured as a paged and balanced tree and the space is partitioned hierarchically into compact regions. While access methods of this class allow dynamic insertions typically of single objects, the problem of efficiently inserting a given data set into the index in bulk is largely open. In this article we address this problem and propose novel algorithms corresponding to its two cases, where the index is initially empty (i.e. bulk loading), and where the index is initially non empty (i.e. bulk insertion). The proposed bulk loading algorithm builds the index bottom-up layer by layer, using a new sampling based clustering method, which improves clustering results by improving the quality of the selected sample sets. The proposed bulk insertion algorithm employs the bulk loading algorithm to load the given data into a new index structure, and then merges the new and the existing structures into a unified high quality index, using a novel decomposition method to reduce overlaps between the structures. Both algorithms yield significantly improved construction and search performance, and are applicable to all dynamic clustered metric trees. Results from an extensive experimental study show that the proposed algorithms outperform alternative methods, reducing construction costs by up to 47% for CPU costs and 99% for I/O costs, and search costs by up to 48% for CPU costs and 30% for I/O costs.  相似文献   

7.
Data Mining and Knowledge Discovery - The graph edit distance is an intuitive measure to quantify the dissimilarity of graphs, but its computation is $$mathsf {NP}$$ -hard and challenging in...  相似文献   

8.
9.
The performance of several discrepancy measures for the comparison of edge images is analyzed and a novel similarity metric aimed at overcoming their problems is proposed. The algorithm finds an optimal matching of the pixels between the images and estimates the error produced by this matching. The resulting Pixel Correspondence Metric (PCM) can take into account edge strength as well as the displacement of edge pixel positions in the estimation of similarity. A series of experimental tests shows the new metric to be a robust and effective tool in the comparison of edge images when a small localization error of the detected edges is allowed.  相似文献   

10.
《Information Systems》2004,29(5):405-420
This paper discusses the effective processing of similarity search that supports time warping in large sequence databases. Time warping enables sequences with similar patterns to be found even when they are of different lengths. Prior methods for processing similarity search that supports time warping failed to employ multi-dimensional indexes without false dismissal since the time warping distance does not satisfy the triangular inequality. They have to scan the entire database, thus suffering from serious performance degradation in large databases. Another method that hires the suffix tree, which does not assume any distance function, also shows poor performance due to the large tree size.In this paper, we propose a novel method for similarity search that supports time warping. Our primary goal is to enhance the search performance in large databases without permitting any false dismissal. To attain this goal, we have devised a new distance function, Dtwlb, which consistently underestimates the time warping distance and satisfies the triangular inequality. Dtwlb uses a 4-tuple feature vector that is extracted from each sequence and is invariant to time warping. For the efficient processing of similarity search, we employ a multi-dimensional index that uses the 4-tuple feature vector as indexing attributes, and Dtwlb as a distance function. We prove that our method does not incur false dismissal. To verify the superiority of our method, we have performed extensive experiments. The results reveal that our method achieves a significant improvement in speed up to 43 times faster with a data set containing real-world S&P 500 stock data sequences, and up to 720 times with data sets containing a very large volume of synthetic data sequences. The performance gain increases: (1) as the number of data sequences increases, (2) the average length of data sequences increases, and (3) as the tolerance in a query decreases. Considering the characteristics of real databases, these tendencies imply that our approach is suitable for practical applications.  相似文献   

11.
陈湘涛  丁平尖  王晶 《计算机应用》2014,34(9):2604-2607
现有的相似性搜索算法通常没有考虑时间因素,为此,提出一种异构信息网中基于元路径的动态相似性搜索算法PDSim。PDSim算法首先计算给定元路径下实体的链接矩阵,得到实体之间的元路径实例数比值,同时基于建立时间的不同,计算其时间差异度;在此基础上针对给定的元路径,获得异构信息网中动态相似性的度量。在多个相似性搜索实例中,PDSim能够捕获到实体随时间变化而产生的兴趣的变化;应用于聚类时,相对于PathSim和PCRW方法,其标准互信息聚类精度可以提高0.17%~9.24%。实验结果表明,PDSim方法与传统的基于链接的相似性搜索算法相比,显著提高了异构信息网中动态相似性搜索的效率和用户满意度,是一种研究实体随时间而发生动态变化的相似性搜索方法。  相似文献   

12.
Improving the recall of information retrieval systems for similarity search in time series databases is of great practical importance. In the manufacturing domain, these systems are used to query large databases of manufacturing process data that contain terabytes of time series data from millions of parts. This allows domain experts to identify parts that exhibit specific process faults. In practice, the search often amounts to an iterative query–response cycle in which users define new queries (time series patterns) based on results of previous queries. This is a well-documented phenomenon in information retrieval and not unique to the manufacturing domain. Indexing manufacturing databases to speed up the exploratory search is often not feasible as it may result in an unacceptable reduction in recall. In this paper, we present a novel adaptive search algorithm that refines the query based on relevance feedback provided by the user. Additionally, we propose a mechanism that allows the algorithm to self-adapt to new patterns without requiring any user input. As the search progresses, the algorithm constructs a library of time series patterns that are used to accurately find objects of the target class. Experimental validation of the algorithm on real-world manufacturing data shows, that the recall for the retrieval of fault patterns is considerably higher than that of other state-of-the-art adaptive search algorithms. Additionally, its application to publicly available benchmark data sets shows, that these results are transferable to other domains.  相似文献   

13.
A visual search is required when applying a recognition process on a scene containing multiple objects. In such cases, we would like to avoid an exhaustive sequential search. This work proposes a dynamic visual search framework based mainly on inner-scene similarity. Given a number of candidates (e.g., subimages), we hypothesize is that more visually similar candidates are more likely to have the same identity. We use this assumption for determining the order of attention. Both deterministic and stochastic approaches, relying on this hypothesis, are considered. Under the deterministic approach, we suggest a measure similar to Kolmogorov's epsilon-covering that quantifies the difficulty of a search task. We show that this measure bounds the performance of all search algorithms and suggest a simple algorithm that meets this bound. Under the stochastic approach, we model the identity of the candidates as a set of correlated random variables and derive a search procedure based on linear estimation. Several experiments are presented in which the statistical characteristics, search algorithm, and bound are evaluated and verified.  相似文献   

14.
《Information Fusion》2008,9(2):156-160
A novel objective quality metric for image fusion is presented. The interest of our metric lies in the fact that the redundant regions and the complementary/conflicting regions are treated respectively according to the structural similarity between the source images. The experiments show that the proposed measure is consistent with human visual evaluations and can be applied to evaluate image fusion schemes that are not performed at the same level.  相似文献   

15.
Communities are basic components in networks. As a promising social application, community recommendation selects a few items (e.g., movies and books) to recommend to a group of users. It usually achieves higher recommendation precision if the users share more interests; whereas, in plenty of communities (e.g., families, work groups), the users often share few. With billions of communities in online social networks, quickly selecting the communities where the members are similar in interests is a prerequisite for community recommendation. To this end, we propose an easy-to-compute metric, Community Similarity Degree (CSD), to estimate the degree of interest similarity among multiple users in a community. Based on 3460 emulated Facebook communities, we conduct extensive empirical studies to reveal the characteristics of CSD and validate the effectiveness of CSD. In particular, we demonstrate that selecting communities with larger CSD can achieve higher recommendation precision. In addition, we verify the computation efficiency of CSD: it costs less than 1 hour to calculate CSD for over 1 million of communities. Finally, we draw insights about feasible extensions to the definition of CSD, and point out the practical uses of CSD in a variety of applications other than community recommendation.  相似文献   

16.
WALRUS: a similarity retrieval algorithm for image databases   总被引:2,自引:0,他引:2  
Approaches for content-based image querying typically extract a single signature from each image based on color, texture, or shape features. The images returned as the query result are then the ones whose signatures are closest to the signature of the query image. While efficient for simple images, such methods do not work well for complex scenes since they fail to retrieve images that match the query only partially, that is, only certain regions of the image match. This inefficiency leads to the discarding of images that may be semantically very similar to the query image since they may contain the same objects. The problem becomes even more apparent when we consider scaled or translated versions of the similar objects. We propose WALRUS (wavelet-based retrieval of user-specified scenes), a novel similarity retrieval algorithm that is robust to scaling and translation of objects within an image. WALRUS employs a novel similarity model in which each image is first decomposed into its regions and the similarity measure between a pair of images is then defined to be the fraction of the area of the two images covered by matching regions from the images. In order to extract regions for an image, WALRUS considers sliding windows of varying sizes and then clusters them based on the proximity of their signatures. An efficient dynamic programming algorithm is used to compute wavelet-based signatures for the sliding windows. Experimental results on real-life data sets corroborate the effectiveness of WALRUS'S similarity model.  相似文献   

17.
Learning distance metrics for measuring the similarity between two data points in unsupervised and supervised pattern recognition has been widely studied in unconstrained face verification tasks. Motivated by the fact that enforcing single distance metric learning for verification via an empirical score threshold is not robust in uncontrolled experimental conditions, we therefore propose to obtain a metric swarm by learning local patches alike sub-metrics simultaneously that naturally formulates a generalized metric swarm learning (GMSL) model with a joint similarity score function solved by an efficient alternative optimization algorithm. Further, each sample pair is represented as a similarity vector via the well-learned metric swarm, such that the face verification task becomes a generalized SVM-alike classification problem. Therefore, the verification can be enforced in the represented metric swarm space that can well improve the robustness of verification under irregular data structure. Experiments are preliminarily conducted using several UCI benchmark datasets for solving general classification problem. Further, the face verification experiments on real-world LFW and PubFig datasets demonstrate that our proposed model outperforms several state-of-the-art metric learning methods.  相似文献   

18.
Similarity matching in video databases is of growing importance in many new applications such as video clustering and digital video libraries. In order to provide efficient access to relevant data in large databases, there have been many research efforts in video indexing with diverse spatial and temporal features. However, most of the previous works relied on sequential matching methods or memory-based inverted file techniques, thus making them unsuitable for a large volume of video databases. In order to resolve this problem, this paper proposes an effective and scalable indexing technique using a trie, originally proposed for string matching, as an index structure. For building an index, we convert each frame into a symbol sequence using a window order heuristic and build a disk-resident trie from a set of symbol sequences. For query processing, we perform a depth-first traversal on the trie and execute a temporal segmentation. To verify the superiority of our approach, we perform several experiments with real and synthetic data sets. The results reveal that our approach consistently outperforms the sequential scan method, and the performance gain is maintained even with a large volume of video databases.  相似文献   

19.
Virtual images for similarity retrieval in image databases   总被引:1,自引:0,他引:1  
We introduce the virtual image, an iconic index suited for pictorial information access in a pictorial database, and a similarity retrieval approach based on virtual images to perform content-based retrieval. A virtual image represents the spatial information contained in a real image in explicit form by means of a set of spatial relations. This is useful to efficiently compute the similarity between a query and an image in the database. We also show that virtual images support real-world applications that require translation, reflection, and/or rotation invariance of image representation  相似文献   

20.
Large-scale similarity search engines are complex systems devised to process unstructured data like images and videos. These systems are deployed on clusters of distributed processors communicated through high-speed networks. To process a new query, a distance function is evaluated between the query and the objects stored in the database. This process relays on a metric space index distributed among the processors. In this paper, we propose a cache-based strategy devised to reduce the number of computations required to retrieve the top-k object results for user queries by using pre-computed information. Our proposal executes an approximate similarity search algorithm, which takes advantage of the links between objects stored in the cache memory. Those links form a graph of similarity among pre-computed queries. Compared to the previous methods in the literature, the proposed approach reduces the number of distance evaluations up to 60%.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号