首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 468 毫秒
1.
Similarity query processing is becoming increasingly important in many applications such as data cleaning, record linkage, Web search, and document analytics. In this paper we study how to provide end-to-end similarity query support natively in a parallel database system. We discuss how to express a similarity predicate in its query language, how to build indexes, how to answer similarity queries (selections and joins) efficiently in the runtime engine, possibly using indexes, and how to optimize similarity queries. One particular challenge is how to incorporate existing similarity join algorithms, which often require a series of steps to achieve a high efficiency, including collecting token frequencies, finding matching record id pairs, and reassembling result records based on id pairs. We present a novel approach that uses existing runtime operators to implement such complex join algorithms without reinventing the wheel; doing so positions the system to automatically benefit from future improvements to those operators. The approach includes a technique to transform a similarity join plan into an efficient operator-based physical plan during query optimization by using a template expressed largely in the system’s user-level query language; this technique greatly simplifies the specification of such a transformation rule. We use Apache AsterixDB, a parallel Big Data management system, to illustrate and validate our techniques. We conduct an experimental study using several large, real datasets on a parallel computing cluster to assess the similarity query support. We also include experiments involving three other parallel systems and report the efficacy and performance results.  相似文献   

2.
Do similarity or distance measures ever go wrong? The inherent subjectivity in similarity discernment has long supported the view that all judgements of similarity are equally valid, and that any selected similarity measure may only be considered more effective in some chosen domain. This article presents evidence that such a view is incorrect for the specific case of relative structural similarity. In this context, similarity and distance measures occasionally do go wrong, producing judgements that can be considered as errors in judgement. This claim is supported by a novel method for assessing the quality of structural similarity and distance functions, which is based on relative scale of similarity with respect to chosen reference objects. The method may be applied either with synthetic graph datasets or with graphs representing objects in an application domain of interest. This work demonstrates the method over synthetic datasets with common measures of structural similarity in graphs. Finally, the article identifies three distinct kinds of relative similarity judgement errors, and shows how the distribution of these errors is related to graph properties under common similarity measures.  相似文献   

3.
Recently, permutation based indexes have attracted interest in the area of similarity search. The basic idea of permutation based indexes is that data objects are represented as appropriately generated permutations of a set of pivots (or reference objects). Similarity queries are executed by searching for data objects whose permutation representation is similar to that of the query, following the assumption that similar objects are represented by similar permutations of the pivots. In the context of permutation-based indexing, most authors propose to select pivots randomly from the data set, given that traditional pivot selection techniques do not reveal better performance. However, to the best of our knowledge, no rigorous comparison has been performed yet. In this paper we compare five pivot selection techniques on three permutation-based similarity access methods. Among those, we propose a novel technique specifically designed for permutations. Two significant observations emerge from our tests. First, random selection is always outperformed by at least one of the tested techniques. Second, there is no technique that is universally the best for all permutation-based access methods; rather different techniques are optimal for different methods. This indicates that the pivot selection technique should be considered as an integrating and relevant part of any permutation-based access method.  相似文献   

4.
The application of document clustering to information retrieval has been motivated by the potential effectiveness gains postulated by the cluster hypothesis. The hypothesis states that relevant documents tend to be highly similar to each other and therefore tend to appear in the same clusters. In this paper we propose an axiomatic view of the hypothesis by suggesting that documents relevant to the same query (co-relevant documents) display an inherent similarity to each other that is dictated by the query itself. Because of this inherent similarity, the cluster hypothesis should be valid for any document collection. Our research describes an attempt to devise means by which this similarity can be detected. We propose the use of query-sensitive similarity measures that bias interdocument relationships toward pairs of documents that jointly possess attributes expressed in a query. We experimentally tested three query-sensitive measures against conventional ones that do not take the query into account, and we also examined the comparative effectiveness of the three query-sensitive measures. We calculated interdocument relationships for varying numbers of top-ranked documents for six document collections. Our results show a consistent and significant increase in the number of relevant documents that become nearest neighbors of any given relevant document when query-sensitive measures are used. These results suggest that the effectiveness of a cluster-based information retrieval system has the potential to increase through the use of query-sensitive similarity measures.  相似文献   

5.
Image retrieval from an image database by the image objects and their spatial relationships has emerged as an important research subject in these decades. To retrieve images similar to a given query image, retrieval methods must assess the similarity degree between a database image and the query image by the extracted features with acceptable efficiency and effectiveness. This paper proposes a graph-based model SRG (spatial relation graph) to represent the semantic information of the contained objects and their spatial relationships in an image with no file annotation. In an SRG graph, the image objects are symbolized by the predefined class names as vertices and the spatial relations between object pairs are represented as arcs. The proposed model assesses the similarity degree between two images by calculating the maximum common subgraph of two corresponding SRG’s through intersection, which has quadratic time complexity owing to the characteristics of SRG. Its efficiency remains quadratic regardless of the duplication rate of the object symbols. The extended model SRGT is also proposed, with the same time complexity, for the applications that need to consider the topological relations among objects. A synthetic symbolic image database and an existing image dataset are used in the conducted experiments to verify the performance of the proposed models. The experimental results show that the proposed models have compatible retrieval quality with remarkable efficiency improvements compared with three well-known methods LCS_Clique, SIMR, and 2D Be-string, where LCS_Clique utilizes the number of objects in the maximum common subimage as its similarity function, SIMR uses accumulation-based similarity function of similar object pairs, and 2D Be-string calculates the similarity of 2D patterns by the linear combination of two 1D similarities.  相似文献   

6.
7.
In several applications, data objects move on pre-defined spatial networks such as road segments, railways, and invisible air routes. Many of these objects exhibit similarity with respect to their traversed paths, and therefore two objects can be correlated based on their motion similarity. Useful information can be retrieved from these correlations and this knowledge can be used to define similarity classes. In this paper, we study similarity search for moving object trajectories in spatial networks. The problem poses some important challenges, since it is quite different from the case where objects are allowed to move freely in any direction without motion restrictions. New similarity measures should be employed to express similarity between two trajectories that do not necessarily share any common sub-path. We define new similarity measures based on spatial and temporal characteristics of trajectories, such that the notion of similarity in space and time is well expressed, and moreover they satisfy the metric properties. In addition, we demonstrate that similarity range queries in trajectories are efficiently supported by utilizing metric-based access methods, such as M-trees.  相似文献   

8.
Optimizing top-k selection queries over multimedia repositories   总被引:2,自引:0,他引:2  
Repositories of multimedia objects having multiple types of attributes (e.g., image, text) are becoming increasingly common. A query on these attributes will typically, request not just a set of objects, as in the traditional relational query model (filtering), but also a grade of match associated with each object, which indicates how well the object matches the selection condition (ranking). Furthermore, unlike in the relational model, users may just want the k top-ranked objects for their selection queries for a relatively small k. In addition to the differences in the query model, another peculiarity of multimedia repositories is that they may allow access to the attributes of each object only through indexes. We investigate how to optimize the processing of top-k selection queries over multimedia repositories. The access characteristics of the repositories and the above query model lead to novel issues in query optimization. In particular, the choice of the indexes used to search the repository strongly influences the cost of processing the filtering condition. We define an execution space that is search-minimal, i.e., the set of indexes searched is minimal. Although the general problem of picking an optimal plan in the search-minimal execution space is NP-hard, we present an efficient algorithm that solves the problem optimally with respect to our cost model and execution space when the predicates in the query are independent. We also show that the problem of optimizing top-k selection queries can be viewed, in many cases, as that of evaluating more traditional selection conditions. Thus, both problems can be viewed together as an extended filtering problem to which techniques of query processing and optimization may be adapted.  相似文献   

9.
Databases are getting more and more important for storing complex objects from scientific, engineering, or multimedia applications. Examples for such data are chemical compounds, CAD drawings, or XML data. The efficient search for similar objects in such databases is a key feature. However, the general problem of many similarity measures for complex objects is their computational complexity, which makes them unusable for large databases. In this paper, we combine and extend the two techniques of metric index structures and multi-step query processing to improve the performance of range query processing. The efficiency of our methods is demonstrated in extensive experiments on real-world data including graphs, trees, and vector sets.  相似文献   

10.
In this paper, we presented a novel image representation method to capture the information about spatial relationships between objects in a picture. Our method is more powerful than all other previous methods in terms of accuracy, flexibility, and capability of discriminating pictures. In addition, our method also provides different degrees of granularity for reasoning about directional relations in both 8- and 16-direction reference frames. In similarity retrieval, our system provides twelve types of similarity measures to support flexible matching between the query picture and the database pictures. By exercising a database containing 3600 pictures, we successfully demonstrated the effectiveness of our image retrieval system. Experiment result showed that 97.8% precision rate can be achieved while maintaining 62.5% recall rate; and 97.9% recall rate can be achieved while maintaining 51.7% precision rate. On an average, 86.1% precision rate and 81.2% recall rate can be achieved simultaneously if the threshold is set to 0.5 or 0.6. This performance is considered to be very good as an information retrieval system.  相似文献   

11.
The similarity join has become an important database primitive for supporting similarity searches and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Two types of the similarity join are well-known, the distance range join, in which the user defines a distance threshold for the join, and the closest pair query or k-distance join, which retrieves the k most similar pairs. In this paper, we propose an important, third similarity join operation called the k-nearest neighbour join, which combines each point of one point set with its k nearest neighbours in the other set. We discover that many standard algorithms of Knowledge Discovery in Databases (KDD) such as k-means and k-medoid clustering, nearest neighbour classification, data cleansing, postprocessing of sampling-based data mining, etc. can be implemented on top of the k-nn join operation to achieve performance improvements without affecting the quality of the result of these algorithms. We propose a new algorithm to compute the k-nearest neighbour join using the multipage index (MuX), a specialised index structure for the similarity join. To reduce both CPU and I/O costs, we develop optimal loading and processing strategies.  相似文献   

12.
13.
Proximity searching is the problem of retrieving, from a given database, those objects closest to a query. To avoid exhaustive searching, data structures called indexes are built on the database prior to serving queries. The curse of dimensionality is a well-known problem for indexes: in spaces with sufficiently concentrated distance histograms, no index outperforms an exhaustive scan of the database.In recent years, a number of indexes for approximate proximity searching have been proposed. These are able to cope with the curse of dimensionality in exchange for returning an answer that might be slightly different from the correct one.In this paper we show that many of those recent indexes can be understood as variants of a simple general model based on K-nearest reference signatures. A set of references is chosen from the database, and the signature of each object consists of the K references nearest to the object. At query time, the signature of the query is computed and the search examines only the objects whose signature is close enough to that of the query.Many known and novel indexes are obtained by considering different ways to determine how much detail the signature records (e.g., just the set of nearest references, or also their proximity order to the object, or also their distances to the object, and so on), how the similarity between signatures is defined, and how the parameters are tuned. In addition, we introduce a space-efficient representation for those families of indexes, making it possible to search very large databases in main memory. Small indexes are cache friendly, inducing faster queries.We perform exhaustive experiments comparing several known and new indexes that derive from our framework, evaluating their time performance, memory usage, and quality of approximation. The best indexes outperform the state of the art, offering an attractive balance between all these aspects, and turn out to be excellent choices in many scenarios. Our framework gives high flexibility to design new indexes.  相似文献   

14.
章旭  石进  谢立 《计算机科学》2008,35(9):201-202
传统的模糊集合模型基于词词关联矩阵来实现模糊检索,词词关联矩阵只考虑语词在文献内部的同时出现.本文提出了一个基于相似性叙词表的模糊集合模型,考虑语词与查询之间的相似性,并将查询扩展包含在此模型中,从而在一定程度上提高了检索性能.  相似文献   

15.
一种基于HBase的高效空间关键字查询策略   总被引:2,自引:0,他引:2  
随着移动定位技术的发展以及智能手机的普及,互联网中空间文本对象的数量正在急速增长,如何在规模庞大且动态增长的空间文本对象中进行高效的空间关键字查询成为了许多空间关键字查询应用所关心的问题.现有的方法通常利用基于R树和倒排索引的混合索引结构来处理空间关键字查询,然而,面对数量巨大而且不断增长的空间文本对象,这些方法往往难以为空间关键字查询的高效性和扩展性提供支持.对此,提出一种基于HBase的空间文本数据索引结构SK-HBase.SK-HBase以HBase作为数据存储,通过有效的数据分配策略对空间文本对象的空间信息和文本信息同时进行索引.在SK-HBase的基础上,本文提出了两种空间关键字查询算法,以保证不同空间范围下的空间关键字查询的高效性和可扩展性.实验证明,我们的方法能够在海量数据下进行高效的空间关键字查询并具有良好的可扩展性.  相似文献   

16.
In this paper, we propose a rotation-invariant spatial knowledge representation called RS-string. Then we present the string generation algorithm to automatically generate RS-strings for segmented pictures. We also propose the spatial reasoning and similarity retrieval algorithms based on RS-strings. The similarity retrieval algorithm is much more flexible than all previous 2D string representations because our approach can consider every possible view of a query picture. Thus the system does not require the user to provide a query picture which must have the same orientation as that of a database picture. Finally, we provide several examples to demonstrate the capabilities of spatial reasoning and similarity retrieval based on the RS-string representation.  相似文献   

17.
This paper describes a color-texture-based image retrieval system for query of an image database to find similar images to a target image. The color-texture information is obtained via modeling with the multispectral simultaneous autoregressive (MSAR) random field model. The general color content characterized by ratios of sample color means is also used. The retrieval process involves segmenting the image into regions of uniform color texture using an unsupervised histogram clustering approach that utilizes the combination of MSAR and color features. The color-texture content, location, area and shape of the segmented regions are used to develop similarity measures describing the closeness of a query image to database images. These attributes are derived from the maximum fitting square and best fitting ellipse to each of the segmented regions. The proposed similarity measure combines all these attributes to rank the closeness of the images. The performance of the system is tested on two databases containing synthetic mosaics of natural textures and natural scenes, respectively.  相似文献   

18.
子序列匹配是时间序列挖掘的经典课题,旨在发现大型数据集中的相似数据序列.很多文献关注固定时间段的序列的查询.但对于多种不同时间段的查询的问题仍然未解决好.基于时间段的查询含义是有时间窗口限制的查询.为了满足多时间段上的查询,简单地为每个时间段的子序列构建索引既耗时又耗存储空间.从目前的文献来看,已有的索引无法满足具有不...  相似文献   

19.
基于空间形状的查询与认知主体的空间推理密切相关。从空间认知的角度,通常希望查询结果是一类形状结构相似的对象集合。以形状的不确定性表达和模糊查询为研究内容,提出一种面向2D对象形状识别的空间查询方法--向心包络算法。算法将对象划分为以最大内径中心为公共点的三角形集合,在此基础上建立相应的形状度量方法,通过提取所有顶点关于对象整体结构的形状影响因子求得对象之间的形状相似度,并建立与模糊形状谓词的匹配关系。实验表明,该方法可以实现2D对象的空间模糊查询,且查询结果与模糊形状谓词基本一致。  相似文献   

20.
Similarity Joins are extensively used in multiple application domains and are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several standalone implementations have been proposed, very little work has addressed the implementation of Similarity Joins as physical database operators. In this paper, we focus on the study, design, implementation, and optimization of a Similarity Join database operator for metric spaces. We present DBSimJoin, a physical database operator that integrates techniques to: enable a non-blocking behavior, prioritize the early generation of results, and fully support the database iterator interface. The proposed operator can be used with multiple distance functions and data types. We describe the changes in each query engine module to implement DBSimJoin and provide details of our implementation in PostgreSQL. We also study ways in which DBSimJoin can be combined with other similarity and non-similarity operators to answer more complex queries, and how DBSimJoin can be used in query transformation rules to improve query performance. The extensive performance evaluation shows that DBSimJoin significantly outperforms alternative approaches and scales very well when important parameters like ε, data size, and number of dimensions increase.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号