首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 250 毫秒
1.
汤春蕾  董家麒 《计算机学报》2012,35(11):2228-2236
子序列的相似性查询是时间序列数据集中的一种重要操作,包括范围查询和k近邻查询.现有的大多算法是基于欧几里德距离或者DTW距离的,缺点在于查询效率低下.文中提出了一种新的基于LSH的距离度量方法,可以在保证查询结果质量的前提下,极大提高相似性查询的效率;在此基础上,给出一种DS-Index索引结构,利用距离下界进行剪枝,进而还提出了两种优化的OLSH-Range和OLSH-kNN算法.实验是在真实的股票序列集上进行的,数据结果表明算法能快速精确地找出相似性查询结果.  相似文献   

2.
相似性搜索是从数据库中检索出同给定数据对象相似的数据对象,已有的基于R-tree的相似性搜索,当搜索空间的维的个数较小时效率较高,但当搜索空间的维的个数较大时则效率很低.针对此问题,提出了新的度量空间分割方法和索引结构pgh-tree,利用数据对象与很少几个固定参考对象的距离之差进行数据分割和索引,产生一个平衡的索引树.在此基础上,提出了新的算法,利用查询数据对象与固定参考对象的距离之差过滤掉大部分的不相关数据,具有较小的I/O代价和距离计算复杂性,平均复杂性为θ(n^0.58),是目前复杂性最小的相似性搜索算法.另外还讨论了基于pgh-tree的最近相邻点搜索策略.  相似文献   

3.
图相似性搜索是在给定的度量标准下查找与查询图相似的图集合,目前大多采用“过滤-验证”的计算框架。针对现有方法中过滤下界不紧密和索引空间占用较大等问题,提出了一种基于查询图分区的多层级过滤、低索引空间占用的图相似性搜索算法Z-Index。该算法首先通过全局粗粒度过滤得到预候选集;然后提出基于扩展概率的查询图分区算法,并采用层级过滤机制进一步精简候选集,增强下界紧密性;最后引入序列相似性差值计算序列中数据分布的稀疏度,提出分区压缩和差值压缩两种编码压缩算法,并据此构建“零”索引结构,降低索引空间开销。实验结果表明,Z-Index算法所得下界更加紧密,产生的候选集大小可减少50%左右,算法执行时间大大缩短,且该算法在索引空间占用极小的情况下仍具有可扩展性。  相似文献   

4.
基于Buddy*-Hash的移动对象时空查询方法   总被引:1,自引:0,他引:1       下载免费PDF全文
索引技术可以提高数据检索和查询效率,为了实现对时空数据库中移动对象的查询操作,需要引入时空索引技术。在传统Buddy-树的基础上提出Buddy*-Hash索引结构,根据扩展查询窗口策略给出范围查询算法。实验结果表明,基于BH索引结构的范围查询算法具有良好性能。  相似文献   

5.
遥感高光谱数据是一种具有空间聚集特性的高维数据。对PT方法进行改进使之与iDistance的索引机制相适应,并融合这两种不同的空间划分策略,提出一种适用于高光谱数据的索引结构。该索引是一种度量空间的高维索引,采用两级空间划分,在处理光谱相似性查询时可同时完成针对距离和空间方位的数据过滤。实验证明该索引可以有效降低I/O和距离计算次数,具有较高的剪枝效率,适用于高光谱数据相似性查询。  相似文献   

6.
随着基因测序技术和人类基因组计划的发展,从大量的生物数据中寻找相似的序列就越来越成为当前研究的热点问题.本文提出了一种聚类的多解析度字符串索引结构,用于解决生物序列的相似性查询问题.首先,以较小容量的MBR(最小绑定矩形)构造基因序列的多解析度字符串索引结构,然后通过对MBR的聚类以夏保序技术的应用,减小索引中MBR的平均体积,从而增加了查询向量到索引的空间距离,提高了索引的过滤能力.还给出了一种新的后处理方法,通过大量的减少编辑距离的计算,提高索引的性能.文中给出了该索引结构并详细介绍了索引的相关算法.实验表明,该索引结构是一种有效的处理生物数据的相似性查询的索引结构.  相似文献   

7.
空间近似关键字查询包含一个空间条件和一组关键字相似性条件,这种查询在空间数据库中返回同时满足以下条件的对象:1)对象的位置信息满足查询中的空间条件;2)对于查询中的任何一个关键字,对象中至少包含一个关键字与其相似度大于给定阈值.随着当前数据的爆炸性增长,空间数据库无法完整地存放在内存中,因此空间数据库需要支持空间近似关键字查询的外存索引.目前,还没有在外存中支持精确的空间近似关键字查询的索引结构.设计了一种新型的外存索引RB树,在外存中支持精确的空间近似关键字查询.RB树支持的空间近似关键字查询包括多种空间条件,如范围查询、NN查询,同时支持多种关键字相似性度量,包括编辑距离、规范化编辑距离等.通过真实数据中的性能测试验证了RB树的效率.  相似文献   

8.
k-最近邻搜索(KNNS) 在高维空间中应用非常广泛,但目前很多KNNS算法是基于欧氏距离对数据进行索引和搜索,不适合采用角相似性的应用。提出一种基于角相似性的k-最近邻搜索算法(BA-KNNS)。该算法先提出基于角相似性的数据索引结构(BA-Index),参照一条中心线和一条参照线,将数据以系列壳—超圆锥体方式进行组织并分别线性存储;然后确定查询对象的空间位置,有效确定一个以从原点到查询对象的直线为中心线的超圆锥体并在其中进行搜索。实验结果表明,BA-KNNS算法较其他k-最近邻搜索算法有更好的性能。  相似文献   

9.
廖巍  吴晓平  胡卫  钟志农 《计算机科学》2010,37(11):180-183
针对基于空间道路网络的k近部查询处理,提出了分布式移动对象更新策略以有效减少服务器计算代价,利用基于内存的空间道路网络部接矩阵、最短路径矩阵结构和移动对象哈希表索引分别对道路网络无向图与移动对象进行存储管理。提出了基于最短路径度量的网络扩展搜索(SPNE)算法,以通过裁剪网络搜索空间来减少k近部查询搜索代价。实验表明,SPNE算法的性能优于传统的NE和MKNN等k近邻查询处理算法。  相似文献   

10.
空间数据仓库有效地支持对空间数据的管理和分析,提供更加全面的决策支持.讨论了一种有效的空间决策支持手段——空间区域聚集查询的实现.基于aggregate cubetree和aR—tree提出了一个可以有效地在空间维和非空间维上进行区域聚集查询的索引结构aCR-tree及其相关算法,并计算分析了查询算法的时间复杂度.与现有技术相比aCR-tree降低了存储代价和每次查询访问的节点数,通过实验证明,该索引结构可以提供较好的存储性能和查询性能.  相似文献   

11.
Bulk construction of dynamic clustered metric trees   总被引:2,自引:2,他引:0  
Repositories of complex data types, such as images, audio, video and free text, are becoming increasingly frequent in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. An important class of access methods for similarity search in metric data is that of dynamic clustered metric trees, where the index is structured as a paged and balanced tree and the space is partitioned hierarchically into compact regions. While access methods of this class allow dynamic insertions typically of single objects, the problem of efficiently inserting a given data set into the index in bulk is largely open. In this article we address this problem and propose novel algorithms corresponding to its two cases, where the index is initially empty (i.e. bulk loading), and where the index is initially non empty (i.e. bulk insertion). The proposed bulk loading algorithm builds the index bottom-up layer by layer, using a new sampling based clustering method, which improves clustering results by improving the quality of the selected sample sets. The proposed bulk insertion algorithm employs the bulk loading algorithm to load the given data into a new index structure, and then merges the new and the existing structures into a unified high quality index, using a novel decomposition method to reduce overlaps between the structures. Both algorithms yield significantly improved construction and search performance, and are applicable to all dynamic clustered metric trees. Results from an extensive experimental study show that the proposed algorithms outperform alternative methods, reducing construction costs by up to 47% for CPU costs and 99% for I/O costs, and search costs by up to 48% for CPU costs and 30% for I/O costs.  相似文献   

12.
基于广义超曲面树的相似性搜索算法   总被引:2,自引:0,他引:2  
张兆功  李建中 《软件学报》2002,13(10):1969-1976
相似性搜索是数据挖掘的主要领域之一.它在数据库中检索出相似的数据,发现数据间的相似性.它可以应用于图像数据库、空间数据库和时间序列分析.对于欧氏空间(一种特殊的度量空间),相似性搜索算法中基于R-tree的方法,在低维时是高效的,当维数增加时,R-tre e的方法将退化为线性扫描.该现象被称为维数灾难(dimensionality curse),主要原因是存在数据重复.当数据量很大且维数很高时,距离计算和I/O操作将非常费时.提出了度量空间上新的空间分割方法和索引结构rgh-tree,利用数据库的数据对象与很少几个固定参考对象的距离信息进行数据分割和分布,产生一个各节点没有数据重复的平衡树.另外,在rgh-tree的基础上提出了相应的相似性搜索算法,该算法具有较小的I/O代价和距离计算次数,平均复杂性近似为o(n0.58).解决了目前算法存在的一些问题.  相似文献   

13.
半结构化数据相似搜索的索引技术研究   总被引:6,自引:0,他引:6  
杨建武  陈晓鸥 《计算机学报》2002,25(11):1219-1226
为了在海量、高维、动态的半结构化数据集上进行有效的相似搜索,该文提出一种采用聚类技术进行索引构建与更新的多路平衡树--CSS-树以及基于CSS-树的相似搜索与动态更新的算法。CSS-树借鉴SS^ -树基于聚类进行节点组织与分裂的基本思想,避免了根据坐标准进行分裂时所要求的维不相关性,同时在节点组织、分裂算法和搜索算法等方面进行了改进,提出了新的搜索剪枝策略,实验表明,该结构及算法对海量半结构化数据相似搜索和效率明显优于传统算法。  相似文献   

14.
Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. In this article we propose a new dynamic paged and balanced access method for similarity search in metric data sets, named CM-tree (Clustered Metric tree). It fully supports dynamic capabilities of insertions and deletions both of single objects and in bulk. Distinctive from other methods, it is especially designed to achieve a structure of tight and low overlapping clusters via its primary construction algorithms (instead of post-processing), yielding significantly improved performance. Several new methods are introduced to achieve this: a strategy for selecting representative objects of nodes, clustering based node split algorithm and criteria for triggering a node split, and an improved sub-tree pruning method used during search. To facilitate these methods the pairwise distances between the objects of a node are maintained within each node. Results from an extensive experimental study show that the CM-tree outperforms the M-tree and the Slim-tree, improving search performance by up to 312% for I/O costs and 303% for CPU costs.  相似文献   

15.
Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. In this article we propose a new dynamic paged and balanced access method for similarity search in metric data sets, named CM-tree (Clustered Metric tree). It fully supports dynamic capabilities of insertions and deletions both of single objects and in bulk. Distinctive from other methods, it is especially designed to achieve a structure of tight and low overlapping clusters via its primary construction algorithms (instead of post-processing), yielding significantly improved performance. Several new methods are introduced to achieve this: a strategy for selecting representative objects of nodes, clustering based node split algorithm and criteria for triggering a node split, and an improved sub-tree pruning method used during search. To facilitate these methods the pairwise distances between the objects of a node are maintained within each node. Results from an extensive experimental study show that the CM-tree outperforms the M-tree and the Slim-tree, improving search performance by up to 312% for I/O costs and 303% for CPU costs.  相似文献   

16.
In this paper we describe a new algorithm for maintaining a balanced search tree on a message-passing MIMD architecture; the algorithm is particularly well suited for implementation on a small number of processors. We introduce a (2B-2, 2B) search tree that uses a bidirectional ring of O(log n) processors to store n entries. Update operations use a bottom-up node-splitting scheme, which performs significantly better than top-down search tree algorithms. The bottom-up algorithm requires many fewer messages and results in less blocking due to synchronization than top-down algorithms. Additionally, for a given cost ratio of computation to communication the value of B may be varied to maximize performance. Implementations on a parallel-architecture simulator are described  相似文献   

17.
Many recent database applications need to deal with similarity queries. For such applications, it is important to measure the similarity between two objects using the distance between them. Focusing on this problem, this paper proposes the slim-tree, a new dynamic tree for organizing metric data sets in pages of fixed size. The slim-tree uses the triangle inequality to prune the distance calculations that are needed to answer similarity queries over objects in metric spaces. The proposed insertion algorithm uses new policies to select the nodes where incoming objects are stored. When a node overflows, the slim-tree uses a minimal spanning tree to help with the splitting. The new insertion algorithm leads to a tree with high storage utilization and improved query performance. The slim-tree is a metric access method that tackles the problem of overlaps between nodes in metric spaces and that allows one to minimize the overlap. The proposed "fat-factor" is a way to quantify whether a given tree can be improved and also to compare two trees. We show how to use the fat-factor to achieve accurate estimates of the search performance and also how to improve the performance of a metric tree through the proposed "slim-down" algorithm. This paper also presents a new tool in the slim-tree's arsenal of resources, aimed at visualizing it. Visualization is a powerful tool for interactive data mining and for the visual tracking of the behavior of a tree under updates. Finally, we present a formula to estimate the number of disk accesses in range queries. Results from experiments with real and synthetic data sets show that the new slim-tree algorithms lead to performance improvements. These results show that the slim-tree outperforms the M-tree by up to 200% for range queries. For insertion and splitting, the minimal-spanning-tree-based algorithm achieves up to 40 times faster insertions. We observed improvements of up to 40% in range queries after applying the slim-down algorithm  相似文献   

18.
Recently, active prevention healthcares are needed for potential patients to be suffered in the future as the forecasted diseases inherited from ancestors. We call active U-healthcare, for providing active, periodic, and continuous medical treatments depending on inherited heterogeneous states in DNAs of patients, such as diabetes, heart diseases, and female diseases. However, the bottleneck of the aggressive active U-healthcare is memory overhead in DNA sequence analysis of each patient since the sequences of DNAs have massive volume. Thus, the efficient retrieve of the many disease patterns in originally recorded on DNAs of potential patients is a major problem. This paper focuses on a novel method for efficient retrieving of disease patterns using a suffix tree in memory. The suffix tree is widely used in the similarity search for sequences consisting of limited characters. It is efficient when the occurrence frequency of a common prefix is high. Since in-memory suffix tree construction algorithms do not scale up, a large-scale disk-based suffix tree construction algorithm, TRELLIS, has been proposed recently. However, the algorithm requires a large amount of memory, disk space, and disk I/Os in order to merge sub-trees having a common prefix. In this paper, we propose a new non-merging method, called NST. The experimental results show that NST constructs an index using less memory than TRELLIS.  相似文献   

19.
This review considers the class of index structures for fast similarity search. In constructing and applying such structures, only information on values or ranks of some distances/similarities between objects is used. The search by metric distances (satisfying the triangle inequality and other metric axioms) and by nonmetric distances is discussed. Structures that return objects of a base that represent the exact answer to a search query and also structures for approximate similarity search are presented (the latter structures do not guarantee precision, but usually return results close to exact and operate faster than structures for exact search). General principles of construction and application of some index structures are stated, and also ideas underlying concrete algorithms (both well-known and proposed lately) are considered.  相似文献   

20.
Approximate similarity retrieval with M-trees   总被引:4,自引:0,他引:4  
Motivated by the urgent need to improve the efficiency of similarity queries, approximate similarity retrieval is investigated in the environment of a metric tree index called the M-tree. Three different approximation techniques are proposed, which show how to forsake query precision for improved performance. Measures are defined that can quantify the improvements in performance efficiency and the quality of approximations. The proposed approximation techniques are then tested on various synthetic and real-life files. The evidence obtained from the experiments confirms our hypothesis that a high-quality approximated similarity search can be performed at a much lower cost than that needed to obtain the exact results. The proposed approximation techniques are scalable and appear to be independent of the metric used. Extensions of these techniques to the environments of other similarity search indexes are also discussed. Received July 7, 1998 / Accepted October 13, 1998  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号