共查询到20条相似文献,搜索用时 15 毫秒
1.
High-dimensional similarity joins 总被引:3,自引:0,他引:3
Shim K. Srikant R. Agrawal R. 《Knowledge and Data Engineering, IEEE Transactions on》2002,14(1):156-171
Many emerging data mining applications require a similarity join between points in a high-dimensional domain. We present a new algorithm that utilizes a new index structure, called the ε tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of finding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions. Hence, the proposed index structure scales to high-dimensional data. We analyze the cost of the join for the ε tree and the R-tree family, and show that the ε tree will perform better for high-dimensional joins. Empirical evaluation, using synthetic and real-life data sets, shows that similarity join using the ε tree is twice to an order of magnitude faster than the R+ tree, with the performance gap increasing with the number of dimensions. We also discuss how some of the ideas of the ε tree can be applied to the R-tree family. These biased R-trees perform better than the corresponding traditional R-trees for high-dimensional similarity joins, but do not match the performance of the ε tree 相似文献
2.
In this paper, we focus on high‐dimensional similarity join (HDSJ) using MapReduce paradigm. As the volume of the data and the number of the dimensions increase, the computation cost of HDSJ will increase exponentially. There is no existing effective approach that can process HDSJ efficiently, so we propose a novel method called symbolic aggregate approximation (SAX)‐based HDSJ to deal with the problem. SAX is the abbreviation of symbolic aggregate approximation that is a dimensionality reduction technique and widely used in time series processing, we use SAX to represent the high‐dimensional vectors in this paper and reorganize these vectors into groups based on their SAX representations. For the very high‐dimensional vectors, we also propose an improved SAX‐based HDSJ approach. Finally, we implement SAX‐based HDSJ and improved SAX‐based HDSJ on Hadoop‐0.20.2 and perform comprehensive experiments to test the performance, we also compare SAX‐based HDSJ and improved SAX‐based HDSJ with the existing method. The experiment results show that our proposed approaches have much better performance than that of the existing method. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献
3.
利用集合相似度自连接算法找出一个集合集中所有相似度大于给定阈值的集合对有着广泛的应用. 基于过滤-验证框架和并行分布式计算框架MapReduce的集合相似度连接是近年来的研究热点. 但现有算法在阈值低时产生较大规模的候选集,导致性能不理想. 针对这一问题,提出采用频繁模式树FP-tree及其派生结构FP-tree*将数据压缩在内存中计算集合相似度自连接以减小候选集规模. 首先设计并讨论基于现有FP-tree*的集合相似度连接计算及其优缺点,提出遍历效率更高的线性频繁模式树结构模型TELP-tree及基于它的算法TELP-SJ(TELP-tree self join),其包括分别面向构建树和遍历树的2阶段过滤算法,这些算法可以减小树规模和减少树遍历. 然后,设计基于MapReduce的并行分布式算法FastTELP-SJ. 最后,基于4组真实应用数据集进行3组性能比较实验. 实验结果表明FastTELP-SJ算法面向高维大规模集合相似度自连接计算时,包括执行时间、内存占用率、磁盘使用量和可扩展性的运行效率最好.
相似文献4.
Byoungju Yang Hyun Joon Kim Junho Shim Dongjoo Lee Sang-goo Lee 《Journal of Intelligent Information Systems》2016,46(3):473-497
Vector similarity join, which finds similar pairs of vector objects, is a computationally expensive process. As its number of vectors increases, the time needed for join operation increases proportional to the square of the number of vectors. Various filtering techniques have been proposed to reduce its computational load. On the other hand, MapReduce algorithms have been studied to manage large datasets. The recent improvements, however, still suffer from its computational time and scalability. In this paper, we propose a MapReduce algorithm FACET(FAst and sCalable maprEduce similariTy join) to efficiently solve the vector similarity join problem on large datasets. FACET is an all-pair exact join algorithm, composed of two stages. In the first stage, we use our own novel filtering techniques to eliminate dissimilar pairs to generate non-redundant candidate pairs. The second stage matches candidate pairs with the vector data so that similar pairs are produced as the output. Both stages employ parallelism offered by MapReduce. The algorithm is currently designed for cosine similarity and Self Join case. Extensions to other similarity measures and R-S Join case are also discussed. We provide the I/O analysis of the algorithm. We evaluate the performance of the algorithm on multiple real world datasets. The experiment results show that our algorithm performs, on average, 40 % upto 800 % better than the previous state-of-the-art MapReduce algorithms. 相似文献
5.
Jianhua Feng Jiannan Wang Guoliang Li 《The VLDB Journal The International Journal on Very Large Data Bases》2012,21(4):437-461
A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings. 相似文献
6.
Identification of all pairs of objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to large datasets and are versatile to represent a variety of similarity notions. Most methods proposed so far present two main phases at a high level of abstraction: candidate generation producing a set of candidate pairs and verification applying the actual similarity measure to the candidates and returning the correct answer. Previous work has primarily focused on the reduction of candidates, where candidate generation presented the major effort to obtain better pruning results. Here, we propose an opposite approach. We drastically decrease the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the verification phase. Our experimental findings show that this trade-off is advantageous: we consistently achieve substantial speed-ups as compared to known algorithms. 相似文献
7.
Mahalakshmi Lakshminarayanan William F. Acosta Robert C. Green II Vijay Devabhaktuni 《The Journal of supercomputing》2014,69(2):930-954
An efficient MapReduce Algorithm for performing Similarity Joins between multisets is proposed. Filtering techniques for similarity joins minimize the number of pairs of entities joined and hence improve the efficiency of the algorithm. Multisets represent real-world data better by considering the frequency of its elements. Prior serial algorithms incorporate filtering techniques only for sets, but not multisets, while prior MapReduce algorithms do not incorporate any filtering technique or inefficiently and unscalably incorporate prefix filtering. This work extends the filtering techniques, namely the prefix, size and positional to multisets, and also achieves the challenging task of efficiently incorporating them in the shared-nothing MapReduce model, thereby minimizing the pairs generated and joined, resulting in I/O, network and computational efficiency. A technique to enhance the scalability of the algorithm is also presented as a contingency need. Algorithms are developed using Hadoop and tested using real-world Twitter data. Experimental results demonstrate unprecedented performance gain. 相似文献
8.
9.
10.
字符串相似连接是指在字符串集合中找出相似的字符串对,是许多应用的关键操作,寻找高效的字符串相似连接算法已成为研究热点。基于划分的过滤-验证方法(Pass-Join)与其他方法相比具有较高的效率。它按照字符串长度递增的顺序访问字符串集合,通过查找一个字符串的划分块是否存在于另一个字符串中,快速筛选出可能相似的字符串对(候选集),然后利用编辑距离进行相似性验证。研究发现,按照字符串长度递减的顺序进行过滤(长度递减过滤)的效果优于按照长度递增的顺序过滤(长度递增过滤)的效果,基于此,提出双向过滤-验证机制:在过滤阶段对长度递减过滤的结果再进行一次长度递增过滤,进一步减小候选集大小;在验证阶段利用双向过滤产生的两对划分块和其匹配子串分隔字符串对,从而减小需要验证的字符串的长度,加速验证过程。实验证明,双向过滤-验证算法在真实数据集上优于原算法。 相似文献
11.
Multiversion two phase locking (MV2PL) provides online serializable queries without introducing the long blocking delays that can occur with conventional two phase locking (2PL). MV2PL requires indexing structures, however, that are capable of supporting multiple versions of data. We present several options for extending single version indexing schemes for use with MV2PL. These basic approaches are largely orthogonal to the underlying indexing structure (e.g., hashing or B+ trees). The options considered differ in where they place version selection information (i.e., references to individual versions); this information is placed either with the data or with the index entries of one or more of the indices. We also present the results from a performance study that show that placing the version selection information with the data is usually the best option, since it keeps the indices smaller and thus enables a larger fraction of the index pages to remain cached in the buffer pool 相似文献
12.
Dori D. Wenyin Liu 《IEEE transactions on pattern analysis and machine intelligence》1999,21(3):202-215
Accurate and efficient vectorization of line drawings is essential for their higher level processing. We present a thinningless sparse pixel vectorization (SPV) algorithm. Rather than visiting all the points along the wire's black area, SPV sparsely visits selected medial axis points. The result is a crude polyline, which is refined through polygonal approximation by removing redundant points. Due to the sparseness of pixel examination and the use of a specialized data structure, SPV is both time efficient and accurate, as evaluated by our proposed performance evaluation criteria 相似文献
13.
Genetic algorithms for approximate similarity queries 总被引:1,自引:0,他引:1
Algorithms to query large sets of simple data (composed of numbers and small character strings) are constructed to retrieve the exact answer, retrieving every relevant element, so the answer said to be exact. Similarity searching over complex data is much more expensive than searching over simple data. Moreover, comparison operations over complex data usually consider features extracted from each element, instead of the elements themselves. Thus, even if an algorithm retrieves an exact answer, it is ‘exact’ regarding the extracted features, not regarding the original elements themselves. Therefore, trading exact answering with query time response can be worthwhile. In this work we developed two search strategies based on genetic algorithms to allow retrieving approximate data indexed by Metric Access Methods (MAM) within a limited, user-defined, amount of time. These strategies allow implementing algorithms to answer both range and k-nearest neighbor queries, and allow also to estimate the precision obtained for the approximate answer. Experimental evaluation shows that very good results (corresponding to what the user would expect) can be obtained in a fraction of the time required to obtain the exact answer. 相似文献
14.
Avraham T Lindenbaum M 《IEEE transactions on pattern analysis and machine intelligence》2006,28(2):251-264
A visual search is required when applying a recognition process on a scene containing multiple objects. In such cases, we would like to avoid an exhaustive sequential search. This work proposes a dynamic visual search framework based mainly on inner-scene similarity. Given a number of candidates (e.g., subimages), we hypothesize is that more visually similar candidates are more likely to have the same identity. We use this assumption for determining the order of attention. Both deterministic and stochastic approaches, relying on this hypothesis, are considered. Under the deterministic approach, we suggest a measure similar to Kolmogorov's epsilon-covering that quantifies the difficulty of a search task. We show that this measure bounds the performance of all search algorithms and suggest a simple algorithm that meets this bound. Under the stochastic approach, we model the identity of the candidates as a set of correlated random variables and derive a search procedure based on linear estimation. Several experiments are presented in which the statistical characteristics, search algorithm, and bound are evaluated and verified. 相似文献
15.
Giorgos Kollias Madan Sathe Olaf Schenk Ananth Grama 《Journal of Parallel and Distributed Computing》2014
This paper addresses the problem of global graph alignment on supercomputer-class clusters. We define the alignment of two graphs, as a mapping of each vertex in the first graph to a unique vertex in the second graph so as to optimize a given similarity-based cost function.1 Using a state of the art serial algorithm for the computation of vertex similarity scores called Network Similarity Decomposition (NSD), we derive corresponding parallel formulations. Coupling this parallel similarity algorithm with a parallel auction-based bipartite matching technique, we obtain a highly efficient and scalable graph matching pipeline. We validate the performance of our integrated approach on a large parallel platform and on diverse graph instances (including Protein Interaction, Wikipedia and Web networks). Experimental results demonstrate that our algorithms scale to large machine configurations (thousands of cores) and problem instances, enabling the alignment of networks of sizes two orders of magnitude larger than reported in the current literature. 相似文献
16.
基于概率相似度的不完备信息系统数据补齐算法* 总被引:2,自引:1,他引:1
在决策属性已知、条件属性值分布不确定的情况下,用基于概率相似度原理和按决策属性划分系统的原则,对缺损数据进行填补,可使不完备决策信息系统的完备化具有较高可信度。 相似文献
17.
The problems of learning a good similarity function between objects naturally arise in machine learning, pattern recognition and data mining such as clustering, community detection or metric learning as well. We focus on the special case of this problem, where similarity function is completely determined by the hidden object classes. But we assume that no information about object labels is accessible on a training stage. The main contribution of the paper is two-stage algorithm assigns to each object its class label and provides a similarity function based on this assignment. We provide risk bounds and empirical evaluation in support of our algorithm. As a consequence of our analysis we provide a new tradeoff between empirical error of a multi-class classifier and its generalization error. 相似文献
18.
Wolf J.L. Dias D.M. Yu P.S. Turek J. 《Knowledge and Data Engineering, IEEE Transactions on》1994,6(6):990-997
Parallel processing is an attractive option for relational database systems. As in any parallel environment however, load balancing is a critical issue which affects overall performance. Load balancing for one common database operation in particular, the join of two relations, can be severely hampered for conventional parallel algorithms, due to a natural phenomenon known as data skew. In a pair of recent papers (J. Wolf et al., 1993; 1993), we described two new join algorithms designed to address the data skew problem. We propose significant improvements to both algorithms, increasing their effectiveness while simultaneously decreasing their execution times. The paper then focuses on the comparative performance of the improved algorithms and their more conventional counterparts. The new algorithms outperform their more conventional counterparts in the presence of just about any skew at all, dramatically so in cases of high skew 相似文献
19.
Pasi Luukka 《Expert systems with applications》2009,36(4):7463-7468
In this article, classification method is proposed where data is first preprocessed using fuzzy robust principal component analysis (FRPCA) algorithms to obtain data in a more feasible form. After this we use similarity classifier for the classification. We tested this procedure for breast cancer data and liver-disorder data. The results were quite promising and better classification accuracy was achieved than using traditional PCA and similarity classifier. Fuzzy robust principal component analysis algorithms seem to have the effect that they project these data sets in a more feasible form, and together with similarity classifier classification on accuracy of 70.25% was achieved with liver-disorder data and 98.19% accuracy was achieved with breast cancer data. Compared to the results achieved with traditional PCA and similarity classifier about 4% higher accuracy was achieved with liver-disorder data and about 0.5% higher accuracy was achieved with breast cancer data. 相似文献
20.
Machining feature-based similarity assessment algorithms for prismatic machined parts 总被引:4,自引:0,他引:4
This paper presents algorithms for identifying machined parts in a database that are similar to a given query part based on machining features. In this paper we only consider parts that are machined on 3-axis machining centers. We utilize reduced feature vectors consisting of machining feature access directions, feature types, feature volumes, feature dimensional tolerances and feature group cardinality as a basis for assessing shape similarity. We have defined a distance function between two sets of reduced feature vectors to assess the similarity between them from the machining effort point of view. To assess the similarity between the two parts, one set of reduced feature vectors is transformed in space using rigid body transformations with respect to the other set such that the distance between them is minimized. The distance between the two sets of aligned reduced feature vectors is used as a measure of similarity between the two parts. The existing machined parts are rank ordered based on the value of the distance with respect to the query part. The cost of previously machined parts that have a very small distance from the query part can be used as a basis for estimating the cost of machining the new part. 相似文献