首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 784 毫秒
1.
刘雪莉  王宏志  李建中  高宏 《软件学报》2015,26(6):1421-1437
按照元组描述的实体对其进行组织和查询处理,是一种管理劣质数据的有效方法.考虑到同一个实体的同一属性存在多个描述的值,因此,基于实体的数据库上的连接是支持多个值的相似性连接.与字符串的相似性连接相比较,实体的相似性连接在数据清洗、信息集成、模糊关键字查询、诈骗检测和文本聚集等领域有着更好的应用效果.通过建立双层索引结构,提出了实体数据库上相似性连接算法ES-JOIN.同时,该方法适用于解决集合中字符串模糊匹配的相似性连接问题,而传统的集合相似性连接只针对集合中元素精确匹配的情况.为了加速连接,还提出了过滤措施对算法进行优化,进一步给出了优化算法OPT_ES-JOIN.实验验证了ES-JOIN算法和OPT_ES-JOIN算法具有很好的效率和可扩展性.实验结果表明,过滤措施具有很好的过滤效果.  相似文献   

2.
为了深入理解和全面把握大数据相似性连接查询技术的研究进展,更好地促进其在图片聚类、实体解析、相似文档检测、相似轨迹检索等领域的广泛应用,对大数据相似性连接查询技术相关研究工作进行了深入调研和分析。首先对相似性连接查询的基本概念进行了介绍,然后分别对集合、向量、空间数据、概率数据、字符串等不同类型大数据的相似性连接查询相关研究工作进行了深入研究,对其优缺点进行了分析和总结。最后,指出了大数据相似性连接查询面临的若干挑战性问题及未来的研究重点。  相似文献   

3.
字符串相似性连接是数据质量管理的基本操作,也是数据价值发现的关键步骤。针对目前已有的方法不能满足面向大数据的增量式处理需求的问题,提出一种面向流式数据的增量式字符串相似性连接方法——Inc-Join,并对方法的索引技术进行了优化。该方法以Pass-Join字符串连接算法为基础,首先,采用字符串划分技术将字符串划分成多个互不相交的子串;然后,建立字符串的反向索引列表并将其作为状态;最后,新增数据只需根据状态进行相似性计算,每次连接操作结束后都对状态进行更新。实验结果表明,Inc-Join方法在不影响连接准确率的同时,有效将长、 短字符串重复匹配次数减少为√n(n是批处理方式的匹配次数)。 实验对3种数据集进行处理,发现使用批处理方式进行相似性连接的响应时间是Inc-Join的1至4.7倍,并呈现急剧递增的趋势;而且优化后Inc-Join方法的响应时间最小只占优化前的3/4,并随处理数据的增多所占比例越来越小。同时优化后的Inc-Join不需要保存状态,再一次减小了算法执行的时间和空间开销。  相似文献   

4.
相似性连接查询技术研究进展   总被引:1,自引:0,他引:1  
相似性连接查询,即查找相似的数据对象对,具有广泛的应用领域,例如相似网页检测、实体解析、数据清洗和相似图像检索等。相似性连接查询是当前大数据处理领域的热点问题之一。讨论了相似性连接查询面临的挑战;根据不同的标准对现有的相似性连接查询进行了分类;总结并比较了现有的字符串、集合、向量和图相似性连接算法;探讨了今后的研究重点和发展趋势。  相似文献   

5.
We introduce asymptotically optimal algorithms for gathering and scattering a small-to-moderate sized set of data on a coarse grained parallel computer. We use these operations to obtain efficient to optimal solutions to several fundamental problems in image processing and string matching (exact or approximate) for coarse grained parallel computers.  相似文献   

6.
Genetic algorithms for approximate similarity queries   总被引:1,自引:0,他引:1  
Algorithms to query large sets of simple data (composed of numbers and small character strings) are constructed to retrieve the exact answer, retrieving every relevant element, so the answer said to be exact. Similarity searching over complex data is much more expensive than searching over simple data. Moreover, comparison operations over complex data usually consider features extracted from each element, instead of the elements themselves. Thus, even if an algorithm retrieves an exact answer, it is ‘exact’ regarding the extracted features, not regarding the original elements themselves. Therefore, trading exact answering with query time response can be worthwhile. In this work we developed two search strategies based on genetic algorithms to allow retrieving approximate data indexed by Metric Access Methods (MAM) within a limited, user-defined, amount of time. These strategies allow implementing algorithms to answer both range and k-nearest neighbor queries, and allow also to estimate the precision obtained for the approximate answer. Experimental evaluation shows that very good results (corresponding to what the user would expect) can be obtained in a fraction of the time required to obtain the exact answer.  相似文献   

7.
基于Vague集加权相似度量的双向近似推理   总被引:9,自引:0,他引:9  
本文提出了一种新的Vague集的加权相似度量方法,以解决文献中关于Vague集相似度量的某些缺陷以及权向量中各个分量难以确定的问题,并且提出了Vague集间相似方向的概念,用它来描述两个相拟Vague集中哪个所包含的信息更精确,并给出了一个判定方法。在此基础上的给出了一种基于Vague集加权相似度量的双向近似推理方法,该方法更好地利用了Vague集信息的精确性,从而提高了推理的精确性和适用性。这为智能系统中的近似推理提供了一个十分有用的工具。  相似文献   

8.
Text-indexing structures provide significant advantages in the solution of many problems related to string analysis and comparison, and are nowadays widely used in the analysis of biological sequences. In this paper, we present some applications of affix trees to problems of exact and approximate pattern matching and discovery in RNA sequences. By allowing bidirectional search for symmetric patterns in the sequences, affix trees permit to discover and locate in the sequences patterns describing not only sequence regions, but also containing information about the secondary structure that a given region could form, with improvements in terms of theoretical and practical efficiency over the existing methods. The search can be either exact or approximate, where the approximation can be defined simultaneously both for the sequence and the structure of patterns. The approach presented in this paper could provide significant help in the analysis of RNA sequences, where the functional motifs often involve not only sequence, but also the structural constraints.  相似文献   

9.
为了解决高维数据相似性连接查询中存在的维度灾难和计算代价高等问题,基于p-稳态分布,将高维数据映射到低维空间。根据卡方分布的性质,证明了如果低维空间的距离大于,则原始空间距离大于ε的概率具有一定的下界,从而可以在低维空间以较低的计算代价进行有效过滤。在此基础上,提出了基于卡方分布的高维数据相似性连接查询算法。为了进一步提高查询效率,提出了基于双重过滤的高维数据相似性连接查询算法。利用真实数据集进行了实验,实验结果表明所提方法具有较好的性能。基于卡方分布的相似性连接查询算法召回率可以达到90%以上。基于双重过滤的相似性连接查询算法可以进一步提高性能,但是会损失一定的召回率。对时间性能要求比较高、对召回率要求不太严格的查询任务可以采用基于双重过滤的相似性连接查询算法;反之,可以采用基于卡方分布的相似性连接查询算法。  相似文献   

10.
按照元组描述的实体对其进行组织和查询处理是一种管理劣质数据的有效方法。考虑到同一个实体的同一属性存在多个描述值,因此基于实体的数据库上的连接是支持多个值的相似性连接。由于多表连接操作的连接顺序对连接性能有着重要的影响,研究了实体数据库上多表连接顺序选择方法,采用基于实体的马尔可夫链蒙特卡洛(Markov chain Monte Carol,MCMC)方法估计出实体数据库的相似性连接操作的结果大小,并以连接结果大小和有无索引作为主要代价,提出了基于实体的多连接顺序优化策略。进一步,通过实验证明了估计连接结果大小的算法在大规模数据上有着显著的优势。  相似文献   

11.
Design patterns are important in software maintenance because they help in understanding and re-engineering systems. They propose design motifs, solutions to recurring design problems. The identification of occurrences of design motifs in large systems consists of identifying classes whose structure and organization match exactly or approximately the structure and organization of classes as suggested by the motif. We adapt two classical approximate string matching algorithms based on automata simulation and bit-vector processing to efficiently identify exact and approximate occurrences of motifs. We then carry out two case studies to show the performance, precision, and recall of our algorithms. In the first case study, we assess the performance of our algorithms on seven medium-to-large systems. In the second case study, we compare our approach with three existing approaches (an explanation-based constraint approach, a metric-enhanced explanation-based constraint approach, and a similarity scoring approach) by applying the algorithms on three small-to-medium size systems, JHotDraw, Juzzle, and QuickUML. Our studies show that approximate string matching based on bit-vector processing provides efficient algorithms to identify design motifs.  相似文献   

12.
Approximate pattern matching algorithms have become an important tool in computer assisted music analysis and information retrieval. The number of different problem formulations has greatly expanded in recent years, not least because of the subjective nature of measuring musical similarity. From an algorithmic perspective, the complexity of each problem depends crucially on the exact definition of the difference between two strings. We present an overview of advances in approximate string matching in this field focusing on new measures of approximation.  相似文献   

13.
字符串相似连接是指在字符串集合中找出相似的字符串对,是许多应用的关键操作,寻找高效的字符串相似连接算法已成为研究热点。基于划分的过滤-验证方法(Pass-Join)与其他方法相比具有较高的效率。它按照字符串长度递增的顺序访问字符串集合,通过查找一个字符串的划分块是否存在于另一个字符串中,快速筛选出可能相似的字符串对(候选集),然后利用编辑距离进行相似性验证。研究发现,按照字符串长度递减的顺序进行过滤(长度递减过滤)的效果优于按照长度递增的顺序过滤(长度递增过滤)的效果,基于此,提出双向过滤-验证机制:在过滤阶段对长度递减过滤的结果再进行一次长度递增过滤,进一步减小候选集大小;在验证阶段利用双向过滤产生的两对划分块和其匹配子串分隔字符串对,从而减小需要验证的字符串的长度,加速验证过程。实验证明,双向过滤-验证算法在真实数据集上优于原算法。  相似文献   

14.
A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings.  相似文献   

15.
Managing large-scale time series databases has attracted significant attention in the database community recently. Related fundamental problems such as dimensionality reduction, transformation, pattern mining, and similarity search have been studied extensively. Although the time series data are dynamic by nature, as in data streams, current solutions to these fundamental problems have been mostly for the static time series databases. In this paper, we first propose a framework to online summary generation for large-scale and dynamic time series data, such as data streams. Then, we propose online transform-based summarization techniques over data streams that can be updated in constant time and space. We present both the exact and approximate versions of the proposed techniques and provide error bounds for the approximate case. One of our main contributions in this paper is the extensive performance analysis. Our experiments carefully evaluate the quality of the online summaries for point, range, and knn queries using real-life dynamic data sets of substantial size. Edited by W. Aref  相似文献   

16.
提出了一种新的Vague集的加权相似度量方法以解决文献[1]中关于vague集相似度量的某些缺陷,并且提出了Vague集间相似方向的概念,用它来描述两个相似Vague集中哪个所包含的信息更精确,并给出了一个判定方法。在此基础上给出了一种基于Vague集加权相似度量的双向近似推理方法,该方法更好地利用了vague集信息的精确性,从而提高了推理的精确性和适用性.这为智能系统中的近似推理提供了一个十分有用的工具.  相似文献   

17.
相似性连接技术在数据清洗、数据集成等领域中具有重要意义,近年来引起了学术界的广泛关注.随着数据量的不断增大、数据处理实时性的要求逐渐提高以及处理器性能提升瓶颈的出现,传统的串行相似性连接方法已经不能满足当前大数据处理的需求.近些年,GPU作为协处理器在机器学习等领域取得了良好的加速效果,因此基于GPU的并行算法开始成为解决各类性能问题的有效解决方案.为此,提出了基于CPU-GPU异构体系的并行相似性连接方法.首先,方法使用GPU构建倒排索引,索引采用SoA(struct of arrays)结构,从而解决了传统索引结构在并行模式下读写效率低的问题.其次,针对串行算法的性能问题,提出基于过滤验证框架的并行双重长度过滤算法,其中利用前缀过滤和构建好的倒排索引提升过滤效果.方法中相似度精确计算验证过程使用CPU计算执行,从而充分利用CPU-GPU的异构计算资源.最后,在多个数据集上进行实验验证性能.通过与串行相似性连接算法进行对比,实验结果表明所提出方法相对于已有方法具有更好的过滤效果和更低的索引生成代价,并在相似性连接上具有更好的性能和良好的加速比.  相似文献   

18.
Common substring problems allowing errors are known to be NP-hard. The main challenge of the problems lies in the combinatorial explosion of potential candidates. In this paper, we propose and study a generalized center string (GCS) problem, where not only all models (center strings) of any length, but also the positions of all their (degenerative) instances in input sequences are searched for. Inspired by frequent pattern mining techniques in data mining field, we present an exact and efficient method to solve GCS. First, a highly parallelized Trie-like structure, consensus tree, is proposed. Based on this structure, we present three Bpriori algorithms step by step. Bpriori algorithms can solve GCS with reasonable time and/or space complexities. We have proved that GCS is fixed parameter tractable with respect to fixed symbol set size and fixed length of input sequences. Experiment results on both artificial and real data have shown the correctness of the algorithms and the validity of our complexity analysis. A comparison with some current algorithms for solving common approximate substring problems is also given  相似文献   

19.
准确匹配实体名称在信息系统集成中有广泛的应用,而在中文环境中,实体名称的变化和笔误使得中文实体名称难以准确匹配,所以需要开发出适应这些变化和笔误的匹配方法。中文实体名称的相似度从字、词、语义三个层次计算出来,将这些相似度线性合并起来,集成各自的优势。为了利用更多的匹配特征,引入了两种机器学习的方法:第一种方法通过训练获得一个优化排序和最佳切分点;第二种方法利用支持向量机来判断两个名称是否指向同一实体。在中文实体名称的数据集上的实验表明,这些方法和特征有效提高了匹配的效果。  相似文献   

20.
两字符串的编辑距离是从一个串转换到另一个串所需要的最少基本操作数。编辑距离广泛应用于字符串近似匹配、字符串相似连接等领域。动态规划法利用编辑距离矩阵来计算两个串的编辑距离,需要计算矩阵中的所有元素,时间效率低。改进的方法改变了矩阵中元素的计算次序,减少了需要比对的元素,但仍需要比对一半以上的元素,时间效率还有待提高。提出基于基本操作序列的编辑距离顺序验证方法。首先,分析了基本操作序列的可列性,给出了列举基本操作序列的方法。然后依次顺序验证基本操作数从小到大的基本操作序列直到某一序列通过验证,得到其编辑距离。在阈值为2的字符串近似搜索实验中发现,所提方法比动态规划类方法具有更高的效率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号