首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
改进的中文近似字符串匹配算法   总被引:1,自引:0,他引:1  
范立新 《计算机工程与应用》2006,42(34):172-174,207
BPM-BM算法在针对汉字等大字符集的近似字符串匹配时取得了很好的实际效果,但该算法在最差情况下的总体时间复杂度为O(!+nm)。而提出的IBPM-BM算法由于具有记忆的能力,保证了过滤阶段的无回溯,可以在理论上保证最差情况下的总体时间复杂度为O(!+n),而在最佳情况下的时间复杂度与BPM-BM算法一致。  相似文献   

2.
针对近似模式匹配算法在处理带有灵活通配符和长度约束近似模式匹配(APMWL)问题时只能解决替换操作, 提出一种基于动态规划的编辑距离矩阵(EDM)构造方法,设计了基于EDM的近似模式匹配算法APM, 可以处理近似匹配中的三种编辑操作,即插入、替换和删除操作。此外,根据文本中字符是否允许被重复使用的约束条件,设计APM-OF算法。实验结果表明,APM和APM-OF与同类算法相比具备显著的优势:与Sail_Approx匹配算法实验对比, 获取解的平均增长率分别达到8.34%和12.37%; 将APM-OF算法应用至模式挖掘中, 挖掘出的频繁近似模式个数为OneoffMining算法的2.07倍。  相似文献   

3.
一种用于抄袭识别的文档距离度量   总被引:1,自引:0,他引:1       下载免费PDF全文
广义编辑距离的计算是一个NP-完全问题,在充分考虑了文档抄袭行为的特点之后提出一种基于广义编辑距离的单向的低计算复杂性的文档距离度量方法。首先,计算第一文档的各段落在第二文档全文中的近似串匹配距离之和,同时确定各段落在第二文档中的近似匹配子串(即原象串),然后根据这些原象串得到回退数和前跳数,最后将三者求和作为文档距离。该文档距离是一种广义编辑距离的近似值,能够在O(n2)时间内计算,并能充分反映抄袭方向。针对人工文档和实际文档的两组实验表明该距离具有较低的漏检率、误检率。  相似文献   

4.
袁先平  仲红  黄宏升  易磊 《计算机工程》2011,37(20):142-144
数据库中字符串近似匹配查询不能完全保护查询双方的隐私信息。针对该问题,提出一种对数据库中字符串数据的近似匹配查询协议。采用安全计算编辑距离协议、同态加密、茫然传输等安全技术,在有效保护查询双方隐私信息的情况下,实现对字符串近似匹配的查询,并分析该协议的正确性、安全性及复杂性,结果表明,该方案是安全有效的。  相似文献   

5.
支持块编辑距离的索引结构   总被引:1,自引:0,他引:1  
在近似字符串匹配中,传统的编辑距离不能很好地衡量诸如人名、地址等数据的相似关系,而块编辑距离可以很好地衡量两个字符串的相似性.如何有效地支持块编辑距离,进行近似字符串查询处理具有重要的意义.计算两个字符串的块编辑距离是一个NP完全问题,因此希望提供有效的方法可以增强过滤能力,并减少假通过率.设计了一种支持移动编辑距离的新颖的索引结构SHV-Trie,通过研究移动编辑距离的操作特性,使用字母出现的频率作为支持移动编辑距离操作的一个下界,并且提出相应的查询过滤算法,同时,针对索引SHV-Trie的空间开销过大的问题,提出一种优化字母排列的索引结构和一种压缩的索引结构及相关查询过滤算法.真实数据集上的实验结果与分析显示了所提出的索引结构具有良好的过滤能力,并通过减少效率假通过率提高查询的效率.  相似文献   

6.
Gonzalo Navarro 《Software》2001,31(13):1265-1312
We present nrgrep (‘non‐deterministic reverse grep’), a new pattern‐matching tool designed for efficient search of complex patterns. Unlike previous tools of the grep family, such as agrep and Gnu grep, nrgrep is based on a single and uniform concept: the bit‐parallel simulation of a non‐deterministic suffix automaton. As a result, nrgrep can find from simple patterns to regular expressions, exactly or allowing errors in the matches, with an efficiency that degrades smoothly as the complexity of the searched pattern increases. Another concept that is fully integrated into nrgrep and that contributes to this smoothness is the selection of adequate subpatterns for fast scanning, which is also absent in many current tools. We show that the efficiency of nrgrep is similar to that of the fastest existing string‐matching tools for the simplest patterns, and is by far unmatched for more complex patterns. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

7.
Justin Zobel  Philip Dart 《Software》1995,25(3):331-345
Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and permuted lexicons, and several string matching techniques, including string similarity measures and phonetic coding. We propose methods for combining these techniques, and show experimentally that these combinations yield good retrieval effectiveness while keeping index size and retrieval time low. Our experiments also suggest that, in contrast to previous claims, phonetic codings are markedly inferior to string distance measures, which are demonstrated to be suitable for both spelling correction and personal name matching.  相似文献   

8.
Experimental comparisons of the running time of approximate string matching algorithms for the k differences problem are presented. Given a pattern string, a text string, and an integer k, the task is to find all approximate occurrences of the pattern in the text with at most k differences (insertions, deletions, changes). We consider seven algorithms based on different approaches including dynamic programming, Boyer–Moore string matching, suffix automata, and the distribution of characters. It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable.  相似文献   

9.
周典瑞  周莲英 《计算机应用》2013,33(8):2208-2211
针对海量数据下相似重复记录检测算法的低查准率和低效率问题,采用综合加权法和基于字符串长度过滤法对数据集进行相似重复检测。综合加权法通过结合用户经验和数理统计法计算各属性的权重。基于字符串长度过滤法在相似检测过程中利用字符串间的长度差异提前结束编辑距离算法的计算,减少待匹配的记录数。实验结果表明,通过综合加权法计算的权重向量更加全面、准确反映出各属性的重要性,基于字符串的长度过滤法减少了记录间的比对时间,能够有效地解决海量数据的相似重复记录检测问题。  相似文献   

10.
A subquadratic algorithm for approximate limited expression matching   总被引:1,自引:0,他引:1  
In this paper we present an efficient subquadratic-time algorithm for matching strings and limited expressions in large texts. Limited expressions are a subset of regular expressions that appear often in practice. The generalization from simple strings to limited expressions has a negligible affect on the speed of our algorithm, yet allows much more flexibility. Our algorithm is similar in spirit to that of Masek and Paterson [MP], but it is much faster in practice. Our experiments show a factor of four to five speedup against the algorithms of Sellers [Se] and Ukkonen [Uk1] independent of the sizes of the input strings. Experiments also reveal our algorithm to be faster, in most cases, than a recent improvement by Chang and Lampe [CL2], especially for small alphabet sizes for which it is two to three times faster.The research of U. Manber was supported in part by a Presidential Young Investigator Award DCR-8451397, with matching funds from AT&T, and by NSF Grant CCR-9001619. G. Myers research was supported in part by NIH Grant LM04960, NSF Grant CCR-9001619, and the Aspen Center for Physics.  相似文献   

11.
刘丽霞  张志强 《计算机应用》2013,33(8):2375-2378
基于Trie树的相似字符串查找算法是利用编辑距离的阈值来计算每个节点的活跃节点集,已有算法由于存在大量的冗余计算,导致时间复杂度和空间复杂度都比较高。针对这个问题,采用了基于活跃节点的对称性和动态规划算法的思想对已有算法进行改进,并对活跃节点集进行了修剪,提出了New-Trie-Stack算法。该算法避免了活跃节点的重复计算,以及已有算法在保存所有已遍历节点的活跃节点集时的空间开销。实验结果表明New-Trie-Stack算法在时间复杂度和空间复杂度上都有明显的下降。  相似文献   

12.
一种融合多种编辑距离的字符串相似度计算方法*   总被引:5,自引:0,他引:5  
针对中西文混合字符串,采用了将汉字作为西文字符的等价单位计算编辑距离的方法,并从输入法的角度提出了采用拼音编码和五笔编码计算编辑距离的方法,最后给出了融合三种编辑距离计算字符串相似度的算法。仿真结果表明,该方法在提高相似重复记录检测的查全率的同时,也能获得较高的查准率。  相似文献   

13.
14.
基于独立分量分析的自适应在线算法   总被引:1,自引:1,他引:1  
独立分量分析(ICA)是近几年兴起的一种高效的信号处理方法,学习步长的优化问题是自适应ICA重要的一方面,基于变步长思想,定义了一种描述信号分离状态的相似性测度,来衡量输出分量之间的相似性程度,并由此提出一种改进的自适应在线算法。根据相似性程度所反映的信号分离状态自适应调节步长,并建立学习步长和相似性测度变化量的非线性关系,克服了传统算法在信道矩阵变化时对步长自适应调整的不足。性能指标分析和仿真实验证明了算法的收敛性和稳态性能。  相似文献   

15.
一种基于链码的三维心血管图像匹配算法   总被引:3,自引:0,他引:3       下载免费PDF全文
为了快速准确地进行三维心血管图像匹配,以帮助医生更加准确地进行心血管疾病的治疗,提出一种基于链码理论的三维心血管图像心血管中轴线的匹配方法,即首先将二维的Freeman编码拓展至三维空间,然后将其用于对已获取的三维心血管进行编码,以便于实现对不同时刻的三维心血管图像心血管中轴线的匹配。另外,还对模式识别中链码的串匹配算法作了一个简要介绍,并讨论了其中的编码、代价函数、归一化的链间距离等难点。为了验证该算法的效果,还选择了两种构造替换代价函数的方法对三维心血管进行了实验,并利用标准公式对实验结果进行了评估。实验结果表明,利用两种代价函数都可以实现图像的匹配,但是匹配的程度有较大差异,其中利用第2种代价函数可以得到更加令人满意的匹配结果。  相似文献   

16.
A method is proposed for approximation of the classic edit distance between strings. The method is based on a mapping of strings into vectors belonging to a space with an easily calculable metric. The method preserves the closeness of strings and makes it possible to accelerate the computation of edit distances. The developed q-gram method of approximation of edit distances and its two randomized versions improves the approximation quality in comparison with well-known results. __________ Translated from Kibernetika i Sistemnyi Analiz, No. 4, pp. 18–38, July–August 2007.  相似文献   

17.
We propose a new variant of the bit-parallel NFA of Baeza-Yates and Navarro (BPD) for approximate string matching [R. Baeza-Yates, G. Navarro, Faster approximate string matching, Algorithmica 23 (1999) 127-158]. BPD is one of the most practical approximate string matching algorithms under moderate pattern lengths and error levels [G. Myers, A fast bit-vector algorithm for approximate string matching based on dynamic programming, J. ACM 46 (3) 1989 395-415; G. Navarro, M. Raffinot, Flexible Pattern Matching in Strings—Practical On-line Search Algorithms for Texts and Biological Sequences, Cambridge University Press, Cambridge, UK, 2002]. Given a length-m pattern and an error threshold k, the original BPD requires (mk)(k+2) bits of space to represent an NFA with (mk)(k+1) states. In this paper we remove redundancy from the original NFA representation. Our variant requires (mk)(k+1) bits of space, which is optimal in the sense that exactly one bit per state is used. The space efficiency is achieved by using an alternative, but equally or even more efficient, simulation algorithm for the bit-parallel NFA. We also present experimental results to compare our modified NFA against the original BPD and its main competitors. Our new variant is more efficient than the original BPD, and it hence takes over/extends the role of the original BPD as one of the most practical approximate string matching algorithms under moderate values of k and m.  相似文献   

18.
为解决传统均衡算法和匹配算法在低信噪比情况下水声稀疏信道估计准确率低和计算量大的问题,提出一种基于正交字典的全反馈信道函数估计方法。通过信道参数估计公式分析,当训练序列与信道函数分量卷积构成正交字典集时,利用匹配算法便可获得最佳匹配结果,结合全反馈结构将信道参数反馈迭代计算,用数学归纳法证明该结构可进一步提高信道函数估计准确率。仿真结果表明,在低信噪比条件下,相比均衡类算法和匹配类算法,提出的信道估计算法可获得更高的估计准确率,具有更低的时间复杂度。  相似文献   

19.
On the Weighted Mean of a Pair of Strings   总被引:4,自引:1,他引:4  
String matching and string edit distance are fundamental concepts in structural pattern recognition. In this paper, the weighted mean of a pair of strings is introduced. Given two strings, x and y, where d(x, y) is the edit distance of x and y, the weighted mean of x and y is a string z that has edit distances d(x, z) and d(z, y)to x and y, respectively, such that d(x, z) _ d(z, y) = d(x, y). We’ll show formal properties of the weighted mean, describe a procedure for its computation, and give practical examples. Received: 26 October 2000, Received in revised form: 27 April 2001, Accepted: 20 July 2001  相似文献   

20.
允许错误的(汉字)字符串快速检索技术   总被引:2,自引:1,他引:2       下载免费PDF全文
在计算机应用的诸多领域中都会遇到字符串似检索问题。本提出了一种技术。它通过应用搜索状态向量及字符-模式匹配向量,将字符串匹配比较转化简单的整数字位运算,有效地解决了字符/汉字串的相似匹配问题,中也给出了实现算法并分析了算法的复杂性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号