首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
近似字符串匹配是模式匹配研究领域中的一个重要研究方向。压缩后缀数组是字符串匹配、数据压缩等领域广泛使用的索引结构,具有检索速度快和适用广泛的优点。利用压缩后缀数组,提出了适合近似字符串匹配搜索算法的数据结构,并在此基础上提出了一种匹配搜索算法。实验结果表明,相对于现有的算法,提出的算法在小字母表的情况下具有计算优势。  相似文献   

2.
This paper proposes new algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. Fixed-length approximate string matching and approximate circular string matching are special cases of approximate string matching and have numerous direct applications in bioinformatics and text searching. Firstly, a counter-vector-mismatches (CVM) algorithm is proposed to solve fixed-length approximate string matching with k-mismatches. The development of CVM algorithm is based on the parallel summation of counters located in the same machine word. Secondly, a parallel counter-vector-mismatches (PCVM) algorithm is proposed to accelerate CVM algorithm in parallel. The PCVM algorithm is integrated into two-level parallelisms that exploit not only word-level parallelism but also data parallelism via parallel environments such as multi-core processors and graphics processing units (GPUs). In the particular case of adopting GPUs, a shared-mem parallel counter-vector-mismatches (PCVMsmem) scheme can be implemented from PCVM algorithm. The PCVMsmem scheme can exploit the memory model of GPUs to optimize performance of PCVM algorithm. Finally, this paper shows several methods to adopt CVM and PCVM algorithms in case the input pattern is in circular structure. In the experiments with real DNA packages, our proposed algorithms and scheme work greatly faster than previous bit-vector-mismatches and parallel bit-vector-mismatches algorithms.  相似文献   

3.
伴随大数据时代的到来,数据快速保序匹配与检索成为众多大数据应用急需解决的关键问题,通过抽象与归约等措施,数据对象可抽象为具有若干属性的点集或序列,从而将数据匹配问题转化为字符或数字序列匹配问题。提出一种基于相似度过滤的数据保序匹配与检索算法,算法分三步:(1)数据转换,基于幅值变化趋势将原始序列转换为二进制,对序列中任何一个字符,通过判断包括其前后邻居在内的三个点的关系定义二进制序列,准确反映相邻三点之间的凸增长(降低)或凹增长(降低)关系;(2)数据归约,为方便候选序列与模式序列之间的相似度计算,运用基于幅度变化比例的数据归约方法,将候选序列与模式序列均归约到固定区间;(3)相似度计算,为区分不同趋势的凸增长(降低)或凹增长(降低)幅度,通过计算候选序列与模式序列对应点之间的差值绝对值之和作为相似度判断依据,提出基于相似度过滤的快速匹配方法,寻找与模式序列变化趋势一致的子序列集合,并按照相似度大小排序。理论分析与实验结果表明:(1)该算法具有亚线性时间复杂度;(2)该算法能有效解决Chhabra等人算法对数据震荡幅度失控的问题,同时解决数据序列与模式序列分段规律但整体不相似的问题;(3)解决了Chhabra等人算法中对匹配序列排序造成的匹配结果疏漏问题。该方法不仅能更准确、更多地匹配出变化趋势一致的子字符串,同时将多个候选子串根据与模式之间的相似度进行排序,为进一步的数据精确检索提供判断依据。  相似文献   

4.
随着网络技术的发展,网络犯罪应运而生。网络钓鱼活动日益加剧,网络钓鱼攻击成为Internet上最主要的网络诈骗方式,对网络安全和电子商务的正常运行构成了极大的威胁。本文通过真实的案例分析,总结了网络钓鱼的手段,并提出了相应的防范对策。  相似文献   

5.
基于匹配区域特征的相似字符串匹配过滤算法孙德才   总被引:1,自引:0,他引:1  
相似字符串匹配过滤算法因其适合大库查找而被广泛应用,为通过提高过滤算法的过滤效率加快匹配速度,提出一种基于匹配区域特征的过滤算法.该算法将模式串和文本串分割成固定长度为kq+1的逻辑块,并从各块中提取了2个新的匹配区域特征:q-gram命中的均匀性和q-gram有效命中的区域性.新算法利用这些新特征优化了传统过滤标准,提高了算法的过滤效率;并改进了QUASAR中基于分块策略的过滤区确定方案.实验结果表明,新算法与改进前相比有效地加快了匹配速度,尤其在误差率较小时改进效果更佳.  相似文献   

6.
The order‐preserving pattern matching problem has gained attention in recent years. It consists in finding all substrings in the text, which have the same length and relative order as the input pattern. Typically, the text and the pattern consist of numbers. Since recent times, there has been a tendency to utilize the ability of the word RAM model to increase the efficiency of string matching algorithms. This model works on computer words, reading and processing blocks of characters at once, so that usual arithmetic and logic operations on words can be performed in one unit of time. In this paper, we present a fast order‐preserving pattern matching algorithm, which uses specialized word‐size packed string matching instructions, grounded on the single instruction multiple data instruction set architecture. We show with experimental results that the new proposed algorithm is more efficient than the previous solutions. ©2016 The Authors. Software: Practice and Experience Published by John Wiley & Sons Ltd.  相似文献   

7.
局部相似自连接能在给定的单个数据集中快速找到所有满足相似要求的记录对,它在数据清洗、基因序列比对和剽窃检测等领域都有广泛的应用。为研究基于单个字符串集的并行自连接算法,提出了一种基于MapReduce框架的自连接算法,解决了局部相似自连接的定位问题。该算法采用了过滤验证二阶段模式;在过滤阶段,采用无关对过滤和冗余对过滤抛弃了大量的无效字符串对;在验证阶段,通过生成小编号串内容保留项解决了字符串编号和内容的快速配对问题。实验结果显示,该算法在大数据集上的自连接速度一直快于当前的优秀算法LS-Join,同时非常适合动态编辑距离参数环境下的局部相似自连接操作。实验结果也证明,该算法中提出的相关技术有效地提高了局部相似自连接的速度。  相似文献   

8.
字符串匹配是生物识别、入侵检测的基础,也是大数据互联网时代的研究热点.随着现代信息技术的发展,日常工作生活中移动及手持小型化设备的使用越发普遍.这些设备的应用场景中包含大量有关串匹配的需求,如人脸识别、实时数据查询等.串匹配算法的实时和准确性决定了使用场景的范围,因此在DSP处理器等移动小型化设备的嵌入式处理器上实现高效串匹配算法的问题变得十分迫切.该文针对DSP处理器因缺乏逻辑判断与跳转指令,难以支持高效串匹配运算的问题,提出了一种基于DSP平台特点的改进串匹配算法.该算法采用位并行的思路,在DSP处理器上实现了串匹配算法的并行化.同时通过前序启动、基于VLIW的数学运算替代逻辑判断、Q-grams等优化手段,提高该算法对于DSP平台的适应性与执行效率,最终实现了一种基于HXDSP的高效串匹配算法VBNDM2.实验结果表明,本算法针对DSP平台,有效地提高了串匹配的效率,实现了算法的高效并行化.  相似文献   

9.
Iconic indexing by 2-d strings   总被引:8,自引:0,他引:8  
In this paper, we describe a new way of representing a symbolic picture by a two-dimensional string. A picture query can also be specified as a 2-D string. The problem of pictorial information retrieval then becomes a problem of 2-D subsequence matching. We present algorithms for encoding a symbolic picture into its 2-D string representation, reconstructing a picture from its 2-D string representation, and matching a 2-D string with another 2-D string. We also prove the necessary and sufficient conditions to characterize ambiguous pictures for reduced 2-D strings as well as normal 2-D strings. This approach thus allows an efficient and natural way to construct iconic indexes for pictures.  相似文献   

10.
一种改进的BM模式匹配算法   总被引:1,自引:0,他引:1       下载免费PDF全文
刘沛骞  冯晶晶 《计算机工程》2011,37(17):248-249
针对BM模式匹配算法的效率问题,提出其改进算法.分析BM模式匹配算法的原理,若文本串中连续的几个字符不在模式字符串中出现,则不需要被比对,以此改变模式字符串的匹配顺序,提高算法的匹配效率.实验结果表明,改进的BM模式匹配算法可以有效地减少字符串的匹配次数和比对次数,能获得良好的字符串匹配效率.  相似文献   

11.
Snort研究及BM算法改进   总被引:1,自引:0,他引:1  
Snort是一个轻型的入侵检测系统,在检测过程中,字符串匹配算法的效率决定了Snort系统的性能.分析了Snort的系统结构和工作流程,对Snort的BM字符匹配算法进行深入研究,提出了BM字符匹配算法的改进方法.实验数据表明,改进的BM字符匹配算法可提高Snort的效率.  相似文献   

12.
在分析了BM模式匹配算法的基础上,提出了一种新的字符串单模式匹配算法,该算法通过对模式中的字符进行等级划分,设置模式中各个字符的优先级,改进模式串的移动方式,减少了模式匹配的次数和字符比较的次数,有效的提高了模式匹配的效率。实验显示,该算法有效的提高了模式匹配的效率。  相似文献   

13.
We propose DMP-tree, a dynamic M-way prefix tree, data structure for the string matching problem in general and prefix matching in particular. DMP-tree has been initially devised for fast and efficiently handling prefix matching which constitutes the building block of some applications in the computer realm and related area. It is assumed there are strings of an alphabet Σ which are ordered. The data strings can have different lengths and some of them can be prefixes of others. Two well known applications of prefix matching are layers 3 and 4 switching in TCP/IP protocols. In layer 3 switching, routers forward an IP packet by checking its destination address and finding the longest matching prefix from a database. In layer 4 switching, the source and destination addresses are used to classify packets for differentiated service and Quality of Services (QoS). DMP-tree is a superset of B-tree. When none of the data strings are a prefix of each other, DMP-tree is the same as B-tree. In DMP-tree, no data string can be in a higher level than another data string which is its prefix. This requires a special procedure for node splitting. Indeed, node splitting differentiates DMP-tree from B-tree. The proposed data structure is simple, well defined, easy to implement in hardware or software and efficient in terms of memory usages and search time compared to other data structures proposed for prefix matching. We have implemented DMP-tree and the experimental results for simulated IP prefixes from the {0, 1} character set show an average search time of for a large number of N, number of data elements, when the internal node branching factor M is big enough (?5).  相似文献   

14.
Information retrieval in document image databases   总被引:2,自引:0,他引:2  
With the rising popularity and importance of document images as an information source, information retrieval in document image databases has become a growing and challenging problem. In this paper, we propose an approach with the capability of matching partial word images to address two issues in document image retrieval: word spotting and similarity measurement between documents. First, each word image is represented by a primitive string. Then, an inexact string matching technique is utilized to measure the similarity between the two primitive strings generated from two word images. Based on the similarity, we can estimate how a word image is relevant to the other and, thereby, decide whether one is a portion of the other. To deal with various character fonts, we use a primitive string which is tolerant to serif and font differences to represent a word image. Using this technique of inexact string matching, our method is able to successfully handle the problem of heavily touching characters. Experimental results on a variety of document image databases confirm the feasibility, validity, and efficiency of our proposed approach in document image retrieval.  相似文献   

15.
Ordered binary decision diagrams (OBDDs) are a very popular graph representation for Boolean functions. They can be viewed as finite automata recognizing sets of strings of a fixed length, where the letters of the input strings are read at most once in a predefined ordering. The string matching problem with string w as pattern, consists of determining, given an input string, whether or not it contains w as substring. We show that for a fraction of orderings tending to 1 when n increases arbitrarily, the minimal size of an OBDD solving the string matching problem for strings of length n has a growth which is an exponential in n.  相似文献   

16.
We propose a fast and memory-efficient algorithm for lexicographically sorting the suffixes of a string, a problem that has important applications in data compression as well as string matching.  相似文献   

17.
To solve, manage and analyze biological problems using computer technology is called bioinformatics. With the emergent evolution in computing era, the volume of biological data has increased significantly. These large amounts of data have increased the need to analyze it in reasonable space and time. DNA sequences contain basic information of species, and pattern matching between different species is an important and challenging issue to cope with. There exist generalized string matching and some specialized DNA pattern matching algorithms in the literature. There is still need to develop fast and space efficient pattern matching algorithms that consider new hardware development. In this paper, we present a novel DNA sequences pattern matching algorithm called EPMA. The proposed algorithm utilizes fixed length 2-bits binary encoding, segmentation and multi-threading. The idea is to find the pattern with multiple searcher agents concurrently. The proposed algorithm is validated with comparative experimental results. The results show that the new algorithm is a good candidate for DNA sequence pattern matching applications. The algorithm effectively utilizes modern hardware and will help researchers in the sequence alignment, short read error correction, phylogenetic inference etc. Furthermore, the proposed method can be extended to generalized string matching and their applications.  相似文献   

18.
在非可信环境下对数据进行加密是保护数据库中数据安全的一种有效方法,但如何对加密数据进行高效地查询是一个难点,引起了研究界的重视.本文提出加密字符数据的一种存储结构,除了加密数据以外,还以加密的方式存储了原始数据的特征值,并基于这种结构实现了对加密数据的两阶段查询方法,通过实验证明其性能较先解密后查询的方法有较大的提高.  相似文献   

19.
A new attributed string matching method for human face profile recognition is proposed in this work. It is a novel idea to apply structural and syntactic technique on face profile matching. The approach works on a chain of profile line segments and highlights the favor of curve matching by suppressing the operations of insert and delete. The technique relies mainly on the merge and change operations of string to tackle the inconsistency problem of feature point detection. A quadratic penalty function is proposed to prohibit large angle changes and overmerging. The method produces very encouraging results and is found to be suitable for similar shape classification.  相似文献   

20.
Vertex-disjoint triangle sets (triangle sets for short) have been studied extensively. Many theoretical and computational results have been obtained. While the maximum triangle set problem can be viewed as the generalization of the maximum matching problem, there seems to be no parallel result to Berge's augmenting path characterization on maximum matching (C. Berge, 1957 [1]). In this paper, we describe a class of structures called triangle string, which turns out to be equivalent to the class of union of two triangle sets in a graph. Based on the concept of triangle string, a sufficient and necessary condition that a triangle set can be augmented is given. Furthermore, we provide an algorithm to determine whether a graph G with maximum degree 4 is a triangle string, and if G is a triangle string, we compute a maximum triangle set of it. Finally, we give a sufficient and necessary condition for a triangle string to have a triangle factor.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号