New Techniques for Regular Expression Searching |
| |
Authors: | Gonzalo Navarro Mathieu Raffinot |
| |
Affiliation: | (1) Department of Computer Science, University of Chile, Blanco Encalada 2120, Santiago, Chile;(2) Equipe Génome et Informatique, Tour Evry 2, 523 place des terrasses de l’Agora, 91034 Evry, France |
| |
Abstract: | We present two new techniques for regular expression searching and use them to
derive faster practical algorithms.
Based on the specific properties of Glushkov’s nondeterministic finite automaton
construction algorithm, we show how to encode a deterministic finite
automaton (DFA) using O(m2m) bits, where m is the number of characters,
excluding operator symbols, in the regular expression. This compares favorably
against the worst case of O(m2m|Σ|) bits needed by a classical DFA
representation (where Σ is the alphabet) and O(m22m) bits needed
by the Wu and Manber approach implemented in Agrep.
We also present a new way to search for regular expressions, which is
able to skip text characters. The idea is to determine the minimum
length ℓ of a string matching the regular expression,
manipulate the original automaton so that it recognizes all the
reverse prefixes of length up to ℓ of the strings originally accepted,
and use it to skip text characters as done for exact string matching
in previous work.
We combine these techniques into two algorithms, one able and one unable to
skip text characters. The algorithms are simple to implement, and our
experiments show that they permit fast searching for regular expressions,
normally faster than any existing algorithm. |
| |
Keywords: | Glushkov automaton Compact DFA representation Bit-parallelism BDM Reverse factors |
本文献已被 SpringerLink 等数据库收录! |
|