首页 | 本学科首页   官方微博 | 高级检索  
     


General suffix automaton construction algorithm and space bounds
Authors:Mehryar Mohri  Pedro Moreno  Eugene Weinstein
Affiliation:1. Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012, United States;2. Google Research, 76 Ninth Avenue, New York, NY 10011, United States
Abstract:Suffix automata and factor automata are efficient data structures for representing the full index of a set of strings. They are minimal deterministic automata representing the set of all suffixes or substrings of a set of strings. This paper presents a novel analysis of the size of the suffix automaton or factor automaton of a set of strings. It shows that the suffix automaton or factor automaton of a set of strings UU has at most 2Q−22Q2 states, where QQ is the number of nodes of a prefix-tree representing the strings in UU. This bound significantly improves over 2‖U‖−12U1, the bound given by Blumer et al. A. Blumer, J. Blumer, D. Haussler, R.M. McConnell, A. Ehrenfeucht, Complete inverted files for efficient text retrieval and analysis, Journal of the ACM 34 (1987) 578–589], where ‖U‖U is the sum of the lengths of all strings in UU. More generally, we give novel and general bounds for the size of the suffix or factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. We also describe in detail a linear-time algorithm for constructing the suffix automaton SS or factor automaton FF of UU in time O(|S|)O(|S|). Our algorithm applies in fact to any input suffix-unique automaton and strictly generalizes the standard on-line construction of a suffix automaton for a single input string. Our algorithm can also be used straightforwardly to generate the suffix oracle or factor oracle of a set of strings, which has been shown to have various useful properties in string-matching. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.
Keywords:String-matching  Pattern-matching  Indexing  Inverted text  Finite automata  Suffix trees  Suffix automata  Factor automata  Music identification
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号