TY - JOUR
T1 - General suffix automaton construction algorithm and space bounds
AU - Mohri, Mehryar
AU - Moreno, Pedro
AU - Weinstein, Eugene
N1 - Funding Information:
We thank Cyril Allauzen for several discussions about the material presented. The research of Mehryar Mohri and Eugene Weinstein was partially supported by the New York State Office of Science Technology and Academic Research (NYSTAR). This project was also sponsored in part by the Department of the Army Award Number W81XWH-04-1-0307. The US Army Medical Research Acquisition Activity, 820 Chandler Street, Fort Detrick MD 21702-5014 is the awarding and administering acquisition office. The content of this material does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred.
PY - 2009/9/1
Y1 - 2009/9/1
N2 - Suffix automata and factor automata are efficient data structures for representing the full index of a set of strings. They are minimal deterministic automata representing the set of all suffixes or substrings of a set of strings. This paper presents a novel analysis of the size of the suffix automaton or factor automaton of a set of strings. It shows that the suffix automaton or factor automaton of a set of strings U has at most 2 Q - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U. This bound significantly improves over 2 {norm of matrix} U {norm of matrix} - 1, the bound given by Blumer et al. [A. Blumer, J. Blumer, D. Haussler, R.M. McConnell, A. Ehrenfeucht, Complete inverted files for efficient text retrieval and analysis, Journal of the ACM 34 (1987) 578-589], where {norm of matrix} U {norm of matrix} is the sum of the lengths of all strings in U. More generally, we give novel and general bounds for the size of the suffix or factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. We also describe in detail a linear-time algorithm for constructing the suffix automaton S or factor automaton F of U in time O (| S |). Our algorithm applies in fact to any input suffix-unique automaton and strictly generalizes the standard on-line construction of a suffix automaton for a single input string. Our algorithm can also be used straightforwardly to generate the suffix oracle or factor oracle of a set of strings, which has been shown to have various useful properties in string-matching. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.
AB - Suffix automata and factor automata are efficient data structures for representing the full index of a set of strings. They are minimal deterministic automata representing the set of all suffixes or substrings of a set of strings. This paper presents a novel analysis of the size of the suffix automaton or factor automaton of a set of strings. It shows that the suffix automaton or factor automaton of a set of strings U has at most 2 Q - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U. This bound significantly improves over 2 {norm of matrix} U {norm of matrix} - 1, the bound given by Blumer et al. [A. Blumer, J. Blumer, D. Haussler, R.M. McConnell, A. Ehrenfeucht, Complete inverted files for efficient text retrieval and analysis, Journal of the ACM 34 (1987) 578-589], where {norm of matrix} U {norm of matrix} is the sum of the lengths of all strings in U. More generally, we give novel and general bounds for the size of the suffix or factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. We also describe in detail a linear-time algorithm for constructing the suffix automaton S or factor automaton F of U in time O (| S |). Our algorithm applies in fact to any input suffix-unique automaton and strictly generalizes the standard on-line construction of a suffix automaton for a single input string. Our algorithm can also be used straightforwardly to generate the suffix oracle or factor oracle of a set of strings, which has been shown to have various useful properties in string-matching. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.
KW - Factor automata
KW - Finite automata
KW - Indexing
KW - Inverted text
KW - Music identification
KW - Pattern-matching
KW - String-matching
KW - Suffix automata
KW - Suffix trees
UR - http://www.scopus.com/inward/record.url?scp=67651095813&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=67651095813&partnerID=8YFLogxK
U2 - 10.1016/j.tcs.2009.03.034
DO - 10.1016/j.tcs.2009.03.034
M3 - Article
AN - SCOPUS:67651095813
SN - 0304-3975
VL - 410
SP - 3553
EP - 3562
JO - Theoretical Computer Science
JF - Theoretical Computer Science
IS - 37
ER -