TY - GEN
T1 - Factor automata of automata and applications
AU - Mohri, Mehryar
AU - Moreno, Pedro
AU - Weinstein, Eugene
N1 - Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2007
Y1 - 2007
N2 - An efficient data structure for representing the full index of a set of strings is the factor automaton, the minimal deterministic automaton representing the set of all factors or substrings of these strings. This paper presents a novel analysis of the size of the factor automaton of an automaton, that is the minimal deterministic automaton accepting the set of factors of a finite set of strings, itself represented by a finite automaton. It shows that the factor automaton of a set of strings U has at most 2|Q| - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a bound that significantly improves over 2||U|| - 1, the bound given by Blumer et al. (1987), where ||U|| is the sum of the lengths of all strings in U. It also gives novel and general bounds for the size of the factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.
AB - An efficient data structure for representing the full index of a set of strings is the factor automaton, the minimal deterministic automaton representing the set of all factors or substrings of these strings. This paper presents a novel analysis of the size of the factor automaton of an automaton, that is the minimal deterministic automaton accepting the set of factors of a finite set of strings, itself represented by a finite automaton. It shows that the factor automaton of a set of strings U has at most 2|Q| - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a bound that significantly improves over 2||U|| - 1, the bound given by Blumer et al. (1987), where ||U|| is the sum of the lengths of all strings in U. It also gives novel and general bounds for the size of the factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.
KW - Factor automata
KW - Finite automata
KW - Information retrieval
KW - Inverted files
KW - Music identification
KW - Suffix automata
KW - Suffix trees
KW - Text indexing
UR - http://www.scopus.com/inward/record.url?scp=38149108437&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=38149108437&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-76336-9_17
DO - 10.1007/978-3-540-76336-9_17
M3 - Conference contribution
AN - SCOPUS:38149108437
SN - 9783540763352
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 168
EP - 179
BT - Implementation and Application of Automata - 12th International Conference, CIAA 2007, Revised Selected Papers
PB - Springer Verlag
T2 - 12th International Conference on Implementation and Application of Automata, CIAA 2007
Y2 - 16 July 2007 through 18 July 2007
ER -