TY - GEN

T1 - Factor automata of automata and applications

AU - Mohri, Mehryar

AU - Moreno, Pedro

AU - Weinstein, Eugene

N1 - Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.

PY - 2007

Y1 - 2007

N2 - An efficient data structure for representing the full index of a set of strings is the factor automaton, the minimal deterministic automaton representing the set of all factors or substrings of these strings. This paper presents a novel analysis of the size of the factor automaton of an automaton, that is the minimal deterministic automaton accepting the set of factors of a finite set of strings, itself represented by a finite automaton. It shows that the factor automaton of a set of strings U has at most 2|Q| - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a bound that significantly improves over 2||U|| - 1, the bound given by Blumer et al. (1987), where ||U|| is the sum of the lengths of all strings in U. It also gives novel and general bounds for the size of the factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.

AB - An efficient data structure for representing the full index of a set of strings is the factor automaton, the minimal deterministic automaton representing the set of all factors or substrings of these strings. This paper presents a novel analysis of the size of the factor automaton of an automaton, that is the minimal deterministic automaton accepting the set of factors of a finite set of strings, itself represented by a finite automaton. It shows that the factor automaton of a set of strings U has at most 2|Q| - 2 states, where Q is the number of nodes of a prefix-tree representing the strings in U, a bound that significantly improves over 2||U|| - 1, the bound given by Blumer et al. (1987), where ||U|| is the sum of the lengths of all strings in U. It also gives novel and general bounds for the size of the factor automaton of an automaton as a function of the size of the original automaton and the maximal length of a suffix shared by the strings it accepts. Our analysis suggests that the use of factor automata of automata can be practical for large-scale applications, a fact that is further supported by the results of our experiments applying factor automata to a music identification task with more than 15,000 songs.

KW - Factor automata

KW - Finite automata

KW - Information retrieval

KW - Inverted files

KW - Music identification

KW - Suffix automata

KW - Suffix trees

KW - Text indexing

UR - http://www.scopus.com/inward/record.url?scp=38149108437&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=38149108437&partnerID=8YFLogxK

U2 - 10.1007/978-3-540-76336-9_17

DO - 10.1007/978-3-540-76336-9_17

M3 - Conference contribution

AN - SCOPUS:38149108437

SN - 9783540763352

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 168

EP - 179

BT - Implementation and Application of Automata - 12th International Conference, CIAA 2007, Revised Selected Papers

PB - Springer Verlag

T2 - 12th International Conference on Implementation and Application of Automata, CIAA 2007

Y2 - 16 July 2007 through 18 July 2007

ER -