Abstract
The edit-distance of two strings is the minimal cost of a sequence of symbol insertions, deletions, or substitutions transforming one string into the other. The definition is used in various contexts to give a measure of the difference or similarity between two strings. This definition can be extended to measure the similarity between two sets of strings. In particular, when these sets are represented by automata, their edit-distance can be computed using the general algorithm of composition of weighted transducers combined with a single-source shortest-paths algorithm. More generally, in some applications such as speech recognition and computational biology, the strings may represent a range of alternative hypotheses with associated probabilities. Thus, we introduce the definition of the edit-distance of two distributions of strings given by two weighted automata. We show that general weighted automata algorithms over the appropriate semirings can be used to compute the edit-distance of two weighted automata exactly. The algorithm for computing exactly the edit-distance of weighted automata can be used to improve the word accuracy of automatic speech recognition systems. More generally, the algorithm can be extended to provide an edit-distance automaton useful for rescoring and other post-processing purposes in the context of large-vocabulary speech recognition. In the course of the presentation of our algorithm, we also introduce a new and general synchronization algorithm for weighted transducers which, combined with ∈-removal, can be used to normalize weighted transducers with bounded delays.
Original language | English (US) |
---|---|
Pages (from-to) | 1-23 |
Number of pages | 23 |
Journal | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
Volume | 2608 |
DOIs | |
State | Published - 2003 |
ASJC Scopus subject areas
- Theoretical Computer Science
- General Computer Science