TY - GEN
T1 - Approximate nearest neighbors and sequence comparison with block operations
AU - Muthukrishnan, S.
AU - Sahinalp, Süleyman Cenk
PY - 2000
Y1 - 2000
N2 - We study sequence nearest neighbors (SNN). Let D be a database of n sequences; we would like to preprocess D so that given any on-line query sequence Q we can quickly find a sequence S in D for which d(S, Q) ≤ d(S, T) for any other sequence T in D. Here d(S, Q) denotes the distance between sequences S and Q, defined to be the minimum number of edit operations needed to transform one to another (all edit operations will be reversible so that d(S, T) = d(T, S) for any two sequences T and S). These operations correspond to the notion of similarity between sequences that we wish to capture in a given application. Natural edit operations include character edits (inserts, replacements, deletes etc), block edits (moves, copies, deletes, reversals) and block numerical transformations (scaling by an additive or a multiplicative constant). The SNN problem arises in many applications. We present the first known efficient algorithm for "approximate" nearest neighbor search for sequences with preprocessing time and space polynomial in size of D and query time near-linear in size of Q. We assume the distance d(S, T) between two sequences S and T is the minimum number of character edits and block operations needed to transform one to the other. The approximation factor we achieve is O(log ℓ(log* ℓ)2), where ℓ is the size of the longest sequence in D. In addition, we also give an algorithm for exactly computing the distance between two sequences when edit operations of the type character replacements and block reversals are allowed. The time and space requirements of the algorithm is near linear; previously known approaches take at least quadratic time.
AB - We study sequence nearest neighbors (SNN). Let D be a database of n sequences; we would like to preprocess D so that given any on-line query sequence Q we can quickly find a sequence S in D for which d(S, Q) ≤ d(S, T) for any other sequence T in D. Here d(S, Q) denotes the distance between sequences S and Q, defined to be the minimum number of edit operations needed to transform one to another (all edit operations will be reversible so that d(S, T) = d(T, S) for any two sequences T and S). These operations correspond to the notion of similarity between sequences that we wish to capture in a given application. Natural edit operations include character edits (inserts, replacements, deletes etc), block edits (moves, copies, deletes, reversals) and block numerical transformations (scaling by an additive or a multiplicative constant). The SNN problem arises in many applications. We present the first known efficient algorithm for "approximate" nearest neighbor search for sequences with preprocessing time and space polynomial in size of D and query time near-linear in size of Q. We assume the distance d(S, T) between two sequences S and T is the minimum number of character edits and block operations needed to transform one to the other. The approximation factor we achieve is O(log ℓ(log* ℓ)2), where ℓ is the size of the longest sequence in D. In addition, we also give an algorithm for exactly computing the distance between two sequences when edit operations of the type character replacements and block reversals are allowed. The time and space requirements of the algorithm is near linear; previously known approaches take at least quadratic time.
UR - http://www.scopus.com/inward/record.url?scp=0033705069&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0033705069&partnerID=8YFLogxK
U2 - 10.1145/335305.335353
DO - 10.1145/335305.335353
M3 - Conference contribution
AN - SCOPUS:0033705069
SN - 1581131844
SN - 9781581131840
T3 - Proceedings of the Annual ACM Symposium on Theory of Computing
SP - 416
EP - 424
BT - Proceedings of the 32nd Annual ACM Symposium on Theory of Computing, STOC 2000
T2 - 32nd Annual ACM Symposium on Theory of Computing, STOC 2000
Y2 - 21 May 2000 through 23 May 2000
ER -