TY - GEN
T1 - Approximate string joins in a database (almost) for free
AU - Gravano, Luis
AU - Ipeirotis, Panagiotis G.
AU - Jagadish, H. V.
AU - Koudas, Nick
AU - Muthukrishnan, S.
AU - Srivastava, Divesh
PY - 2001
Y1 - 2001
N2 - String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string joins directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on matching short substrings of length q, called q-grams, and taking into account both positions of individual matches and the total number of such matches. Our approach applies to both approximate full string matching and approximate substring matching, with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers. We demonstrate experimentally the benefits of our technique over the direct use of UDFs, using commercial database systems and real data. To study the I/O and CPU behavior of approximate string join algorithms with variations in edit distance and q-gram length, we also describe detailed experiments based on a prototype implementation.
AB - String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string joins directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on matching short substrings of length q, called q-grams, and taking into account both positions of individual matches and the total number of such matches. Our approach applies to both approximate full string matching and approximate substring matching, with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers. We demonstrate experimentally the benefits of our technique over the direct use of UDFs, using commercial database systems and real data. To study the I/O and CPU behavior of approximate string join algorithms with variations in edit distance and q-gram length, we also describe detailed experiments based on a prototype implementation.
UR - http://www.scopus.com/inward/record.url?scp=84944318804&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84944318804&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84944318804
T3 - VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases
SP - 491
EP - 500
BT - VLDB 2001 - Proceedings of 27th International Conference on Very Large Data Bases
A2 - Apers, Peter M. G.
A2 - Atzeni, Paolo
A2 - Snodgrass, Richard T.
A2 - Ceri, Stefano
A2 - Ramamohanarao, Kotagiri
A2 - Paraboschi, Stefano
PB - Morgan Kaufmann
T2 - 27th International Conference on Very Large Data Bases, VLDB 2001
Y2 - 11 September 2001 through 14 September 2001
ER -