Adaptive Fuzzy String Matching: How to Merge Datasets with only One (Messy) Identifying Field

Aaron R. Kaufman, Aja Klevs

Research output: Contribution to journalArticlepeer-review

Abstract

A single dataset is rarely sufficient to address a question of substantive interest. Instead, most applied data analysis combines data from multiple sources. Very rarely do two datasets contain the same identifiers with which to merge datasets; fields like name, address, and phone number may be entered incorrectly, missing, or in dissimilar formats. Combining multiple datasets absent a unique identifier that unambiguously connects entries is called the record linkage problem. While recent work has made great progress in the case where there are many possible fields on which to match, the much more uncertain case of only one identifying field remains unsolved: this fuzzy string matching problem, both its own problem and a component of standard record linkage problems, is our focus. We design and validate an algorithmic solution called Adaptive Fuzzy String Matching rooted in adaptive learning, and show that our tool identifies more matches, with higher precision, than existing solutions. Finally, we illustrate its validity and practical value through applications to matching organizations, places, and individuals.

Original languageEnglish (US)
JournalPolitical Analysis
DOIs
StateAccepted/In press - 2021

Keywords

  • adaptive learning
  • Record linkage

ASJC Scopus subject areas

  • Sociology and Political Science
  • Political Science and International Relations

Fingerprint

Dive into the research topics of 'Adaptive Fuzzy String Matching: How to Merge Datasets with only One (Messy) Identifying Field'. Together they form a unique fingerprint.

Cite this