# Annotated Bibliography for Matching

This has been written up primarily to keep my thoughts in order about a current research topic, namely, the interaction between matching and disclosure limitation. Don't take anything I say here too seriously `:-)`

If you're reading this, you should probably also know: Bill Winkler (mentioned multiple times below) has recently written an extensive survey on record linkage work.

## The statistical literature

• Newcombe, et al, "Automatic linkage of vital records" Science 130:954--959, 1959. Available on-line as part of the background material provided with the Record Linkage Techniques - 1985: Proceedings of the Workshop on Exact Matching Methodologies, Arlington, VA, May 9-10,1985. Very early paper describing a linkage application. Outlines or at least suggests many of the key areas in linking personal records: use of Soundex to handle spelling errors; blocking to limit the number of comparisons; multiple blocking rules to increase number of matches found; estimating probability of match using log-odds and independency assumptions; using self-match probabilities to estimate match probabilities.
• Felligi and Sunter, "A theory for record linkage", Journal of the American Statistical Society, 64:1183--1210, 1969. Also available on-line as part of the background material provided with the Record Linkage Techniques - 1985 Workshop. Very clean and influential work on theory of record linkage. Roughly speaking, record linkage is viewed as classification of pairs as "matched" and "unmatched", based on certain computed features of the pairs. Technically:
• Felligi-Sunter describe what I would call a generative Bayesian learning approach, where one classifies a pair (a,b) from (A x B) as matched or unmatched based on `Prob(features(a,b)|match)` and `Prob(features(a,b)|nonmatch)`, where `features(a,b)` is some vector of comparison features between records a and b. One implication of this is that every optimal rule for deciding on matches can be obtained by appropriate thresholding the ration `Prob(features|match)/Prob(features|nonmatch)`.

• Felligi-Sunter further propose to assume that the individual features in the vector features(a,b) are independent (a sort of Naive-Bayes assumption), for tractibility.
• Finally, Felligi-Sunter propose a way of estimating probabilities of individual match features like "a and b have the same last name and the last name is 'Jones'" based on error models, and the frequency of name 'Jones' in the datasets A, B, and (A intersect B).
• William E. Winkler and Yves Thibaudeau, An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Census, Number RR91/09. This paper extends the basic Fellegi-Sunter model in a couple of ways. Most importantly, Winkler proposes using EM and a latent (hidden) match variable for each pairing (a,b). The variable is one if (a,b) is a match, zero otherwise. One can use EM and same old independence assumptions to compute models of Prob(features|match). One can also use it to estimate some of the error parameters in the frequency-based model of Fellegi-Sunter.

Experimentally, this produces extreme values for some match variables, so Winkler uses some ad hoc smoothing tricks along with EM. I have another worry: in classifying pairs as match/nonmatch, individual pairs (a,b1) and (a,b2) are not independent. In fact, their "classes" are highly correlated, since in many cases only one of (a,b1) and (a,b2) can be a match. The EM formulation glosses over this though. For some reason I find this more disturbing than the assumption of match-feature independence.

Winkler also outlines a string-matching scheme to measure agreement between partially matching names, like "Cohen" and "Cohn". It's not described very completely in this paper, and I won't review it at all, mostly because it seems very ad hoc and it seems that more principled approaches now exist.

Experimentally, the results with the methods are a bit unclear. They suggest that partial string matching is important, that EM requires smoothing, and that frequency-based parameter setting is quite important. Winkler blames the assumption of match-feature independence, and talks about using more general modelling techniques to fix it.

Winkler says that the standard way in practise of matching records is an iterative one, in which Fellegi-Sunter parameters are estimated, then records are reviewed, then parameters are adjusted, and that experts are needed to make it all work. He also notes that estimated match/nonmatch probabilities are usually not accurate enough to be useful.

• William E. Winkler, Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage, Numbern RR93/12. Extends the RR91/09 paper by
• Using three latent for EM, rather than two. Intutitively the three classes correspond to matches, nonmatches, and non-matches from the same household. This seemed a bit ad hoc, but a recent paper by Charniak on unsupervised name matching titled Unsupervised learning of name structure from coreference data (from NAACL-2001) uses a similar model.
• Using linear constraints on the probability model to encourage (force?) EM to find the "right" latent structure (MCECM). The details aren't described but they come from a paper by Meng and Rubin, and apparently it involves constraining the optimization in the "M" step. It looks like Winkler does something fairly simple in the end, perhaps replacing a parameter p[t+1] (the M-step maximized version of p[t]) with a linear combination of p[t+1] and p[t], and using grid search to find a mixing coefficient alpha that obeys the linear constraints.
• Using models for `Prob(features|match)` that don't assume independence. Again, the details aren't presented.
Experimentally, it looks like these tricks help most when the sets of convex constraints and the set of interactions are carefully chosen for the experimental dataset.
• William E. Winkler, Methods for Record Linkage and Bayesian Networks. A much more recent paper ("presented last month in an ASA record linkage session organized by Malay Ghosh"), this paper uses three latent classes, more conventional smoothing methods and a little bit of labeled data to constrain EM, rather than the constraints of RR93/12. Again a bunch of non-independent models are used for match features, but in this case the independency assumptions don't seem to hurt much.

This paper is very much inspired by Kamal Nigam's thesis work on using EM for semi-supervised learning with multinomial naive Bayes.

## String edit based metrics for record linkage

• A. Monge and C. Elkan have a couple of papers that describe using the Smith-Waterman string edit distance metric for generic, domain-independent record linkage:
• "The field-matching problem: algorithm and applications." In The Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, August 1996.
• "An efficient domain-independent algorithm for detecting approximately duplicate database records", In The proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, May 1997.
• Ristad and Yianilos, Learning String Edit Distance, IEEE Trans. PAMI, 20(5):522-532, 1998. (There's also a conference-length version from 1994.) Presents an EM method for learning weights for non-affine edit-distance metrics, given pairs of matches strings.
• Bilenko and Mooney, Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases, Technical Report AI 02-296, Artificial Intelligence Lab, University of Texas at Austin, February 2002.

Extends the Ristad-Yianilos algorithm for affine edit-distance metrics. Affine edit distances allow the cost of a sequence of N insertions (or deletions) to be different from N times the cost of a single insertion (or deletion).

A nice description of various edit distance algorithms, together with some probabilistic versions of these algorithms based on HMMs, can be found in the book Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids by Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison.