I'm interested in working on interesting machine learning problems driven by important real-world challenges. Broadly, my research for the past several years has focused on unsupervised and semisupervised learning on large, noisy datasets. Specifically, I have deployed and developed methods for learning in non-iid settings and for record linkage (i.e. co-reference resolution, entity resolution, deduplication).
Most recently, I have been researching cross-validation resampling techniques which estimate and correct for the effects of statistical dependency. The independent and identically distributed (iid) assumption is fundamental to the guarantees of most machine learning algorithms. Yet, in practice, it is frequently violated. Data from active learning, time series, and clustering or record-linkage results all break this assumption to varying degrees.
I have also researched novel semi-supervised methods and theoretical guarantees for record linkage -- a topic closely relatd to clustering, yet traditionally a more algorithm-driven field of study. Record linkage problems occur in natural language processing when noun phrases refer to the same person/group (e.g. Mr. President, Obama, he, POTUS), in most databases (e.g. health records, scientific databases, across social networks) and in search engines (e.g. duplicated spam web pages). The results of record linkage are used in several critical applications, including informing policy decisions (e.g. US Census) and counter-terrorism (e.g. suspect communication records at the FBI). From a machine learning standpoint, this is a problem of inferring the latent population and assigning records to the correct latent individuals.
I'm fortunate to apply my work to the counter-human-trafficking project at CMU (think NBC Dateline meets Terminator). We scrape hundreds of millions of escort ads from online, extract features, and use machine learning to find cases of human trafficking. The tools are used by over 100 real-world law enforcement to make actual arrests and rescue real victims (e.g. Marinus Analytics). My PhD advisor for this work is Artur Dubrawski and part of the DARPA MEMEX initiative. I am also an NSF Fellow.