People-Centric Natural Language Processing

David Bamman
School of Computer Science
Carnegie Mellon University

Thesis (draft): [ pdf ]

Comments welcome! dbamman@cs.cmu.edu

Abstract

In this thesis, I advocate for a model of text analysis that focuses on people, leveraging ideas from machine learning, the humanities and the social sciences. People intersect with text in multiple ways: they are its authors, its audience, and often the subjects of its content. I argue that developing computational models that capture the complexity of their interaction will yield deeper, socio-culturally relevant descriptions of these actors, and that these deeper representations will open the door to new NLP and machine learning applications that have a more useful understanding of the world.

I explore this perspective by designing, implementing and evaluating computational models of three kinds: a.) unsupervised models of personas, which capture patterns of identity and behavior in the description of people as the content of text; b.) unsupervised models of author variation, which capture patterns in how latent and observed qualities of the author influence the text we see; and c.) models of audience variation, which capture patterns in how variation in the audience can influence the text we see. Each of these research fronts captures one dimension of how people interact with each other as mediated through text. Together, these three axes define a coordinate system for investigating written language in its socially embedded context. At a large scale, this thesis illustrates how organizing data around people and reasoning about the subtleties of their interaction with text can both generate new social insight and improve performance on practical tasks.

Thesis committee

Noah Smith (chair), Carnegie Mellon University
Justine Cassell, Carnegie Mellon University
Tom Mitchell, Carnegie Mellon University
Jacob Eisenstein, Georgia Institute of Technology
Ted Underwood, University of Illinois