Unsupervised Discovery of Biographical Structure in Text

David Bamman and Noah Smith. TACL 2014.

[ PDF ]


In chronicling the events in a set of individuals' lives, encyclopedic biographies — from Plutarch's Parallel Lives to Wikipedia — provide an extraordinary amount of information detailing how the lives of the historically famous unfold. The life events described in these texts have natural structure: events exhibit correlations with each other (e.g., those who divorce must have been married), can occur at roughly similar times in the lives of different individuals (marriage is more likely to occur earlier in one's life than later), and can be bound to historical moments as well ("fights in World War II" peaks in the early 1940s). While social scientists have long been interested in the structure of these events in investigating the role that individual agency and larger social forces play in shaping the course of an individual's life, the data on which these studies draw has largely been restricted to categorical surveys and observational data; we present here a latent-variable model that exploits the correlations of event descriptions in biographies to learn the structure of abstract events, grounded in time, from text alone.

At the same time, the subjects of biographies are not a random sample of the population, nor are their contents unbiased representations. Nearly all encyclopedias necessarily prefer the historically notorious (if due to nothing else than inherent biases in the preservation of historical records); many, like Wikipedia, also have disproportionately low coverage of women, minorities and other demographic groups. The abstract event classes that we learn with our model allow us to perform a large-scale analysis of the content of 242,970 Wikipedia biographies. Though it is known that women are greatly underrepresented on Wikipedia — not only as editors (Wikipedia 2011, Hill and Shaw 2013) but also as subjects of articles (Reagle and Rhue 2011) — we find that there is a bias in their characterization as well, with biographies of women containing significantly more emphasis on events of marriage and divorce than biographies of men.



  • 2.3M.wiki.events.txt.gz [192M]. Wikipedia event data, as used in experiments. 2,313,867 events from 242,970 people, each born after 1800 and whose biographies contain at least 5 events. Only events that contain at least 1 term from the most frequent 10,000 words and multiword expressions are retained.
  • wiki.genders.txt [26M]. Inferred gender for 862,171 people on Wikipedia.
  • wiki.people.dates.txt [30M]. Inferred dates of birth (and death, when available) for 927,404 people on Wikipedia.
  • Further Reading

    Please cite the following paper when using these resources in research.


    The research reported in this article was supported by U.S. National Science Foundation grant CAREER IIS-1054319 to N.A.S. and Google’s support of the Reading is Believing project at CMU.