A Bayesian Mixed Effects Model of Literary Character

David Bamman, Ted Underwood and Noah Smith. ACL 2014.

[ PDF ]


Our work here focuses on the unsupervised learning of character types in a collection of 15,099 English novels published between 1700 and 1899, falling in the broader tradition of the unsupervised learning of generic entity classes (Collins and Singer 1999, Elsner et al. 2009, Haghighi and Klein 2010, Yao et al. 2011, Chambers et al. 2013, inter alia). While our prior work (Bamman et al. 2013) makes the assumption that character types are responsible for probabilistically generating all text associated with a character, we introduce a model here based on different assumptions; in this work, we explicitly account for the influence of extra-linguistic metadata (such as author) on word choice (as in Rosenfeld 1996, Eisenstein et al. 2011, Taddy 2013, Rabinovich and Blei 2014), allowing us to discount author-specific vocabulary in learning types that cut across multiple authors.

Literary perspective

This paper is less an exercise in "distant reading" than an experiment in literary theory.

We set out to model literary character—an interesting task in part because critics have never reached consensus about what a "character" is. On the one hand, characters are discussed as if they were imaginary persons; in this sense of the word, a character is a collection of social or psychological attributes. On the other hand, characters are formal aspects of narrative—E. M. Forster's "word-masses" (Forster, 1927). If we understand character formally, it becomes difficult to separate a character from their relation to plot, authorial style, and point of view. This reflects our intuition that Huck Finn in Tom Sawyer is in one sense better compared to other "sidekicks" than to the first-person narrator of Adventures of Huckleberry Finn, even though they are referentially "the same person."

As Alex Woloch has pointed out, the tension between "referential" and "formalist" definitions of character has been a centrally "divisive question in ... literary theory" (Woloch, 2003). We don't expect to provide a simple solution. In practice, human readers slide back and forth between referential and formal models of character, depending on the comparisons they need to make in a particular critical context. It may also be true, as Tzvetan Todorov has suggested, that different models of character are appropriate for different genres: the characters in fairy tales, for instance, may be entities of a more structural and less referential kind than characters in a realist novel (Todorov, Poetics of Prose, 1971).

But the ambiguity of "character" is precisely what makes it an interesting concept to model. Literary scholars know that the term covers a range of different phenomena, but we don't yet know how deep those divisions run. Can we reproduce human judgments about character with a single model of the concept, or will it be necessary to develop different models for different genres and critical problems? In the long run, this question will matter for distant readings of literary history, but it also has an immediate theoretical interest.

In this article, we tackle one part of the theoretical question by exploring the relation between character and authorship. It has long been acknowledged that the characters of Charles Dickens, for instance, have a family similarity we call "Dickensian." On the other hand, we also recognize that a philanthropist in a Dickens novel is—in a different sense—more similar to other philanthropists than he is to a Dickens villain. Our goal in this article is to show that computational methods can support the same range of perspectives, allowing a provisional, flexible separation between these referential and formal dimensions of character.

We framed different models of character based on different assumptions about the relation between character and authorship. One model treated the concepts as fused; another used knowledge about authors to explicitly distinguish authorial effects from character. We trained both models on 257,298 characters drawn from 15,099 works of fiction written in English and published between 1700 and 1899. The training data included all the words grammatically governed by each character, with information about their grammatical relationship to the character (is this verb, e.g. something that the character did or something that was done to them). The texts were drawn originally from HathiTrust Digital Library, and include translations from other languages. See the article for more information about data preparation and the NLP pipeline.

Finally, we evaluated these models of character against preregistered human hypotheses. For instance, we had predicted in advance that the seducer Wickham in Pride and Prejudice would resemble the similarly unreliable Willoughby in Sense and Sensibility more closely than either character resembled Mr Darcy, in the same author. Some of the hypotheses we framed (like this one) tested the model's ability to discriminate between character types in a single author (emphasizing a "referential" dimension of character). Other hypotheses tested the model's ability to recognize the "family resemblance" that makes characters Austenian or Dickensian (emphasizing a more "formal" dimension of the concept).

As an experiment in modeling, our project was straightforwardly successful. We were able to design models that reproduce human intuitions about character in a moderately reliable way. More importantly for the goals of this project, we were able to design a range of options that emulate human readers' ability to use different definitions of character for different purposes. The model that explicitly distinguished authorship from a referential concept of character was better able to discriminate character types in the works of a single author. The model that treated the concepts as fused was better able to discriminate characters drawn from different authors. This result suggests that it will be possible to train probabilistic models that embody different theoretical assumptions about complex literary concepts.

We believe this research will also contribute to literary history, but that part of the project is still in progress. Although we have begun to see interesting patterns in the history of character, we don't mean to suggest that the models described here are optimal for historical research. In this project, our goal was to show that probabilistic models can explicitly encode a range of different theoretical assumptions. But a literary historian might not want to be maximally explicit, in advance, about the definitions of mutable concepts. Since flexibility is an important part of the historian's toolkit, it's possible that the best models for historical research will preserve some ambiguity about underlying assumptions. We plan to address this question in a second phase of the project, aimed at an audience of literary historians.


Further Reading

Please cite the following paper when using these resources in research.


The research reported in this article was supported by a National Endowment for the Humanities start-up grant to T.U., U.S. National Science Foundation grant CAREER IIS-1054319 to N.A.S., and an ARCS scholarship to D.B.