Twitter14K Dataset

The Twitter14K dataset supports the work described in:

David Bamman, Jacob Eisenstein, Tyler Schnoebelen (2012), "Gender in Twitter: Styles, Stances, and Social Networks," [ArXiv]

twitter14K.tar.gz [57M]


This dataset contains aggregated word counts of the most frequent 10,000 words (over the period Jan 1 - July 31, 2011) for each of 14,464 Twitter users from a total of 9,212,118 tweets. These are users who self-report their location to be within the United States, whose primary language (as observed in our data) is English, and who have an active social network of between 4 and 100 people (where an active tie is defined by two mutual @-messages between users, one in each direction, separated by at least two weeks); see the paper above for more details.

All of the data presented here has been anonymized.


The data consists of the following three files:

user_info.txt, which contains an identification number (1-14464), induced gender (1=male, 0=female), expected number of male friends in network, expected number of female friends in network, and proportion of network that's male.

		1047	0	6.972	3.028	0.6972
		
unigram_info.txt, which contains an identifier (1-10000) for each of the words in the 10,000 word vocabulary, along with that word.

		279	lot
		1027	loud
		2701	massage
		
word_counts.txt, which lists the counts of each user from user_info.txt using each word from unigram_info.txt. For example, user #1047 above uses the word "lot" (#279) 18 times, "loud" (#1027) 3 times and "massage" (#2701) 7 times.

		1047,279,18
		1047,1027,3
		1047,2701,7