Twitter14K Dataset
The Twitter14K dataset supports the work described in:
David Bamman, Jacob Eisenstein, Tyler Schnoebelen (2012), "Gender in Twitter: Styles, Stances, and Social Networks," [ArXiv]
twitter14K.tar.gz [57M]This dataset contains aggregated word counts of the most frequent 10,000 words (over the period Jan 1 - July 31, 2011) for each of 14,464 Twitter users from a total of 9,212,118 tweets. These are users who self-report their location to be within the United States, whose primary language (as observed in our data) is English, and who have an active social network of between 4 and 100 people (where an active tie is defined by two mutual @-messages between users, one in each direction, separated by at least two weeks); see the paper above for more details.
All of the data presented here has been anonymized.
The data consists of the following three files: user_info.txt, which contains an identification number (1-14464), induced gender (1=male, 0=female), expected number of male friends in network, expected number of female friends in network, and proportion of network that's male.
1047 0 6.972 3.028 0.6972unigram_info.txt, which contains an identifier (1-10000) for each of the words in the 10,000 word vocabulary, along with that word.
279 lot 1027 loud 2701 massageword_counts.txt, which lists the counts of each user from user_info.txt using each word from unigram_info.txt. For example, user #1047 above uses the word "lot" (#279) 18 times, "loud" (#1027) 3 times and "massage" (#2701) 7 times.
1047,279,18 1047,1027,3 1047,2701,7