GeoText -- Geo-tagged Microblog Corpus
=======================================

URL: http://www.ark.cs.cmu.edu/GeoText
Version: 2010-10-12

The dataset is described in the following paper.
Please consider citing it if appropriate. Thanks!

  "A Latent Variable Model for Geographic Lexical Variation."
  Jacob Eisenstein, Brendan O'Connor, Noah A. Smith, and Eric P. Xing.
  http://www.cs.cmu.edu/~nasmith/papers/eisenstein+oconnor+smith+xing.emnlp10.pdf
  In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, 2010.

Contact brenocon@cmu.edu with any questions.


Overview
========

377616 messages from 9475 geo-located microblog users approximately within the
United States, over one week in March 2010.  See the 'Data' section of the
paper for more details.


Contents
========

full_text.txt   --  All messages and meta information, in tab-separated fields.

processed_data/
  data.mat      --  Matlab format.  Key variables are
    w_data         -- (User ID, "document position", word ID) triples
    u_lat, u_long  -- Coordinates per user.

  Main plaintext-formatted data:
  user_info     --  Geo coordinates per user (from their first message)
                    User IDs correspond to line numbers in this file.
  vocab_wc_dc   --  Vocabulary file, with word and doc counts. 
                    Word IDs correspond to line numbers in this file.
  user_pos_word --  (User ID, docposition, word ID) triples

  Other versions:
  user_word_tf  --  Normalized TF features per user, triples format.
  {train,dev,test}.dat  --  Word counts per user, "LDA" format.

preproc/        --  Scripts for constructing some of the above files
                    from full_text.txt.

geo_eval/       --  Scripts we used for location prediction evaluation.
                    A little messy; not all are used.
                    geo_dist.py is the most (only?) useful one.

"Document position" means the position in the document obtained from
concatenating all the user's messages together.

Train/Dev/Test splits
=====================

Train, Dev, and Test splits are by user ID.
Folds are numbered 1,2,3,4,5, 1,2,3,4,5, across users
  i.e., fold = (userID % 5); fold = fold==0 ? 5 : fold  

Training set is folds 1,2,3
Dev set is fold 4
Test set is fold 5

Some files already have train,dev,test splitted versions.


De-identification
=================

All messages were public Twitter messages posted in March 2010.

Even so, we have taken an additional, if modest, step of anonymizing usernames
in the author field as well as @-mentions.  This certainly does not ensure
privacy, but makes casual searching for individual users a little harder.