This dataset contains a corpus of e-mail collected at Carnegie Mellon University for the RADAR Project. Because of the current anonymization, an IRB panel review determined it was exempt from further review and could be distributed for research purposes. The dataset can be downloaded from http://www.cs.cmu.edu/~pbennett/action-item-dataset.tgz.

Anonymization of Data

In order to anonymize the data, many names in the corpus (people, universities, etc.) were changed to other names from a corresponding dictionary. After this, each unique token is mapped to a random string of the same length. Everywhere in the corpus the token is used it is replaced with the new random string. Since the corpus only contains a small number of messages (744), as far as we know statistical inference can only determine the "true identity" of some non-identifying words (e.g. "the") but not of any significant content of the e-mails nor the identity of senders and recipients.

Action-Item Labeling

Each message has been judged as to whether or not it contains an "action-item" and the location of the action item(s) in the e-mail. This data is given in judgments/judgments.txt. The first column gives the filename of the message. The second column says whether or not it contains any action items (Y/N). If it contains action-items, the following columns are [character_offset string_length] pairs of the character based offset into the file of the beginning of the action-item and the length of the action-item.

Use of the Data in Machine Learning

Because token counts and sequence information is preserved in the data, many machine learning algorithms can be applied to learn to predict action-items within this corpus and compare to other published data.

Sentence Segmentation and Additional Annotation

Further NLP processing of the original non-anonymized e-mails was performed and used to create annotation files. These files contain [character_offset string_length] pairs for various types of structure detected in the e-mail. For example "sentence", "person", "organization", "date", etc. Using the sentence annotations, it is possible to reproduce all experiments in the literature below. The file MsgAnnotationFilenamePairs.txt gives the correspondence between message filenames and annotation filenames.

Messages

All messages are contained in the messages directory.

Other E-mail Preprocessing

Threading data, From/To, Date, and all other parts of the header except the subject line have been removed. In addition, when an e-mail contained portions of previous messages (e.g. "inline replies"), the portions of the previous messages were stripped by hand.

More Informtion

Further details on the collection, annotation, and use of this data is documented in the following publications:

Usage Citation

When citing usage of the datset, please cite as:
Detecting Action-Items in E-mail. Bennett, P.N. & Carbonell, J.G. SIGIR 2005 Beyond Bag of Words Workshop. http://www.cs.cmu.edu/~pbennett/action-item-datset.html. 2005.

Main Page

Paul N. Bennett
Last modified: Fri Mar 2 12:44:29 PST 2007