This dataset contains a corpus of e-mail collected at Carnegie Mellon
University for the RADAR Project. Because of the current anonymization, an IRB panel review
determined it was exempt from further review and could be distributed
for research purposes. The dataset can be downloaded from http://www.cs.cmu.edu/~pbennett/action-item-dataset.tgz.
Anonymization of Data
In order to anonymize the data, many names in the corpus (people,
universities, etc.) were changed to other names from a corresponding
dictionary. After this, each unique token is mapped to a random
string of the same length. Everywhere in the corpus the token is
used it is replaced with the new random string. Since the corpus
only contains a small number of messages (744), as far as we know
statistical inference can only determine the "true identity" of some
non-identifying words (e.g. "the") but not of any significant
content of the e-mails nor the identity of senders and recipients.
Action-Item Labeling
Each message has been judged as to whether or not it contains an
"action-item" and the location of the action item(s) in the e-mail.
This data is given in judgments/judgments.txt. The first column
gives the filename of the message. The second column says whether
or not it contains any action items (Y/N). If it contains
action-items, the following columns are [character_offset
string_length] pairs of the character based offset into the file of
the beginning of the action-item and the length of the action-item.
Use of the Data in Machine Learning
Because token counts and sequence information is preserved in the
data, many machine learning algorithms can be applied to learn to
predict action-items within this corpus and compare to other
published data.
Sentence Segmentation and Additional Annotation
Further NLP processing of the original non-anonymized e-mails was
performed and used to create annotation files. These files contain
[character_offset string_length] pairs for various types of
structure detected in the e-mail. For example "sentence", "person",
"organization", "date", etc. Using the sentence annotations, it is
possible to reproduce all experiments in the literature below. The
file MsgAnnotationFilenamePairs.txt gives the correspondence between
message filenames and annotation filenames.
Messages
All messages are contained in the messages directory.
Other E-mail Preprocessing
Threading data, From/To, Date, and all other parts of the header
except the subject line have been removed. In addition, when an
e-mail contained portions of previous messages (e.g. "inline
replies"), the portions of the previous messages were stripped by
hand.
More Informtion
Further details on the collection, annotation, and use of this data is
documented in the following publications:
- Feature Representation for Effective Action-Item
Detection. Bennett, P.N. & Carbonell, J.G. SIGIR 2005 Poster. 2005.
- Detecting Action-Items in E-mail. Bennett, P.N. & Carbonell,
J.G. SIGIR 2005 Beyond Bag of Words Workshop. 2005.
- Building Reliable Metaclassifiers for Text Learning
(Ch. 10). Bennett, P.N. PhD Thesis. 2006.
- Combining Probability-Based Rankers for Action-Item Detection.
Bennett, P.N. HLT 2007. 2007.
Usage Citation
When citing usage of the datset, please cite as:
Detecting Action-Items in E-mail. Bennett, P.N. & Carbonell,
J.G. SIGIR 2005 Beyond Bag of Words Workshop.
http://www.cs.cmu.edu/~pbennett/action-item-datset.html. 2005.
Main Page
Paul N. Bennett
Last modified: Fri Mar 2 12:44:29 PST 2007