Software and Datasets


 

Software:  Jangada

 

Jangada is an API for signature block extraction and reply-to extraction from email messages. The ideas follow the ideas of the following paper (CEAS2004 - Learning to Extract Signature and Reply Lines from Email),, but performance was slightly improved by using a new set of features not mentioned in the original reference.

 

Some Features: Extracts signature blocks and reply lines in email messages with very good accuracy. Can be easily integrated in other Java applications (For instance, the entire email message as a String can be used as input). Can be easily integrated in other Minorthird applications (using the TextLabels format, it accepts as input email messages with other annotations - such as dates, personal names, speech acts, etc)

 

Licensing: University of Illinois/NCSA Open Source License

Documentation: Very poor. An initial javadocs page is here. There is some documentation on how to use Jangada in the example files below.

Requires: j2sdk1.4 or later. Uses MinorThird.jar.

Recommended: When using email files as input, results will be better if the messages are in mime (.eml) format.

 

Usage example:

1.      create a new directory (for instance, jangadaDir)

2.      download jangada.jar, minorThird.jar, the example files, and the email files to jangadaDir

3.      Unzip (gunzip Demos.tar.gz) and Untar (tar –xvf Demos.tar) the example files, as well as the email files.

4.      add jangadaDir, jangadaDir/minorThird.jar  and jangadaDir/jangada.jar to the CLASSPATH

5.       

6.      For a quick demo,

7.      compile the example files. For instance: “javac Demo2.java” – (in case of errors, please check you CLASSPATH again)

8.      run the examples on the email files directory: “java Demo2 emails/*”

9.      Check the documentation on the DemoX.java files and try your own application.

 

 

Reminder 1: if you’d like to have access to the source code, please send me an email.

Reminder 2: If you used this package, please cite the following reference:

·        Learning to Extract Signature and Reply Lines from Email, Vitor R. Carvalho and William W. Cohen, CEAS-2004 (Conference on Email and Anti-Spam), Mountain View, CA, July 2004


 

 

Software:  Ciranda

 

A java application that predicts the Email-Acts (or email speech-Acts) of email messages. The ideas follow the contents of the following papers (emnlp04 and sigir05), but performance was significantly improved by careful feature selection and additional features.

 

Some Features:

Predicts the following acts: Request, Commit, Deliver, Propose, Meet, dData.

Provides the confidence in each prediction.

Easy way to use these acts as features in your application.

 

Licensing: No guarantees are provided. Lots of bugs for sure. Use at your own risk!

Documentation: Very poor. An initial javadocs page is here. Please check Example.java on how to use it.

Requires: j2sdk1.4 or later. Uses MinorThird.jar (see below)

Questions: I’ll be happy to help, especially if you tell me what a good Ciranda is  :-)

 

Usage example:

1.      create a new directory called ciranda, and ciranda/lib

2.      download ciranda.jar and minorThird.jar to ciranda/lib

3.      add ciranda/ and lib/ciranda.jar to the CLASSPATH

4.      download the example file Example.java to ciranda/

5.      compile it: “javac Example.java” – (in case of errors, please check you CLASSPATH again)

6.      run the example: “java Example”

7.      or run the main application on a directory with emails in text format (without headers)

8.      create the test directory ciranda/testdir

9.      add some emails in text format (such as msg1, msg2, msg3) to ciranda/testdir

10. run “java –jar  lib/ciranda.jar  testdir”

11. or try your own application.

 

Reminder: Send me an email if you'd like the source code. If you use this package, please use the following reference:

·        Learning to Classify Email into ”Speech Acts”,, William W. Cohen, Vitor R. Carvalho and Tom M. Mitchell, EMNLP-2004 (Conference on Empirical Methods in Natural Language Processing), Barcelona, Spain, July 2004

 

 


 

 

Dataset:  Signature and Reply Dataset [Datasets in Minorthird Format]

 

These 617 email messages have signature lines and reply-to lines annotations. The messages are a subset of the 20 Newsgroups dataset (produced by Ken Lang at CMU in the mid-90's).

 


 

 

 

 

Back to Vitor Carvalho’s Home page