Package saui_pr4 :: Module emailscraper
[hide private]
[frames] | no frames]

Module emailscraper

source code

Scrape an mbox-format mailbox for e-mail body text and output data suitable for training a language model.

This is part of project 4 in 05-631 Software Architecture for User Interfaces, Fall 2007.


Author: David Huggins-Daines <dhuggins@cs.cmu.edu>

Functions [hide private]
 
usage()
Show usage of this module.
source code
 
normalize_subject(msg)
Normalize the subject header of an e-mail message.
source code
 
normalize_addrs(msg)
Normalize address headers of an e-mail message.
source code
 
normalize_body(msg)
Normalize the body of an e-mail message.
source code
 
scrape_mbox(filename)
Scrape usable text out of a mailbox.
source code
Variables [hide private]
  quotre = re.compile(r'^(\s*>)+\s*', re.MULTILINE)
  origre = re.compile(r'^\s*-+\s*Original\s+Message\s*-+.*', re....
  cvsre = re.compile(r'^Update of /.*', re.MULTILINE | re.DOTALL)
  sharre = re.compile(r'^#!/.*$.*', re.MULTILINE | re.DOTALL)
  msword = re.compile(r'^Content-Type: (application/msword|text/...
  sepre = re.compile(r'[-.#_*=~^]{5,}\s*', re.DOTALL)
  listre = re.compile(r'^\s*[-+*]+\s+', re.MULTILINE)
Variables Details [hide private]

origre

Value:
re.compile(r'^\s*-+\s*Original\s+Message\s*-+.*', re.MULTILINE | re.DO\
TALL)

msword

Value:
re.compile(r'^Content-Type: (application/msword|text/html).*', re.MULT\
ILINE | re.DOTALL)