Page Model

Items to be extracted: speaker, location, stime, etime.

A seminar announcement is assumed to correspond to a record in a relational database, containing the fields (in the order listed above): name of the speaker, location of the seminar, and the start and end times of the seminar. Any field may or may not be instantiated for a given announcement. If it is instantiated, it takes a single text fragment. In the case where multiple instances occur in a text (as is common, for example, for the start time), the extraction of any occurrence is counted as correct.

Instances of these fields are identified in the text by means of SGML-style tags. A seminar start time, for example, is bracketed by the tags <stime> and </stime>. (In addition, the tags sentence and paragraph were added to facilitate linguistic experiments. These tags were never used in any reported experiments.)

To my knowledge, it is safe to assume that any text matching the regular expression </?[a-z]+> was introduced by me. Thus, to return a seminar announcement to its pristine state, it should suffice to delete all text matching this pattern.

SAMPLE EXTRACTION OUTPUT

cmu.andrew.academic.sds.seminars-56:0

cmu.cs.proj.cimds-249:0