PROBLEMS:

The dangling_period and dangling_terminator states should probably be
merged.  The former occurs mostly with number processing, so maybe
it's ok but I'm not sure anymore.

Number handling should probably be reviewed, particularly the
scientific notation stuff, which is pretty much as it came from CU,
except for spillover debugging.  This component was not particularly
exercised by wsj or ap.

Fractions (in WSJ) are dealt with oddly: 7/16 is "seven /slash one
six".  There ought to be a specialized fraction processor, so it comes
out as "seven sixteens").

Abbreviations, as always, are a problem.  The current list should be
further pruned and some sublists therein expanded.  I'm starting to
believe that abbreviations are probably best dealt with as a
source-specific process and should be moved to the pre-processor
stage, at least for all the odd-ball abbreviations.

wsj90-92 has datelines as part of the text, while wsj87-89 does not.
This should be reconciled.  (Probably by stripping the datelines.)

Tabular and other non-textual material should be filtered.  I believe
this can be done by counting the relative percentage of words, numbers
and punctuation in a <TEXT>.  A threshold % of words needs to be
present in a "real" text.  Of course this is tricky in practice (see
filter_sgml_smart, where a threshold of 2.0 might work).



