A number of software packages are available. Please suggest
additional relevant software.
- A collection of Java classes for accessing web pages using either command line arguments
or direct calls from Java. Supports (a) getting a single page
given its URL, (b) getting a number of pages that match a specified
search query, and (c) crawling and caching entire websites.
UIMA - A package for
combining outputs of multiple text annotators into an efficient
processing pipeline interfaced to a database to store large annotated
dataset (details coming soon - meanwhile contact Eric Nyberg).
Minorthird - A collection of Java
classes for storing text, annotating text, and learning to extract
entities and categorize text.
Scone - A knowledge base system
which we'll use as a repository for facts and beliefs in the ReadTheWeb
- Named entity extractors using Minorthird: /afs/cs/project/theo-21/software/textAnnotators/minorthird
has several - for people (trained on email), for organizations (trained on
newswire) and for proteins (trained on various different subsets of
medline). They can all be used with edu.cmu.minorthird.ui.ApplyAnnotator.
SecondString -A collection of Java classes for approximate string-matching.
This page is located in the file
It is writable by any member of the course.
It was created using NVU, freely available at http://www.nvu.com/
Tom Mitchell, January 20, 2006.