Read The Web - Software Page

Spring 2006


A number of software packages are available.  Please suggest additional relevant software.

WIT - A collection of Java classes for accessing web pages using either command line arguments or direct calls from Java.  Supports (a) getting a single page given its URL, (b) getting a number of pages that match a specified search query, and (c) crawling and caching entire websites.

UIMA - A package for combining outputs of multiple text annotators into an efficient processing pipeline interfaced to a database to store large annotated dataset (details coming soon - meanwhile contact Eric Nyberg).

Minorthird - A collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text.
Scone - A knowledge base system which we'll use as a repository for facts and beliefs in the ReadTheWeb system. 

SecondString  -A collection of Java classes for approximate string-matching.

This page is located in the file /afs/cs/project/theo-21/www/software.html. 
It is writable by any member of the course.
It was created using NVU, freely available at
Tom Mitchell, January 20, 2006.