Lightweight Structured Text Processing (Thesis Proposal)

Robert C. Miller

Robert C. Miller. "Lightweight Structured Text Processing." PhD Thesis proposal, Computer Science Department, Carnegie Mellon Unversity, April 1999.

Abstract

Web pages, source code, and other text documents contain structured data worthy of automatic processing -- searching, sorting, reformatting, calculating, and so on. Unfortunately, generic text-processing tools limit their input and output to generic formats that may not match the format the user wants to process. The usual solution to this problem is a custom program that parses the source text and processes it, but custom parsers are hard to build and often discard useful information from the source document. I propose a new approach, lightweight structured text processing, in which users describe relevant document structure interactively and manipulate documents directly with generic tools. This approach will be embodied in LAPIS, a system for displaying and processing web pages, source code, and text files. LAPIS makes several contributions: (1) text constraints, a new pattern language for identifying regions of text in a simple, readable, composable, expressive, and robust manner; (2) algorithms and representations for implementing text constraints with reasonable efficiency; (3) an architecture for composing and reusing structure detectors, such as external C++ or HTML parsers; (4) a user interface that integrates pattern matching, manual selection, learning from examples, and external parsers, allowing the user to combine these techniques for convenient structure description; (5) the ability to refer to "presentation-level" structure that may not be directly reflected in the linear text, such as page layout, typesetting, and table rows and columns; and (6) the ability to handle variations and exceptions in document structure by specifying fuzzy patterns. My thesis is that lightweight structured text processing lets users describe and manipulate many kinds of text structure, from implicit to explicit, formal to informal, and presentation-level to logical-level, allowing the user to manipulate the text in its original format, and delivering the convenience, speed, and scalability of automation without the cost or difficulty of writing a custom program.

Full Text


Rob Miller