Newsgroups: comp.lang.prolog
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!news.sprintlink.net!in1.uu.net!allegra!alice!pereira
From: pereira@alta.research.att.com (Fernando Pereira)
Subject: Re: Strings in DCG-style Chart Parsing
In-Reply-To: alech@ai.uga.edu's message of 23 Sep 1995 16:56:01 GMT
X-Nntp-Posting-Host: alta.research.att.com
Message-ID: <PEREIRA.95Sep24123722@alta.research.att.com>
Sender: usenet@research.att.com (netnews <9149-80593> 0112740)
Reply-To: pereira@research.att.com
Organization: AT&T Bell Laboratories
References: <441e71$dp7@hobbes.cc.uga.edu>
Date: Sun, 24 Sep 1995 16:37:22 GMT
Lines: 70

In article <441e71$dp7@hobbes.cc.uga.edu> alech@ai.uga.edu (Andrew Lech [MSAI]) writes:
   Recently, I was thinking about the inefficient practice of storing accumulator
   strings in DCG-based chart parsers (i.e. representing that [the,dog] was parsed
   by storing the "before" and "after" strings [the,dog,chased,the,cat] and
   [chased,the,cat]) and I was wondering if anybody else concurred.

   While easily excessible in DCGs, accumulator strings are extremely redundant.
   If they are used to represent what comes between, then, for each partial
   substring, the chart will contain a copy of that substring and two copies of
   the remaining substring.
   [...]
Serious DCG chart parsers use integers rather than lists to represent
input positions. In addition to the redundancy (in space and time) you
mention, the list representation representation does not allow easy
indexing of chart items by input position, which, for instance, is
needed to achieve O(n^3) parsing time for the DCG encoding of a
CFG. These observations hold with respect to typical Prolog
implementations, which do not preserve ground (sub)term identity when
using a clause for resolution or when asserting a clause. If a Prolog
implementation were to represent input ground (sub)terms as pointers
that are preserved across resolution and assertion, representing input
string positions as suffixes of the input string would not require
more space or time than the integer representation of positions, and
could be indexed efficiently by a suitable indexer. Unfortunately,
such Prolog implementations are not in wide use (if in use at all). (I
used to bug Quintus about this when I was involved with them in
the mid-80s, but they had more urgent things to do.) There's also an
interesting and more theoretical question associated to this issue:
how subterm sharing relates to polynomial parsability. A relevant
paper is 

A Tractable Extension of Linear Indexed Grammars.
Bill Keller and David Weir.
EACL-95.
E-printed as cmp-lg/9502021, http://xxx.lanl.gov/ps/cmp-lg/9502021.

Examples of how to do efficient chart parsing in Prolog can be found
in, for instance

@book(Pereira+Shieber-85:Prolog,
	Author={Fernando C. N. Pereira and Stuart M. Shieber},
	Key={Pereira and Shieber},
	Title={Prolog and Natural-Language Analysis},
	Publisher={Center for the Study of Language and Information},
	Number={10},
	Address={Stanford, California},
	Series={CSLI Lecture Notes},
	Year={1987},
	Note={Distributed by Cambridge University Press}
)

@Article{Shieber+Schabes+Pereira:infer,
  author = 	 "Stuart M. Shieber and Yves Schabes and Fernando C.
		  N. Pereira",
  title = 	 "Principles and Implementation of Deductive Parsing",
  journal = "Journal of Logic Programming",
  volume = 24,
  number = "1-2",
  pages = "3-36",
  year =	 1995
  note = "Earlier version e-printed as cmp-lg/9404008."
}

-- 
Fernando Pereira
2B-441, AT&T Bell Laboratories
600 Mountain Ave, Murray Hill, NJ 07974-0636
pereira@research.att.com


