22 June, 1995

parser / mkparserclass
C++ parse objects

        Wilfred J. Hansen, Andrew Consortium

(The C++ version of this document is described in ./parser.doc.)

A parser object represents a grammar and the state of a parse according
to that grammar.  After parsing one text, the object can be reused to
parse another.  Unlike yacc and other systems, the grammar is
represented by uniquely named tables, so there are no name conflicts and
multiple parsers are possible.  Grammar tables are generated with a
version of the Bison package from the Free Software Foundation.  An awk
script removes the Bison parser, so the resulting parser and application
is not tainted with FSF's General Public License.  (Nor is Bison's
output tainted any longer;  as of June, 1995, FSF removed the GPL
restriction from the Bison parser.) 

Andrew provides an enhanced version of Bison which is upward compatible
from Bison and yacc.  It supports "multi-character tokens" and a few new
switches, including -k, which is required for mkparser.

In the descriptions below, it is assumed that the grammar is gggg and is
described in file gggg.gra.  Do not use the string 'bison' as part of
grammar's filename;  for instance, do not use gggg.bison as a file name. 

For an overview of the parse and compilation tools of Andrew, see 
	help parse

Overview

The gggg.gra file is processed through Bison to produce the gggg.tab.c
file, which is then processed by the 'mkparserclass' script to produce
gggg.C.  This latter file defines the class gggg derived from class
parser.  The application code allocates (with 'new') an instance of
class gggg and then calls its inherited Parse method to parse a lexeme
stream. 


Grammar (.gra file)

The AUIS version of Bison supports one additional token type: 
multi-character tokens.  These are written in the grammar surrounded by
quotation marks, as in "<=".  Thus one rule of the grammar might be

	expression : expression "<=" factor ; 

In other words it is not necessary to define LE as a token and then
teach the token analyzer that a less-than followed by an equal-sign is
the token LE.  (AUIS's 'tlex' token analyzer determines the token list
from the tables generated by Bison.) 

Semantic action routines {specified in braces in the grammar} may refer
to locations in the value stack with $$, $1, $2, and so on, as in yacc
and Bison.  In addition, the variable 'parser' points to the parser
object for the parse in progress.  One use for this is to access the
associated 'rock' value: 

	struct whatever *info 
		= (struct whatever *)parser->GetRock(); 


The parser reports compilation errors by calling the Error method. 
Grammar routines and other application functions can also call the Error
method to report errors.  Error has the default action of calling
parser::ErrorGuts, which prints an error message.  If an application
wishes some other error action, it should override ErrorGuts.  To do so,
it must include at least a declaration for ErrorGuts in the .gra file; 
this is the signal for mkparserclass to include the appropriate
declaration in the class declaration.  The declaration in the gggg.gra
file should be either

	extern void gggg::ErrorGuts(int severity, 
			char *severityname, char *msg); 

or 
	void gggg::ErrorGuts(int severity, 
			char *severityname, char *msg) {
		/* ... appropriate error handler ... */
	}

When ErrorGuts is called (severity&~parser_FREEMSG) will have one of the
values parser_WARNING, parser_SERIOUS, parser_SYNTAX, or parser_FATAL,
as defined in parser.H;  severityname will be the corresponding string. 
The parameter msg will be a character string; if
(severity&parser_FREEMSG) is non-zero, ErrorGuts must free the character
string.  Applications should call Error instead of ErrorGuts because the
former computes the maximum severity and counts the number of errors. 

In a context where there is no pointer to the parser object for the
current compilation, it can be retrieved via parser::GetCurrentparser. 
For instance, here is a possible definition of yyerror: 

		static void
	yyerror(char *msg) {
		parser::GetCurrentparser()->Error(parser_WARNING, msg); 
	}

In yacc and Bison, grammar rule actions may contain special macros to
control parse termination and error processing: yyclearin, yyerrok,
YYACCEPT, YYERROR, YYABORT.  These are supported in parserclass. 

    yyerrok - When the current action is completed, the error state will
    be cleared and normal parsing will resume.

    yyclearin - When the current action is completed, the pending input
    token is discarded, so a new one will be fetched before parsing
    proceeds.

    YYACCEPT - The current action terminates immediately and the entire
    parse also terminates, indicating success to the caller.

    YYABORT - The current action terminates immediately and the entire
    parse also terminates, indicating failure to the caller.

    YYERROR - The current action terminates immediately and the current
    reduction is treated as an error.  The parser enters the error state
    and continues scanning input until yyerrok is called by a rule or a
    rule containing 'error' as a token is reduced.  The parser ignores
    any yyerror calls in an action if the action terminates with YYERROR.

In yacc and Bison, YYSTYPE defaults to int.  Mkparserclass removes this
default;  the .gra file must have a type name YYSTYPE established by one
of these means: 

    Include a %union section in the grammar header in gggg.gra.
    #define YYSTYPE in gggg.gra or a file it #includes
    Typedef YYSTYPE in gggg.gra or a file it #includes.

If a %union { ... } appears in the grammar, an appropriate declaration
for YYSTYPE will appear in gggg.C.  If the -d switch is given to Bison,
the same declaration will also appear in gggg.H. 


Application Code

In general, an application creates a parser object for a given grammar
with 'new': 

	class gggg *ggggparser = new class gggg; 

The parse itself is done by applying the Parse method of this new object: 

	(ggggparser)->Parse(lexer, lexrock); 

where the arguments specify the lexeme stream. 

A complete program might look like this

	#include <andrewos.h>	/* in Andrew, this must
			precede gggg.H;  but the line is not needed
			in non-Andrew applications */
	. . . 
	#include <gggg.H>	/* class definition created by Bison
			and mkparserclass from gggg.gra */
	. . . 
	class gggg *ggggparser = new class gggg; 

	. . . /* modify ggggparser object.  For instance: */
	. . . parser_SetRock(ggggparser, xxxxx); 

	if ((ggggparser)->Parse(lexer, lexer_rock) == parser_OK) {
		/* action for successful parse */
	}
	else {
		/* action for failed parse */
	}

The file gggg.H is generated by mkparserclass and includes the
declaration of class gggg as a class derived from class parser.  If the
-d switch was specified to Bison, the definitions it produces are
incorporated into gggg.H;  typically these are #defines for the token
numbers of the various terminal symbols.  (For tlex, the -r switch must
also be used.  This switch is currently only in the Andrew version of
Bison.) 

While parsing, the parser fetches each successive token by calling the
lexer provided as the first argument.  The second argument, lexer_rock,
is supplied as one of the arguments to the lexer.  The full type
expected of the lexer function is

	int lexer(void *lexrock, void *yylval)

A lexer routines can copy the semantic value of a token into *yylval,
which will have space for a value of type YYSTYPE.  Note that, if the
value is a pointer to an object, the pointer should be stored in *yylval
and not in yylval (which will disappear as the function returns). 
Suppose YYSTYPE is specified with %union:

	%union {int i; struct hunk *hunkptr; struct hunk v}

Then a token with an integer semantic would store it with
	yylval->i = integer_value;
and a pointer currently in ((hunk *) h) would be stored as
	yylval->hunkptr = h;
An actual hunk value, hv, could be copied into yylval with
	yylval->v = hv;

The lexer must return Bison token numbers rather than yacc numbers; 
yacc uses the first 256 values to indicate distinct ASCII characters,
but Bison does not.  In 'tlex', the Bison token numbers are acquired
from the gggg.tab.c output; other lexers can generate yacc token numbers
and then translate them: if the yacc token number is in t, the Bison
token number is

	(ggggparser)->TranslateTokenNumber(t)

Between the 'new class gggg' and the call to the Parse method, the
application can apply other methods to the ggggparser object:
EnumerateTokens to enter reserved words in a symbol table, SetKillVal to
handle error cleanup,  SetRock to store a pointer for use by semantic
action routines, and so on. 

The parser returns the maximum severity from among the severity values
passed to the Error method.  These values are, in increasing order,

	parser_OK  no error
	parser_WARNING  there was some minor problem
	parser_SERIOUS  compilation aborted, 
			but scan continued
	parser_SYNTAX  same as SERIOUS, 
			but due to a syntax error
	parser_FATAL	compilation could not continue


Parse-time stacks

A parser object has two stacks which are initially allocated at 500
elements, but grow if needed.  Use left recursion in grammars to avoid
requiring great stack depth.  Note that stack depth reflects the recall
complexity of the program to a person reading it;  consequently, a
grammar requiring a large stack is unlikely to describe a language that
people can feel comfortable with. 

The value stack contains copies of objects as returned by the lexer and
set in the action routines.  If these objects contain pointers to
"pointee" objects, the client is responsible for the memory occupied by
the pointees.  If the parser terminates early for a syntax error or
ABORT, the pointee values can be deleted by supplying a KillVal
function;  to use f as the function, write

	ggggparser->SetKillVal(f)

After a syntax error and before discarding the stack, this function is
called for each value on the stack.  The killval function is also called
as states are popped for error recovery.  The call is

	(killvalfunction)(parseobject, value-pointer-from-stack)


Mkparserclass

The mkparserclass script is invoked to produce a class declaration from
the gggg.tab.c file generated by Bison.  The process can also use the
gggg.tab.h file generated by Bison in response to the -d switch.  (For
use with tlex, Andrew Bison must be used and must also be given the
switches -r and -k.)

At minimum, mkparserclass has one argument, the prefix of the file names: 

	mkparserclass gggg

where the input files are gggg.tab.c and possibly gggg.tab.h. The
derived class will be named gggg and will be described in two generated
files--gggg.C and gggg.H. 

Mkparserclass may have one or two additional arguments, the name of the
Bison output .c file and the name of the .h file.   In any case, the
prefix is used to generate the name of the class and the names of the
generated files.


Compilation

In a Makefile, a .gra file is converted to .C and .H files via rules like

	gggg.C gggg.H: gggg.gra
		rm -f gggg.C gggg.H gggg.tab.c gggg.tab.h
		bison -b gggg -k gggg.gra
		mkparserclass gggg
	gggg.o: gggg.C

The resulting .C file from mkparserclass is compiled as a normal .C file
and linked together with other .o files for the application.  The .H
file is included by the client program. 

Andrew Bison can be given additional flags, among which are

	-d	defines - generates gggg.tab.h
	-r	raw - token numbers in gggg.tab.h are bison numbers 
	-v	verbose - generates gggg.output, useful for debugging

The -k and -r switches are implemented in the AUIS version of Bison.  -k
causes output of a few additonal declarations.  (See bison.texinfo in
the Andrew distribution of Bison.) 

Bison's -l switch should NOT be used;  mkparserclass depends on the
#line directives in the file.  If necessary, these can be removed from
gggg.C with sed: 
	sed '/^#line/d' gggg.C > ,gggg.C; mv ,gggg.C gggg.C

In an Andrew Imakefile, the grammar is processed with a rule like: 

	Parser(gggg, flags)

Where 'flags' is normally empty but can include the -d, -r, and -v flags
described above.  The result is to process gggg.gra to produce gggg.C
and gggg.H. 


Linking

The application must be linked with parser.o or a library containing it.
 Parser.o is the result of compiling parser.C, from the same source
directory as mkparserclass.  In Andrew, parser.o is installed in
	$ANDREWDIR/lib/atk/libparser.a. 


Methods of the 'parser' class

virtual int Parse(parser_lexerfptr, void *lexerrock)
    Causes the parser to run to completion using the lexeme stream
    supplied as the arguments.  Returns one of the severity values,
    indicating `the highest severity error encountered. 

static int ParseNumber(char *buf, long *plen, long *intval, double *dblval)
    Parses a number from buf and sets *plen to the number of characters
    recognized  If intval is non-null, *intval is set to the number's
    integer value.  Similarly, if dblval is non-null, *dblval is set to
    the number's value as a double.  Returns 1 if syntactically an
    integer, 2 for a double, and 0 for a syntax error. 
    An integer is
        a zero followed by a string of octal digits,
        a non-zero digit followed by decimal digits,
        0x followed by a string of hexadecimal digits, or
        a character within apostrophes, possibly \-escaped. 
    A real is of the form
        [ddd][.][ddd][Epddd]
    where
        [...] indicates an optional part except that the
                complete number must have either . or Epddd
        ddd is a digit sequence  (one or more digits)
        p (sign) may be empty or '+' or '-'
        E (exponent indicator) may be 'e' or 'E'

static int TransEscape(char *buf, int *plen)
    When the call is made, buf holds a character sequence (at least
    three chars) that occurred after a backslash in a string.  The
    translation is returned as an int.  The number of characters used is
    returned in *plen.  (plen may be NULL.)

    The translations are a superset of C: 
        	 escape seq      :  translation
		 ------------------------
		 \\ \' \" \b \t	 :  as in C
		 \n \v \f \r	 :  as in C
		 \ddd		 :  octal digits, as in C
		 \?		 :  \177  (DEL)
		 \e		 :  \033  (ESC, ctl-[)
		 \^@		 :  \000  (NUL)
		 \^a ... \^z	 :  \001 ... \032  (ctl-a ... ctl-z)
		 \^[  \^\  \^]	 :  \033  \034  \035
		 \^^  \^_	 :  \036  \037

virtual void Error(int severity, char *msg)
    Call this method to report an error.  It counts the number of
    errors, records the maximum severity, and then calls ErrorGuts for
    disposition. 

virtual void ErrorGuts(int severity, char *severityname, char *msg)
    Override this method to handle errors.  The default function prints
    an error message to stderr. 

virtual void EnumerateReservedWords(parser_enumresfptr handler, void *rock)
    The handler is called for each alphabetic reserved word:
                handler(rock, char *word, int tokennumber)
    It is not called for names beginning with "set";  for names
    beginning with "tok", only the rest of the name is passed. 
    Uppercase letters in token names are converted to lower, and vice
    versa. 

virtual int TokenNumberFromName(char *name)
    Returns the token number corresponding to the string.  Typical strings:
        function   setID   tokNULL   'a'   ":="
    (Note the different quotes around the two kinds of character
    tokens.)  If the name is not found, returns 0. 

char TranslateTokenNumber(int x)
    Bison assigns different token number than yacc;  in particular, the
    first 256 do not correspond to the ASCII characters.  This function
    converts a yacc token number, x, into the token number required by
    Bison.

void SetRock(void *r)
    Sets the 'rock' value associated with the parser.  This value is
    then available in any context--lexical analysis, semantic action
    routine, or other--which has a pointer to the parser object.

void * GetRock()
    Returns the current 'rock' value. 

void SetKillVal(parser_killfptr kv)
    This function sets the killval function to kv.  The latter is called
    when value stack items are popped for errors.  See above. 

parser_killfptr GetKillVal()
    Returns the killval function. 

static class parser *GetCurrentparser()
    During any call to the Parse method, this function returns the
    current parser object.  This can be supplied as the object for the
    Error method.  

static boolean SetDebug(boolean value) 
Sets the debug flag to the given value;  value must be 0 or 1.  Returns
the prior value.

int GetErrorState()
    The error-state is an integer indicating how many tokens must be
    successfully parsed before resuming correct parsing.  Usually this
    value is zero;  when a syntax error is detected, the value is set to
    three.  To clear the errorstate, an action invokes yyerrok and to
    enter the errorstate, an action terminates with YYERROR. 

void SetMaxSeverity(int s) 
    Sets the value remembered as the maximum severity encountered.  It
    is preferable to do so by calling the Error method.  

int GetMaxSeverity()
    Returns the current maximum severity value.  

void SetNErrors(int n)
    Allows the application to set the number of errors encountered.  It
    is usually incorrect to call this function.  

int GetNErrors()
    Returns the number of errors that have been encountered in the
    current compilation.  

char **GetTokenNames()
    Returns a pointer to an array of all token names in order by token
    number.  

short GetNTokens()
    Returns the number of tokens in the grammar. 


Copyright 1992, 1995 Carnegie Mellon University.  All rights Reserved. 
$Disclaimer:  $