15-312 Recitation #1: Context-Free Grammars
2002-08-28
Joshua Dunfield (joshuad@cs)
Carnegie Mellon University
[Intro]
-----------------------------------------------------
GRAMMARS
-----------------------------------------------------
Much of this material is from Harper's lecture notes (see course web page).
DEF. A context-free grammar has three components:
1. an alphabet \Sigma of _terminals_ (or _letters_, or _tokens_)
2. a finite set N of _nonterminals_ that stand for the syntactic categories
3. a set P of productions
A ::= \alpha
where A is a nonterminal and \alpha is a string of terminals and
nonterminals.
Read "::=" as "can have the form".
For a given grammar, each nonterminal defines a _language_.
DEF. A _language_ is a set of strings over the alphabet \Sigma.
Example:
d ::= 0 | 1 | ... | 9 <-- Abbreviates d ::= 0
d ::= 1 ...
n ::= d | d n
How is the language of a nonterminal defined? It's the set of strings
that can be _derived_ from that nonterminal by _applying_ productions.
If we have a nonterminal x in a string s:
\alpha x \beta
we can apply a production x ::= \gamma
to get the string
\alpha \gamma \beta
\gamma may itself contain nonterminals, so we can repeat until we
get something without nonterminals.
What is L(d)?
We start with d.
d ::= 0 | 1 | ... | 9 abbreviation for d ::= 0, d ::= 1, ...
Can apply any of these productions.
If we apply the first production, we get the derivation
d => 0
If we apply the second, we get d => 1. And so on.
So: L(d) is the 10 digits: {0, 1, 2, ..., 9}.
...Big deal. What is L(n)?
We start with n.
Can apply the first production n ::= d
n => d
and then apply one of the d ::= productions:
n => d => 0
or
n => d => 5
or ...
Clearly, then, L(n) contains L(d).
Or can apply the 2nd production n ::= d n
n => d n
Now we pick a nonterminal on the right hand side, say d:
n => d n => 3 n
Only one choice now. Again we can apply either n ::= d or n ::= d n.
Let's apply n ::= d.
n => d n => 3 n => 3 d
and choose a d-production:
n => d n => 3 n => 3 6
So. What is L(n)?
The natural numbers:
{0, 1, 2, ..., 9, 01, 02, ..., 10, 11, ..., 1313, ...}
?: Give a derivation of 1313 (which happens to be my office number).
We define the syntax of a programming language with a context-free
grammar, designate a nonterminal (say "p" for program) as the start
symbol, and are interested in the _recognition_ question:
Given a string \alpha of length n, is there a derivation
p => ... => ... => \alpha
?
This is decidable and tractable -- O(n^3) -- for any context-free
grammar. (Non-CF: other symbols on the left of a nonterminal.) In
practice, we use more efficient (linear time) techniques that work
only on certain grammars.
One such linear technique is recursive descent, a very simple method
that will suffice for this class. Real compilers often use an
automatically-generated parser; _parser generators_ take context-free
grammars (with some restrictions, to ensure linear time) and emit a
parser, eliminating most of the drudgework.
A derivation can be read as a tree [do an example].
Note that there may be *more* than one derivation from p to a
particular string. If any string exists with more than one derivation
from a particular nonterminal, we say the grammar is _ambiguous_.
More than one derivation ==> more than one parse tree. In practice,
we care very much about the particular parse tree/derivation of a
string, not just whether it's syntactically correct, and ambiguities
must be eliminated from the grammar, or resolved in some way.
Example:
Syntax of a subset of ML types
Alphabet: int, list, ->, (, )
Example strings of the language:
int
int list
int list list
int -> int
int -> int list
( int -> int ) list
( int -> int -> int ) list
Note right associativity
Grammar:
Base types b ::= int
Tycons c ::= b | t list
Types t ::= c | t -> t
Two problems:
1. Ambiguous. Examples: int -> int -> int,
int -> int list.
2. Doesn't allow parentheses.
Solution to 1: We want -> to be right associative, as it is in ML.
Change t ::= t -> t
to t ::= c -> t
And we want " list" to bind more tightly than "->".
Change c ::= b | t list
to c ::= b | c list
Solution to 2: Add a b-production (or a c-production; it doesn't matter)
b ::= int | ( t )
We're encoding the levels of precedence into the grammar -- a standard
technique.
?: In the programming language Haskell, a list of integers would be written
"[int]", a list of a list of integers "[[int]]", and so forth. Modify
the grammar accordingly.
?: Add *
Note: Conformance to a CFG is a very rough first approximation of
what it means for a program to be valid; for example, CFGs can't
enforce the restriction that variables be declared before use,
much less enforce type safety.
(Aside: CFGs also can't recognize some apparently-simple languages.
{a^n b^n c^n | n > 0} = {abc, aabbcc, aaabbbccc, ...} is not
recognizable by a CFG. Rather than formulate the (surprisingly
complicated) context-sensitive grammar that recognizes this language
and write a parser for that grammar, I would simply use a CFG
S ::= ABC
A ::= a | aA
B ::= b | bB
C ::= c | cC
--do context-free parsing, count the a's, b's, and c's, and reject the
string if the counts differ.
Likewise, while some other restrictions (like declaration before use)
can be enforced with less restrictive grammars, it gets very ugly.
To see what I mean, go to E & S and dig out
A. van Wijngaarden et al. Revised report on the algorithmic
language ALGOL 68, Acta Informatica 5:1-3 (1975)
which uses a "two-level" grammar.)
(Another, course-related aside: You may notice that I haven't told you
how to write a parser. This is intentional; syntax is *not* the focus
of 312. We will not ask you to write a parser, though we probably
will ask you to modify a parser you're given. If we do, that will
certainly be the easy part of the assignment...)
--------------------------------------------------
YOU SHOULD KNOW...
--------------------------------------------------
1. The definition of a CFG
2. How to derive strings from a nonterminal
3. Given a nonterminal n and a CFG, how to figure out L(n)
4. How to design a CFG for a language L, given a description of L
5. How to show that a grammar is ambiguous