15-312 Recitation #1: Context-Free Grammars 2002-08-28 Joshua Dunfield (joshuad@cs) Carnegie Mellon University [Intro] ----------------------------------------------------- GRAMMARS ----------------------------------------------------- Much of this material is from Harper's lecture notes (see course web page). DEF. A context-free grammar has three components: 1. an alphabet \Sigma of _terminals_ (or _letters_, or _tokens_) 2. a finite set N of _nonterminals_ that stand for the syntactic categories 3. a set P of productions A ::= \alpha where A is a nonterminal and \alpha is a string of terminals and nonterminals. Read "::=" as "can have the form". For a given grammar, each nonterminal defines a _language_. DEF. A _language_ is a set of strings over the alphabet \Sigma. Example: d ::= 0 | 1 | ... | 9 <-- Abbreviates d ::= 0 d ::= 1 ... n ::= d | d n How is the language of a nonterminal defined? It's the set of strings that can be _derived_ from that nonterminal by _applying_ productions. If we have a nonterminal x in a string s: \alpha x \beta we can apply a production x ::= \gamma to get the string \alpha \gamma \beta \gamma may itself contain nonterminals, so we can repeat until we get something without nonterminals. What is L(d)? We start with d. d ::= 0 | 1 | ... | 9 abbreviation for d ::= 0, d ::= 1, ... Can apply any of these productions. If we apply the first production, we get the derivation d => 0 If we apply the second, we get d => 1. And so on. So: L(d) is the 10 digits: {0, 1, 2, ..., 9}. ...Big deal. What is L(n)? We start with n. Can apply the first production n ::= d n => d and then apply one of the d ::= productions: n => d => 0 or n => d => 5 or ... Clearly, then, L(n) contains L(d). Or can apply the 2nd production n ::= d n n => d n Now we pick a nonterminal on the right hand side, say d: n => d n => 3 n Only one choice now. Again we can apply either n ::= d or n ::= d n. Let's apply n ::= d. n => d n => 3 n => 3 d and choose a d-production: n => d n => 3 n => 3 6 So. What is L(n)? The natural numbers: {0, 1, 2, ..., 9, 01, 02, ..., 10, 11, ..., 1313, ...} ?: Give a derivation of 1313 (which happens to be my office number). We define the syntax of a programming language with a context-free grammar, designate a nonterminal (say "p" for program) as the start symbol, and are interested in the _recognition_ question: Given a string \alpha of length n, is there a derivation p => ... => ... => \alpha ? This is decidable and tractable -- O(n^3) -- for any context-free grammar. (Non-CF: other symbols on the left of a nonterminal.) In practice, we use more efficient (linear time) techniques that work only on certain grammars. One such linear technique is recursive descent, a very simple method that will suffice for this class. Real compilers often use an automatically-generated parser; _parser generators_ take context-free grammars (with some restrictions, to ensure linear time) and emit a parser, eliminating most of the drudgework. A derivation can be read as a tree [do an example]. Note that there may be *more* than one derivation from p to a particular string. If any string exists with more than one derivation from a particular nonterminal, we say the grammar is _ambiguous_. More than one derivation ==> more than one parse tree. In practice, we care very much about the particular parse tree/derivation of a string, not just whether it's syntactically correct, and ambiguities must be eliminated from the grammar, or resolved in some way. Example: Syntax of a subset of ML types Alphabet: int, list, ->, (, ) Example strings of the language: int int list int list list int -> int int -> int list ( int -> int ) list ( int -> int -> int ) list Note right associativity Grammar: Base types b ::= int Tycons c ::= b | t list Types t ::= c | t -> t Two problems: 1. Ambiguous. Examples: int -> int -> int, int -> int list. 2. Doesn't allow parentheses. Solution to 1: We want -> to be right associative, as it is in ML. Change t ::= t -> t to t ::= c -> t And we want " list" to bind more tightly than "->". Change c ::= b | t list to c ::= b | c list Solution to 2: Add a b-production (or a c-production; it doesn't matter) b ::= int | ( t ) We're encoding the levels of precedence into the grammar -- a standard technique. ?: In the programming language Haskell, a list of integers would be written "[int]", a list of a list of integers "[[int]]", and so forth. Modify the grammar accordingly. ?: Add * Note: Conformance to a CFG is a very rough first approximation of what it means for a program to be valid; for example, CFGs can't enforce the restriction that variables be declared before use, much less enforce type safety. (Aside: CFGs also can't recognize some apparently-simple languages. {a^n b^n c^n | n > 0} = {abc, aabbcc, aaabbbccc, ...} is not recognizable by a CFG. Rather than formulate the (surprisingly complicated) context-sensitive grammar that recognizes this language and write a parser for that grammar, I would simply use a CFG S ::= ABC A ::= a | aA B ::= b | bB C ::= c | cC --do context-free parsing, count the a's, b's, and c's, and reject the string if the counts differ. Likewise, while some other restrictions (like declaration before use) can be enforced with less restrictive grammars, it gets very ugly. To see what I mean, go to E & S and dig out A. van Wijngaarden et al. Revised report on the algorithmic language ALGOL 68, Acta Informatica 5:1-3 (1975) which uses a "two-level" grammar.) (Another, course-related aside: You may notice that I haven't told you how to write a parser. This is intentional; syntax is *not* the focus of 312. We will not ask you to write a parser, though we probably will ask you to modify a parser you're given. If we do, that will certainly be the easy part of the assignment...) -------------------------------------------------- YOU SHOULD KNOW... -------------------------------------------------- 1. The definition of a CFG 2. How to derive strings from a nonterminal 3. Given a nonterminal n and a CFG, how to figure out L(n) 4. How to design a CFG for a language L, given a description of L 5. How to show that a grammar is ambiguous