Outline for Lecture, 12/2/97 o formal languages o regular languages o regular expressions Formal Languages An _alphabet_ is a finite set of characters. examples: {0,1} {a,b} {a,b,c,d,e,...,z} A _string_ is a finite sequence of characters from an alphabet. examples: 01101 baabaa tartans (empty string) <- the string that contains no characters A _language_ is a set of strings over some alphabet. examples: {0,01,10,11,100} <- a finite language, with 5 strings {0,00,000,0000,00000,...} <- a language that is not finite {ab,aabb,aaabbb,...} (empty language) <- the language that contains no strings {strings whose lengths are prime} {(empty string),a,t,at} <- the (empty string) can be a string in a language Here's something curious: The (empty string) does not belong to the (empty language)! Regular Languages We say that a string x is _accepted_ by a finite automaton M if, after reading x, M is left in a special accepting state. (Otherwise, x is _rejected_ by M.) example: M ---- 0 ---- 0 ---- __ start \ | |--------->| /\ |--------->| | | 1 state / | | | \/ |<---------| |<-' ---- ---- 0 ---- \ ^ | 1 \ |_| 1 _\| ---- ^ | /\ | | | \/ | <---\ ---- \ ^ | accepting states |_| 0,1 (This machine has two accepting states, and two non-accepting states.) M accepts 1, 101, 111, any string starting with a 1 M also accepts 0, 01, 011000, any string with an odd number of 0's Let us define L_M to be the language of strings that M accepts. L_M = {x | M accepts x} We say that M accepts L_M. A language is called _regular_ if there is a finite automaton M that accepts it. examples: {strings starting with 1 or having an odd # of 0's} (empty string) Here's the machine: ---- __ \| | | 0,1 /| |<-' ---- {all strings} Here's the machine: ---- __ \| /\ | | 0,1 /| \/ |<-' ---- {strings with an even number of 0's after the first two 1's} Here's the machine: 0 1 1 _ _ _ | | | | | | | v | v | v ---- 1 ---- 1 ---- 0 ---- __ start \ | |--------->| |--------->| /\ |--------->| | | 1 state / | | | | | \/ |<---------| |<-' ---- ---- ---- 0 ---- seen seen first second 1 1 How about the following language: {strings of balanced parenthesis} Is this a regular language? NO! Why not? Because a finite automaton can only remember a finite number of things. In particular, if the finite automaton has k states, then it can only remember k different things. So, it won't be able to keep track of the number of unmatched left parenthesis correctly when there are many of them. Here's another: {strings with the same number of 0's and 1's} (Same problem.) On the positive side, all finite languages are regular. example: {avr, bru, se} ---- a ---- v ---- r ---- start \ | |--------->| |--------->| |--------->| /\ | state / | | | | | | | \/ | ---- ---- ---- ---- | | | | b ---- r ---- u ---- | \---------->| |--------->| |--------->| /\ | | | | | | | \/ | | ---- ---- ---- | | s ---- e ---- \------------>| |--------->| /\ | | | | \/ | ---- ---- all transitions not shown lead to the following state, from which there is no escape: ---- | | | | ---- Theorem: The complement of any regular language is also regular. Proof: For a language L_M accepted by a machine M, make each accepting state in M a rejecting state, and each rejecting state an accepting state. Theorem: The union and intersections of two regular languages are also regular. Idea behind the proof: Build a new machine that "simulates" the machines for both languages simultaneously. The new machine will have one state for each pair of states taken from the first and second machine. Regular Expressions (another way to define regular languages) A regular expression can be formed as follows: Basis rules: 1) a <- a single character is a regular expression 2) (empty string) <- the empty string is a regular expression Each regular expression denotes a language. In these two examples, the regular languages are {a} and {(empty string)} Recursive rules: If R_1 and R_2 are regular expressions, then so are 3) R_1 R_2 <- concatentation examples: R_1 = a, R_2 = b, R_1 R_2 = ab, language = {ab} R_1 = ab, R_2 = b, R_1 R_2 = abb, language = {abb} 4) R_1 | R_2 <- union (read this as R_1 or R_2) R_1 = a, R_2 = b, R_1 | R_2 = a | b, language = {a, b} (a|b)(c|d), language = {ac, ad, bc, bd} (a|(empty string)) a, language = {aa,a} 5) R_1* <- closure, zero or more strings from R_1 concatenated together 6) (R_1) <- parentheses for grouping of operators examples: a* , language = {(empty string),a,aa,aaa,aaaa,aaaaa,...} (a|b)*, language = {(empty string), a, b, aa, ab, ba, bb, aaa, ...} Theorem: A language L can be described by a finite regular expression if and only if there is a finite automaton M that accepts L. Proof: Not in this course. One last example: Suppose we want to write a regular expression that denotes the language: {strings with an even number of 0's} 1*(01*0)*1* Here's the machine the accepts it: ---- 0 ---- start \ | /\ |--------->| | state / | \/ |<-------- | | ---- 0 ---- | ^ | ^ |_| |_| 1 1 How about an even number of 0's or an even number of 1's ? Here's the machine. ---- 0 ---- start \ | /\ |--------->| /\ | state / | \/ |<-------- | \/ | ---- 0 ---- |^ |^ 1 || 1 1 || 1 v| v| ---- 0 ---- | /\ |--------->| | | \/ |<---------| | ---- 0 ---- Coming up with a regular expression is a bit harder! Practical regular expressions: Many high-level programming languages have built-in processing of regular expressions. (Examples: perl and emacs-lisp) These give a very useful way to do string-matching. Here's the notation used in Perl. x match the character x (for non-special chars) [0-9] matches any character in the given range * zero or more iterations of the previous + one or more iterations of the previous . matches any character \ the quote symbol. so "\." matches "." only | the or operator ^ beginning of the string $ end of the string () parentheses So, for example, how would you identify an IP address in string form? It consists of four positive numbers separated by dots. In perl this would be: ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ Somebody pointed out that each number in an IP address must be in the range [0-255]. We could recognize such a number with the following regular expression: [0-9] | [1-9][0-9] | 1[0-9][0-9] | 2[0-4][0-9] | 25[0-6] Actually, in perl, "\d" is shorthand for [0-9], so the expression could be abbreviated significantly. To represent any string with an even number of zeros you can write: ^(1*01*01*)*$