Okay, so I've fleshed out and changed the syntax a bit from my earlier note, and recast examples from the first few chapters of Paulson's book into this syntax. Parts of it still seem less than ideal to me, and some of the examples use syntax that I'm not sure should actually be implemented. You should look at http://www.cs.cmu.edu/~kw/paulson.txt http://www.cs.cmu.edu/~kw/mlnext.txt for examples, maybe before reading this. Parsing ~~~~~~~ The basic grammar is the same for terms, types, and patterns, which simplifies parsing, and (I think) promotes a pleasing regularity, but maybe that's my Elf bias showing. Not every term construct is legal as a pattern or type, of course (and vice versa), and some keywords have different (but related) meanings at the three levels. Lexical syntax .............. The lexical syntax distinguishes "proper" from "common" identifiers. You should think (roughly): Proper Common recursive let name non-recursive let val never bound by patterns always bound by patterns stands for possibly stands for pure value effectful computation includes datatype and includes type variables constructor names can be symbolic never symbolic I usually call a "proper" identifier a NAME and a "common" identifier a VARIABLE. The characters allowed in alphabetic identifiers are [A-Za-z0-9.?!_'] and an alphabetic identifier is "proper" if it starts with [A-Z0-9.?!] ** QUESTION: This is exactly the opposite of the Prolog world (but ** it is the same convention as the Mathematica world). Is there ** anyone who would argue for the other way? Except that identifiers that start with "'" must obey the syntax for character literals. A very few tokens that would otherwise be "common" identifiers are reserved keywords instead: of if with where fn raise _ A few symbolic identifiers (see below) are also reserved keywords: ( ) { ; } ` , | := == -> => && || Numeric, character, and string literals are thought of as proper identifiers that are present in the initial environment and can never be shadowed. If a proper identifier looks like a numeric literal, then it is, otherwise it's not. (So "1st" is legal and is not a numeric literal or anything like that.) If an identifier looks qualified (e.g. "Module.Thing") then it is. But if it uses periods in such a way that it doesn't look qualified, then it isn't (the most important example being the base case "." for lists, and floating point literals "3.14159"). ** QUESTION: how to fit qualified symbolic identifiers into this model? ** They are so rare that they shouldn't influence the rest of the design, ** but in cases where they're unavoidable there should be some hack. The double quote can only be used to form string literals (see below). "Special" characters are [][)(}{;,`] and always form a symbolic identifier (actually all but [ ] are reserved keywords) by themselves. The hash "#" is for comments and pragmas, in Elf style (except that Elf uses "%" instead). Any other printable ASCII character is a symbolic identifier constituent (so "~&$<>%%" is a legal symbolic identifier), and symbolic identifiers are always "proper". Alphabetic identifiers are never infix or prefix. Symbolic identifiers always are. A use of a symbolic identifier is infix if it looks infix, or prefix if it looks prefix: - 1 Prefix "-" 1 - 2 Infix "-" The prefix and infix forms of a symbolic identifier are completely independent of each other; they can have different types, and different bindings in the environment (as indeed the two forms of "-" generally do in all sane programming languages). If you must destroy the fixity of a symbolic identifier you can do it by saying e.g. (f -> f 1) (#prefix -) Same as "- 1" (f -> f 1 2) (#infix -) Same as "1 - 2" There is no way of declaring operator precedences or associativities. The infix form of a symbolic identifier is right-associative if it ends with ":", or left-associative otherwise. The simplest case ":" itself is the usual right-associative list constructor. ** ASIDE: This can actually be seen as a special case of a convention ** going all the way back to the Principia of Russell and Whitehead (I ** think--I saw it first in Church's paper introducing his simple theory ** of types) Prefix operators have the highest precedence, followed by left infix operators, then right infix operators. ** QUESTION: We are giving up the ability to write "1 + 2 * 3" instead of ** "1 + (2 * 3)", and in exchange we don't have to deal with precedence ** declarations--the parsing of an expression is independent of its ** environment. Is this trade worth it? Keywords that happen to look like symbolic identifiers (e.g. "->" "&&" etc.) NEVER have the same precedence as operators; it's always lower. Any "applicative expression" (see grammar below) can be treated as an infix identifier by surrounding it with ``: x `Elem` set foo `Xyzzy 1` bar These are always left-associative, and have the same precedence as other left-associative symbolic identifiers. Grammar ....... It would be best to read the examples from Paulson and only then look at the formal grammar (if ever). The atomic expressions are: exp0 ::= VARIABLE NAME INTEGER STRING CHAR FLOAT fn _ ( exp ) { funrules } General lambda abstraction { exp } Simple lambda abstraction The rest of the core expression grammar goes (from tightest to loosest binding): Applicative exp1 ::= exp1 exp0 | exp0 Prefix op exp2 ::= PREFIX_OP exp2 | exp1 Left op exp3 ::= exp3 INFIX_LEFT exp2 | exp2 Right op exp4 ::= exp3 INFIX_RIGHT exp4 | exp3 Conjunction exp5 ::= exp5 && exp4 | exp4 Disjunction exp6 ::= exp6 || exp5 | exp5 Comma exp7 ::= exp7 , exp6 | exp6 Simple rule exp8 ::= exp7 -> exp | exp7 => exp | exp7 Note that the ambiguity in -> and => is resolved by making them extend as far to the right as possible, as usual. Finally, some constructs of least possible precedence: exp ::= raise exp Exception exp8 if exp8 | exp Conditional exp8 with { withrules } Case analysis exp8 with withrule ditto exp8 where { decls } Local binding exp8 where decl ditto exp8 Again where there's ambiguity, these extend as far to the right as possible. The "funrules" and "withrules" differ only in how the patterns on the lefthand sides are interpreted, not syntactically. Their common syntax is rules ::= ; rules rule rules rule ::= exp7 -> exp The "decls" are also what constitute a top-level program. decls ::= ; decls decl decls decl ::= exp exp7 of exp exp7 == exp exp7 := exp type exp7 == exp data exp7 == { condecs } data exp7 == condec A "condec" differs from a "decl" only in how it's interpreted, not syntactically. The words "type" and "data" are not reserved, but this does not introduce any ambiguity (exercise for the reader: why?) Informal semantics ~~~~~~~~~~~~~~~~~~ Proper names ............ Proper identifiers stand for possibly effectful computations (though quite often they'll actually stand for values). As in Haskell, the scope of a proper identifier is the entire block of declarations in which it occurs (so they're implicitly mutually recursive) as well as the scope of the block itself. So in the following Foo where { Bar == (x -> Foo x); Foo == (x -> Bar x) } all three Foos are mentions of the same identifier. The types of proper identifiers are declared with "of", and the identifiers themselves are defined with "==". The definition may be by several cases, e.g. Foo Int of Int; Foo 1 == 3; Foo 2 == Foo 1; Foo x == raise Boggle; but all cases must be contiguous, with no other declarations intervening, and any type declarations for Foo must come before all of them, and all the patterns on the left of == must have the same number of arguments. Common names ............ By contrast, common identifiers stand for values. All common identifiers are introduced by binding them in a pattern. The scope of a common identifier extends to the end of the rule (if the pattern introducing it is the left-hand side of a rule or ==) or the rest of the block (if the pattern introducing it is the left-hand side of :=) and the scope of the block itself. So in Foo x y Zee where { x := Factorial 100; y := x + 1; Zee a == a + x } all the occurrences of x are mentions of the some identifier. Anonymous functions ................... Function definitions can be written in many different ways. The type of a function can always be written in a way that makes it look like the function itself (except that "->" becomes "=>" and "==" becomes "of"). The general form for anonymous functions is exemplified by Map == { fn f Nil -> Nil; fn f (Cons x xs) -> Cons (f x) (fn f xs); }; Each rule is one case of the definition. The pattern on the left, and the expression on the right, use the keyword "fn" to stand for the anonymous function itself (and "fn" can be used just like any other identifier could be), so for instance we allow Map == { f `fn` Nil -> Nil; f `fn` Cons x xs -> Cons (f x) (f `fn` xs); }; and anonymous functions can be recursive. A "fn" on the right-hand side refers recursively to the nearest enclosing anonymous function. It is an error to use "fn" in an expression that's not nested inside an anonymous function. When multiple == bindings are used to define a name, the set of bindings can be transformed into an anonymous binding by replacing each occurrence of the name, on the left of ==, with "fn", and replacing "==" with "->". So for instance we transform Foo 1 == 3; Foo 2 == Foo 1; Foo x == raise Boggle; into Foo == { fn 1 -> 3; fn 2 -> Foo 1; fn x -> raise Boggle; }; Each rule in an anonymous function must apply "fn" to the same number of arguments. If there is just one argument, then the "fn" can be dropped: Foo == { 1 -> 3; 2 -> Foo 1; x -> raise Boggle; }; Either all "fn"s are dropped, or none are. If the function has only one rule, the outer braces can be dropped. (More generally, braces are always used to enclose semicolon-separated things. If there's only one thing in the braces, they can be dropped.) ** QUESTION: In the Paulson examples I also used the shorthand form ** {_+1} or {1+_} to stand for sectioning, which breaks this rule ** slightly (we need the braces to see where _ is implicitly bound). ** It's also useful so that we can say {Foo bar baz} where SML would ** force us to say (fn () => Foo bar baz), when playing with lazy ** computations. Is this special syntax worth it? (You'll note I did ** include it in the grammar above). Value bindings and case analysis ................................ Patterns are also used in some other places, where they don't stand for calls to an anonymous function. In these cases, the "fn" is NEVER allowed. In any context a "pattern" such as "x y" in { x y -> 3 }; x y := 3; is bogus. Only "fn" or a constructor name can be applied to arguments in a pattern. (But see the discussion of string constants, below.) Patterns must match the production "exp7" in the grammar. Patterns are also used to bind value (common) variables using ":=". In this case, the "fn" is NEVER allowed: pi := 3.14159; (x, y) := Unzip mylist; Patterns are also used in the rules of a "with" (case analysis). Again, in this case the "fn" is NEVER allowed. mylist with { Nil -> 0; Cons x xs -> 1 } (which is "case mylist of Nil => 0 | Cons(x,xs) => 1" in SML) Function types .............. The type of a function, in general, is given by a "rule" where -> becomes =>: Add3 of (fn Int Int Int => Int); Add3 == (fn x y z -> x + y + z); All the same conventions apply. The "fn" can be dropped if there's only one argument: Increment of (Int => Int); Increment == (x -> x + 1); And you can use the same form for type declarations as for Haskell-style definitions: Add3 Int Int Int of Int; Add3 x y z == x + y + z; Type declarations for values must never do that. But the following is okay: x of Float; x := 3.14159; The rule that -> and => extend as far to the right as possible leads to the natural thing for curried functions (combined with the convention that the braces surrounding an anonymous function can be dropped if there's only one rule): Add3 of Int => Int => Int => Int; Add3 == x -> y -> z -> x + y + z; Tuples ...... Tuples are given by comma-separated expressions. The type of a tuple looks like the tuple itself: Diagonal Int of (Int, Int); Diagonal x == (x, x); A tuple need not be surrounded by parentheses. Commas don't associate at all. (So (1, 2, 3) and ((1, 2), 3) and (1, (2, 3)) are all different.) Type definitions ................ A type (or type constructor) definition is introduced by the keyword "type": type IntPair == (Int, Int); type Pair3 a == (a, a, a); type CPS a answer == (a => answer) => answer; Constructor-level functions are given by "->", but an anonymous constructor- level function must only have one rule, and obviously it can't use non-trivial pattern matching: type Pair3 == a -> (a, a, a); type CPS == fn a answer -> ((a => answer) => answer); All the other conventions for the syntax of functions are the same for constructor-level functions, though: # sick and twisted type CPS == a `fn` answer -> ((a => answer) => answer); Datatypes ......... A datatype is introduced by the keyword "data": data Bool == { True; False }; data List a == { Nil; Cons a (List a) }; data Sum a b == { Left a; Right b }; In more advanced systems (which I haven't thought much about), it is necessary to be able to specify the return type of a constructor: data GADT a == { GInt Int of GADT Int; GBool Bool of GADT Bool }; data NList a n == { Nil of NList a 0; Cons a (NList a n) of NList a (n+1) }; When the return types aren't specified, they default to the type on the left of the ==. So the earlier datatype declarations, more explicitly, look like: data Bool == { True of Bool; False of Bool }; data List a == { Nil of List a; Cons a (List a) of List a }; data Sum a b == { Left a of Sum a b; Right b of Sum a b }; All the same conventions apply for constructor declarations as for any other declaration, so the following is fine: data List a == { Nil of List a; Cons of a => List a => List a }; data List a == { Nil of List a; Cons of fn a (List a) => List a }; All datatypes are subtypes of a common type Data. This replaces ML's "exn". Once syntax for exception handlers is introduced, it will just involve case analysis on type Data. ** QUESTION: Obviously this will only make sense if the subtyping for ** datatypes makes sense. Also there's the question of representation. ** I think having a little flexibility in subtyping datatypes is valuable ** enough that any penalties in the bit-banging of the representations ** are worth it. But someone who knows should evaluate this. Existentials in datatypes ......................... First of all, with the syntax for datatypes discussed above, we already have existential types, if the binding of type variables is set up the right way. I want to think of constructor declarations as always being in the more explicit form given above. In that case, each constructor declaration's free type variables are local to that declaration. We can alpha-convert them, and erase the variables on the outside. For instance, we could expand data List a == { Nil; Cons a (List a) } into data List a == { Nil of List a; Cons a (List a) of List a } then, since now the full type of each constructor is explicit, erase the type variable on the outside, and alpha-convert the type variables in the constructor declarations any way we want: data List _ == { Nil of List b; Cons c (List c) of List c } (We wouldn't have been able to do that, of course, before making the type of each constructor fully explicit, because data List a == { Nil; Cons c (List c) } obviously doesn't make sense.) Anyway, if we want an existential type, in the fully explicit form, we can just introduce more type variables that don't occur in the result type of a constructor. For instance, my MinML interpreter represents functions in CPS style. The final answer type of the CPS monad is unknown. We can capture it with an existential type. Then the datatype of values in my MinML interpreter, in fully explicit form, looks like data Value == { VBool Bool of Value; VInt Int of Value; VPair Value Value of Value; VFun (Value => (Value => a) => a) of Value; # et cetera }; Since the "a" of the VFun constructor doesn't appear in the constructor's result type, it's effectively an existential. Here we're making use of the equivalence (this is not ML syntax) all a. (Foo a => Bar) =~= (exists a. Foo a) => Bar Now that we've seen the fully explicit form, we can go back to the lighter form, but it still means the same thing: data Value == { VBool Bool; VInt Int; VPair Value Value; VFun (Value => (Value => a) => a); # et cetera }; Sometimes there would be a mixture of type variables; this also makes sense. For example, we might want to parameterize Value over the representation of integers. Then in fully explicit form we'd have data Value intrep == { VBool Bool of Value intrep; VInt intrep of Value intrep; VPair (Value intrep) (Value intrep) of Value intrep; VFun (Value intrep => (Value intrep => a) => a) of Value intrep; # et cetera }; In fully explicit form, we can alpha-convert freely, so the above is exactly the same as data Value _ == { VBool Bool of Value alice; VInt bob of Value bob; VPair (Value carol) (Value carol) of Value carol; VFun (Value dave => (Value dave => evelyn) => evelyn) of Value dave; # et cetera }; Or we can go back to the lighter form data Value intrep == { VBool Bool; VInt intrep; VPair (Value intrep) (Value intrep); VFun (Value intrep => (Value intrep => a) => a); # et cetera }; but now of course we can't alpha-convert quite as freely, since the name "intrep" on the left-hand side of "==" is getting copied implicitly into each constructor declaration as its result type. Explicit universals and existentials .................................... I think the syntax for these should be fairly light. In a lot of type theories that have universals, we think of them as being given by a higher-order constructor function (or maybe one of them for each kind). So the type of the polymorphic identity function looks like all a. (a => a) =~= All (a -> (a => a)) where All of (Type => Type) => Type; I want to leverage this idea but make it syntactically simpler. So we say that when "->" is used at constructor level, sometimes the constructor variables being abstracted over are implicitly universal or existential. The idea is that a -> !b -> ?c -> !d -> Foo a !b ?c !d implicitly means a -> All (b -> Exists (c -> All (d -> Foo a b c d)) or something like that. (Here my Isabelle history is showing.) I have to work out the details. You could also say fn a !b ?c !d -> Foo a !b ?c !d as usual. But anyway, the earlier implicit existential in the Value datatype could then be made explicit by data Value == { # ... VFun (?a -> Value => (Value => ?a) => ?a); }; Another example is well-typed self-application: SelfApp of (!a -> !a => !a) => (!a -> !a => !a); SelfApp x == x x; (This is in the model of polymorphism based on subtyping, not explicit instantiation; i.e. Curry style.) Another example is the type of the Filter function on dependently refined lists, which needs an existential: data List a n == { Nil of List a 0; Cons of a => List a n => List a (n+1) }; Filter of (a => Bool) => List a n => ?m -> List a ?m suchthat ?m <= n; Filter pred == { Nil -> Nil; Cons x xs -> Cons x (fn xs) if pred x | fn xs; }; Lazy datatypes .............. TODO Use Okasaki/wadler even style Declaring a datatype always implicitly declares its lazy counterpart Lazy expressions and functions must have a lazy datatype for their types Lists ..... The standard list syntax is given by data List a == { . ; a : List a }; Strings ....... The type String is just a synonym for List Char. A string is just the list of its Unicode codepoints (Char includes all Unicode codepoints, not just what the unicode standard calls a "character"... noncharacters are useful for sentinel values). ** QUESTION: I currently think that strings should just be lists. But ** can anyone think of counter-arguments? ** ** As far as representation, I think we can afford, in 2005, to be a ** little thunk-happy. One simple idea: a List is either a machine ** word uniquely identifying the nil (.) constructor, a pointer to a ** tagged pair for the cons (:) constructor, or a pointer to a thunk ** (tagged as such). So string constants can be stored in UTF-8, or ** whatever, and the code pointed to by the thunk can actually read ** out the individual characters. The character "#" is special in string literals. The type of a string literal is String, if it contains no unescaped "#"s. If it contains one "#", then its type is (String => String). If it contains two "#"s, then its type is (String => String => String), et cetera. So we can write e.g. Print ("To # or # to be..." "be" "not") This would obviously be more powerful in the presence of Haskell-style type classes, in which case we'd want "To # or # to be..." to have a type like (Showable => Showable => String) or whatever. A further generalization would be to allow pattern matching in the same style: Foo ("# + #" s1 s2) == "add # #" s1 s2; Here I think we'd want to generalize the syntax to allow for regular expressions. ** QUESTION: If we allow regular expression matching, how do we avoid ** or finesse the issue of multiple possible matches? Vectors and arrays .................. ** RANT: ** ** In general, there should be no FORCED distinction between "vectors" with ** constant time access and "lists" (you can always use type abstraction ** to CREATE such a distinction if you wish). Any list can be fed to a ** primitive function ** ** Pack of List a => List a ** ** that is semantically indistinguishable from the identity function, but ** the list you get back from Pack is "suspiciously fast", in that the ** operation ** ** Nth of List a => Int => a ** ** will execute in constant time. The function ** ** Tabulate of Int => (Int => a) => List a ** ** also creates "suspiciously fast" lists, and in fact Pack can be ** defined in terms of it. ** ** In fact, I might go so far as to eliminate the array types, and go ** with (Ref (List a)) instead. As long as you use a list in a "linear" ** way, thinking of the primitive ** ** Update of List a => Int => a => List a ** ** as linear, as long as you never use the first argument again, ** then the underlying vector, if it is a vector, should be destructively ** modified, and the update should happen in constant time. The ** linearity check could be static or dynamic. If you're worried that ** the linearity check would be fragile, you can always use type ** abstraction to create an Array a implemented by Ref (List a), where all the ** operations exposed by the abstract type on Array a have implementations ** that are statically linear, and so obviously so that even a rather ** stupid compiler will be able to see that. Conjunction and disjunction ........................... As usual, we want these to be short-circuiting, so they need to be special forms. As a side benefit, we get to choose the meaning of the special forms by analogy at the level of types and patterns. For types, obviously && should be intersection and || should be union. For patterns, && is simultanous matching and || is alternative matching. Simultaneous matching generalizes the "as" construct of SML, with the bonus that you don't have to remember which way it goes (always a problem for me), because && is commutative: Foo (list && Cons x xs) == Cons x list; Foo (Cons x xs && list) == Cons x list; Primary? (Red || Green || Blue) == True; Primary? _ == False; Numbers ....... All number types are subtypes of a common type Number. A Number is one of the following: 1. A positive or negative infinite-precision rational. 2. Positive or negative zero. 3. Positive or negative infinity. 4. An IEEE NaN. This makes it possible to coerce any IEEE floating type to Number without loss. (And the infinities are useful anyway, for the same reason that they are in the IEEE standard. We can always carve out another subtype of Number that disallows them, if we want.) As usual, the basic equality on Number must be transitive, and a congruence. So it must distinguish the zeroes, et cetera. No one is stopping us from defining other, less sharp equalities. No one is stopping us from carving out a subtype of Number not containing the infinities, in which case the distinction between positive and negative zero is unobservable (because the special reciprocal operation mapping this subtype to itself will raise an exception when given a zero rather than returning an infinity). No one is stopping us from defining a subtype Integer of Number containing only the positive zero (and a lossy operation mapping the Number -0 to the Integer +0). ** QUESTION: Complexes? (See Kahan's article on the sign of zero.) ... more to come