The d2c Dylan-to-C compiler:

  • Limitations
  • Command-line arguments
  • LID file format
  • Extensions and Libraries
  • Mindy compatibility notes
  • Debugger
  • Environment
  • Function Representations
  • Naming Object representations
  • Special adjectives
  • Optimizations
  • Generated Files
  • Compile Time Constants
  • Limitations

    d2c does not implement all of Dylan. In particular:

    Command-line arguments

        d2c -gMndT {-tdirectory} {-Ldirectory}*
            {-Dfeature}* {-Ufeature*}
            lid-file
    

    d2c compilation is driven by a LID (Library Interchange Description) file describing the library contents and various compilation options. It serves a similar purpose to a make file (but we use make too.) See the description of the LID file format below.

    In operation, d2c reads and processes the dylan files, generating .c and .s files and a temporary .mak makefile. It then runs gmake on this makefile.

    d2c recognizes these switches:

    -g
    Dump definitions needed to support debugging with gdb/dig.
    -Ldirectory
    Add directory to the library search path.
    -Dfeature
    -Ufeature
    Define or undefine a feature for #if conditional compilation. Features may also be specified in the LID file by the "Features:" option.
    -M
    Generate dependency info to be included in makefiles. This is included in higher level makefiles such as those generated by the Perl gen-makefile script used to compile the runtime system and compiler.
    -ppathname
    Used to specify the location for the platform description database file. The default is "$DYLANDIR/etc/platforms.descr".
    -no-binaries
    Inhibits compilation of the generated C code for cross-compilation. You can later compile by running make on the "cc-unit prefix-files.mak" or by using the Makegen created "cc_files" target.
    -Ttarget
    Generate code for the given target machine. Normally defaults to this platform. See the platforms.descr file for the names of the supported platforms. Often used with -no-binaries.
    -d
    Compiler debug mode (for debugging this compiler)

    LID file format

    A LID file is composed of entries of the form "keyword: value", similar to mail headers and to the Dylan file header format. Currently d2c expects the list of source files to appear as the "main body" of the LID file, after the header and a blank line. In the Harlequin LID format, there is a "Files: " entry which is used instead, and which we do not yet support.

    d2c recognizes these LID entries:

    Library: dylan-library-name
    The Dylan name for the library that we are defining. There must be a corresponding "define library" somewhere in the source for this library.
    Unit-prefix: c_legal_identifier_fragment
    This prefix is used to make the C translation of names in this library unique w.r.t. any other libraries that might be used. This defaults to the library name, so only needs to be specified if the library name contains illegal C name characters (such as "-").
    Unique-ID-base: decimal-integer
    Unique class identifiers for classes defined in this library are assigned sequentially starting with the specified integer. This should always be specified, but you won't get a sensible error if it is missing. The base should be sufficiently far from the base for any other library so that class IDs won't overlap. You will get a compile-time error if overlap occurs. A good base for user code would be 30000.
    Executable: result-file-name
    Specifies that we are building a runnable application rather than a library. The executable is generated with the specified name.
    Entry-Point: dylan-module:dylan-variable
    When generating an executable, this LID option specifies which dylan function is called as the main entry point. You can also have no main entry point, in which case the program exits after running all of the top level forms. This entry-point function is called with two arguments, argc (an integer) and argv (a raw pointer). Note that this is incompatible with Mindy, and rather brutal as well. You can get the Mindy semantics of calling Extensions:Main by using the Extensions module in your main module and then specifying: "Entry-Point: mymodule:%main" in the LID file. The %Main function parses the arguments and then calls Main.
    Linker-options: various "ld" flags
    This option specifies flags which must be passed to ld when linking against this library. This is primarily used when a foreign library is called via one of the undocumented callout mechanisms. For example, Dylan.lid specifies "-lm" so that it can use the math library. This dependency is automatically propagated to users of the library.
    Features: {feature | ~feature}*
    The argument is a space-separated list of features or misfeatures. If the token begins with "~", then the rest of the token is interpreted as a feature to remove. Otherwise, the token is added as a feature.
    Unrecognized LID entries are quietly ignored. This handles any comment-like LID entries such as "Author: ", etc. d2c also recognizes // as a comment-to-end-of-line sequence (equivalent to whitespace.)

    Here is a sample LID file:


    rcs-header: $Header: /afs/cs.cmu.edu/project/gwydion-9/dylan/docs/htdocs/RCS/d2c.html,v 1.3 97/06/04 19:50:59 ram Exp Locker: ram $
    library: my-program
    unit-prefix: myprog
    unique-id-base: 30000
    executable: mp
    entry-point: main:%main
    
    myprog-exports.dylan
    myprog.dylan
    

    Extensions and Libraries

    d2c has been written so that most of the Dylan code can be shared between Mindy and d2c. Dylan extensions (such as conditional compilation) that are implemented by both Mindy and d2c are described here. The common libraries are documented here.

    Mindy compatibility notes

  • In d2c the "main" entry is specified by the Entry-point: LID file option, and can be called whatever you want. This differs from Mindy where there is a standard variable "dylan:extensions:main". However you can the mindy semantics by using %main, see Entry-Point:.
  • The "define library" and "define module" forms for a library must be in a separate file which is the first file specified in the LID file. This file should specify "Module: Dylan-User" in its header.
  • In general, d2c is much more picky about all sorts of errors. It enforces type-related stuff to a much greater degree, and is even in some cases more picky about syntax.
  • Unlike Mindy, d2c does implement macros. In some cases this means that syntax error messages are not as good. If you can't figure out a d2c syntax error, try Mindy.
  • The d2c runtime does not automatically force output on *standard-output* on process exit, so you may need to add explicit calls to force-output.
  • d2c is missing the TK and Inspector libraries. Also, the d2c Random library is missing the more exotic functionality of the Mindy Random library, like random-gaussian() and random-exponential().
  • Mindy has some Dylan extensions which d2c does not implement. The most significant omission is threads.

    Debugging D2C code with DIG

    Dig is a wrapper for GDB which incorporates some specialized domain knowledge concerning the Dylan language and the D2C compiler. For the most part, you will seem to still be debugging the generated C code with GDB. However, some commands have been modified to allow a richer interface to Dylan objects and functions.

    Dig currently only compiles on HP/Ux and Win32, and is probably not useful on the latter. Feel free to get it working on other platforms. dig is not strictly necessary in any case, but it does sugarcoat some of the naming issues.

    DIG Commands

    In general, every GDB command still exists within DIG (although you may not be able to abbreviate it as expected -- for example, the "interactive" command shadows the "info" command, so that you must type at least "inf"). However some commands have been modified or added to facilitate debugging of Dylan code. The commands below reflect only the added capabilities.

    print
    This command extends the existing print command by allowing it to print Dylan values and call Dylan functions. This actually comprises three different special capabilities:
    1. DIG applies heuristics to translate Dylan variable names into their C equivalents. It guess right most of the time, but the results are not guaranteed. Sometimes there are several possibilities -- in this case it will ask you for clarification.
    2. If the expression value is a Dylan object then DIG invokes "print" to provide a meaningful description of the object. (The choice of which "print" to use depends upon the setting of *warning-output*.) Because DIG calls a function within your program, the program must be running before Dylan values may be printed.
    3. If the expression contains a Dylan function call, then DIG invokes that Dylan function. Arguments of the form "foo: bar" are translated correctly.
    find
    prints the translation of a Dylan variable name into its C equivalent.
    break
    if you specify a Dylan generic function, then DIG will set breakpoints in all of that function's methods. Let me know if this feature proves to be useful.
    interactive
    by default, anything typed into DIG is assumed to be a DIG command. If you need to provide input to your program, you must use this command to toggle the "interactive mode". In interactive mode, you may type data into your process. However, any type-ahead may produce strange and unpredictable results.
    prompt
    change DIG's command prompt. (You should not use gdb's native "set prompt" command. Bad things will happen.)
    quit
    does about what you'd expect, but does some extra clean-up work.
    gdb
    passes the following text to GDB verbatim. This allows you, for example, to use GDB's "print" command instead of DIG's.

    DIG Gotchas

    • Because of the challenges of name translation, the "print" and "break" commands may be noticeably slower than you expect. Since translations are cached, it will get faster as you go along.
    • If you try to print something that looks like a Dylan object, but isn't valid, DIG will encounter a "beg fault" and have to recover. This has the annoying side-effect of changing the "current frame" to be at the bottom of the call stack, regardless of where you were before.
    • Like GDB, DIG has problems with optimized code. Variables may be re-used (or eliminated), function calls may be inlined, and things will generally be less predictable. In addition, some Dylan functions may disappear if you do not pass the "-g" switch to D2C.
    • There are probably many other gotchas, but I don't know about them. Please tell me about anything not mentioned above.

    Environment

    These environment variables are used by D2C:
    DYLANDIR
    The root of the installed Gwydion tree. In the default configuration, this defaults to "/usr/local" on Unix and "c:\dylan" on win32. This variable in turn establishes the defaults for DYLANPATH and the "platforms.descr" file.
    DYLANPATH
    The search path for dylan libraries. Directories in the DYLANPATH are searched after any directories specified by explicit -L options. If not set, this defaults to ".:$DYLANDIR/lib/dylan" (".;%DYLANDIR%\lib\dylan" on Win32). If set, the value must include the directory where the "Dylan" library is to be found.
    PATH
    d2c expects to find make and the C compiler in PATH. On Unix we use the gnu tools gmake, gcc, and ldb. Other compilers can work, but at a minimum this requires a new platform description in "$DYLANDIR/etc/platforms.descr". You must also have some of the GNU-win32 tools to run d2c on Windows, though make and Visual C++ are normally used for compilation. To build d2c, you also need perl and the various scripts in the tools/" directory.

    The gnu assembler must be used in conjunction with the generated code from gcc. If you somehow end up running the HP/UX "as" with gcc, it will produce many errors about STAB entries, etc.

    CCFLAGS
    This variable holds the flags passed to the C compiler. The default is platform specific, but always includes "-I$DYLANDIR/include". If you do set this variable, you must also specify the Dylan system include directory. The default optimization flags for gcc are " -g -O4 -finline-functions". You can roughly halve the size of the executable by omitting the -g, but at the cost of debuggability. Leaving out the other optimize flags will speed compilation at the cost of runtime speed.

    Function Representations

    In d2c, there are two distinct things that may be thought of as "the function". The first is the actual C code d2c generates for the function. The second is an actual Dylan object (a "function object"), which is a general instance of <function>. Function objects are not created if the compiler can prove it isn't necessary (which is usually the case for functions that aren't exported from a library), where "necessary" means that the function might be stored into a variable, passed to another function, or otherwise used as a first-class value.

    The actual C code comes in three pieces, or entries, with each entry being a separate C function. At a call site, the compiler can either know exactly what function is being called or it might not have a clue (e.g. inside map where it calls the passed in function). So to keep from having to pay the penalty of runtime checking everything all the time instead of just when necessary, we generate multiple entry points for each function.

    The main entry is the entry that is used when the compiler can determine that everything is fine. It doesn't have to check any argument types or figure out what values correspond to what keywords.

    The general entry is the entry that is used in a random call where the compiler can't tell anything. It checks the argument types, decodes the keywords, and then calls the main entry.

    The generic entry is like the general entry, except that it is only used when the method was invoked via some generic function. The generic function dispatch stuff already guarantees that they argument types are okay, so it only has to decode the keywords before calling the main entry.

    Note:
    the various entries are just different pieces of C code. There is no dylan object that correspond to them. If the compiler can prove a given entry won't be used, d2c will omit that entry.

    Naming

    d2c generates C code, and thus must come up with a unique, legal C identifier for each thing that is to be referenced. (We say "thing" because it isn't necessarily a Dylan object.) d2c starts by computing a unit prefix ("unit" being synonymous with "library"). The unit prefix can be user specified; if not specified, it defaults to the library name in all lowercase.

    Character Set Translation

    Since Dylan allows characters in identifiers that C does not, we must translate these punctuation characters into alphanumeric sequences. Because Dylan is case-insensitive we also fold all alphabetic characters to lowercase. This frees the uppercase characters to be used to represent the extra characters. We translate characters which aren't legal in C as following:
    ' ' => "BLANK"
    '!' => "D"
    '%' => "PCT"
    '$' => "C"
    '&' => "AND"
    '*' => "V"
    '+' => "PLUS"
    '-' => "_"
    '/' => "SLASH"
    '<' => "LESS"
    '=' => "EQUAL"
    '>' => "GREATER"
    '?' => "QUERY"
    '^' => "RAISE"
    '_' => "X_"
    '|' => "OR"
    '~' => "NOT"
    otherwise => "Xhex code"
    As a special case to deal with the Dylan <class> naming convention, the brackets are stripped off of the variable name, and CLS_ is prefixed to the name. So <list> becomes CLS_list instead of LESSlistGREATER.

    Basic Name Translation

    A basic name is a module binding (like "define module" or "define constant"). The C name is formed by concatenating the unit prefix, module name and variable name, separated by uppercase Z's:
    unix prefixZmodule nameZbasic name

    Derived Name Translation

    d2c needs to create C names for many global definitions which are related to some Dylan variable but which are not the actual Dylan value of that variable. These derived names are created by adding suffixes to the basic name:
    GF name_METH
    Some method on the base name.
    GF name_DISCRIM
    Discriminator for a generic function.
    method name_GENERAL
    Default entry for a method.
    method name_GENERIC
    Method entry used by GF dispatch.
    method name_MAIN
    The actual body of a method.
    method name_INT_local method name
    Local method inside named method.
    method name_INT_method
    Some method form inside the named method.
    slot name_DEFER
    Deferred evaluation of a slot type.
    slot name_INIT
    Slot init function.
    slot name_SETTER
    Method used to implement setting a slot.
    slot name_GETTER
    Method used to implement setting a slot
    var name_TYPE
    Holds the type of a variable when the type isn't constant.
    var name_VAR
    The actual value of a Dylan variable.
    class name_MAKER
    Internal constructor for a class.
    LINE_542
    Some function resulting from compiling line 542.
    UNKNOWN
    As above, except we don't where it came from.
    some name_542
    The 542'nd distinct instance of name. _VAR names are guaranteed never to have this uniquifier suffix.

    As you might infer from the preceding, some suffixes can be combined, but except for _INT_ not to an arbitrary depth. Some examples:

    	dylanZdylan_visceraZCLS_type    /* <type> */
    
            /* general-entry for maker for <type-error> */
    	dylanZdylan_visceraZCLS_type_error_MAKER_GENERAL
    
    	/* signal{<condition>} internal search */
    	dylanZdylan_visceraZsignal_METH_INT_search_MAIN
    
    	dylanZdylan_visceraZVdebuggerV_VAR     /* *debugger* */
    
    For local variables, we simply add "L_" to the front of the name. This may result in a non-unique name. In which case, a uniquifier is appended to the end; see the "some name_542" rule above.

    Object representations

    In general, d2c picks the most specific representation that it can be sure will work. For instance, if d2c is sure that a given object is an <integer>, then it will use the C type "long" to represent the object. If, however, d2c only knows for sure that the object is an <object>, d2c will use the descriptor_t representation, even if it later turns out the <object> is in fact an <integer>.
    known dylan type c type
    <integer> long
    <single-float> float
    <double-float> double
    <extended-float> long double
    <raw-pointer> void *
    no data word heapptr_t
    <object> descriptor_t
    Note:
    There isn't a dylan type that can describe the set of objects that use the heapptr_t representation. See above.

    Note 2:
    If a functional (see below) class has exactly one data slot that can be magically represented, it is also magically represented in the same way. <character> falls into this category.

    An immediate representation is one where the actual data is directly there. As opposed to a pointer representation where the actual data lives in the heap and is referenced via a pointer.

    The general representation is the fully general representation that can be used to represent any Dylan object. It consists of a heap pointer and a data word. (i.e. descriptor_t) The heap pointer representation is used to represent anything that doesn't need the data word. (i.e. heapptr_t)

    "Boxed" and "unboxed" are somewhat vague terms that one will often hear on the 'net. Unboxed data is the raw data, the good stuff, with no overhead. The drawback is that if you don't know the type of the raw data (is it an integer, a character, or a float?), it's just a bunch of bits. Boxing means to add meta-data (the type of the data) so that the data can be interpreted unambiguously. The d2c immediate representation is an unboxed representation, while the d2c general representation is a boxed representation. Depending on the situation, the heap pointer representation might be considered either boxed or unboxed.

    Special adjectives

    inline
    Methods can be declared inline. If a method is inline, then the body of the method is duplicated at all valid call sites. This allows optimization of the called code based on the calling context.
    movable
    Methods and generic functions can be declared movable. A movable function is one that doesn't depend on when it happens. In other words, in can't depend on any global state, just the arguments.

    Plus is an example of a movable function: 2 + 2 is always 4 no matter when. movable implies flushable; see below.

    flushable
    Methods and generic functions can be declared flushable. This means that the function may depend on global state, but cannot change any global state. Extracting the value of a slot is a flushable operation (assuming the slot is guaranteed to be initialized). If the result of a flushable function isn't used, the call can be dropped.
    functional
    Classes can be declared functional. The slots of functional classes have to be constant (and in fact, default to constant). Furthermore, equality (==) is defined in terms of the slot values, not the pointer identity of the heap representation of the object. Actually, currently you have to define a functional-== method that checks to see if two instances of a functional class are the same yourself. So you could intentionally get it wrong, and then strange things would happen. But the idea is that the instances will be == iff the object-class for them both is the same and all slots are ==. Functional classes may have subclasses, but be sure to define functional-== methods accordingly.

    Optimizations

    d2c performs the following optimizations:

    common sub-expression elimination (CSE), optimistic type inferencing, compile time method selection, inlining, code motion.
    See also compile time constants.

    Generated Files

    d2c generates a variety of files. They are:
    *.c
    d2c generates a .c file for each .dylan file it processes.
    unitprefix-inits.c
    contains code for performing various initializations for this particular library. This includes executing any top-level expressions contained in the library.
    unitprefix-heap.s
    contains the initial heap image for this particular library.
    cc-unitprefix-files.mak
    A makefile which contains rules for compiling all the .c and .s files, and linking either an executable or a library, depending on which the LID file specifies.
    library.lib.du
    (Only generated when not building an executable) Contains various information about the library that d2c needs to remember.
    inits.c
    (Only generated when building an executable) It invokes each library's initialization routines, then calls the entry-point function (if any).
    heap.s
    (Only generated when building an executable) It contains initial heap information that is of a global nature. For instance, because all symbols with the same value are ==, the literal #"foo" in the String-extensions library is == to the literal #"foo" in the Streams library, and so cannot go in either library's unitprefix-heap.s. Currently symbols are the only object with this property.

    Compile Time Constants

    In Dylan, the type in a type declaration is an expression just like any other. This means that in general, the compiler can't tell what a type declaration means without running the program. That won't work, so what d2c actually does is recognize simple type expressions and then exploit the obtained type information for optimization and type inference.

    If the type expression isn't an obvious compile-time constant, then d2c gains no useful information from it, but is required to go to extra work to implement a run-time check. For this reason, your code will compile much better if you only use type expressions which d2c can recognize as being constant. The notion of compile time constant lies at the heart of many of d2c's optimizations.

    To start with, some terminology:

    ctype
    The compile-time representation of a Dylan type. If a type is not constant, it is represented as an <unknown-ctype>, which frustrates any attempt at further type inference.
    ct-value
    A compile-time value. The phrase "compile time constant" is synonymous. <ct-value> is not a subclass of <ctype>, nor the other way around. However, there are many classes which are subtypes of both <ctype> and <ct-value>.
    EQL-value
    A value is an eql-value if members of its class can be compared with ==. <class>es, <integer>s, <character>s, and <symbol>s are eql-values; in d2c non-class <type>s are not.
    eql-ct-value
    A ct-value which is also an eql-value.

    If we can figure out a ct-value equivalent to an expression parse, we say that expression is ct-evaluable. An expression is ct-evaluable if any of the following holds:

    • The expression is a literal.
    • The expression is a body (e.g. the guts of a begin/end), and each expression in the body is ct-evaluable. (A body whose last component expression is ct-evaluable, but which also has non-ct-evaluable expressions in it, might have side effects. Thus, the body can't be ct-evaluable because that might suppress the side effects.)
    • The expression is a reference to a binding, and that binding is to a module binding rather than a local binding, and the definition it is bound to is ct-evaluable (see below).
    • The expression is a function call, and the arguments to the function are all ct-evaluable, and the function itself is a direct reference to a module binding which is a function (ie, we know which function to invoke), and the function has built-in support, and the function call meets the conditions of that specific function (see below).

    Some definitions are ct-evaluable, and some are not:

    • A "define variable" definition is never ct-evaluable, because it is variable.
    • A "define constant" is ct-evaluable if the type constraint is ct-evaluable and the initial value is ct-evaluable.
    • A "define generic" is ct-evaluable if all specializers, all result types, and the type constraints on #rest types are ct-evaluable.
    • A "define method" doesn't usually define a module binding; define generics do. However, if the define method introduces an implicit generic, that implicit generic is ct-evaluable.
    • "define function" in d2c is equivalent to "define method".
    • "define class" is ct-evaluable if all its superclasses are ct-evaluable. (Note that d2c currently will puke if a class has a non-ct-evaluable superclass, so effectively all classes in d2c are ct-evaluable.)
    • "define macro", "define module", and "define library" don't produce module bindings.

    For function calls to most functions with built-in support, simply knowing that the arguments and the function are ct-evaluable is enough to make the function call ct-evaluable. These functions include:

    type-union, false-or, subclass, direct-instance, negative, abs, \+, \-, \*, ash, \^, logior, logxor, logand, lognot
    limited(<integer>) and limited(<collection>) are also ct-evaluable if their arguments are ct-evaluable.

    However, two functions are different. In addition to requiring that the arguments and the function are ct-evaluable, they impose additional constraints:

    singleton(obj) is ct-evaluable only if obj is an eql-ct-value.

    one-of(obj1, obj2, ...) is ct-evaluable only if every arg is an eql-ct-value.

    The most common way to construct a type which is not ct-evaluable is to create a singleton of a non-class type. For instance, although

            type-union(<foo>, <bar>)
    
    is ct-evaluable,
            singleton(type-union(<foo>, <bar>))
    
    is not.

  • [ Gwydion home page | mail to gwydion-group ]