22.1.2. Parsing of Numbers and Symbols

Common Lisp the Language, 2nd Edition

Next: Macro Characters Up: Printed Representation of Previous: What the Read

22.1.2. Parsing of Numbers and Symbols

When an extended token is read, it is interpreted as a number or symbol. In general, the token is interpreted as a number if it satisfies the syntax for numbers specified in table 22-2; this is discussed in more detail below.

The characters of the extended token may serve various syntactic functions as shown in table 22-3, but it must be remembered that any character included in a token under the control of an escape character is treated as alphabetic rather than according to the attributes shown in the table. One consequence of this rule is that a whitespace, macro, or escape character will always be treated as alphabetic within an extended token because such a character cannot be included in an extended token except under the control of an escape character.

To allow for extensions to the syntax of numbers, a syntax for potential numbers is defined in Common Lisp that is more general than the actual syntax for numbers. Any token that is not a potential number and does not consist entirely of dots will always be taken to be a symbol, now and in the future; programs may rely on this fact. Any token that is a potential number but does not fit the actual number syntax defined below is a reserved token and has an implementation-dependent interpretation; an implementation may signal an error, quietly treat the token as a symbol, or take some other action. Programmers should avoid the use of such reserved tokens. (A symbol whose name looks like a reserved token can always be written using one or more escape characters.)

change_begin
Just as bignum is the standard term used by Lisp implementors for very large integers, and flonum (rhymes with ``low hum'') refers to a floating-point number, the term potnum has been used widely as an abbreviation for ``potential number.'' ``Potnum'' rhymes with ``hot rum.''
change_end

----------------------------------------------------------------
Table 22-2: Actual Syntax of Numbers

number ::= integer | ratio | floating-point-number 
integer ::= [sign] {digit}+ [decimal-point] 
ratio ::= [sign] {digit}+ / {digit}+ 
floating-point-number ::= [sign] {digit}* decimal-point {digit}+ [exponent] 
                       | [sign] {digit}+ [decimal-point {digit}*] exponent
sign ::= + | - 
decimal-point ::= . 
digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 
exponent ::= exponent-marker [sign] {digit}+  
exponent-marker ::= e | s | f | d | l | E | S | F | D | L
----------------------------------------------------------------

----------------------------------------------------------------
Table 22-3: Standard Constituent Character Attributes

! alphabetic       <page>   illegal          <backspace>   illegal 
" alphabetic *     <return> illegal *        <tab>         illegal * 
# alphabetic *     <space>  illegal *        <newline>     illegal * 
$ alphabetic       <rubout> illegal          <linefeed>    illegal * 
% alphabetic       .        alphabetic, dot, decimal point
& alphabetic       +        alphabetic, plus sign 
' alphabetic *     -        alphabetic, minus sign 
( alphabetic *     *        alphabetic 
) alphabetic *     /        alphabetic, ratio marker 
, alphabetic *     @        alphabetic 
0 alphadigit       A, a     alphadigit 
1 alphadigit       B, b     alphadigit 
2 alphadigit       C, c     alphadigit 
3 alphadigit       D, d     alphadigit, double-float exponent marker 
4 alphadigit       E, e     alphadigit, float exponent marker 
5 alphadigit       F, f     alphadigit, single-float exponent marker 
6 alphadigit       G, g     alphadigit 
7 alphadigit       H, h     alphadigit 
8 alphadigit       I, i     alphadigit 
9 alphadigit       J, j     alphadigit 
: package marker   K, k     alphadigit 
; alphabetic *     L, l     alphadigit, long-float exponent marker 
< alphabetic       M, m     alphadigit 
= alphabetic       N, n     alphadigit 
> alphabetic       O, o     alphadigit 
? alphabetic       P, p     alphadigit 
[ alphabetic       Q, q     alphadigit 
\ alphabetic *     R, r     alphadigit 
] alphabetic       S, s     alphadigit, short-float exponent marker 
^ alphabetic       T, t     alphadigit 
_ alphabetic       U, u     alphadigit 
` alphabetic *     V, v     alphadigit 
{ alphabetic       W, w     alphadigit 
| alphabetic *     X, x     alphadigit 
} alphabetic       Y, y     alphadigit 
~ alphabetic       Z, z     alphadigit 

----------------------------------------------------------------

A token is a potential number if it satisfies the following requirements:

It consists entirely of digits, signs (+ or -), ratio markers (/), decimal points (.), extension characters (^ or _), and number markers. (A number marker is a letter. Whether a letter may be treated as a number marker depends on context, but no letter that is adjacent to another letter may ever be treated as a number marker. Floating-point exponent markers are instances of number markers.)
It contains at least one digit. (Letters may be considered to be digits, depending on the value of *read-base*, but only in tokens containing no decimal points.)
It begins with a digit, sign, decimal point, or extension character.
It does not end with a sign.

As examples, the following tokens are potential numbers, but they are not actually numbers as defined below, and so are reserved tokens. (They do indicate some interesting possibilities for future extensions.)

1b5000       777777q      1.7J       -3/4+6.7J    12/25/83 
27^19        3^4/5        6//7       3.1.2.6      ^-43^ 
3.141_592_653_589_793_238_4           -3.7+2.6i-6.17j+19.6k

The following tokens are not potential numbers but are always treated as symbols:

/            /5           +            1+           1- 
foo+         ab.cd        _            ^            ^/-

The following tokens are potential numbers if the value of *read-base* is 16 (an abnormal situation), but they are always treated as symbols if the value of *read-base* is 10 (the usual value):

bad-face        25-dec-83       a/b     fad_cafe        f^

It is possible for there to be an ambiguity as to whether a letter should be treated as a digit or as a number marker. In such a case, the letter is always treated as a digit rather than as a number marker.

Note that the printed representation for a potential number may not contain any escape characters. An escape character robs the following character of all syntactic qualities, forcing it to be strictly alphabetic and therefore unsuitable for use in a potential number. For example, all of the following representations are interpreted as symbols, not numbers:

\256   25\64   1.0\E6   |100|   3\.14159   |3/4|   3\/4   5||

In each case, removing the escape character(s) would allow the token to be treated as a number.

If a potential number can in fact be interpreted as a number according to the BNF syntax in table 22-2, then a number object of the appropriate type is constructed and returned. It should be noted that in a given implementation it may be that not all tokens conforming to the actual syntax for numbers can actually be converted into number objects. For example, specifying too large or too small an exponent for a floating-point number may make the number impossible to represent in the implementation. Similarly, a ratio with denominator zero (such as -35/000) cannot be represented in any implementation. In any such circumstance where a token with the syntax of a number cannot be converted to an internal number object, an error is signaled. (On the other hand, an error must not be signaled for specifying too many significant digits for a floating-point number; an appropriately truncated or rounded value should be produced.)

There is an omission in the syntax of numbers as described in table 22-2, in that the syntax does not account for the possible use of letters as digits. The radix used for reading integers and ratios is normally decimal. However, this radix is actually determined by the value of the variable *read-base*, whose initial value is 10. *read-base* may take on any integral value between 2 and 36; let this value be n. Then a token x is interpreted as an integer or ratio in base n if it could be properly so interpreted in the syntax #nRx (see section 22.1.4). So, for example, if the value of *read-base* is 16, then the printed representation

(a small face in a bad place)

would be interpreted as if the following representation had been read with *read-base* set to 10:

(10 small 64206 in 10 2989 place)

because four of the seven tokens in the list can be interpreted as hexadecimal numbers. This facility is intended to be used in reading files of data that for some reason contain numbers not in decimal radix; it may also be used for reading programs written in Lisp dialects (such as MacLisp) whose default number radix is not decimal. Non-decimal constants in Common Lisp programs or portable Common Lisp data files should be written using #O, #X, #B, or #nR syntax.

When *read-base* has a value greater than 10, an ambiguity is introduced into the actual syntax for numbers because a letter can serve as either a digit or an exponent marker; a simple example is 1E0 when the value of *read-base* is 16. The ambiguity is resolved in accordance with the general principle that interpretation as a digit is preferred to interpretation as a number marker. The consequence in this case is that if a token can be interpreted as either an integer or a floating-point number, then it is taken to be an integer.

If a token consists solely of dots (with no escape characters), then an error is signaled, except in one circumstance: if the token is a single dot and occurs in a situation appropriate to ``dotted list'' syntax, then it is accepted as a part of such syntax. Signaling an error catches not only misplaced dots in dotted list syntax but also lists that were truncated by *print-length* cutoff, because such lists end with a three-dot sequence (...). Examples:

(a . b)         ;A dotted pair of a and b 
(a.b)           ;A list of one element, the symbol named a.b 
(a. b)          ;A list of two elements a. and b 
(a .b)          ;A list of two elements a and .b 
(a  . b)        ;A list of three elements a, ., and b 
(a |.| b)       ;A list of three elements a, ., and b 
(a  ... b)      ;A list of three elements a, ..., and b 
(a |...| b)     ;A list of three elements a, ..., and b 
(a b . c)       ;A dotted list of a and b with c at the end 
.iot            ;The symbol whose name is .iot 
(. b)           ;Illegal; an error is signaled 
(a .)           ;Illegal; an error is signaled 
(a .. b)        ;Illegal; an error is signaled 
(a . . b)       ;Illegal; an error is signaled 
(a b c ...)     ;Illegal; an error is signaled

In all other cases, the token is construed to be the name of a symbol. If there are any package markers (colons) in the token, they divide the token into pieces used to control the lookup and creation of the symbol.

old_change_begin
If there is a single package marker, and it occurs at the beginning of the token, then the token is interpreted as a keyword, that is, a symbol in the keyword package. The part of the token after the package marker must not have the syntax of a number.

If there is a single package marker not at the beginning or end of the token, then it divides the token into two parts. The first part specifies a package; the second part is the name of an external symbol available in that package. Neither of the two parts may have the syntax of a number.

If there are two adjacent package markers not at the beginning or end of the token, then they divide the token into two parts. The first part specifies a package; the second part is the name of a symbol within that package (possibly an internal symbol). Neither of the two parts may have the syntax of a number.
old_change_end

change_begin
X3J13 voted in March 1988 (COLON-NUMBER) to clarify that, in the situations described in the preceding three paragraphs, the restriction on the syntax of the parts should be strengthened: none of the parts may have the syntax of even a potential number. Tokens such as :3600, :1/2, and editor:3.14159 were already ruled out; this clarification further declares that such tokens as :2^ 3, compiler:1.7J, and Christmas:12/25/83 are also in error and therefore should not be used in portable programs. Implementations may differ in their treatment of such package-marked potential numbers.
change_end

If a symbol token contains no package markers, then the entire token is the name of the symbol. The symbol is looked up in the default package, which is the value of the variable *package*.

All other patterns of package markers, including the cases where there are more than two package markers or where a package marker appears at the end of the token, at present do not mean anything in Common Lisp (see chapter 11). It is therefore currently an error to use such patterns in a Common Lisp program. The valid patterns for tokens may be summarized as follows:

nnnnn            a number
xxxxx            a symbol in the current package
:xxxxx          a symbol in the keyword package
ppppp:xxxxx     an external symbol in the ppppp package
ppppp::xxxxx    a (possibly internal) symbol in the ppppp package

where nnnnn has the syntax of a number, and xxxxx and ppppp do not have the syntax of a number.

change_begin
In accordance with the X3J13 decision noted above (COLON-NUMBER) , xxxxx and ppppp may not have the syntax of even a potential number.
change_end

[Variable]
*read-base*

The value of *read-base* controls the interpretation of tokens by read as being integers or ratios. Its value is the radix in which integers and ratios are to be read; the value may be any integer from 2 to 36 (inclusive) and is normally 10 (decimal radix). Its value affects only the reading of integers and ratios. In particular, floating-point numbers are always read in decimal radix. The value of *read-base* does not affect the radix for rational numbers whose radix is explicitly indicated by #O, #X, #B, or #nR syntax or by a trailing decimal point.

Care should be taken when setting *read-base* to a value larger than 10, because tokens that would normally be interpreted as symbols may be interpreted as numbers instead. For example, with *read-base* set to 16 (hexadecimal radix), variables with names such as a, b, f, bad, and face will be treated by the reader as numbers (with decimal values 10, 11, 15, 2989, and 64206, respectively). The ability to alter the input radix is provided in Common Lisp primarily for the purpose of reading data files in special formats, rather than for the purpose of altering the default radix in which to read programs. The user is strongly encouraged to use #O, #X, #B, or #nR syntax when notating non-decimal constants in programs.

Compatibility note: This variable corresponds to the variable called ibase in MacLisp and to the function called radix in Interlisp.

[Variable]
*read-suppress*

When the value of *read-suppress* is nil, the Lisp reader operates normally. When it is not nil, then most of the interesting operations of the reader are suppressed; input characters are parsed, but much of what is read is not interpreted.

The primary purpose of *read-suppress* is to support the operation of the read-time conditional constructs #+ and #- (see section 22.1.4). It is important for these constructs to be able to skip over the printed representation of a Lisp expression despite the possibility that the syntax of the skipped expression may not be entirely legal for the current implementation; this is because a primary application of #+ and #- is to allow the same program to be shared among several Lisp implementations despite small incompatibilities of syntax.

A non-nil value of *read-suppress* has the following specific effects on the Common Lisp reader:

All extended tokens are completely uninterpreted. It matters not whether the token looks like a number, much less like a valid number; the pattern of package markers also does not matter. An extended token is simply discarded and treated as if it were nil; that is, reading an extended token when *read-suppress* is non-nil simply returns nil. (One consequence of this is that the error concerning improper dotted-list syntax will not be signaled.)
Any standard # macro-character construction that requires, permits, or disallows an infix numerical argument, such as #nR, will not enforce any constraint on the presence, absence, or value of such an argument.
The #\ construction always produces the value nil. It will not signal an error even if an unknown character name is seen.
Each of the #B, #O, #X, and #R constructions always scans over a following token and produces the value nil. It will not signal an error even if the token does not have the syntax of a rational number.
The #* construction always scans over a following token and produces the value nil. It will not signal an error even if the token does not consist solely of the characters 0 and 1.

old_change_begin

Each of the #. and #, constructions reads the following form (in suppressed mode, of course) but does not evaluate it. The form is discarded and nil is produced.

change_begin

#,

Each of the #A, #S, and #: constructions reads the following form (in suppressed mode, of course) but does not interpret it in any way; it need not even be a list in the case of #S, or a symbol in the case of #:. The form is discarded and nil is produced.
The #= construction is totally ignored. It does not read a following form. It produces no object, but is treated as whitespace.
The ## construction always produces nil.

Note that, no matter what the value of *read-suppress*, parentheses still continue to delimit (and construct) lists; the #( construction continues to delimit vectors; and comments, strings, and the quote and backquote constructions continue to be interpreted properly. Furthermore, such situations as '), #<, #), and #space continue to signal errors.

In some cases, it may be appropriate for a user-written macro-character definition to check the value of *read-suppress* and to avoid certain computations or side effects if its value is not nil.

change_begin

[Variable]
*read-eval*

X3J13 voted in June 1989 (DATA-IO) to add a new reader control variable, *read-eval*, whose default value is t. If *read-eval* is false, the #. reader macro signals an error.

Printing is also affected. If *read-eval* is false and *print-readably* is true, any print-object method that would otherwise output a #. reader macro must either output something different or signal an error of type print-not-readable.

Binding *read-eval* to nil is useful when reading data that came from an untrusted source, such as a network or a user-supplied data file; it prevents the #. reader macro from being exploited as a ``Trojan horse'' to cause arbitrary forms to be evaluated.
change_end

Next: Macro Characters Up: Printed Representation of Previous: What the Read

AI.Repository@cs.cmu.edu