Typographic text as a data type

5 November 1993

Typographic text as a data type

Wilfred J. Hansen
Andrew Consortium
Carnegie Mellon University
5000 Forbes Ave.
Pittsburgh PA 15213
412 268 6788 (Home: 412 421 5121)
wjh@cmu.edu

Abstract: Since computer text has moved beyond plain ASCII to embrace typographical styles, it is time to consider how programs can process such text. This paper describes the semantics of a language facility for accessing and modifying text styles. Functions are provided for defining styles, imposing them on text, clearing them from text, searching for them, and text traversal by like-styled segments. Examples demonstrate the utility of these functions. Alternative approaches are sketched.

In bygone days, users were fortunate if their computer printed numeric tables with values identified by labels in upper case letters. Then came mixed case text and later the selection of characters in ASCII. More recently, computer text has come to include the full printer's arsenal: typographic styles including fonts, indentation, justification, and so on.

In parallel with program output, programming languages have advanced from the 48 characters of Fortran to double that in ASCII and even more in ISO-8859 and representations of non-European languages. It is now time for programming languages to acknowledge typographical text. Language implementations should accept styles within program text and comments, string constants should permit typography, and languages should provide operations for dealing with typographical styles. This paper describes as a set of functions the semantics of a facility for operations on styled text.

Reasonable arguments can be presented on both sides of the question of whether a language ought to support styled text. In a low-level language like C or C++ it may be desirable to forego styles in order to allow programmers to create parochial style facilities or to allow implementation of multiple facilities. For modern higher-level languages a built-in style system permits efficiency of implementation, absolves the programmer from the implementation task, and permits typography in programs that are to be manipulated by other programs (e.g., a cross-referencer or pretty-printer).

The only languages currently offering style operations are macro languages associated with word processors. At best these languages deal with styles in the step-by-step sequence of a user operating at the console: select text, apply menu operations. This works reasonably well, but may be a bit awkward for systems that are not interactive.

Ness [Hansen, 1990], the language described below, is a macro language, but contains sufficient additional functionality to serve as a general purpose programming language. Ness's string data type is described in the next section. Later sections describe the styles facility as it has been implemented and section 4 discusses various alternate designs.

1. Subseq Values

Ness is an ideal environment for dealing with styled strings because its subseq data type [Hansen, 1992] relates naturally to a styled portion of text. A subseq data value refers to a subsequence of some underlying text and is thus appropriate as an argument to a function to impose styles on a text or as the result of a function to find text in some particular style.

Five primitive operations can be defined for the subseq type:

next(s) - returns a subseq referring to the character after the last one referred to by s. Empty if s ends at the end of its underlying string.

start(s) - returns an empty subseq referring to the position between the character before s and the first character of s.

base(s) - returns a subseq for the entire sequence of which s refers to a part.

extent(r, s) - returns a subseq for all text from the beginning of r to the end of s. If the end of s precedes the beginning of r, the value is the empty subseq at the end of s.

replace(r, s) - modifies the text underlying r so the portion referred to by r is removed and is replaced with a copy of s. Returns a reference to the copy of s in its new location.

Figure 1 illustrates these primitives.

Figure 1. Four primitive functions. The subseq values below the base show the result of applying the primitive functions to m, s, and p.

Compositions of the primitive functions can yield references to all other interesting subsequences relative to a given subsequence. For instance, the character which begins where s begins is next(start(s)) and the empty sequence at the end of m is start(next(m)).

2. Basic Style Functions

Style values in the Andrew User Interface System include a level of indirection. Styled text is marked with a named style; an additional table associates styling attributes with the named style. For instance, the named style "Heading" has the primitive attributes of boldness and negative indentation. Styled text can nest; that is, text marked with the Heading style can contain text in the Italic style. For visual purposes, this is indistinguishable from having three consecutive segments having, respectively, the styles Heading, Heading/Italic, and Heading.

Ness functions specify style values by supplying a subseq referring to text whose first character is in the desired style. The style facilities of Ness include a number of functions, but all can be described in terms of three basic functions:

addstyles(m, s) - The text referred to by m is modified by imposing the styles of the character next(start(s)). Returns m, which now refers to the newly styled text.

hasstyles(m, s) - boolean. Returns True if all the text referred to by next(start(m)) has all of the named styles of next(start(s)). Otherwise returns False.

removestyles(m, s) - modifies the text referred to by m so it does not have any of the named styles which apply to the character next(start(s)). Returns m, which now refers to newly unstyled text. (By an oversight, removestyles has not been implemented. That it has not been missed emphasizes the fact that most style algorithms written so far have relied primarily on addstyles and hasstyles.)

One simple function that can be written in terms of these primitives is clearstyles, which removes all the styles from a piece of text:

subseq function clearstyles(subseq m)
return removestyles(m, m)
end function

(Clearstyles has been implemented, but there are insufficient tools to write removestyles in terms of clearstyles.)

Another important function that has been implemented is searchforstyle:

searchforstyle(m, s) - searches m for a sequence of text having all the styles of next(start(s)) and returns a reference to that sequence. If there is none, returns the empty sequence at start(next(m)).

3. Examples

Many of the programs that have been written with the style functions translate text from a markup language into Andrew text. For instance, in Scribe markup a styled segment is delimited with

@style{ ... }

where the text within the braces is to be given the named style and the delimiters deleted. The heart of the algorithm relies on a table relating style names to styles:

subseq style_table := "@b{@i{@heading{" ~ ...

In this table, each entry is in the style named in the entry; the first is bold, the second italic, and the third is in the heading style. The translator algorithm first searches for '@style{' and sets one variable, say b, to refer to the entirety of this beginning delimiter. Taking care to recursively process nested styles, it then does a second search for the closing brace and sets a second variable, say e, to refer to it. The algorithm proceeds with:

s := search(style_table, b) -- find b in table
addstyles(extent(b, e), s) -- apply style from table to text
replace(b, "") -- delete opening delimiter
replace(e, "") -- delete closing delimiter

When two or more people work on a single document, it sometimes happens that two or more different styles are used for headings or other document elements. One author might utilize the Heading style while the other just uses Bold and both look the same on the screen. A Ness macro could be added as a menu option to normalize heading styles. The macro would incorporate a loop which searched for Bold text:

m := searchforstyle(text, "bold")

Let us assume that there is a function HeaderLine which determines whether its argument is in a header line, and if so returns the entire line; if not, it returns an empty subseq value. Then the macro continues with

s := HeaderLine(m)
if s /= "" then

clearstyles(s)
addstyles(s, "heading")

end if

4. Alternative Designs

The design presented here describes style operations which have the side effect of altering their argument. In the formal presentation of the subsequence reference operations, a functional approach is first described and the replace operator is omitted. It is possible to define style operations which produce a re-styled copy of their argument. However, this leads to much more creation of new text so it was omitted from the design. A programmer wishing to simulate pure functional programming can define faddstyles which returns a styled copy of its argument:

subseq function faddstyles(subseq m, subseq s)

return addstyles(m~"", s)

end function

where concatenation (~) always copies its arguments and produces a new string.

In principle, styles are a separate data type from text; by this light, it is incorrect to utilize styled text to represent styles. It is especially awkward in that only the first character of the styled text is applied, so any remaining text is simply ignored. However, styles are not really used enough in programs to justify the introduction of the machinery for a new data type together with suitable constants, values, and operations. Experience to date indicates that using text as the representation of a style is adequate.

If styles were a separate data type, there would be operations in that type for creating and operating on styles. With the text-representation approach, however, a style definition function is needed:

definestyle(name, s) - associates with name the style of next(start(s)).

The value s could have been used to assign its style, which might be a composite of several named styles. The result of definestyle is a new named style which will appear, for instance, in menus.

The designer of some applications may wish to allow the user to enter the name of a style to be imposed on some text. For this purpose Ness provides the function:

addstylebyname(m, name) - imposes on the text referenced by m the style of the first character of name.

It is an artifact of the underlying AUIS text object that styles may be nested and will always apply to whatever segment of text they are initially applied to. There can be two adjacent segments of bold text and this is different than one single segment of bold text which encompasses the same characters. One can imagine instead a system where these two cases are indistinguishable, perhaps because the types for each character are independent. Since programmers may become confused with styles as they are now, it would be possible to introduce a style normalizer which would combine adjacent equal styles. (It could also eliminate the common problem of extraneous styles wrapped around whitespace.)

Another artifact of the AUIS environment is the indirection of mapping style names to style attributes. This mapping is allowed in the Ness implementation to affect style comparison; hasstyles compares the names of styles rather than their attributes. This means that two segments that have the same attributes may not compare as having the same styles. An alternative design would assign style attributes to text without the intervention of named style; this alternative would be difficult to implement within AUIS text and would not have any benefits as far as any programs tried to date.

5. Traversal by Groups and Segments

When processing styled text, it is common to traverse the text in terms of its styles. Since styles may be nested, it is not clear exactly what traversal orders should be provided. Ness provides traversal in terms of segments and groups:

segment - the text between two points where styles change. All the text in a segment has the same style.

group - consecutive text on which a style has been imposed. The text may have additional styles nested within it.

Suppose we have text with styles nested in this way:

2222 55 33333333 44444 66 111111111111111111 777 abcdefghijklmnopqrstuvwxyz
That is, d...u has style 1, fghi has style 2, and so on. Then the style segments are abc, de, fghi, jklm, n, op, qr, s, tu, v, wx, y, and z; while the style groups are d...u, fghi, f...m, opqrs, qr, wx, and wxy. Note that the list of style segments covers the entire text, but the style groups cover only the text that has styles.

Traversal is done by functions whose argument is a subsequence of the text and whose result is the next succeeding appropriate subsequence:

nextstylesegment(m) - returns a subseq for the text from start(next(m)) up to the next style change.

nextstylegroup(m) - returns a subseq for a longer style group that starts at the same place as m, or if none, the shortest of the style groups that start at the next place after start(m) where a style group starts.

enclosingstylegroup(m) - returns a subseq for the smallest style group that covers the same text as m but also covers additional text.

When applied to the alphabet text above, nextstylesegment returns successively each of the listed segments. Nextstylegroup returns the segments in their numeric order. Enclosingstylegroup will perform one of these mappings:

fghi -> fghijklm fghijklm -> d...u qr -> opqrst opqrst -> d...u wx -> wxy

Conversion from styled text to Scribe form can illustrate style traversal. The code can utilize the same table, style_table, as in the conversion from Scribe to styled text. If we do not have nested styles, the appropriate loop is

while True do

m := nextstylegroup(m)
if m = "" then exit while end if
s := searchforstyle(style_table, m)
if s /= "" then

replace(start(m), s)
replace(start(next(m)), "}")

end if

end while

Note that searchforstyle finds in style_table the entire entry from @ through {; this value is exactly the value to be inserted at start(m).

When styles can be nested, the algorithm becomes more complicated because two style groups may start in the same place. In order to get their leading delimiters in the correct order, it is necessary to perform the translation from the outermost enclosing style group inward. This can be done with a recursive function utilizing enclosingstylegroup and modifying the text just before the function exits. The alternative of constructing a new result text does not, in this case, lead to a simpler algorithm.

It is not difficult to express nextstylesegment in terms of hasstyles; the element next(m) is in the same segment as m if

hasstyles(m, next(m)) and hasstyles(next(m), m)

It is more of a challenge to express nextstylegroup in terms of simpler functions:

subseq function nextstylegroup(subseq m)

subseq s -- styles required to continue group
if m = "" then

-- find next segment that has some styles
m := nextstylesegment(m)
if hasstyles(" ", m) then -- m has no styles

m := nextstylesegment(m)

end if
s := m

elif not hasstyles(" ", next(m)) and hasstyles(m, next(m))

and not (hasstyles(next(m), previous(m))
and hasstyles(previous(m), next(m))) then

-- next segment has some of the styles of m,
-- but is not the same as the style before m: extend m
s := next(m)

else

-- m is not empty and cannot be extended
-- find next segment after first(m) that increases styles
m := nextstylesegment(start(m))
while hasstyles(m, next(m)) do

m := nextstylesegment(m)

end while
m := next(m)
s := m

end if

-- m is first segment of group
-- extend m with all following text having at least same style as s
while hasstyles(next(m), s) do

m := extent(m, next(m))

end while
return m

end function

6. Experience

Most Ness applications that have used the style functions have employed the addstyles. These have included:

Pretty-printer. The pretty-printer for Ness is, naturally, written in Ness. It italicizes reserved words and emboldens the function names in function declaration headers.

Translator from RTF to Andrew format. Each styled portion of text in RTF is delimited with curly braces and a keyword. The translator looks up the keyword in a table and determines the translation. When the translation is a style, the proper style is taken directly from a character in the table.

American Heritage Dictionary. The source form of the American Heritage Dictionary contains macros of the form <ME>,each of which specifies the format for the succeeding text. As with the RTF translator, each macro is located in a table and the correct translation is made; styles are in the table. It is remarkable that adding bold and italic in the right places turns plain-looking text into looking exactly like a dictionary.

Preserving subsequence across file transfer. In some cases it is desirable to write out a text that has subsequences marked for some purpose. The only way to do this in Ness is to mark the text with a special style for each such subsequence. These styles are preserved when the file is written and later reread. After re-reading, searchforstyle or nextstylegroup can be utilized to re-constitute the marks.

Few Ness programs have yet exploited the style facilities. Those that have, however, have not uncovered any hints as to how to design the system better. Further experiments are clearly called for.

REFERENCES

Hansen, Wilfred J., "Enhancing documents with embedded programs: How Ness extends insets in the Andrew ToolKit," Proceedings of IEEE Computer Society 1990 International Conference on Computer Languages, March, 1990, New Orleans, IEEE Computer Society Press (Los Alamitos, CA), 23-32.

Hansen, W. J., Subsequence References: First Class Values for Substrings, ACM Trans. Prog. Lang. and Sys. 14, 4, Oct. 1992.