Information Technology Center
Carnegie-Mellon University
4910 Forbes Avenue
Pittsburgh, PA 15213
With bit-mapped graphics systems, like IBM's RT PC, text can be much
more than a stream of ASCII characters. There can be many variations
as shown in Figure 1: differences in font, face code, size, justification,
indentation, sub- or superscripts, and so on. In this article I want
to describe how the Andrew system stores text and displays it on the
screen. As a focus, I will look at the question of how a character
gets to the screen after being typed. In older hardware this echoing
was done by the display device itself, but in modern workstations considerable
software is involved. My emphasis will be on the data structures.
Figure 1: Varieties of text formatting. Note the variations
in font, face type, size, justification, indentation, and subscripts.
This caption has indented margins, the standard italic font, and is
justified left and right.
Before getting on, however, I need to say just a few words about workstations and Andrew. For this article let me define a workstation as a deskside computer with:
Andrew is the name of the software produced by the Information Technology Center, a joint project of IBM and Carnegie-Mellon University. One aspect of the project is a distributed file system designed so that if you have an account you can sit down at any of five thousand work stations and get your files. You can also give anyone permission to read your files, without the inconvenience of making and transporting a floppy disk. This same communication network is the basis of Andrew's comprehensive electronic mail system.
The Andrew user interface software includes a window manager, a subroutine package for dealing with text, an editor and mail system that utilize the text package, and numerous other facilities. I find the large screen to be a significant advantage in my work. I often look at two or more source files at once to write a third or answer a mail message. In the screen image shown in Figure 2, I am writing this article and checking a source file for the definition of the data structure for documents.
Figure 2: An Andrew Screen Image. In this example I am
writing this article and checking a source file. Top to bottom in
the left column are a system status window, a shell window for giving
commands, and an editor for the styles in this article; the latter
is open to the style used for Algorithms. In the right column are
the editor working on this document and another editor where I am checking
source code.
This article attempts to describe how the Andrew system stores and displays text. However, I have taken liberties with the names of routines and simplified many of the details so they are not the same as the actual implementation. Moreover, an improved Andrew formatting system is being built and it differs in many ways from what is described here. Much of the software described below was originally written by James Gosling who based it on his version of the Emacs editor.
Figure 3: An Andrew document. The length of this document
is 35 characters. Characters 0, 9, 22, and 34 are W, r, w, and question
mark, respectively.
At this point I could tell you how documents are implemented, but I
prefer to let you puzzle over that while I describe how applications
manipulate documents. Conceptually, the program refers to the document
as a stream of characters, with the first numbered as zero. See Figure
3. To have a document, a program first declares a variable to refer
to it:
struct document *doc;
This declares doc to be a pointer to a control block for a document;
one element, length, is the number of characters in the document.
The principle operations on documents are these four routines:
NewDocument(initiallength) - returns a pointer to a control block for a newly created document with capacity for initiallength characters. The document can get bigger than initiallength, so the exact value is not particularly important.
CharAt(doc, position) - returns the character presently at location position in document doc. Will return nonsense if position is negative or as large or larger than the number of characters in the document. If doc has the contents shown in Figure 3, CharAt(doc, 1) returns the value 'h'. (For performance, CharAt is implemented as a macro in C.)
InsertString(doc, position, string, length) - inserts length characters from string into document doc. The insertion is such that the first inserted character will wind up in location position. The call InsertString(doc, 9, "talking ", 8) will convert Figure 3 to discuss a "talking raven".
DeleteChars(doc, position, length) - Deletes length characters from doc, beginning with the character at location position. The call DeleteChars(doc, 22, 8) would result in "Why is a raven like a desk?"
With just these routines we can implement all of the operations on
documents that are usually available in text editors. For example,
when the user types a letter to be inserted in the document, the routine
called is InsertCharacterCommand, as shown in Algorithm 1. Here and
below the Algorithms are written in a subset of C similar to Pascal.
Note that a declaration containing an asterisk before the identifier
declares that variable to be a pointer to the indicated type. As a
special case, if variable str is to refer a string of characters it
is declared as "char *str". Given a pointer one may either extract
an element--as in doc->length to get the length of a document--or subscript
it to get an element of an array of the given type--as in str[i] to
get the (i+1)st character of str. (The first character
is str[0].)
Algorithm 1: Code called when a character is typed. Doc is the document being edited and CaretLocation is the place currently selected for insertion.
For a more extensive example, consider the global replace operation. The user is prompted to provide an old string and a new one, then the editor replaces every instance of the old string in the text with the new one.
We need first the routine in Algorithm 2 to find an instance of a string in a document and return its location. Note that the outer while loop terminates when the length remaining in the document is shorter than the string str. The inner while loop terminates either when it finds the string, when i>=len, or when the ith character of the string does not equal the (pos+i)th character of the document.
Given the routine find we can write Algorithm 3 to do the global replace.
In practice it would be called by the command processor after it has
requested the old and new strings from the user.
Algorithm 2: Finding a string in a document
Algorithm 3: Global replace operation
How then is a document implemented? By brute force. The struct document control block points to a single array of characters large enough to store the text of the document. You can perhaps imagine two problems with this scheme. The first is that each insertion might entail moving the entire rest of the document for each character inserted. This is avoided by leaving a gap in the middle of the text array at the location of the most recent insertion or deletion. If I insert a character in the first paragraph and then move to the last paragraph and make an insertion there, the gap is moved by moving all intervening characters, filling the old gap and leaving a new one. After too many insertions the second problem is reached: the document text exceeds the array size. But some documents will never grow large. How then can the size of the text array be adjusted to accomodate both small stable documents and large growing documents? Again Andrew uses a brute force solution: When an insertion would make the text too large, a new array fifty percent larger is allocated and the existing text is copied to it.
The solutions to both problems potentially require copying large portions of a document, which might be thought to be disconcerting to the user. I can report, however, that after two years of experience with the system I have never noticed a delay for copying the text. After all, a typical document is less than a 100,000 characters and a typical copy loop has about six instructions. At one million instructions per second, the entire copy takes no more than 0.15 seconds, when moving four characters per cycle. Most documents are shorter, so the time to copy the text is insignificant; it takes longer than that to paint a full screen of text.
Instead of the data structure in Figure 4, some editors use an alternative
with a linked list of control blocks, one for each line. It is undeniable
that this structure can be much faster for the operations of insertion
and deletion of characters; there is never a copy anywhere near as
long as 0.15 seconds. But other delays are encountered, especially
in a paging environement. Not only does the data structure take considerably
more space--sometimes twice as much--but the control blocks and text
lines can become scattered over numerous virtual memory pages. When
that happens a single screen repaint may require touching twice as
many pages as there are lines on the screen. If they cannot all fit
in memory, lengthy paging delays occur. With the Andrew data structure
paging is minimized.
The complete Andrew document data structure is shown in Figure 4. With this structure InsertCharacter and DeleteChars are written in terms of two subroutines. GapTo(doc, position) moves the gap so it occurs just before the character at the given position. Then a deletion can be made simply by decreasing the size of the document and increasing the value which says where the text after the gap begins. (The initial part of the text after the gap is thus deleted.) For insertions the routine RoomFor(doc, size) is also used. It ensures that the gap is big enough for an insertion of size characters.
It is a bit of C arcanery, but here is the full declaration of CharAt:
Marker magic occurs because they are updated two ways by InsertString and DeleteChars. These routines adjust marker limits so they always refer to the same part of the text. If the string "talking " is inserted in Figure 5 at position 9, just before the 'r', the position value of marker C is incremented by eight, the length value for marker B is increased by eight, and marker A is unchanged. While adjusting limits a changed flag is set in a marker if the text it refers to is modified. For the insertion of "talking ", the flag is set only for marker B. (The text referred to by C has moved, but not changed.) Once the flag has been set, it remains set until some routine outside the document package turns it off. Usually this is a routine associated with the routine that created the marker in the first place.
Figure 5: Three markers on a document. Marker A (position:0
length:6) refers to the text "Why is", B (position:7 length:12)
refers to "a raven like", and C (position:15 length:14) refers
to "like a writing". Note that the text refered to by markers can overlap.
From the discussion so far we can deduce the fields of a marker control block. Each struct document has a pointer to the list of markers associated with the document, and each marker has a pointer, doc, to its document. The extent of the text referred to is given by position and length. If the referenced text is changed the changed flag is set. And finally we have next and prev fields to connect the markers together in a doubly-linked list.
The routines to update markers are straightforward except for one decision that must be made: If an insertion is made at either end of the text referred to by a marker, is the length field of the marker made bigger? The Andrew answer is that the marker is made longer if the position of the insertion is a character that is referred to by the marker. Thus the length of a marker m will be increased only if m->position <= InsertionPosition < m->position+m->length. Note that this rule will never extend the length of a marker that has length zero.
How then do markers aid in updating the display? The trick is that the display management portion of the editor keeps a marker for each line displayed on the screen. The line is redisplayed on the screen only if the text it refers to has changed. To make this possible, the editor is carefully partitioned between the routines which respond to user inputs and those which update the display.
At the highest level is a main loop which determines whether there is user input and processes it. The loop defers calling the screen update routine until no input is pending. This main loop utilizes a data structure for each portion of the screen. For a document the data structure representing its screen image is the view, a structure that keeps all the information needed to format the document for display. Among the fields of a view are these:
The marker for the ith line is view->Line[i]->m. Curiously, it must refer to text beyond the end of the ith line because insertion of a space in the first word of the (i+1)st line may require the ith line to be redrawn with a short new word at its end. Thus the marker must include all the text on the line and also the first word of the next line.
Given these considerations we can now describe the update routine; it reconciles the screen image with the new contents of the document in three phases. The first determines which text lines have to be redrawn based on the changed flags in the Line array. In this process it calls a subroutine, DetermineSpacing which marches across the line interpreting the formatting information so it can find the height and width of each character and the widths for spaces to perform justification. All this information is preserved in the LineImage structure for the line and the value returned by DetermineSpacing is the position in the document of the first character for the next line. Note that an unchanged line may have to be redrawn if its y coordinate changes or if the previous line ends at a different position in the text. The second phase of the update routine erases the old text from each portion of the display that is to be redrawn. The third phase then plots each line that has been identified as needing to be redrawn. The heart of this phase is a call on SendTextToDisplay which uses the information recorded in the LineImage by DetermineSpacing and actually sends the characters to the screen,
The three phases of the view update process are sketched in Algorithm
4.
Algorithm 4: Updating a view. The work of understanding the style information is hidden within DetermineSpacing and SendTextToDisplay.
/* Phase 1: Find lines that need to change
in this view due to changes in doc.*/
To review, here is how a typed character gets to the screen: the interaction routine updates the document with InsertString, which sets the changed flag in a marker in one of the elements of the Line array in the view. The update routine finds that the marker for the line is changed, erases the existing line, and replots the line--now with the new character. It is part of the miracle of modern workstations that this all happens fast enough to keep up with any typist.
(Incidentally, if Lewis Carroll had been asked the Mad Hatter's riddle, I suspect he might have answered that the desk wrote "Nevermore".)
________________________
Wilfred J. Hansen is a System Designer at the Information Technology
Center, Carnegie-Mellon University. His major work there has been
with the Edittext text editor described here and its successor. His
PhD thesis project, with Stanford University, was the first hierarchical
syntax-driven editor and he has co-authored two texts with E. M. Reingold:
Data Structures, 1981, and Data Structures in Pascal, 1986,
Little Brown and Co., Boston.