Introducing Structured Data Types into Internet-scale
Information Systems
Introduction and Overview
Managing the increasingly large volume of information on computer
networks is rapidly becoming an important problem in computing.
The Internet, the largest wide-area computer network, is growing 
exponentially in terms of hosts, users, and traffic.  The NSFNet
backbone carried 14 Terabytes of data in March 1994; about half
of that was due to information services, such as FTP, Gopher, and WWW.
It is clear that a large supply and demand of information exists.
  The form in which information is disseminated on the Internet, however,
leaves much to be desired.  Most information has some sort of semantic
structure to it.  It could be a text broken up into chapters and paragraphs,
a bus schedule showing routes and times, a city map displaying streets
and elevations, or a complex medical database.  But while Internet
information systems may be able to transmit the data involved with
these pieces of information, they give little assistance in telling
how the data is structured.
  The semantic structure of information makes a large body of information
much more manageable.  Knowing the meaning of a type of 
information helps one extract, derive, compile, and condense useful
information from a larger set of raw data.  It helps in searching
for relevant information, and in intelligently filtering out irrelevant 
information to a query.  In these tasks, it is not enough to simply
know that a piece of information is composed of several components; ideally,
one wants to be able to know the meaning of the components, and what one
can do with the parts.  A search of card catalog entries, for instance,
may need to know how to extract the author of an entry, and compare
the author's name against a search term.
  In the Internet, there is little support for semantically structured
information.  A particular application, such as a library catalog, may
define a certain format for their book entries, which may be semantically
rich, but only meaningful to programs specifically written to understand
that format.  A client program written to read University A's card
catalog may be able to make no sense of University B's card catalog,
even though both are available on the Internet.
  In contrast, applications that want to share their information widely are
generally forced to use a lowest common denominator approach.  The most
common such denominator is plain unstructured text.  Frequently used
applications may, over a long period of time, settle on higher-level common
denominators, such as RFC-822 mail messages, or GIF image formats, or
documents formatted with TeX or HTML.  But these higher-level formats
still lack much of the semantic structure many applications need; and the
process of finding a usable common standard even for these formats can
take years.  (Then, in a few years more, these formats are often replaced
by other, incompatible formats.)  The rate at which new data types
can be introduced and used in an Internet context is far too slow, and
cannot be made much faster with current standards procedures.
  How can information be provided on the Internet at a higher semantic
level, while remaining usable by a large number of information clients?
Two observations are relevant here:
  -  The concept of abstract data types, or of "objects",
    provides a solution to many of the complexities of data formats
    and operations.  Abstract data types provide a well-defined
    interface of operations and attributes, so that a client can
    use a complex datum without having to know how it is formatted,
    or how operations are implemented.  Indeed, a number of systems,
    such as CORBA, [OMG92]
 already attempt to implement an object-oriented
    system distributed over the network.  None have yet, however,
    been able to cope with the scale and heterogeneity of the Internet.
    This is in part because they are designed for general-purpose
    computing, which includes both reads and writes.  They therefore
    have to worry about issues like consistency of data updates,
    fault tolerance, and a fairly uniform semantic model for
    references and meta-data.  These problems are much less
    relevant (and sometimes impossible to solve) in a system designed
    for disseminating information widely, rather than mutating it.
  
-  A very large body of knowledge and computing power is already
    available in the information agents (clients, servers, and mediators)
    that exist on the Internet.  At present, most agents dealing
    with information are set up in a few standard ways; most commonly,
    a client operated by a user will contact a server maintaining a
    database, and fetch a datum from the database directly.  In occasional
    variations, a "server" may act as a gateway to a database another
    server maintains; or a fixed data type conversion program may be
    run off-line by a client or a server.   These types of interactions
    are useful but limited.  Human "agents" commonly use richer techniques
    to discover information: they collaborate with "experts" in a particular
    domain in order to find relevant initial information in a domain, and
    for assistance in gathering and understanding that information.
    Similar techniques for computerized agents could be quite useful as
    well, in particular "mediators", third-party experts suggested by
    Wiederhold in [Wie92].
I propose to make an explicit object-based level of abstract data usable
in Internet information systems.  Widespread use of such abstract data
requires that new types be definable anywhere on the network, and not
simply by some central standards authority.  Furthermore, in order for these
types to be used, information about these new types, and operations on those
types, must be available to other agents which request it.  This requires
not only support specifically for abstract types, but also a well-defined
interface for agents to talk to each other about types and operations; and
some standard method to provide information about types, their
operations, and their relations.
I claim that these requirements can be satisfied with a two-level
software architecture.  The upper level focuses on the data being shared,
and the abstract operations being carried out on it.  At this level,
methods are invoked; object references are resolved; new data types
and operations are defined.  (See figure 1b.)  The lower level focuses on the agents
supporting these operations.  Here, agents request other agents for
data objects or references, carry out abstract data operations on
behalf of other agents, and encode and decode concrete representations
of abstract objects so that they can be passed through the network.
(See figure 1a.)
This level abstractly describes what is already carried out (in
a domain-specific manner) by the protocols of many existing Internet agents,
such as HTTP [BL93]
servers, Domain Name Service [Moc87] resolvers,
or WAIS [Kah91] indexes.
(Information from these existing systems
 can also be incorporated into the higher-level information
 system through the use of "wrapper" or "gateway" agents, which provide
 explicit abstract types for the implicit data abstractions these systems
 support.)

Figure 1.  The two levels of abstraction in an information system.
To bridge these two levels of abstraction, the agents need to know
about the types of objects they are manipulating.  For this purpose,
I propose a special mediator agent that can give information
about types of information in the network.  A client or a server can
contact this agent (which I call a type oracle) to find an agent
to carry out a defined operation
on a data type, or to find out how information of one type or encoding can
be converted into another type or encoding.  Someone who wishes to
define a new data type or encoding can register it (and its operations)
with a type oracle, which can then share this information with other
agents, including other type oracles.  Oracles can also use their
knowledge of the lattice of types and encodings to derive new transformations
not provided by any single agent (such as a conversion from type A to type C
that uses a converter from A to B followed by a converter from B to C).
A few questions arise at this point:  Can a coherent information system be
built to this design?  Will the design really give widely-distributed
information systems more semantic power?  Will it be useful for real
applications, or will it introduce too much
overhead (either in response time to queries, or in the amount of work
a client or provider is expected to do) to be feasible?  Will it be able
to interoperate with existing information systems?
I propose the
following course of action to answer these questions:
  -   First, I will analyze existing information systems that are
     already in use over the Internet, such as Gopher, World Wide Web,
     and the Domain Name System.  The objective of the analysis will be
     to show the common features of these systems, show how their data
     and agent abstractions can be interpreted (at least implicitly)
     in terms of the architectural model given above, explain why
     they have gained frequent use in the large-scale, heterogeneous
     environment of the Internet, and discuss problems and limitations 
     of these systems.  Some analogs in distributed systems and
     object-oriented databases will be considered as well.
     
  
-   Second, I will make a detailed design of an information system
     based on the architecture I proposed, and build a prototype 
     implementation.  This will include a number of information
     agents using a common toolkit; a type oracle; a collection of
     data types and encodings supported by the agents and the oracle;
     and protocols to allow the agents to work together and operate
     on the types.  This will demonstrate that the design is feasible,
     and that it can handle a reasonable-sized repertiore of common types.
  
-   Third, I will test the implementation.  This will involve
     one or more case studies, where I choose a particular information
     gathering problem (such as a distributed library research problem,
     or a software engineering information search) and show how
     my system makes it significantly easier for agents to be built
     to allow clients to find useful information than existing systems do.
     It will also involve observation of a less controlled test: releasing
     the design and implementation to users on the Internet.  This will
     allow me to see if disinterested users find value added in my 
     approach, and also find where difficulties arise in practice with
     the system.
Recap: Key Concepts
The key concepts of the thesis, then, are these: 
- An information system architecture using typed, replicating objects
to model information, with an underlying agent communication protocol.
- Use of mediator agents ("type oracles") to maintain information
about an ever-growing lattice of types, and to assist agents that
want to use these types.
- Encapsulating existing data on the Internet with structured types
and encodings, allowing it to be used in higher-level architectures.
Of these, the type oracle should be the primary contribution of the thesis.
In the remainder of this proposal, I describe in more detail
the rationale for my research.  I will explain my work's relation to the
current state of the practice in Internet information gathering,
and to distributed computing concepts.
I will outline the basic abstractions of my architecture, and
explain the problems they address and some of the problems involved
in using them.  I will describe some relevant related work by others.
I will describe my plan of research, tell how long I expect my activities
to take, and describe what contributions I expect these activities to make.
Internet Information Systems: Uses and Problems
As noted in the introduction, the Internet is rapidly becoming
a widely-used medium for exchanging information.  Many
applications proposed for networked information systems imply a
rich structure to this information.
For example, a medical researcher may want to examine blood pressure
readings from a clinical sampling and correlate them to heart attack
occurrences, using the structure of patient medical histories.  A
scientist may want to find books in several libraries about plate tectonics,
using catalog entries and search indexes.  A software engineer may wish
to find and examine C++ modules for processing SQL queries, using
the structure of program archives and descriptions.
In an ideal world, such tasks would be simple to carry out effectively.
But they remain difficult or infeasible in today's Internet,
due in part to limitations of the net's current model of information
space.  Among these limitations:
- 
The conceptual structure of information space is hazy:
Experienced users often have
trouble not only with finding information they are interested in, but even
with finding out what information exists on a subject they
are interested in.  The software engineer in the previous
example, for instance, may not know where to search for C++ modules
in the first place, let alone ones that have anything to do with SQL.
Various indexing schemes have been proposed to make a better
conceptual map of cyberspace, but there is no clear consensus yet
on what kinds of indexes to use.  With no common formats,
common semantic bases for indexing, or general mechanisms for
relating one indexing scheme to another, indexing
schemes will remain primitive and incomplete.
- 
The structure and encoding of information objects is
often inappropriate for applications:
A large corpus of information, even an explicitly structured one,
may still be useless to someone who lacks the knowledge
or computing power to sift through the information to find relevant
facts or derive or synthesize needed knowledge.  Theoretically, one
form of information may have information content equal to (or even
greater than) another, but still be much less useful in a practical
sense.  If, for instance, the only interface to medical records returns
plain text in various formats, it can be prohibitively difficult
to extract appropriate information about blood pressure and heart attacks.
Current practices encourage information providers to provide information
either in a lowest-common-denominator form, or in a form specifically
tuned for a single application.  Both of these inhibit useful information
sharing.
- 
Maintaining useful information sources is difficult:
It is relatively easy in many cases to put some information on-line
and offer it to the world.  It is much more difficult to keep the
data current and the format relevant.  Part of this problem is related
to the previous one: it is extremely difficult to define new formats
and types of data without having client applications explicitly reprogrammed,
or maintaining a number of gateways or alternate repositories for
different formats understood by clients.  Mediators can conceivably be used
to update data automatically and provide gateway services, but they
require well-understood interfaces to work in a general context.
Computation models: The need for abstract types:  A number
of the problems above can be solved in part by better computation models
for internet information systems, in particular, abstract types.  Some
benefits of abstract type systems:
- They provide an appropriate level of abstraction for data manipulation.
  Client programs can be written
  in data-driven terms like 'search this index using these attributes'
  or 'retrieve the object referred to by this attribute', without
  needing to know the full details of the data implementation.
- They provide a useful model for taking advantage of the expertise
  of a network of agents.  In today's information systems,
  the burden of computation and type decoding falls entirely on
  the client, or on the server providing data.
  But in models using abstract data objects,
  operations manipulating information are associated with the data types,
  rather than any particular agent.  Knowledge
  about how to operate on the types can be delegated to sites that
  define the type, or have been told the type definition.
 
- They provide a vocabulary for information about
  new data types and formats that is independent of representation or
  implementation concerns.  Thus, information systems do not have to settle
  for a lowest-common-denominator approach for information exchange, nor
  do they need to settle for a fixed repertoire of types and operations.
  The structure and semantics of different search indexes, for instance,
  can be described and related via different abstract types.
 
In the next section, I will look briefly at two communities working
towards usable wide-area data types: the distributed computing community
and the community of developers of existing Internet information systems.
By examining the strengths and weaknesses of their approaches, I will
lay the groundwork for an architecture combining features from both
communities.Distributed computing perspectives
The distributed computing community has already proposed or implemented
a number of systems for distributed objects.  If abstract
data types are so useful for distributed information systems, then,
why hasn't one of these object systems taken over cyberspace?  While
immaturity of these systems may be one possible reason, another
important reason is that the applications these systems are designed
for are different in important ways from information dissemination
applications.
Why existing distributed computing models aren't sufficient:
  Distributed computing researchers have long been aware that
 computing over multiple machines introduces many new problems
 not present in a single address space:
 - Data-related problems.  An arbitrary distributed
 process may need to have strong guarantees about the consistency of the
 data it manipulates.  But in a large-scale heterogeneous system, it can
 be very difficult to keep data consistent without locking up
 arbitrary servers for unacceptable durations.  This is unacceptable
 in a wide-area information system.
 
- Operation-related problems. In an undistributed
 application, a request for an operation can be made with a simple
 procedure call.
 In a small-scale distributed application, the operation may involve a remote
 procedure call, with some conventions for encoding parameters, carrying out
 the operation, and returning results.  In a large-scale
 distributed world, not all agents are known, and communication channels
 and agents, and their semantics, are out of the control of any one
 person or project.  So even more complications arise.
 It may now be relevant,
   for instance, for a server to know who invokes an operation,
   or for a client to know the cost of an operation.
   New modes of failure and recovery strategies may be called for (since
   permission to carry out an operation may be denied, a remote
   server is not available, or the return type is unexpected).
   The role of meta-data to evaluate the results of an operation
   becomes more critical.
   These are symptoms of
   the greater level of heterogeneity introduced by scaling up a
   distributed application to the Internet world.  Languages, operating
   systems, data types, and ways of organizing data and software vary
   (both over different servers and over time)
   more widely than most distributed systems are designed to handle.
How existing Internet infosystems are different
Fortunately, because their application domain is limited, Internet
information applications do not have to solve all the problems inherent
in distributed computing.  In particular,
the information delivery task can be simplified by the following
domain assumptions:
  - The predominant flow of
information is one-way.  Information is
originally provided by certain sources acting in server roles,
and then retrieved, transformed, and used by other agents acting
in client roles.
Read access to information is widely available, but write
access is not available, or severely limited.  (Clients can transform
the values they receive, but cannot mutate the source data themselves.
Some information systems allow clients to send back requests to
change or add to the data a server provides, but these changes, if
made at all, are done locally by the server, outside the scope of
the information retrieval application.)
This assumption avoids many of the complications of general-purpose wide-area
database systems.
- Also, in many Internet applications, changes in
information do not have to be propagated immediately.
Conventional database systems take great pains to make sure
that query responses use the latest available version of a set
of data, and that the set of data given is internally consistent.
In many wide-area information applications, these kinds of guarantees
are either prohibitively costly, or flatly impossible.  Fortunately,
many applications do not need these sorts of guarantees; or can
make do with simply knowing roughly the consistency or currency
of information.  And in many systems, mutations of information occur
significantly less frequently than accesses to information.  (And 
with some types of meta-information, such as information about
types and resources, information tends to accumulate but not mutate.)
Relaxing currency and consistency requirements gives
third-party agents a useful role in an Internet information system.
An agent can provide information originally supplied by another agent
without necessarily having to verify that the original agent's information
has not changed.  It can synthesize information based on data from several
agents.   It can derive or transform the information for a client in ways
the original server might not be able or willing to do.
How existing Internet infosystems handle datatypes.
Many Internet information systems have found it useful to define
their own semantic types.  Gopher, for instance, uses menus and
bookmarks to let users navigate.  The World Wide Web (WWW) [BL+92] uses
simple structured hypertext documents to navigate through the system,
and defines a data type (HTML) for these documents.
While these types are more useful than the simple ASCII text used
to encode them, users of these systems soon want more structured types.
For example, a number of WWW sites have "What's New" pages in HTML, which
invariably consist of a list of dates, resource descriptions,
and links to the resources, in reverse date order (and sometimes
spread out over several documents).  This format convention reflects
a new 'abstract type' to the human client.  But this type cannot be
easily used by programs (though it might be convenient for some of them)
because the information system provides no way to describe the new type
in a well-defined way.  A standards body might incorporate it into
a later revision of the information system, but if this occurs at all,
it will take a very long time.
An example.
Even with a relatively small, simple set of types, agents may have
difficulty exchanging information, as shown in the following example.
Suppose that a client program on a Macintosh has a reference to an image
it wishes to display.
Retrieval of the image is simple enough in many Internet infosystems:
The client examines the reference to see what server it should contact,
talks to the server with the appropriate protocol, and gets the image
shipped to it for display.  The World Wide Web, Gopher, and even
anonymous FTP are all capable of doing this.
But can the client do anything with the information it retrieves?
Suppose that the image is stored on a Unix-based server at a remote
university.  The image is saved there in X bitmap format (xbm),
and has been compressed with GNU-zip to make it easier to store,
and quicker to ship.  This format and encoding makes sense for
the Unix environment where the picture is stored, but may not be useful
to the client.  The Mac client, for instance, may know how
to display GIF images, for instance, but not know anything about
XBM images (a similar type, but with a different color model and encoding).
And the GNU uncompressor may not be available on the Macintosh.
The conflict in data types must be resolved if the
two agents are going to interact meaningfully.  First of all, at
least one agent must realize the nature of the conflict.
(A naive client program might
blithely assume everything is going well, and display the unknown-format
image as gibberish-- or worse, crash when it tries to display the
image.)  If the client can tell what kind of information
the server is sending it, it can detect a problem, and possibly
convert the data to a form it can operate on.  Or, the client may
tell the server up front what data formats it can deal with, and the
server can convert the data appropriately.
Existing systems have these capabilities, but only to a limited extent.
When Gopher and WWW servers ship data, they also send meta-data
identifying the type of the data they ship.  The World Wide Web's HTTP
servers also allow a client to send a list of types it will accept.  The
vocabulary of types one can talk about is limited; in Gopher's case, to
a set of single-character codes set by the Gopher developers; and in
the Web's case, to the MIME type set.  MIME's type repertoire
(described in [BF92]) is richer
than Gopher's, and allows people to use their own 'experimental' type
names outside the standard type repertoire, but all parties in a transaction
must have a common understanding of the experimental types used.  Also,
MIME's encoding repertoire is small and fixed, so that 'GIF' and
'compressed GIF' need to be expressed as two different types in the
MIME system.  (Web developers stretched the MIME convention to add new
encoding types, so as to avoid the combinatorial type-expansion problem
arising with different data types having different compressions.  But
the problem resurfaces with two or more levels of encoding, which is not
uncommon.)
Why third party agents are useful.
But there is a more fundamental problem to these systems than
limited vocabularies for types and encodings.  Even if the client
knows the kind of data it gets, and the server knows what kind of
data the client wants, one of the agents has to know how to adapt
to the other.  In the image-fetching example,
one of the parties has to know how to convert the data from
the server's format to the client's format.
If they don't have enough knowledge among themselves to do the conversion,
the agents are stuck-- even though
an agent somewhere else in the network may be able to supply this missing knowledge, or do the required conversion.
Third parties can be useful not only for type conversion, but also
for abstract operations on types.  For instance, if for some reason
the client wanted the image not for display, but simply to get
some information from it (such as its dimensions, or a string corresponding
to text characters embedded in the image), a mediator
could be enlisted to carry out the operation and return the results to
the client.  Conversion might not be necessary at all.
Basis of a more powerful architecture
To some up, then, there are two basic requirements for an interet-wide
system to handle structured types, that are not adequately addressed
in existing information systems.  These are:
- 1.
- Ways to define and describe inter-agent operation at a higher level
    than simple client-server interaction with a fixed protocol.  One should
    be able to publicize the servives an agent provides
    for data and operations, and the agents should be able to negotiate
    with clients to carry out
    these services.  There needs to be a way of discovering particular
    agents for a needed task.
 
- 2.
- Ways of talking about the data types, encodings, and associated
      operations that these agents handle.  In a large, distributed
      internet, new types, encodings and operations will be
      introduced all the time.  But this is not an unfamiliar problem.
The universe of information objects on the Internet is large
enough that the futility of central administration of the objects is
obvious.  Instead, infosystems designers have come up with decentralized
ways to distribute and refer to the objects.  The
solutions (such as the Web's URL scheme, which specifies the access
method of a particular Net object) are not perfect, but do currently
provide a workable way to find objects in many cases.
Likewise, with rich enough conventions for talking about types, new
abstract types and operations can be brought into the Internet and used,
without having to wait for some centralized standards body to act.
Statement of thesis.  These two architectural concepts
are related very closely.  A well-defined system for talking
about types and operations provides a rich framework in which agents
can interact.  And an agent that is an expert in cataloging and handing out
type information allows new operations (and types) to be defined
in a distributed manner.
This agent, the type oracle, has a protocol that
allows agents to request services and discover information
about new abstract data types.  (It can, for instance, identify new types
in relation to known types, and can find other agents to carry out
needed operations or conversions on unknown types.)
My thesis, then, is this:
A model of distributed information systems allowing
individual information providers to define and share their own
abstract structured types is feasible, and will make Internet information
systems significantly more powerful.
  A distributed network of type oracles,
combined with a flexible naming, subtyping, encoding, and inheritance scheme,
will allow these structured types to be introduced and used by
a variety of information agents.
In the sections to follow, I will describe how I plan to investigate
and test this statement.  First, I describe some of the details of
a design which supports this model, to demonstrate how such a
system could be designed and built.  Then, I will describe how the
system relates to other work in similar fields.  Finally, I will discuss
the specific activities I will undertake to complete the thesis, and
the contributions that I expect to result.
The design of an Internet information object system
In this section, I will discuss the key abstractions 
in a design for such an Internet information object system,
explain how and why they would be used, and discuss how they
should be implemented in a workable system.
The major abstractions discussed here are
objects, agents (and their computation model),
and type oracles.  References and meta-data will also be addressed. 
Since one goal of my system is to interoperate (at least to some extent) with
these systems, I will also discuss, where appropriate, how some of
these abstractions relate to existing information systems.
Objects: Abstract types, encodings, and operations
What objects are.
The system I propose represents information in the Internet as
objects, which are instances of
abstract types.
Each type is identified by one or more well-defined names.
Objects are used through operations or methods, whose
names and signatures are available in the type declaration, as with
many object-oriented languages.
Objects may also have a set of attributes, which may be
retrieved, or sometimes set, via an operation.  Types may also have
expected semantics; for instance, one type's "angle" attribute
may be expected to always be a number between 0 and 360.
Objects are used by invoking operations or reading attributes in
the manner of a procedure-call (or remote-procedure call).
A type
may have one or more supertypes.
Objects of one type support the operations of the type's supertypes,
and can 'stand in' for the objects of the supertype if necessary.
Inheritance is not implied by subtyping, though, for reasons to be shown later.
Objects in a heterogeneous wide-area infosystem.
So far the object model should look quite similar to traditional
programming conventions.  There are two important additional 
aspects of the model, though.  One is that objects can have
meta-data associated
with them, showing their origin, type, or other run-time 
information.  This sort of information is usually handled transparently by
the environment in traditional programming languages, but is made
more explicit here, for reasons we will see later.
Another important aspect of objects in this system is that they may have
encodings.  Encodings are used to transmit object instances from
one agent to another.
They may also be used in the implementation of an object operation.
Encodings are similar to the representations of object-oriented
programming languages, but they are not opaque: agents with a copy
of the object can work directly with its encoding, if they know how.
An encoding specification
includes a lower-level type used to represent the object,
and a named scheme used for translating to and from this type.
For example, a HTML document may be encoded as a sequence of characters,
using its standard SGML representation as its encoding scheme.  An encoding
itself may be encoded, since it too is an instance of an abstract type.
A given object
type may have several encodings associated with it, and subtype
encodings need not have anything to do with supertype encodings.  (This
is one reason why subtyping does not imply inheritance.)
An object will eventually be encoded in a 'primitive' type, which
could be as simple as a sequence of bytes.  (At some level, all
Internet information gets transmitted in this form; though agents
might treat higher-level types as primitives as well.)
Objects in existing infosystems.  If one considers
a byte stream to be a simple object (with operations like 'next-byte'),
all Internet information systems can be modeled with objects, but this
model is degenerate and uninteresting.  But the formats of the data
types used in information systems can be treated as encodings of
abstract types.  They can thus be incorporated into an object-based
information system via an agent that provides an object wrapper around
the encodings.  (Rufus [Sho+93] essentially
does this for its "semi-structured files".)
An HTML document in the World Wide Web, for instance,
could be viewed as an encoding of a "Web-hypertext" object, with methods
like "follow the first link" or "fetch the title".
Information agents
Information agents are programs that operate on information objects.
They can talk to other agents in the Internet, operating
in a client or a server role (and sometimes in both roles).
In my design, agents know of a certain set of types, as well as a
set of definitions of operations on these types.  These definitions
might include one or more of the following:
         -  Code for the operation (for at least some encodings)
         
-  Reference to another agent that can do the operation
         
-  Knowledge of the operation's existence and implementation.
            (A type oracle is then queried to find an agent that
             can carry out the operation)
A particular agent might only implement an operation for
certain encodings of a type.  The same code may be used
by several implementations.  This allows for a certain degree of code
inheritance, if desired.
Agents have a repository
   of objects they have direct access to,
   without having to talk to other agents.
   The 'same' object may be in several agent repositories at once,
   since when a server 'transmits an object' to a client, it actually
   transmits a copy of its encoding.
   If clients are particularly concerned about
   consistency, meta-data can be used to identify the agent originating
   an object.
Interoperation with other agents: the computation model.
In my design, agents speak a common protocol about objects, types,
and operations.  To carry out an operation on an object, an agent may
make requests to one or more servers that have knowledge of the object, 
its type, or its operations.
Agents may also know special-purpose protocols to talk
to databases and clients that don't talk the common protocol directly,
such as HTTP servers or SQL databases.  There may be multiple
protocols used to carry out similar operations, depending on the
performance requirements of an application, but my thesis will concentrate
on a single protocol that's robust enough to be usable in case studies
in the later part of my thesis.  In any case, changes in the required
protocol should be much less frequent than changes in the set of
data types.
In my basic protocol, agents use a request-reply interaction similar to that
of remote procedure calls.  (While a simple agent might
actually implement the request and reply as a procedure call, nothing prevents
an agent from having requests pending on multiple agents at once, if
that is desires for efficiency.)  Client agents can make multiple requests
in the same session, but state (other than that inherent in the information
repository) need not be preserved between different sessions, and should
be kept to a minimum within a session.  This has helped keep the interactions
of existing infosystems simple.  Mutation of the repository is not a part
of the protocol, so concerns like serializability are not an issue.
The type oracle and its services
What type oracles do.
A type oracle is a mediator agent that provides information about
structured types to information agents, and to application programmers.
Given the name of a type, a type oracle can find its description,
its supertypes, operations, methods, and encodings.  It can refer
clients to agents that can carry out requested operations or conversions
between types, or between encodings.  It can take advantage of its
knowledge of the type lattice to perform conversions and substitutions
that are not explicitly coded by any single agent.  (For example, if
a client has type encoding A, and needs to convert it to B, it can find
a converter from A to C, and another converter from C to B.  See figure 2c
below.)  Earlier research (such as the data translation work of
[Mam+89]) has revealed algorithms for some
of these tasks, but there are still a number of open algorithm questions that
can be studied in the thesis.
Why type oracles?
Type oracles simplify the problem of managing large numbers of abstract types.
Their ability to locate third-party expert agents for a type allows
information clients to use many more types (and operations) than those
they were explicitly coded for.  They also avoid the requirement in
many distributed systems that there be a single agreed-on form for all types
(figure 2a below), without requiring explicit conversions from every
type to every other (figure 2b below).
Type oracles can use meta-data associated with conversion operations to
direct type conversions or operations that preserves as much
information as is necessary and feasible.  Some types of conversion require
no information to be lost; others require that certain operations or
expectations are possible, even if this means the loss of extraneous
information.
 
 
 
Figure 2.  Different models of type conversion
Multiple type oracles.
A full-blown Internet information system will have multiple type
oracles.  Oracles can query other oracles to find out about new types.
(Conversion and substitution strategies will work best if
a given oracle knows about as many types and mappings as possible.)
A given type can be kept private (as one might wish to do while
developing and testing it) by registering it with a local oracle,
but instructing the oracle not to give information about it to outside
oracles or agents.
How types can be used with meta-data and references
Meta-data and references are both essential parts of
a wide-area information system.  In a system where structured information
is passed between agents that may not know each other, data
may need to be accompanied by tags identifying the type of data.
Type tags, however, are not the only kind of meta-data which may
be needed: information about the source, currency, and cost of
information, for instance, may be desired as well.  References are
required whenever a piece of information wishes to name or point to
another piece of information.  They may also be required for
efficiency, when it is not practical to ship a large block of data
from one site to another.
Meta-data and references as abstract types.
There is, however, no universally adopted mechanism to name
objects in information systems.  (Finding adequate naming schemes
in heterogeneous distributed systems is in fact an open problem,
one that this thesis will not attempt to solve in the general case.)
There are, however, a number of naming schemes with varying semantics
(such as the URLs of the Web, the semantic filenames of Prospero
  [Neu92], 
 the domain naming scheme of the Internet, and the Message-IDs of Usenet).
The abstract types model I propose can be used to distinguish and
classify the various naming schemes in use on the Internet.  Similarly,
as new forms of meta-data become necessary, new abstract types can be
used to model them.  Thus, a wide variety of data, from both existing
infosystems and new infosystems, can be used in this framework.
Minimal meta-data and reference requirements.  While
I do not intend to investigate all of the possible types of meta-data
and references in my thesis, I will have to design a few required for
the system to operate.  For instance, basic forms of reference to
other agents must be supported.  Meta-data containing type tags will
be needed to effectively use type oracles.  And the names of types
themselves are references that require certain semantics (particularly 
persistence, unique identification, and resolvability) and namespace
management.  My thesis will include provisions for these basic types
in the protocol or in the basic type lattice.
Image example revisited
How will these abstractions work together in actual use?  We return
to the image example from the previous section for an illustration.
The client starts out with a reference for an image it wishes to display.
It resolves the reference (perhaps with the help of another agent), and
contacts a server that holds the image in its repository.
The server passes the client meta-data indicating that the picture is of
type X-Window-Dump (a subtype of Image), and encoded in the standard XWD
format, further encoded with GNU compression.  Since the client does
not know to implement the Display operation for this format,
it asks a type oracle for help in displaying it as an Image.
The Display operation cannot
be executed remotely, so conversion is required.  The oracle
tells the client that the image can also be converted to other subtypes,
one of which is the GIF type the client understands.  The client
can display GIFs, so it uses the type oracle to find agents that will
do a conversion out of GNU compression, and then from XWD to GIF.  An
uncompressed
GIF-format image is finally sent to the client, which it then displays.
The example above elides a number of details that need to be tuned
carefully in an actual implementation.  The strategy for negotiation
between agents is left unspecified, as is the strategy for when to
send data, and to whom.  (Bandwidth may be saved, for instance, if the
initial client request to the server returns meta-data but not
the actual image data, assuming the image is large.)  While I suspect
that different strategies may be appropriate for different applications, 
I hope to discover useful general strategies for agent interaction
in my thesis.
Having completed an overview of my design, I now discuss the relationship
of my work to work in related areas.
Related work
Research projects in a number of areas have direct relevance
to my thesis.  A detailed analysis of this work belongs in the
thesis proper rather than in the proposal, but the following categories
of related work (some already mentioned) are worth noting:
- Distributed objects: A number of groups
have extended the ideas of remote procedure calls and proposed or
created systems where object methods can be invoked from arbitrary
machines over a network.  The CORBA proposal [OMG92]
 of the Object Management
Group is probably the best-known instance; its core Object
Request Broker standard is available today.  CORBA's goals are in
some ways more ambitious than the object system I propose, since
the system is meant for general-purpose distributed programming.
Certain essential features of the proposed Internet infosystems
architecture may be hard to implement in this system: in particular
the migration and replication of objects is not part of the CORBA model.
CORBA's proposed "interface repositories" may contain some of the same
information that a type oracle would, but the repositories are passive
and not designed for global use.  Nevertheless, this system is worth
watching, since it has many similar goals, and is supported by numerous
manufacturers.
    
- Systems with type expertise:  Some other
information systems have agents that are
knowledgeable about new types.  Rufus [Sho+93], a system developed at IBM
Almaden to manage semi-structured information on a site-wide basis,
includes a type expert called a classifier, which constructs structured
objects as proxies for unstructured data files.  The classifier analyzes
the file contents to select a type to use for constructing the object.
A Rufus followup paper [SS94] describes an algorithm that allows the
classifier to learn to classify a possibly arbitrary number of new types.
A Rufus classifer, then, can be thought of as
a particular kind of 'type oracle' whose expertise is converting a
type encoding (a semi-structured file) to a particular abstract type.
Rufus, however, is designed for a single site, and in its current
form does not scale up to Internet-wide information systems.
    
- Extendable types:  Most distributed programming
systems, such as distributed object systems, allow a theoretically unlimited
number of structured data types.  They typically lack general run-time
services to assist in using with new data types, though, so it is difficult
for applications to use types other than ones known about at the time of
program creation.
A number of systems, however, give more support.  Rufus's classifier
has already been cited; its type conformance and revision model, where
an arbitrary number of implementations can exist for a known type, also
provide support for new types.  SGML, a well-known text markup
convention described in [AAP86] and elsewhere,
allows syntactic descriptions of new data types (known as DTDs)
to be passed along with data objects, so that arbitrary applications can
parse them, as long as the object format follows certain basic markup
conventions.  The DTDs do not, however, give semantic support for the types.
    
- Agent Cooperation:  Information systems involving
cooperative agents have existed for a long time.  The modern Internet
depends heavily on one such system: the Internet Domain Name System servers
which manage information on a hierarchical name-space of hosts.  Many
research systems also have agents cooperate for a specific purpose.
One example in a particularly relevant domain is Indie
[Dan+92], which
consists of a network of index agents dispatching
requests and trading new search index records based on the published
specialties of each agent.
  While the ways in which agents cooperate is
application-specific in many systems, systems have also been built to
handle more general cooperation strategies.  ISIS
[Bir92], for instance,
provides fault-tolerance guarantees for agents organized in "process groups",
though it says relatively little about the data model the agents use.
     
- Heterogeneous information retrieval:
In designing a new information architecture, one cannot overlook one
reason that systems like Gopher and WWW have gained so much acceptance:
they allow a wide variety of popular information formats to be served
via several common protocols.  The World Wide Web can theoretically
include an arbitrary number of types, but since the only generally understood
way to refer to types at the moment is through the MIME typing system,
its adaptability is limited.  Some research systems, such
as Rufus, also allow data in many different formats to be exchanged
intelligently, due to the classifier mechanisms mentioned earlier.
    
- Software architecture:
Software architecture research provides
useful frameworks for understanding and designing distributed information
systems.  In [Abo+93], Abowd, Allen and Garlan
describe a useful language for
discussing software architectures in terms of components and connectors,
and in terms of particular "styles" of component types and connector
interactions that characterize particular kinds of architecture.
(This description is refined further in [AG94].)
The description of Internet information systems as agents (components)
interacting via a common protocol of abstract data operations (the connectors)
fits well into this
language.  The particular data abstractions described in the previous
section could be thought in a general way to describe the style of Internet
information system architecture, though most of Garlan and Allen's
work on style has concentrated more on protocols and computations than
on data abstractions.
The notion of a reference architecture is also a useful one
for building and analyzing information systems.  Reference architectures often
have several purposes: they can describe the basic abstractions and
building blocks that are used in a particular application domain, and
provide a basis for comparing different systems in the domain; or they
can try to describe a basic "common demoninator" that should characterize
all useful systems in a particular domain.  These goals, while related,
should be recognized as often having cross purposes.  My thesis will
contribute towards both goals, but in separate sections.
 
The Plan of the Thesis
The following questions are key to my thesis:
-  Are abstract structured types usable in a distributed
 Internet information system, with scale and data heterogeneity on
 the order of Gopher or WWW?
-  Will type oracles allow useful structured types to be
      defined in a decentralized, continuously updated manner?
      Once defined, can the right types be found for a given job?
-  Will this system give wide-area information systems more semantic
      power in real applications?  Will people find the ideas worth
      adopting for their applications?  Why or why not?
-  Can the system interoperate with existing information outside the
      world of agents specifically constructed for this model?
-  What are the limitations of the design?  How would different
      designs of the object/type system or the agent protocol change
      the system's performance or capabilities?
The following questions are ones I wish to address in my thesis,
but they may not necessarily be completely answered:
-  What are the best ways for type oracles to determine optimal
      conversions to related types?
-  What are the best ways to handle the introduction and revision
      of new types in this system?  To what extent can shared types
      and implementations be evolved without breaking earlier type guarantees?
-  What is the best way to express semantic constraints on types
      and operations, in a widely heterogeneous world? 
     (I propose no formalism for such constraints, but plan to
      include a slot in definitions to place either textual or formal  
      semantic requirements.  Interpretation and enforcement is left to
      agent implementors.)
In order to answer these and other questions, I will do the following:
- Analyze the state of the practice:  I claim that
  my architectural approach to Internet information systems will
  bring more order and more transformational and operational power
  to the world of Internet information systems.  To help justify
  this claim, I will analyze existing information systems that are
     already in use over the Internet, such as Gopher, World Wide Web,
     and the Domain Name System.
  I will describe their architectural concepts, explicit
  and implicit, compare the design decisions that were made in
  their data structures, computations models, and protocols, and
  describe analogs in distributed and object-oriented systems.
  The analysis will
     show the common features of these systems (largely in terms of
     the abstractions described earlier), show how their data
     and agent abstractions can be interpreted in terms of my
     architectural model, and highlight problems and limitations 
     of these system's designs.  (This type of analysis contrasts with
     the more user-oriented analysis of surveys like
     [Sch+92].)
     I do not expect this part of the
     thesis to take an especially long time, but it will help
     demonstrate how my design takes advantage of known assets
     of existing infosystems, and how it improves on those systems.
     It will also help provide a sound basis for informing and
     justifying my design.
  A complete and thorough analysis of Internet information systems
  could probably make a thesis in itself, similar to Tom Lane's work
  in user interface architectures.   My analysis is not so ambitious;
  it's simply meant to
  help lay the foundations for the specific architecture I will develop.
  The analysis should, however, help information agent and system
  developers better understand their design task, even if they
  do not adopt the specific model I propose in the thesis.
 
 
- Produce a detailed architectural design of information agents
     and type oracles:  With the design analysis as a foundation,
     I will go on to describe in detail a design for Internet agents,
     and specifically for type oracles, that will allow structured types
     to be defined in a distributed fashion and used over a wide-area
     network.  Basic agent services, data models, and the protocols
     used to request services from agents will be described.  Distributed
     type oracle mechanisms will be shown, as well as the basic procedures
     used to define new types, and to handle type conversion and substitution.
     Various sophisticated enhancements could be made to
     the basic system, such as handling of type evolution, and complex
     use of meta-data and graph analysis to maximize information conversion.
     While I hope to explore these issues to some extent in the thesis,
     my first priority will be providing a basis for the basic type
     and agent services.  If those are general and flexible enough,
     others can build more sophisticated type and method services on
     top of the basic architecture.
 
- Produce a prototype to implement and demonstrate the
   architecture, and test its application to actual information systems:
  To demonstrate that my architectural design is workable, I will
  implement a prototype system that supports replicated abstract
  information objects over the Internet, and includes type oracles allowing
  new types to be defined and used.  Implementing the prototype
  will show basic proof of concept.  To show its ability to make
  use of existing information, I will also construct basic agents that
  act as 'gateways' (in both directions) to existing information systems
  such as the World Wide Web.  Furthermore, I will carry out a case
  study showing how my system's use of structured types makes certain
  information gathering applications significantly easier to carry out than
  the current (ad hoc) state of the practice.  (The exact domain of my
  case study is not specified yet; it might, for example. be one of the
  software engineering, library research, or medical applications
  mentioned earlier in the paper.)  Case studies should examine
  how types are created, publicized, found, and changed.
  Here are some possible ways a case study can be evaluated:
 
 -  Using myself:  Comparing code size or development of application 
       implementations in my system with already-built applications of
       the same functionality.  Comparing ease of adding new functionality
       in both cases (assuming source is available for both).
      
 
-  Using other local people:  Comparing types of applications people
       find worth building with the system to types of applications people
       build without it.  Collaborating with selected local projects
       and studying how the system helps or does not help their tasks, 
       and how structured types are defined and used in practice.
 
-  Using people on the net:  Analyzing the growth of type lattice, and
        the reuse of types (either directly by other projects, or
        indirectly through subtyping).  Analyzing the quality of answers
        provided by the type oracle.  Measuring frequency and types of
        cross-system use between my system and other existing infosystems.
        (Instrumentation code in the type oracle should be quite useful here.)
 
  If time permits, I
  may also put together and document an agent toolkit to help people
  construct their own agents in my system.  Through experience with
  the case studies, I will get a better sense of the strengths and
  weaknesses of my architecture, and be able to suggest improvements.
 
A successful design and implementation of the system, combined with
tests of the system on sample applications, and an observance of the
use of the system (or related systems) in the Internet, should provide
the necessary material to answer the key questions identified above.
The additional questions can be answered to a certain extent as I
consider different design possibilities and observe the use of the system
by myself and others.  The thesis will include a report of the
alternatives I considered and implemented, and the strengths and
weaknesses I observed in practice.
Timeline.
I would like to work on all three activities simultaneously in the course
of my thesis, but the relative emphasis will change over time.  The infosystems
analysis and architecture specification will predominate in the first phase;
the implementation in the next phase; and the testing in the third phase.  
Writing will mostly take place during the first and last phases of the work.
I will be taking an incremental approach to the implementation.  I will
be experimenting with different protocols on my own early in the thesis work,
but eventually want to have outside users try out the system as well
to see how well the system scales up.  Partly in order to attract
outside users,
I will need to make my system interoperable (in both directions) with existing
infosystems like WWW.  In this way, users can take advantage of the added value
of my system without losing access to the information resources they
already have.  In addition, WWW browsers like TkWWW and Mosaic with their
fill-out forms capabilities are
general enough that I can probably use them, or slight variants on them,
as the initial user interfaces to my agents, instead of having to build my own.
My thesis has a number of milestones that can serve as good
indicators of progress.  Here are the major ones, with estimates
of probable time to completion:
  - Thesis Proposal (May 1994)
  
- Analysis of existing systems (Aug/Sep 1994)
  
- Detailed design (Nov/Dec 1994)
  
- Release basic prototype, with agent protocol, oracles, basic types
       (Dec 1994/Jan 1995)
  
- Case studies (Spring 1995)
  
- Finish writing document (Fall 1995)
I expect to complete the thesis in the fall of 1995, if all goes smoothly.Expected Contributions
These are the key contributions that I expect to come from my thesis:
 -  For researchers: a better understanding of the requirements and
    architecture of Internet-scale information systems, and how structured
    information types can be used in a widely distributed environment.
    This will come through my analysis of existing systems, the design
    of the system I describe, and the comparison to other existing or
    possible systems through analysis and experimentation.  The type oracle
    will also provide a useful example of the well-known
    "mediator" concept in information systems.
 
-  For information agent builders: a working prototype of a
      type oracle service, and an understanding of the strengths and
      weaknesses of various design choices involved in its construction.
      Case studies will show its applicability to various domains.
 
-  For information providers: a lattice of useful structured types
       for common forms of information.  I'll need to build the beginnings
       of the lattice in order to test my design; and users in case studies
       will enlarge it further.  A repertoire of data types will be useful not
       only to people using a system of my design, but also to people
       wanting to incorporate structured data into their own systems.
There are certain questions relevant to my information system
that I do not expect
to make major contributions towards, though I hope to use the
research and experience of others in these areas in my design.
These include questions of
security, privacy, and cost accounting, naming syntax and semantics,
and human interfaces to information systems.
Other problems, such as search and filtering, will probably not be addressed
directly in the thesis, but I hope that the work of the thesis
will enable better solutions to these problems.
While the world of Internet information systems is changing
extremely rapidly, I expect these contributions to have staying power.
A well-constructed design, analysis and experience report on
type oracles and their datatypes should remain useful
as a guide to designers of many distributed information systems
beyond the particular system I design.
Notes
- [AAP86]
- Association of American Publishers.
Standard for Electronic Manuscript Preparation and Markup.
Washington, D.C.: Association of American Publishers, 1986.
- [AG94]
-  Robert Allen and David Garlan.
 "Formal Connectors".
 Technical Report, CMU-CS-94-115, Carnegie Mellon University, Pittsburgh, PA.
 A copy is available on-line.
- [Ank+93]
- F. Anklesaria,
M. McCahill, P. Lindner, D. Johnson, D. Torrey, and B. Alberti.
"The Internet Gopher Protocol."  Internet RFC
1436, March 1993.
- [Abo+93]
-  Gregory Abowd, Robert Allen, and David Garlan.
 "Using Style to Understand Descriptions of Software Architecture".
 In Proceedings of the ACM SIGSOFT '93 Symposium on the
  Foundations of Software Engineering, December 1993, p. 9-20.
- [BC90]
-  Kenneth P. Birman and Robert Cooper,
 "The ISIS Project: Real
 Experience with a Fault Tolerant Programming Systems".
Technical Report TR 90-1138, Cornell University Department of Computer
Science, July 1990.
- [BF92]
- N. Borenstein and N. Freed.
"MIME (Multipurpose Internet Mail Extensions): Mechanisms for Specifying
and Describing the Format of Internet Message Bodies."  Internet RFC
1341, June 1992.
- [BL+92]
- 
T.J. Berners-Lee, R. Cailliau, J-F Groff, B. Pollermann. "World-Wide Web: The
Information Universe".  In "Electronic Networking: Research, Applications
and Policy", Vol. 2 No 1, pp. 52-58 Spring 1992, Meckler Publishing,
Westport, CT, USA.  (A preprint is available on-line.)
- [BL93]
- 
T,J. Berners-Lee, "Hypertext Transfer Protocol".  Internet draft, CERN,
November 1993.  Work in progress.
- [Bir92]
- Kenneth P. Birman, "The Process Group Approach to
 Reliable Distributed Computing".  Technical Report TR 91-1216, Cornell University Department of Computer Science, July 1991, revised September 1992.
- [Dan+92]
-  Peter B. Danzig, Shih-Hao Li, and Katia Obrazacka,
   "Distributed Indexing of Autonomous Internet Services".
Computing Systems, 5(4):433-459, Fall 1992.
(A preprint is available on-line.)
- [Kah91]
-  Brewster Kahle.  "An Information System for Corporate Users:
 Wide Area Information Servers".  Technical Report TMC-199, Thinking
 Machines Corporation, Cambridge, MA, 1991.
- [Mam+89]
-  Sandra A. Mamrak,
Michael J. Kaelbling, C. K. Nicholas, and M. Share.  "Chameleon:
A System for Solving the Data-Translation Problem."  IEEE
Transactions on Software Engineering 15(9): 1090-1108, September 1989.
- [Moc87]
- P. Mockapetris,
"Domain Names - Concepts and Facilities."  Internet RFC
1034, November 1987.
- [Neu92]
- B. C. Neuman,
"The Virtual System Model: A Scalable Approach to Organizing Large Systems".
  Technical Report 92-06-04, University of Washington Computer Science
 Department, Seattle, WA, June 1992.  A copy is 
available on-line.
- [OMG92]
- Object Management Group.
The Common Object Request Broker: Architecture and Specification.
OMG Document Number 91.12.1, Revision 1.1.  Wellesley, MA: QED Publishing Group,
1992.
- [Sho+93]
-  K. Shoens, A. Luniewski, P. Schwarz, and J. Thomas.
"The Rufus System: Information Organization for Semi-Structured Data".
In Proceedings of the 19th VLDB Conference, Dublin,
Ireland, 1993.
- [Sch+92]
- Michael F. Schwartz, Alan Emtage, Brewster Kahle, and B. Clifford
   Newman. "A Comparison of Internet Resource Discovery Approaches."
   Computing Systems 5(4):461-493, Fall 1992. 
 A preprint is
available on-line.
- [SS94]
-  Peter Schwarz and Kurt Shoens.
"Managing Change in the Rufus System".
In Proceedings of the 1994 International Conference on Data Engineering,
Houston, Texas, February 1994.
- [Wie92]
-  Gio Wiederhold.
"Mediators in the Architecture of Future Information Systems".
IEEE Computer 25(3):38-49, March 1992.
spok@cs.cmu.edu (Written 17-May-94)