% User documentation for the EYE project.
%
% Begun 26 Feb 96
% Copyright Mary Soon Lee 1996
%
% Additions by Andrew on March 20th
%------------------------------------------------------------------
\documentstyle[twoside,psfig]{article}

\title{EYE Documentation: Version 0.01 (March 20th, 1996)}
\author{Developed by Schenley Park Research Inc.}

\begin{document} 
 
\maketitle 
\newpage
\hspace{2cm}
\newpage
\pagestyle{myheadings}
\markboth{EYE Documentation: Version 0.0}{Schenley Park Research}
\tableofcontents
\newpage
 
\section{Introduction}

EYE is a tool to help people apply machine learning and statistical
techniques.  It enables people to use advanced techniques without
needing to master the underlying algorithms.  Instead EYE has a simple
interface where the user presents the raw data, and then applies EYE to
perform the desired analysis.  Functions include:
\begin{itemize}
\item {\bf BlackBox}: searches for a model that accurately explains the data.
\item{\bf Optimize}: finds a set of inputs that will optimize a given
  criterion, for instance to maximize the sum of the outputs.
\item {\bf Predict}: predicts future behavior from past data.
\end{itemize}
See Section~\ref{compendium} for a compendium of EYE functions.

This document explains how to use EYE.  If you are in a hurry to begin,
you need only read section~\ref{example} and section~\ref{starting}.
Later sections describe the online help facility, additional user
interface tools, and the advanced interface to EYE.

To illustrate the use of EYE, we consider the example of a gardener
trying to grow prize-winning flowers.

\subsection{Tutorial Example: The Gardener}
\label{example}

Suppose a gardener with an interest in machine learning wants to grow
prize-winning flowers.  She's kept records of her past attempts: what
fertilizers she used, how much she watered the seedlings, the
temperature of the greenhouse, what height the flowers grew to, how
brightly colored they were.

The gardener decides to use EYE to help her.  She has several new plant
regimens in mind, and wants EYE to predict how well each one will
perform.  She's also curious to see what regimen EYE itself will
recommend if she asks it to maximize the brightness of the flowers,
subject to the constraint that the flowers must be at least thirty
centimeters in height.

\section{Getting Started: How to Get EYE Running}
\label{starting}

This section provides all the information you need to start using EYE.

EYE can run under either Windows 95 or Windows NT on a PC.  To start it,
bring up an MS-DOS Command Prompt window, move to the directory where
you saved the EYE executable (by using the cd command), and type EYE.
This will bring up the EYE window with the initial welcome screen.

The simplest way to use EYE is via the GMBL menu\footnote{For the
curious, GMBL stands for General Memory Based Learning, the machine
learning approach that underpins the EYE code.} on the main menu bar.
Select the GMBL menu with the left mouse button, and then select the
first menu item, ``Run GMBL.''  This brings up the following dialog
box:

\centerline{\psfig{file=simpledialog.ps,height=2.7in}}

Suppose the gardener introduced in section~\ref{example} wants to see
how the flower-height depends on the various factors in the plant
regimen (the quantity of green-grow fertilizer, the number of
mineral-drops, the amount of water, and the temperature of the
greenhouse).  To find out, first type garden.mbl into the datafile
slot of the dialog box to tell EYE to use the gardening data.  Now
select {\bf graph} from the main listbox by clicking it with the left
mouse button.  The dialog box should now look like this:

\centerline{\psfig{file=simpledialog2.ps,height=2.7in}}

Press the RUN button.  The cursor changes to a black eye while EYE
analyzes the data, and then EYE displays four graphs, showing how the
flower-height varies with each of the four factors in turn, while the
other factors are held constant.  Notice that the final graph,
corresponding to the effect of temperature, is very close to a flat
line.  This shows that the flower-height is hardly affected by the
temperature---at least for the regimens the gardener has tried in the
past.

To run EYE again, select ``Run GMBL'' from the GMBL menu as before.
The same dialog box will appear, with the datafile already filled in
as garden.mbl.  Perhaps this time the gardener, being an intrepid
soul, wants to see if EYE can find a model that explains the data.  To
follow in her footsteps, select {\bf BlackBox} from the listbox and
then press the RUN button.

The black eye appears, showing that EYE is at work, and results start
scrolling down the screen.  EYE is busy searching for a good model for
the data.  {\bf BlackBox} performs this search without any prompting
from the user.  It tries out function approximators such as nearest
neighbor, kernel regression, and attribute subsets---autonomously
tuning their parameters and deciding which model to test next.

After a few seconds, the black eye disappears, and the text stops
scrolling.  You can now examine EYE's report on the {\bf BlackBox}
search.  The overall evaluation should be visible at the bottom of the
scrollable window.  It should look something like this\footnote{Because
EYE uses random numbers to make decisions such as which data should be
used in the testset, the precise results will vary from one run to the
next.}:

\begin{verbatim}
5.  Evaluation.

        Now, if we simply predicted the global average, 
        the mean-abs testset error would be 7.02.

        The best thing we've found so far in the
        searches reduces that by 67%.
\end{verbatim}

This tells us that EYE has found a model for the data whose average
prediction error is only thirty-three percent of that for the global
average model.

You have now learned almost all that you need to know to start
applying EYE to your own data.  To run EYE, select ``Run GMBL,'' type
in the datafile, select the function you want, and press the RUN
button.  There is only one more thing you need to know: how to get EYE
to use your own data.

\subsection{Datafiles}

EYE expects datafiles to be arranged with one datapoint per line in
the file.  Each datapoint consists of a sequence of floating point
numbers, specifying the values of each of the variables for that
datapoint.  If you wish to include comments in your datafiles, you can
do so by starting each line of comment with the `\#' character.
For instance, here is part of the garden.mbl datafile:
\begin{verbatim}
# GreenG MinDrop Water Temp  Height   Brightness
    2      2      2     15   11.9      2
    2      2      2     20   12.1      2
    2      2      2     25   11.5      2
    2      2      4     15   27.9      2
\end{verbatim}
By default, EYE assumes that the rightmost column of numbers
represents the output value, and that all the other variables are
inputs.  To find out how to specify other formats, see section
~\ref{format} (in brief: you need to select the advanced option from
the ``Run GMBL'' dialog box, and then edit the format slot in the
advanced dialog box).

You are now ready to try out EYE on your own data.  In doing so, you
may spot unfamiliar terms appearing on the screen, such as {\bf
GMString} or {\bf AutoRSM}.  The following section describes how to
get online help that will explain cryptic terms like these.  Later
sections describe such things as how to switch to the previous screen,
how to halt EYE midway through a computation, and how to use the
advanced interface to gain additional control.

\section{Getting Help}
\label{help}

The simplest way to use EYE's online help is via the Help menu on the
main menu bar.  To see the range of topics for which help is provided,
select ``Introductory Help'' from the Help menu.  This brings up a
list of the available help topics.  To get help on any of these, just
click on the corresponding word with the left mouse button.

Whenever you see underlined words, such as those on the introductory
help screen, you can click them with the left mouse button to get more
information.  Sometimes clicking a word that isn't underlined will
still produce help.  (EYE's output would look rather messy if it always
underlined every word for which help was available.)

Help is also available from the Help buttons on several of the dialog
boxes.

The next section describes additional user interface features---from
how to use the File menu, to how to change the colors used to display
EYE's output.  Section~\ref{advanced} describes more advanced features
of the interface, and section~\ref{compendium} is a compendium of all
the EYE functions.

\section{A Medley of Other GUI Features}

Section~\ref{starting} explained a simple way to run EYE, and
section~\ref{help} described how to use the online help.  This
section discusses other useful features of the interface.  

\subsection{The File Menu and How to Exit EYE}

If you select the File menu with the left mouse button, you will see
something like this:

\centerline{\psfig{file=filemenu.ps,height=2in}}

We will briefly describe each option in turn.
\begin{itemize}
\item{\bf Open} 

This lets you select a new datafile.  It brings up a window showing the
available files, and waits for you to choose one.  When you want to
switch datafiles you can either use Open, or you can bring up the
dialog box to run EYE (see section~\ref{starting}) and type the new
filename into the datafile slot.  As a shortcut, you can invoke Open by
holding down the Ctrl character while you press the letter `O' (that's
what the enigmatic Ctrl+O means).

\item{\bf Save} 

Once you start using the advanced interface to EYE (see
section~\ref{advanced}) you may start modifying your datafile, perhaps
naming your variables or altering which ones are treated as outputs.
Select Save if you wish to save these changes.  As a shortcut, you can
invoke Save by holding down the Ctrl character while you press the
letter `S'.

\item{\bf Save As} 

This lets you save your datafile under a new name, bringing up a window
that shows you the existing files.

\item{\bf 1 garden.mbl} 

EYE remembers the last four files that you looked at.  This lets you
select them directly instead of using Open.

\item{\bf Exit} 

Last but by no means least, Exit allows you to quit EYE.
\end{itemize}

\subsection{Cursors and Pictures}

EYE provides a few visual cues to show what it is doing.  Whenever it
is busy computing, the cursor will change to a black eye.  The cursor
returns to the standard white arrow when EYE has finished computing.

When EYE expects to be engaged in a particularly long computation, it
also draws a black box\footnote{The choice of a black box is a tip of
the hat to the {\bf BlackBox} function of EYE, which is designed to
autonomously search for a good model for the data} that bounces in the
left hand side of the window.  The black box disappears when EYE
finishes working.

\subsection{Halting EYE}

If you wish to halt EYE midway through a computation, simply press the
letter `Q'.  Within a second or two a dialog box will pop up and give
you the option of halting.  

\subsection{Where Did the Last Screen Go?}

EYE only keeps the latest set of output in its scrolling window.
Anything you saw earlier---perhaps some results, perhaps help on a
particular topic---disappears.  But if you want to inspect earlier
output, it's easy to do.  Simply select the ``Previous Screen'' option
from the GMBL menu.  You can select this repeatedly to look at
increasingly old output.

\subsection{Colors and Fonts}

To change the default colors and text size of EYE's output, bring up
the GMBL menu and pick ``Set Properties.''  This brings up a dialog box
that lets you specify the number of characters per line, the foreground
color, and the height of the text.

Note that the foreground color is the color used for plain text; if you
change it, other colors, such as the background color of the screen,
may change as well.  Later versions of EYE will allow you to specify
these colors directly.

\section{An Advanced Interface to EYE}
\label{advanced}

As well as the simple dialog box to run EYE (obtained by choosing ``Run
GMBL'' from the GMBL menu), there is also an advanced dialog box.  This
provides the user with more control of EYE's operations.  This section
describes how to use the advanced dialog box.

There are two ways to invoke the advanced dialog box.  You can bring it
up by clicking the Advanced button in the simple ``Run GMBL'' dialog
box.  Or you can simply click the right mouse button in the main EYE
window.  Try either of these methods and you should see the following:

\centerline{\psfig{file=advanced.ps,height=3.6in}}

The datafile and action fields should be familiar to you from the
simple dialog box (the action simply being the task you wish EYE to
execute when it runs, such as {\bf BlackBox}).  And the Run, Help, and
Cancel buttons each have the obvious effect.  But there are quite a
few new parameters.  You can edit these parameters directly by typing
a new value into the dialog box, or indirectly by using the Edit
button.

We now explain each of the remaining parameters in turn, and then
discuss the Edit and Inspect buttons.  Section~\ref{compendium} goes
on to describe the EYE functions (such as {\bf BlackBox} and {\bf
Predict}).

\subsection{Use Classification/Use Regression}

By default EYE uses regression and searches for the best general model
for data, just as we have seen with the example of the gardening data.
Sometimes, however, the data falls into a special category: it
represents a classification problem.  Here each datapoint falls into
one of a finite number of classes, and the goal is to be able to
predict which class a new datapoint will belong to.

For instance, suppose our friend the gardener had carried out
experiments on growing hybrids.  Perhaps the color of the flowers on
the hybrid plants varied: some had yellow flowers, some had orange
flowers, and some had red flowers.  Now the gardener would like to
predict what color flowers will result from particular hybrid
experiments.  Each experiment produces a result belonging to one of a
finite number of classes: yellow, orange, or red.  We represent this
by assigning one output variable to each class, and setting that
output variable to be 1 if the result belongs to that class, and 0
otherwise.  

The user can switch on the {\it Use classification} mode if their data
conforms to a classification problem (i.e. for each datapoint there is
a single output that has the value 1.0---corresponding to that
datapoint's class---and all the other outputs are 0.0).  EYE will then
constrain its own predictions and models so that they also conform to
the classification mode.

\subsection{Format}
\label{format}

This parameter shows the current input/output status of each of the
data columns in a datafile.  Recall the sample of the garden.mbl
datafile that we showed earlier:
\begin{verbatim}
% GreenG MinDrop Water Temp  Height   Brightness
    2      2      2     15   11.9      2
    2      2      2     20   12.1      2
    2      2      2     25   11.5      2
    2      2      4     15   27.9      2
\end{verbatim}
Here the first four data columns correspond to input variables
(factors in the plant regimen) and the last two data columns represent
output variables (the flower-height and color-brightness that resulted
from the regimen).  The {\it Format} string for the garden.mbl
datafile is thus: ``iiiioo.''  In general, the nth character in a {\it
Format} string is an `i' if and only if the nth data column is being
treated as an input, an `o' if the column is being treated as an
output, and a `-' if it is being ignored.

By default, EYE assumes that the rightmost column of numbers in a
datafile represents the output value, and that all the other columns
correspond to input variables.  You can edit the {\it Format} string
if this default assumption is incorrect.  

If you wish to permanently record a non-default format for the current
datafile, first edit the {\it Format} string, and then save the
datafile (using the File menu).  Whenever EYE opens that datafile in
the future, it will read in the {\it Format} that you recorded.

\subsection{Restrict} 

Not yet available: later versions of EYE will let the user control
this parameter.

\subsection{Verbosity} 

Not yet available: later versions of EYE will let the user control
this parameter.

\subsection{Blackbox test} 

The proportion of the data that {\bf BlackBox} reserves for use in a
test-set (to check against overfitting).  This parameter should be
between 0.0 and 1.0.  See section~\ref{blackbox} for more information
on {\bf BlackBox}.  {\bf Search} also refers to this parameter.

\subsection{Blackbox seconds}

The number of seconds for which {\bf BlackBox} or {\bf Search} will
run before producing a report on their progress.  To halt {\bf
BlackBox} before this time is up, press the letter `Q'.  See
section~\ref{blackbox} for more information on {\bf BlackBox}.

\subsection{No. crossval}

The number of leave-one-out samples to use during cross-validation
(cross-validation is used by both {\bf BlackBox} and {\bf Search}).
Suppose this parameter is set to N; then instead of finding the mean
leave-one-out error of {\it all} points in the dataset,
cross-validation will find the mean leave-one-out error of the N most
recent points.

\subsection{Max. no. attributes}

Not yet available: later versions of EYE will let the user control
this parameter.

\subsection{Query point}

The {\it query point} is a vector specifying a point in input space.
The nth number specifies the value of the nth input variable.  The
{\it query point} is used by several EYE functions:

\begin{itemize}
\item {\bf Analysis} gives informations about predictions, gradients,
and confidence intervals for the current {\it query point}.

\item {\bf Graph} holds the values of all the inputs (other than the
one currently being graphed) to their value in the {\it query point}.

\item {\bf Predict} makes its prediction about the current {\it query
point}.  See section~\ref{predict}.
\end{itemize}

The initial value assigned to the {\it query point} when a new
datafile is opened is the midpoint of the range of inputs, i.e. the
nth number in the {\it query point} is midway between the lowest and
highest values taken by the nth input variable.

{\bf AutoRSM} and {\bf Optimize} both set the {\it query point} (see
section~\ref{autorsm} and section~\ref{optimize} for details).

\subsection{Testfile}

Not yet available: later versions of EYE will let the user control
this parameter.

\subsection{GMString}
\label{gmstring}

{\it GMStrings} are enigmatic entities that encapsulate descriptions
of function approximators.  Function approximators lie at the heart of
EYE.  For instance {\bf BlackBox} hunts for the function approximator
that most accurately models the data, and when it ceases running {\it
GMString} will be set to a representation of the best function
approximator it has found.

The following is an example of a {\it GMString}:
\begin{verbatim}
       L24:SN:93--9
\end{verbatim}
In this section we describe how to interpret {\it GMStrings}.  This
description is not for the faint-hearted.  If you prefer, you can skip
this section, and rely on the {\bf Edit} and {\bf Inspect} functions
to give you a more digestible explanation of any {\it GMStrings} that
you encounter.

\subsubsection{Interpreting GMStrings: The three characters before the 
first colon}

The first character of a {\it GMString} specifies the type of local
model to use during regression.  EYE supports five types of local
model:
\begin{itemize}
\item `A': local averaging (kernel regression).
\item `L': locally linear regression.
\item `C': part way between locally linear and locally quadratic
regression: this includes a term containing the sum of the squares
of all the inputs in addition to the linear terms.
\item `E': part way between locally linear and locally quadratic
regression: this includes terms for the squares of each input, but
does not contain any cross-terms (the product of two or more inputs).
\item `Q': locally quadratic regression.
\end{itemize}

The second character of a {\it GMString} specifies how much smoothing
the function approximator uses.  This ranges from 0 (no smoothing at
all) to 9 (a fully global model).  For values from 1 to 8, the data is
partially smoothed: the local model is built from data weighted by a
gaussian centered at the current query point.  The smaller the
standard deviation of the gaussian, the more the data will be biased
toward points close to the query, hence the more local the model.  For
a value of 8, the standard deviation is set to one half of the width
of the input space; for a valye of 7, the standard deviation is set to
one quarter of the width of the input space; each time the number
drops by one, the standard deviation is halved again.  Hence a value
of 1 represents a very local model with almost no smoothing, and a
value of 8 leads to considerable smoothing.

The third character of a {\it GMString} specifies how many nearest
neighbors to ensure are included in the local regression.  If this is,
say, three, then the three nearest neighbors of a {\it query point}
will always be fully weighted when making predictions, even if they
aren't particularly close to the {\it query point}.

Returning to our example {\it GMString}, L24:SN:93--9, we can now see
that it represents a function approximator that uses locally linear
regression, with little smoothing, and that always weights the four
nearest neighbors fully.

\subsubsection{Interpreting GMStrings: The two characters between the 
colons}

EYE uses data structures called {\it kdtrees} to make predictions more
computationally efficient.  The kdtrees can be used in three modes,
indicated by the value of the first character between the colons:
\begin{itemize}
\item `S': slow mode; this is the most accurate mode and yields exactly
the same answers as conventional predictions methods.
\item `M': medium mode; this is almost as accurate as the slow mode,
but runs more quickly.
\item `F': fast mode; this gives less accurate (but still good)
predictions and runs much more quickly.
\end{itemize}

The second character between the colons is usually set to `N'
meaning that the time variable---if it is even included in the
data---should not be treated specially.  If time is included in the
data, however, you may wish to set this character to `W'.  EYE will
then only use datapoints older than the current query point when
making a prediction (i.e. it won't cheat and peer through a time
portal into the future).

Returning to our example {\it GMString}, L24:SN:93--9, we now also
know that the corresponding function approximator will use kdtrees in
the slow but fully accurate mode, and that it will not treat time
specially.

The default value for this pair of characters is `SN'.  It is legal to
omit this section of the {\it GMString} if it takes the default value.
Thus our example, L24:SN:93--9, could be legally abbreviated as
L24:93--9.

\subsubsection{Interpreting GMStrings: The characters after the final
colon}

The characters after the final colon in a {\it GMString} describe
how much each input should be weighted.  The standard format for this
description uses the nth character in the sequence to specify the
weighting for the nth input variable:
\begin{itemize}
\item `9': indicates that the input variable should be fully weighted.
\item `8': indicates that this input variable should be given a
weighting of a half.
\item `7': indicates that this input variable should be given a
weighting of a quarter.
\item The weighting continues to halve each time the number is reduced
by one, meaning less and less priority should be given to this input,
until ...
\item `0': indicates that the input variable should be left out of
any local models, but included if fully global regression is used.
\item `-': indicates that the input variable should be completely
ignored.
\end{itemize}

Thus we now see that the function approximator given by our example
{\it GMString}, L24:SN:93--9, ignores the third and fourth inputs
entirely, weakly weights the second input, and fully weights inputs
one and five.

We have now almost completed our survey of {\it GMStrings}.  The one
remaining detail is that curly brackets may be used to shorten the
description of the input weightings.  Suppose we had a large number
of inputs and a {\it GMString} of:
\begin{verbatim}
E35:FN:90000000000000090000000000000000009000000000002000000000000000000
\end{verbatim}
We would like a shorter notation for the lengthy sequence after the
second colon.  We notice that the characters in this sequence are all
zeroes except for 9's in positions 0, 15, and 34, and a 2 in position
46 (we count from zero in the coming short-hand).  This can be denoted
by the following notation: \{0\}9[0,15,34]2[46].  The value enclosed
in the curly brackets is the default weighting to be given to inputs.
Any inputs with a non-default weighting are specified afterwards by
listing the value they take, then, within square brackets, all the
inputs that take that value.  

Hence a shorter legal notation for the lengthy {\it GMString} we saw
above would be: E35:FN:\{0\}9[0,15,34]2[46].

We have now finished our survey of {\it GMStrings}.  Phew!

Note that the EYE function {\bf Predict} makes its prediction in
accordance with the current {\it GMString}. See section~\ref{predict}.

\subsection{Edit}

The {\it Edit} button brings up a list of the objects that you are
currently allowed to edit.  If you click on one of the objects, EYE
will bring up a window to help you edit it.  

For example, if you have opened the datafile garden.mbl (which should
therefore be the name displayed in the {\it datafile} field of the
advanced dialog box), then the list of objects you can edit should
include an item called {\it Names}.  If you select this, EYE will
bring up the following dialog box:

\centerline{\psfig{file=edit.ps,height=3.5in}}

This lets you edit the names of the variables for the garden.mbl
datafile, and also lets you edit the ranges of those variables.  The
tall list on the left hand side of the dialog contains the items that
you can edit.  A brief explanation of the currently selected item (in
this case Green-Grow-name) is displayed in the large box to the right.
The item's current value is shown in the smaller box below this; to
change the value, simply type your desired value into the current
value slot---for instance you might decide to shorten Green-Grow's
name to just Green.

By default EYE gives very dull names to the variables represented by
the data columns in the datafile, calling the variable for the first
data column {\it attribute0}, that for the second data column {\it
attribute1}, and so forth.  Often you may want to select the Edit
Names option after you open a datafile for the first time, to assign
more descriptive names to your variables.  If you then save the
datafile, EYE will always remember your names in the future.

All edit dialog boxes have the same basic form.  You select the item
you wish to change, edit its value by typing into the {\it current
value} slot of the dialog, and then select the next item you wish to
change.  When you have made all the changes you wish, simply press the
Done button.

\subsection{Inspect}

The {\it Inspect} button brings up a list of the objects that you are
currently allowed to inspect.  If you click on one of the objects, EYE
will display detailed information about that object, explaining what
the object is and showing its current value.

\section{A Compendium of EYE Functions} 
\label{compendium}

In this section we provide a compendium of the EYE functions.  Many of
these functions use additional parameters, which are italicized for
clarity throughout this section.  If you wish to adjust these
parameters, you can do so via the advanced dialog box described in
section~\ref{advanced}.  That section also briefly explains the meaning
of each of the parameters.

\subsection{BlackBox}
\label{blackbox}

{\bf BlackBox} is one of the core components of EYE.  It aggressively
hunts for the model that best explains the data\footnote{To be more
exact: it searches for the function approximator that will give the
best predictive accuracy when used on new data drawn from the same
distribution as the data in the current training and test sets.}.  

{\bf BlackBox} searches over a wide variety of models, and incremental
reports on its findings scroll down the screen as it runs.  As well as
considering different kinds of function approximator, such as nearest
neighbor and kernel regression, it also searches for the best
attribute subsets (determining whether any input variables can be
ignored, and what relative weights should be given to the remaining
inputs).  Each of the models typically has several parameters that
need to be tuned, such as distance metric parameters and smoothing
parameters.  {\bf Blackbox} autonomously searches for the optimal
values of these parameters.  It uses multiple levels of
cross-validation to police itself against overfitting.

Because searching over all models may take a considerable time , the
user can set the {\it Blackbox seconds} parameter to impose an upper
limit on the execution time.  When the time limit is reached, {\bf
BlackBox} stops running and produces a summary of its findings.

If you have run {\bf BlackBox} earlier on---and if you are still using
the same datafile and parameter settings---then a new call to {\bf
BlackBox} will resume from where it last stopped, rather than
duplicating earlier work.  Note that {\bf BlackBox} will also refer to
any results discovered during the EYE function {\bf Search} (see
section~\ref{search}).

{\bf BlackBox} uses the following parameters: {\it
Classification/Regression, Blackbox seconds, Blackbox test,
No.~crossval, Max.~No.~Attributes, Testfile}.

{\bf Blackbox} sets the {\it GMString} parameter to the best function
approximator that it has found for modeling the data.

\subsection{Predict}
\label{predict}

This option predicts the outputs for a given setting of the input
variables.  The value of the input variables to use is determined from
the {\it query point}.  For example, our friend the gardener might want
to see the predicted flower-height and flower-brightness for the
following regimen:
\begin{verbatim}
  Amount of Green-Grow        2.1 
  Amount of Mineral-Drops     5   
  Amount of Water             6
  Temperature                 20
\end{verbatim}

To find EYE's prediction, the gardener must first tell EYE which
regimen she wants to investigate.  This is done by setting the {\it
query point} to the corresponding values.  In this case she would set
the query point to be:
\begin{verbatim}
  2.1  5   6  20
\end{verbatim}
She can then find EYE's prediction by selecting the {\bf Predict}
action, and pressing the RUN button.

Note that {\bf Predict} uses the current {\it GMString} to determine
which model is used when predicting the output.  See
section~\ref{gmstring} for information on {\it GMStrings}.

{\bf Predict} uses the following parameters: {\it
Classification/Regression, GMString, Query point}.

\subsection{Set}

This function allows the user to set the value an object.  The current
version simply brings up the following dialog:

\centerline{\psfig{file=set.ps,height=1.3in}}

The user types in the object-name and its new value.  For instance, to
change the datafile to vulture.mbl the user would type ``datafile
vulture.mbl'' into the dialog:

\centerline{\psfig{file=set2.ps,height=1.3in}}

\noindent and press the OK button.  To find out which objects can
currently be set in this way, click on Help and then click on Set
(included in the list of GMBL actions).  Future versions of EYE will
have a more user-friendly implementation of {\bf Set}.

\subsection{Graph}

{\bf Graph} draws univariate graphs and bivariate contour plots
showing how the predicted value of the first output varies for a given
input. It gives you four options:
\begin{itemize}
\item Draw a single 1d graph
\item Draw a single 2d graph
\item Draw all 1d graphs
\item Draw all 2d graphs
\end{itemize}
In the first two cases, you get to select which input and output
attributes are involved in the graph.
Predictions are made in accordance with the current {\it GMString}.
For the 1d graphs, $95\%$ confidence intervals on the predictions are drawn.

Later versions of EYE will expand the functionality of {\bf Graph}.
Options will include:
\begin{itemize}
\item Plotting only the known datapoints, not the predicted values
(currently both predicted values and known values are graphed).
\item Plotting inputs against time.
\end{itemize}

{\bf Graph} uses the following parameters: {\it
Classification/Regression, GMString, Query point}.
If you ask to draw a long series of graphs and part-way through you
get tired of waiting for the remainder, you can press 'Q' to stop
the graph generation after the current graph.

\subsection{LOOHistogram}

Not yet available: later versions of EYE will include this function.

\subsection{LOOPredict}

{\bf LOOPredict} displays the predicted values of a set of
leave-one-out (LOO) predictions.  EYE predicts the output for each
input-point in the datafile.  While making its prediction for a given
input-point, EYE must ignore that point's known output (leaving that
one point out of its model: hence leave-one-out).  Predictions are
made in accordance with the current {\it GMString}.

{\bf LOOPredict} uses the following parameters: {\it
Classification/Regression, GMString}.

\subsection{Search}
\label{search}

{\bf Search} allows the user to guide the search for the best model
for the data\footnote{When we say `best model' we mean the function
approximator that will give the best predictive accuracy when used on
new data drawn from the same distribution as the data in the current
training and test sets.}.  In contrast, the EYE function {\bf
BlackBox} autonomously seeks the best model.  The results discovered
during a user-guided {\bf Search} will be remembered and referenced by
later calls to {\bf BlackBox}.  Likewise, {\bf Search} will refer to
the results found during earlier runs of {\bf BlackBox}, so as not to
duplicate earlier work.

If you select the {\bf Search} function, EYE will bring up a dialog
box to help you specify what you wish to search for.  Options include
searching for the best amount of smoothing, the best distance metric
parameters, and the best attribute subsets.  Once you have made your
selection, incremental results will start scrolling down the screen as
EYE performs the search, just as in {\bf BlackBox}.  When {\bf Search}
is finished, it displays a summary of its findings.

To prevent EYE from spending too long hunting for the very best model,
the user can set the {\it Blackbox seconds} parameter to limit the
execution time.  When the time limit is reached, {\bf Search} stops
running and produces a summary of its findings.

{\bf Search} polices itself against overfitting by using multiple
levels of cross-validation.  If {\bf Search} discovers a better model
than any found beforehand, it will set the {\it GMString} to an
encapsulation of this model.  (See section~\ref{gmstring} for an
explanation of {\it GMStrings}.)

{\bf Search} uses the following parameters: {\it
Classification/Regression, Blackbox seconds, Blackbox test,
No.~crossval, Max.~No.~Attributes, Testfile}.

\subsection{IntelliPrinc}

Not yet available: later versions of EYE will include this function.

\subsection{Transform}

Not yet available: later versions of EYE will include this function.

\subsection{Analysis}

{\bf Analysis} displays information about predictions, gradients,
confidence intervals, and noise. All the information is with respect
to the current {\it query point}.

\subsection{AutoRSM}
\label{autorsm}

{\bf AutoRSM} is an experiment design tool. See
Section~\ref{se:autonopt} below.

\subsection{AutoLOOP}
\label{autorsm}

{\bf AutoLOOP} is an experiment design utility. See
Section~\ref{se:autonopt} below.

\section{Auton Optimize: Autonomous Experiment Design}
\label{se:autonopt}

(WARNING: THIS SECTION
IS A PRELIMINARY DRAFT. MORE-USER-FRIENDLY SOFTWARE AND DOCUMENTATION
IS UNDER DEVELOPMENT FOR FUTURE EYE RELEASES)

RSM stands for Response Surface Methods.  They are methods by which
statisticians choose experiments to model and optimize systems.  The
purpose of the RSM software is to automate the process of choosing
experiments to model and optimize systems.

Imagine you have a widget maker you would like to model and optimize.
Your widget maker has several knobs on the outside of it that may be
adjusted which will affect the speed at which completed widgets come
out.  You would like to find the knob settings that maximize the rate
at which widgets are produced.

If you have a way to set the knobs to lots of different positions 
automatically and it takes a very short amount of time after each
setting (say less than a second) to observe the widget rate for that
setting, then you might just try a lot of settings and choose the
best.  If, however, it takes a longer amount of time to observe how
many widgets come out for each setting, or it is expensive to change
the settings because you risk lost widget production at a poor setting
then you would not want to, or would not be able to try lots of settings.
This is where the RSM software can help you.  It can consider the
expense in running a experiments and choose experiments very carefully
to optimize your widget maker while minimizing the cost incurred.

RSM is specifically designed to optimize noisy systems.  This is good
if your system happens to be a noisy one.  It may also be good even if
you don't think your system is particularly noisy.  Suppose you have
determined that you can count the number of widgets that come out in
15 minutes time to get an accurate estimate of the production rate at 
that setting so you assume the system you're optimizing is not noisy.
However, it may turn out that if you only watch for one minute, you will
get slight fluctuations in the number you observe.  Then your process
has become a noisy one.

Why would you want to take a non-noisy optimization problem and turn it
into a noisy one?  Suppose that some knob settings are so bad that it
is obvious after one minute that they are far from optimal.  Then you
wouldn't want to waste time watching what happens for a full 15 minutes.
You would like to cut the experiment short and move on to a new one.
If you switch to watching each setting for only one minute and reporting
the noisy result to RSM, it will automatically request more experiments
in the more promising areas and quickly ignore the poor ones.  The software
will give you the effect of running the poor experiments for only a minute
or two while running the good ones for longer periods because it is built
for noisy optimization.

Do you have a widget maker?  If you're still reading this document, the answer
is almost certainly yes.  Do you have a system where decisions are made that 
affect the value of a result?  Do you have an opportunity to try out some
different decisions to see if they'll work better?  If so, then you have a
candidate for our RSM software.  In addition to manufacturing processes, there
are lots of other examples:  

\begin{itemize}

\item
Choosing the parameter settings of a software algorithm (such as the 
configuration and learning rate of a neural network).

\item
Choosing a marketing campaign that will most improve sales.

\item
Given the limited ability to test product designs, choosing the design values
most likely to meet or exceed specifications, thus reducing product
development cost and/or improving the performance of the final design.
\end{itemize}

\subsection{THE RSM METHODS}

All of the algorithms used in the RSM software package rely on the use of
a function approximator that can provide estimates of:   the value of 
parameter settings, their gradients, the noise level, and confidence intervals 
on those estimates.  RSM is integrated with the GMBL (General Memory Based
Learning) package and uses it to provide these estimates from historical data
and new data from experiments.  GMBL is described in another document and its
inner workings will not be discussed here.  We only need to know that it can
provide the estimates listed above.  The rest of this section describes how
each of the different algorithms for optimization works.

\subsubsection{AutoRSM}

AutoRSM is an automation and extension of the techniques that would be used
by a statistician applying response surface methodolgy.  In the basic RSM
method, experiments are taken in a certain region of interest in order to
obtain a local model of the effects of the input variables on the outputs.
These parameter settings of these experiments are chosen in order to maximize 
the information gained from the experiment.  Once a particular region of
interest is well understood, a decision is made.  Either we believe that a
local optimum lies within the current region of interest, in which case we
give the optimum, or we move the region of interest to a new area that it
expected to yield better results.

We can now describe the AutoRSM algorithm:

\begin{enumerate}
\item
Choose an initial base point (center of region of interest).

\item
Check to see if we have enough information to follow a gradient
    to a new region of interest.  If so, move the base point to the
    new region of interest and suggest an experiment in the new
    region of interest.  The decision of where to move the base includes
    checking for quadratic ridges and valleys in order to find the
    direction of movement that will be most efficient in getting to
    the optimum.  Go to step 5.

\item
Check to see if we have enough information to say that there is
    a local optimum within this region of interest.  Suggest an experiment
    at or near the optimum.  We may choose an experiment near rather than
    at the optimum in order to get more information to more precisely
    identify its location.  Go to step 5.

\item
Since we will not be moving the base yet, suggest the experiment that will
    add the most information about our current estimate of the gradient at
    the base point.  Go to step 5.

\item
Check for stopping criteria.  Examples include a fixed number of 
    experiments, or the identification of a local optimum.  If the criteria
    is not met, repeat to step 2.
\end{enumerate}

\subsubsection{PMAX}

PMAX stands for predicted maximum.  It and all the other algorithms that 
folow, choose an experiment by assigning a value to each possible new 
experiment.  Their suggestion is made by searching the space to find the 
one with the best value.  Therefore, each of these algorithms are described 
by stating what their experiment valuation function is.

We call the vector of parameters, or inputs, X.  We call the vector of 
results, or outputs, Y.  We use C(X,Y) to represent the cost (or reward)
for performing experiment X and obtaining result Y.  Y = f(X) is the function
mapping parameter settings onto results.  E(f(X)) or E(Y|X) is what our
function approximator tells us the expected result of a particular experiment
is given all the data its been trained on so far.  V(X) is the function that
assigns a value to each experiment X.  Using that notation PMAX is defined as:
\begin{equation}
V(X) = C(X,E(f(X)))
\end{equation}
In words, the value is just the cost of the experiment, given that its result
is the expected result.  PMAX will suggest the experiment which has the 
largest, or smallest, value depending on whether we are maximizing or 
minimizing (whether C is a cost or a reward).


\subsubsection{IEMAX}

IEMAX stands for interval estimation maximization, named after the interval
estimation algorithm proposed by Leslie Kaelbling.  Instead of using the 
expected cost, the algorithm is optimistic and uses the best cost that falls
anywhere within a confidence interval on the expected value of Y.  We use
I(f(X)) to denote all the values of Y inside a confidence interval on the
estimation of f(X). Then IEMAX is defined as:
\begin{equation}
V(X) =  \frac{\mbox{MAX}}{y \in I(f(x))} C(X,Y)
\end{equation}
Again, IEMAX will suggest the experiment with the highest (or lowest) value.
In the case of minimization, the MAX operator is replaced with the MIN
operator.


\subsubsection{OPTEX}
OPTEX is short for optimal experiment and is so named because it is derived
from a crude approximation to the optimal value of an experiment (determining 
the true optimal value is terribly intractible).  The OPTEX valuation consists
of two different types of cost functions.  One is called the "online" cost
which is the cost of each experiment.  The cost of a set of experiments, then,
is the sum of the online costs for each one.  The second cost is called the
"final" cost.  In optimization problems there is criterion being optimized and
the value of a set of experiments is just the value of the best result found
in the batch.  Often, it may be that an optimization problem includeds both
online and final costs.  The OPTEX experiment valuation consists of three
terms which will be discussed separately.

The first and second terms operate on the online cost.  The first is 
E(C(X,f(X))).  This is the expected cost of experiment X.  Note the difference
between this and PMAX, which is the cost of the experiment when the result is 
the expected result.  In order to compute this cost we must evaluate the 
following integral over the set of all possible Y's:
\begin{verbatim}
      /
v1 =  | C(X,Y) * p(Y|X)
      /
\end{verbatim}
where p(Y|X) is the probability of result Y from experiment X.  The
implementation of this actually performs only a coarse numerical evaluation
of this integral within a confidence interval.

We define Cstar to be the best of the estimated costs at each of the points
we have seen so far.  A true optimal experiment valuation would give this
experiment credit for all future improvements in cost derived from the result
of the experiment.  Since computing that is intractible, we make a gross
simplification.  We assume that after this experiment, our policy will be to
choose the best experiment we've seen from then on (we will not do that,
of course).  Then the future improvement in cost from this experiment is the
improvement in its result over the best seen so far times the number of
experiments we will perform in the future (the integral is again over all 
possible Y's:
\begin{verbatim}
         /
v2 = n * | max(0, (C(X,Y) - Cstar)) * p(Y|X)
         /
\end{verbatim}
n is the number of experiments remaining to be done.  Again, if the object
is minimization, the max operator becomes a min operator.  The effect of
multiplying by the number of steps to go is to make the experiments more
aggressive when their is a lot of time left, and more conservative when the
trials are almost done.

The third term is on the final cost function.  It looks like the second
term without the number of steps factored in.  Its purpose is simply to
credit the experiment with any improvement it makes upon the best result
obtained so far with respect to the final cost function.  Assuming an
additional subscript to label this as pertaining to final cost rather than 
online the term is:
\begin{verbatim}
      /
v3 =  | max(0, (C(X,Y) - Cstar) * p(Y|X)
      /
\end{verbatim}
Again, the integral is a coarse numerical approximation and the max is a min
in the case of minimization.

The entire experiment value is:  V(X) = v1 + v2 + v3  and OPTEX suggests the 
experiment with the maximum (or minimum) value.

\subsection{Using Auton RSM}

From the main EYE menu you may choose the RSM command or the AutoLOOP command.
RSM permits up to seven options, though the first time you use it, many of
these won't be available.
\begin{itemize}
\item
{\bf new}
Create a new AutoRSM project. If there is currently data loaded in
EYE, you will be given the option of using that data, or starting from
no data at all.

\item
{\bf load}
If, in an earlier session, you saved an AutoRSM project, this will
allow you to reload that project in the same state that it was saved.

\item
{\bf choose}
This suggests the next experiment, based on your current optimization
method. It explains how it made its decision with words and graphs.

\item
{\bf edit}
This allows an expert user to alter the optimization method, the task
specification, and other AutoRSM objects described below.

\item
{\bf inspect}
This allows an expert user to inspect the optimization method, the task
specification, and other AutoRSM objects described below.


\item
{\bf observe}
This allows the user to report to AutoRSM the result of performing one
or more of its recommended experiments. It is not necessary to report
a result which RSM recommended: you can tell AutoRSM the input parameter
values you used and the output values you observed. This will also change
the data in EYE's current datafile, and so you will be able to run all the
other eye operations (graph, blackbox, etc) taking the new data into account.

\item
{\bf save}
This saves the entire state of the autorsm project so that you can load
it for future optimization sessions.
\end{itemize}

\subsection{THE RSM DATA STRUCTURES}

\subsubsection{Task Specification}

The TaskSpec structure is used to define an optimization problem to
the system.
Here is an example:
\begin{verbatim}
                  |  online->optweights NULL  |  final->xweights  NULL   |     
                  |  online->ytarget    NULL  |  final->pn_string NULL   |     
                  |  online->yweights   NULL  |  final->max       TRUE   |     
                  |  online->xtarget    NULL  |  con->fixed_dims  0      |     
                  |  online->xweights   NULL  |  con->fixed_vals  0.5    |     
                  |  online->pn_string  NULL  |  con->lo_vals     0      |     
                  |  online->max        TRUE  |  con->hi_vals     1      |     
                  |  final->optweights  1     |  total_steps      30     |     
                  |  final->ytarget     NULL  |  auto_facode      FALSE  |     
                  |  final->yweights    NULL  |  [usin_size       1]     |     
                  |  final->xtarget     NULL  |  [usout_size      1]     |     
\end{verbatim}
Online and final are the online and final costs.  Each is
specified by a quality structure.  The cost function given by a
quality structure is:
\begin{verbatim}
C(X,Y) = eval(pn) + 
         optweights . Y + 
         (X - xtarget)' * diag(xweights) * (X - xtarget) +
         (Y - ytarget)' * diag(yweights) * (Y - ytarget)
\end{verbatim}
There are four terms.  The first is a parse node.  It is given by specifying
string expression such as "5.0 * x[1] - 3.2 *y[0] + sin(x[0])".  Look in the
source file "damut/parse.c" to see its full syntax.  x and y are used for the
input and output variables, but their respective variable names (described
later in the sidat structure) may also be used.  This parse node could be the
sole method of describing cost functions since each of the remaining 3 terms
could be written in it.  However, those terms are listed separately because
they are the most common forms for cost functions, and some algorithms can
take advantage of knowing the cost function is in those forms.

The second term is a weighted sum of the outputs.  This is most useful for
straight minimization or maximization of output values.

The third and fourth terms penalize squared deviation from a target input and
output value respectively.  Common uses are two find the X that achieves a
target value of Y rather than minimizing or maximizing it.  Similarly, it may
be desirable to find the X nearest some nominal X that achieves a target or
minimal cost, and thus there might be a penalty on squared deviation from the
nominal.  This is good for regularizing a problem that would otherwise be
ill-defined because many Xs achieve the target Y.  Normally, a quadratic
penalty function can include a full matrix of weights.  In practice, it is
rare that this matrix would have non-zero terms anywhere but on the diagonal.
Therefore, the structure specifies the weights as a vector of the weights
on the diagonal.

For simplicitly, the algorithms expect the online and final costs to both be
maximization, or both be minimization.

The constraints describe what the range of allowed experiments is.  Fixeddims
and fixedvals allow some of the input variables to be fixed at a certain
value and thus excluded from the search.  Fixeddims is an integer vector with
1s for any dimension that should be held fixed and 0s for the ones that should
be experimented over.  The fixedvals vector specifies what the values should
be for the dimensions that are fixed.  The values in the fixedvals vector
of the dimensions that are not fixed are ignored.  

Why not just remove any dimensions that you don't want to search over from the 
problem specification altogether?  The reason is that you may have data which
has your "fixed variables" set at values other than what you intend to fix them
at.  You would still like to use that data and have it contribute to the
function approximator's estimation even though you already know what value you
want for those variables.

Fixeddims and fixedvals are equality constraints on some of the variables.
Inequality constraints are specified with the lovals and hivals vectors.
Only axis-parallel constraints may be given and you must either give upper and
lower bounds on all the input variables, or on none of them.  The software
will still operate if you give no bounds, but it is almost always better to
set some reasonable limits on the input variables.

totalsteps is used by OPTEX to compute n in the second term of its valuation
function.  It is the total number of experiments you intend to run including 
any historical results you have.  n is totalsteps minus the number of data 
points already given.

autofacode is a boolean indicating whether you would like to use an algorithm
that will automatically choose the best facode for you.  The facode is part of
the GMBL function approximator and is described later.  Use of this option is 
not recommended yet.  In its current form, it always suggests gmstrings of the
from L?0:SN:{9} where the ? is filled in by its estimate of the best smooth
code to use based mostly on leave one out cross validation error.  Because of
the special requirements of the RSM algorithms, choosing a facode based
strictly on leave one out prediction error is not necessarily the best thing 
to do.  It is important that the function approximator extrapolates in a
reasonable way because the RSM algorithms need to have reasonable estimates
of what will happen if they explore outside the region currently covered by the
data.  Reasonable extrapolation is not something that is directly encouraged
by minimizing leave one out error.

\subsubsection{OptPars}

The optpars structure specifies which algorithm you want to use and the
parameters for that algorithm.  Most of these parameters are not that important
to understand, however, because they have good defaults.

The one field you must choose is opttype. It is one of
{\bf rsm
pmax
iemax
optex}
described earlier.

The rest of the parameters are described with respect to each optimization
algorithm.

\noindent
{\bf RSM}
The AutoRSM method is currently restricted to one output variable.  Currently,
it does not support inequality constraints that are different from the extents 
of the sidat structure.  It uses the sidat extents instead of the constraints
structure.  It ignores all the fields of the quality data structure except for 
the boolean max in final.  It will try to maximize or minimize the single 
output value depending on this boolean according to the algorithm described 
earlier.  It uses the following other parameters:
\begin{verbatim}
radius - Specifies the radius of the region of interest about the base in 
         coordinates scaled to the size of the extent in the sidat data 
         structure described below.  It is also the distance the base is
         moved whenever a base movement is chosen

jumpheight - The amount of increase in the expected output necessary for a
              movement of the base.

jumpconf - The confidence in obtaining the given jumpheight necessary for
            a movement of the base.

bestconf -

locmaxconf - The level of confidence in a local optimum required to declare
              it found and move the next experiment there.

criticaldegrees -

rsmfinalexcite - A boolean saying whether or not to continue small
                   experiments about the location of the local optimum in order
                   to more precisely determine its position.

usesteepest -
\end{verbatim}

\noindent
{\bf PMAX}

PMAX operates on the final cost.  There is only one relevant parameter:
\begin{verbatim}
epsilon - PMAX uses a simplex search to find the best experiment, given its
          experiment valuation formula.  That search is terminated when an
          optimal value has been found to within epsilon.
\end{verbatim}

\noindent
{\bf IEMAX}

IEMAX operates on the final cost.  Its parameters are:
\begin{verbatim}
epsilon - As in PMAX.

conflevel - The width of the confidence intervals used.  For example, 0.95 is
             the default.
\end{verbatim}

\noindent
{\bf OPTEX}

OPTEX uses both online and final cost as described above.  Its parameters are:

\begin{verbatim}
epsilon - As in PMAX.
\end{verbatim}


\subsubsection{OptState}

This is for the optimization algorithms that need to keep some internal state
other than the list of experiments done and historical data. 

\subsection{User Edits of RSM Parameters}
Changes to the RSM parameters are increasingly drastic as shown
in the following list:
\begin{verbatim}
Facode - Change the function approximator used by the RSM algorithm.  This may
         be done as more data is gathered and you have a better idea of what
         yields the best function approximator.

OptState - In general, changing this could be a dangerous thing since optstate 
           is something these algorithms keep for their own benefit.  
           Currently, base is the only to change and that is fairly safe.

OptPars - Changes here amount to changing the solution you are using on
          your problem.

TaskSpec - Changes here change the problem being solved.
\end{verbatim}



\section{Schenley Park Research, Inc.}

EYE is produced by Schenley Park Research, Inc., a software company
committed to bringing advanced statistical, machine learning, and
artificial intelligence algorithms to the marketplace.

For more information on Schenley Park Research contact: 
\begin{verbatim}
  Jeff Schneider                  Email: j.schneider@cs.cmu.edu
  Schenley Park Research, Inc.    Phone: 412-268-2339
  101 Oak Park Place              Fax: 412-268-5571
  Pittsburgh, PA 15243
\end{verbatim}
  
\end{document}

