% most useful Latex command
\newcommand{\comment}[1]{}
\documentclass[12pt,twoside]{report}
\usepackage{fullpage}
\usepackage{cmu-titlepage}
\usepackage{epsf}
\usepackage{times}
%\usepackage[nomarkers]{endfloat}
%\usepackage{changebar}
%\usepackage{lscape}


%\input{restore}

%\renewcommand{\dbltopfraction}{.9}
%\renewcommand{\topfraction}{.9}
%\renewcommand{\textfraction}{.1}
%\renewcommand{\floatpagefraction}{.9}
%\renewcommand{\dbltopnumber}{1}
%\renewcommand{\dblfloatpagefraction}{.9}

%\def\normspacing {\renewcommand{\baselinestretch}{0.9}}
%\def\tinyspacing {\renewcommand{\baselinestretch}{0.7}}

\def\normspacing {}
\def\tinyspacing {}

\normspacing

\def\thesection{\arabic{section}}
\def\thefigure{\arabic{figure}}

%\newcounter {subsubsubsection}[subsubsection]
%\def\thesubsubsubsection {\thesubsubsection.\arabic{subsubsubsection}}
%\def\subsubsubsection {\subsubsection*}

\makeatletter\long\def\unmarkedfootnote#1{{\long\def\@makefntext##1{##1}\footnotetext{#1}
}}

\def\pagefigwidth {\epsfxsize=5in}
\def\colfigwidth {\epsfxsize=5in}
\def\figfontsize {}
%\def\leadfigwidth{\epsfxsize=2in}
%\def\absfigsize {\epsfxsize=3.0in}
%\def\relfigsize {\epsfxsize=3.0in} 

\newcounter{ecount}
\newenvironment{ecompact}{
  \begin{list}%
    {\arabic{ecount}}{\usecounter{ecount}
    \parsep 0pt plus 1pt
    \partopsep 0pt plus 1pt
    \topsep 2pt plus 2pt minus 1pt
    \itemsep 0pt plus 1pt}}%
  {\end{list}}

\newenvironment{icompact}{
  \begin{list}{$\bullet$}{
    \parsep 0pt plus 1pt
    \partopsep 0pt plus 1pt
    \topsep 2pt plus 2pt minus 1pt
    \itemsep 0pt plus 1pt}}%
  {\end{list}}


\begin{document}

\bibliographystyle{acm}

\title{The Case For\\
Prediction-based Best-effort Real-time Systems}

%\comment{
\author{Peter A. Dinda \and Bruce Lowekamp 
\and Loukas Kallivokas \and David R. O'Hallaron}


\citationinfo{A version of this paper appeared in the Seventh Workshop on Parallel and Distributed Real-Time Systems (WPDRTS '99)}

\date{January 1999}

\trnumber{CMU-CS-98-174}

\keywords{distributed real-time systems, interactive applications,
host load prediction}


\support{
Effort sponsored in part by the Advanced Research Projects Agency and
Rome Laboratory, Air Force Materiel Command, USAF, under agreement
number F30602-96-1-0287, in part by the National Science Foundation
under Grant CMS-9318163, and in part by a grant from the Intel
Corporation.
}


\abstract{
We propose a prediction-based best-effort real-time service to support
distributed, interactive applications in shared, unreserved computing
environments.  These applications have timing requirements, but can
continue to function when deadlines are missed.  In addition, they
expose two kinds of adaptability: tasks can be run on any host, and
their resource demands can be adjusted based on user-perceived
quality.  After defining this class of applications, we describe two
significant examples, an earthquake visualization tool and a GIS map
display tool, and show how they could benefit from the service.
Finally, we present evidence that the service is feasible in the form
of two studies of algorithms for host load prediction and for
predictive task mapping.
}

\maketitle
\cleardoublepage

\section{Introduction}



There is an interesting class of interactive applications that could
benefit from a real-time service, but which must run on conventional
reservation-less networks and hosts where traditional forms of
real-time service are difficult or impossible to implement.  However,
these applications do not require either deterministic or statistical
guarantees about the number of tasks that will meet their deadlines.
Further, they are adaptable in two ways: their components are
distributed, so a task can effectively be run on any available host,
and they can adjust the computations tasks perform and the
communication between tasks, trading off user-perceived degradation
and the chances of meeting deadlines.

We propose a best-effort real-time service that uses history-based
prediction to choose how to exploit these two levels of adaptation in
order to cause most tasks to meet their deadlines and for the
application to present reasonable quality to the user.  We believe
that such a service is feasible, and while providing no guarantees,
would nonetheless simplify building responsive interactive
applications and greatly improve user experience. 

The paper has two main thrusts.  The first is to define the class of
resilient, adaptable, interactive applications we have described above
and to show that it contains significant real applications.  To this
end, we analyze two applications in detail, explaining the demands
users place on them and the adaptability they provide.  The two
applications are QuakeViz, a distributed visualization system for
earthquake simulations being developed at CMU, and OpenMap, a
framework for interactively presenting map-based information developed
by BBN.

The second main thrust of the paper is to support the claim that
history-based prediction is an effective way to implement a
best-effort real-time service for this class of applications.  We
present two pieces of supporting evidence here. The first piece of
supporting evidence is a trace-based simulation study of simple
prediction algorithms that predict on which host a task is most likely
to meet its deadline based on a history of past running times.  One of
the algorithms provides near optimal performance in the environments
we simulated.  The second piece of supporting evidence is a study of
linear time series models for predicting host load that shows that
simple, practical models can provide very good load predictions.
Because running time is strongly correlated with the load experienced
during execution, the fact that load is predictable suggests that
information that can be collected in a scalable way can be used to
make decisions about where to execute tasks.  Together, these two
results suggest that prediction is a feasible approach to implementing
a best-effort real-time service.

\section{Application characteristics}
\label{sec:app_chars}

The applications we are interested in supporting have the four
following key characteristics.

\comment {
\begin{description}
\item[interactive] Computation is initiated or guided by a
human being.
\item[resilient] Missed deadlines are not fatal.
\item[distributable] Their components can be run on any host with the
required resources.
\item[adaptable] Application behavior can be changed
according to the needs of the user and the capabilities of the system.
\end{description}
}

\subsubsection*{Interactivity} 
The applications are interactive---computation takes the form of tasks
that are initiated or guided by a human being who desires
responsiveness.  Achieving responsiveness amounts to providing timely,
consistent, and predictable feedback to individual user actions.  If
the feedback arrives too late or there is too much jitter for a series
of similar actions, the utility of the program is degraded, perhaps
severely.  Research has shown that people have difficulty using an
interactive application that does not respond in this
manner~\cite{EDITOR-RESPONSE-VARIATIONS-EMBLY-81,PSYCH-LIMITS-ON-SYS-RESPONSE-TIME-KOMATSUBARA-97}.
%For example, high-jitter or long latency feedback makes freehand
%drawing impossible in an image editor.  
Our mechanism for specifying timely, consistent, predictable feedback
is the task deadline.

\comment{
Therefore, an application must be designed to provide response
to a user's request by a specified deadline.  The key challenge in
this work is meeting these requirements as well as possible in an
environment lacking the guanteed resources which allow deterministic
performance.
}


\subsubsection*{Resilience} 
The applications are resilient in the face of missed deadlines to the
degree that they do not require either statistical or deterministic
guarantees from the real-time system.  The inability to meet a
deadline does not make these applications unusable, but merely results
in lowered quality.  For example, occasional missing frames in playing
back video do not make the video performance unacceptable.
Consistently missing or irregular frames, however, result in
unacceptable playback.  Resilience is the characteristic that suggests
a best-effort real-time approach, instead of traditional ``soft''
(statistically
guaranteed)~\cite{LOTTERY-OSDI,PROB-JOB-SCHED-DIST-RT-BESTAVROS,TIME-DRIVEN-SCHED-MODEL-RTS-JENSEN-85}
and ``hard'' (deterministically guaranteed) real-time
approaches~\cite{HARD-RTS-STANKOVIC-BOOK-88,MARS-SURVEY}.


\subsubsection*{Distributability} 
The applications have been developed
with distributed, possibly parallel, operation in mind.  We assume
that it is possible to execute their tasks on any of the available
hosts using, for example, mechanisms such as
CORBA~\cite{CORBA-20-ARCH-SPEC-TECHREPORT} or Java
RMI~\cite{JAVA-RMI-SPEC}.  Tasks need not be replicable (stateless),
but any data movement required to execute a task on a particular host
must be exposed, perhaps via an interface repository or reflection
mechanism.  

\comment{
support parallelism internally and are able to take advantage of
predictions designed to optimize the combination of computation and
communication they need to meet their requirements.  Replicable tasks
(those without side-effects) are also useful, because they simplify
reallocation and allow several copies of the tasks to be run
simultaneously to maximize the probability of meeting a deadline.
}

\subsubsection*{Adaptability} 
The applications expose controls, called {\em
application-specific quality parameters}, that can be adjusted to
change the amount of computation and communication resources a task
requires.  Adjustments such as changing resolution, image quality,
frame rate, and response time may be needed in order to deal with
situations where a task's deadline cannot be met by choosing the
appropriate host to run it, or when longer term changes in the
available resources result in many tasks missing their deadlines.
These parameters can also be used as ``knobs'' to adjust the behavior
of the application to meet the user's quality metrics.

\comment{
The applications we are interested in supporting have the following
characteristics:
\begin{enumerate}
\item They are {\em interactive}---computation is initiated or guided by a
human being who desires consistent, predictable response times.
\item They are {\em resilient} in the face of missed deadlines---no
guarantees, either statistical or deterministic about the number of
missed deadlines are strictly necessary.
\item They consist of {\em distributed and Replicable components},
making it possible to execute their tasks on any available host.
\item They expose {\em quality parameters} that can be adjusted to change the
amounts of computation and communication a task requires.
\end{enumerate}

Applications with  these characteristics  expose their requirements in
the form of deadlines for tasks, and provide two degrees of freedom in
controlling how those tasks are executed.  The real-time system adjusts
these control knobs to attempt to meet  as many deadlines as possible.
Because no reservations are provided  by the computing environments we
target, and we are in uncontrollable competing compute traffic

}



\section{QuakeViz}

The Quake project developed a toolchain capable of detailed simulation
of large geographic areas during strong
earthquakes~\cite{QUAKE-JOURNAL-98}.  For example, a simulation was
developed of the response of the San Fernando Valley to an aftershock
of the 1994 Northridge Earthquake.  The finite element representation
of this simulation contains up to 77 million tetrahedrons producing
over 40 million unknowns.  Storing all of the data for each time step
of the simulation would require approximately 6TB of data.  Even
selective output results in tens of gigabytes of data.


Thorough assessment of the seismicity associated with the geographic
region requires accurate, interactive visualization of the data
produced by the simulation.  To facilitate rapid turn-around in
improving the simulation and in testing different conditions, we are
designing an environment where the visualization can be done on a
researcher's regular desktop, without requiring special hardware
support.


Designing a visualization system for Quake which can be used
interactively on a standard desktop system across unreserved network
resources is a challenge because these devices were not designed to
handle this type of load.  Nevertheless, with proper management of the
visualization data, it is possible to achieve reasonable quality in
this environment.


\begin{figure}
\centerline{\pagefigwidth
\epsfbox{quake_structure.eps}}
\caption{Stages of the Quake data visualization toolchain.
Displacement data follows the upper path.  Density data, which does
not change during the simulation, but may need to be recalculated if
the visualization parameters change, follows the lower path.  Data can
be downsampled on any edge if network bandwidth is limited.}
\label{fig:quake}
\end{figure}
The visualization toolchain is shown in Figure~\ref{fig:quake}.  The
raw simulation data is typically read from storage, rather than from
the running application, because of the simulation's high cost,
scheduling complexities, and the lack of any runtime tunable
parameters in the simulation.  The raw irregular mesh data is first
interpolated onto a regular grid to facilitate processing and
downsampled to a more manageable size.  The resulting grid is then
used to calculate isosurfaces corresponding to various response
intensities.  The displacement and density isosurfaces are combined
with topology information to produce a 3-dimensional image, which is
then drawn on the user's desktop display.


Because the Quake visualization is done offline, the user's input
consists of controlling the parameters of the visualization.  These
operations may include rotating the display, removing surface layers,
zooming in and out, and adjusting the parameters for isosurface
calculation.


In the visualization diagram shown in Figure~\ref{fig:quake},
generally we expect the resources to retrieve and do the initial
interpolation to be available on only a few high-performance machines,
such as those commonly used at remote supercomputing facilities.
Because the output device is determined by the user's location, the
isosurface calculation and scene calculation are the two components
which can be freely distributed to optimize performance.


\subsection{Task location}
The location of these two phases is governed by the combination of
network bandwidth, and computing and graphics resources available to
the user.  In fact, there are reasons to divide the pipeline at any of
the three remaining edges, depending on predicted processor and
network capacity.  Three examples are shown in
Figure~\ref{fig:quake-examples}.
\begin{figure}
\pagefigwidth
\centerline{\epsfbox{quake_options.eps}}
\caption{Different placements of the Quake tasks corresponding to
different resource situations.  (a) A high performance network and
workstation are available for the visualization.  (b) A low-bandwidth
network and PC with a 3D graphics accelerator are used.  (c)  A low-end
PC is used as a framebuffer displaying the result of a completely
remote visualization process.}
\label{fig:quake-examples}
\end{figure}



Figure~\ref{fig:quake-examples}a is the ideal case where resources
allow the full regular mesh produced by the interpolation phase to be
sent directly to the user's desktop.  This would be possible in the
case where there is a large amount of bandwidth between the remote
site and the user's desktop and the user's desktop has sufficient
power to perform the entire 3d processing operations in real-time.
This situation is unlikely in all but the most powerful environments,
but may offer the best opportunity for interactive use.


In Figure~\ref{fig:quake-examples}b, the isosurface calculation is
performed in the supercomputer doing the interpolation. The scene
calculation is done in the user's desktop machine.  This could be used
when less bandwidth is available, because the isosurfaces are
significantly lower bandwidth than the grid data.  The isosurface
calculation is also fairly expensive, whereas the scene calculation
can be done quite effectively by a variety of commodity graphics cards
currently available.


A very limited case is shown in Figure~\ref{fig:quake-examples}c,
where the user's desktop is acting only as a framebuffer for the
images.  This setup may be useful if the size of the final pixmap is
smaller than the size of the 3D representation.  This will depend on a
number of factors, such as resolution, number of surfaces,
translucency, and the occlusion of data by opaque geographic features.
This arrangement could also be used in a case where the desktop
machine does not support hardware 3D graphics.



\subsection{Quality parameters}
Frame rate, resolution, and interactive response time are the primary
quality parameters which can be adjusted to meet the user's
requirements.  Location of the different tasks was discussed above.
Now consider some of the other adaptations which can take place:


\begin{itemize}
\item While running uninterrupted, frame rate can be determined by the
constant delays in the pipeline.  If lower response latency is
desired, the resolution of the images can be reduced at any edge of
the pipeline.  This may be important because there may be a high
latency for issuing a change request all the way to the supercomputer
processing the images.


\item Although zooming in on the data can be performed at any stage of
the pipeline, it makes sense to propagate the change back to the
interpolator, so that it can maximize the amount of data produced
representing the region of interest.  The combination of these
adjustments makes it possible to provide quick response time to
adjustments to the visualizer by temporarily reducing image quality
until the adjustments can pass through the entire pipeline.


\item Even if the system is running with the user's terminal being
used only as a framebuffer, that image can be used for most
adjustments in a temporary manner.  For instance, although zooming in
won't immediately reveal any more details, it will provide the user
with the positive feedback required to maintain usability.
\end{itemize}

\comment{
(i) Description: In short, QuakeViz deals with the visualization and
post-processing of data sets that result from large-scale finite
element simulations of the earthquake-induced ground motion in large
basins.

(ii) Background: It is of great interest and significance, for
seismologists, engineers, urban planners and policy makers alike, to
be able to predict as accurately as possible the way a given
geographic location will react to future strong earthquakes. The
knowledge of the specific response characteristics of a geographic
location, i.e. its seismicity, is typically incorporated into design
codes that guide practicing engineers in erecting structures that are
safe and tailored to the local seismic expectations. To be able though
to assess the seismicity associated with a large geologic structure (>
10,000 ml3) it becomes necessary to simulate its response in a
multiple-scenario mode under excitations that are provided by past or
even future-anticipated earthquakes.

We have previously developed a toolchain that allows us to recover the
volume and surface motion characteristics of large geographic areas
that result when the soil is disturbed by the action of a strong
earthquake. Using the motion data, it is possible to arrive at
characterizations of the area's seismicity. We have used the toolchain
to, amongst others, simulate the response of the San Fernando Valley
in California to an aftershock of the 1994 Northridge Earthquake. The
core of the simulation engine is based on finite element computations
on an unstructured mesh. For the finest mesh created thus far, there
resulted 77 million tetrahedra and 13.3 million nodes giving rise to
approximately 40 million unknowns (the components of the displacement
vector). The simulation takes place in the time domain that is
discretized into equidistant time-steps. Thus, the simulation of a
one-minute earthquake-induced motion, using 20,000 time steps, will
result in, approximately, 6TB of data. Selective output might, at
best, reduce the data set to several tens of gigabytes of data. The
data set represents nodal values on the vertices of a 3D unstructured
mesh and its size prohibits multiple copy storage for local
processing. To convey the information encased in the data set to
analysts, seismologists, engineers, designers, etc, while allowing a
certain level of interaction, it becomes necessary to post-process and
visualize the data set on a highly heterogeneous networked
environment.

We next identify the visualization and processing tasks according to
whether they are compute- or communication-intensive.

(iii) Computational tasks

From 3d-unstructured grid data: 

surface data (regular grid extrapolations) - isosurface 

overlays with terrain and

urban landscape data (may entail extrapolation or downsampling) -

isosurface 

Zooming (near-source effects, edge-effects) 

higherresolution needs - isosurface Volume slicing (entails regular grid
extrapolations) 

isosurface Local/global response spectra (ffts,
searches for maxima, multiple freqs) 

Correlations with existing damage
distributions (energy measures + overlays) 

(iv)
Communication-intensive tasks
}


\section{OpenMap}
\label{sec:openmap}

\begin{figure*}
\centerline{
\epsfxsize=6.5in
\epsfbox{openmap.eps}
}
\caption{Structure of an OpenMap application.  Solid arrows represent
the flow of data, while dotted arrows represent user requests.}
\label{fig:openmap}
\end{figure*}

BBN's OpenMap is a architecture for combining geographical information
from a variety of different, separately developed sources in order to
present a unified coherent visual representation, in the form of a
multi-layered map, to the end user~\cite{OPENMAP-WEB-PAGE,QOS-GROUPWARE-MUTT-OPENMAP-99}.  
%
% there is an openmap paper I could reference here, but I can't find
% the damn thing, either my copy or via inspec.
%

OpenMap consists of four different kinds of components. Geographical
information is provided by third party {\em data sources}, which have
unique interfaces.  A {\em specialist} encapsulates a specific data
source, hiding the details of accessing it behind a uniform
CORBA interface.  The interface is based on sequences of objects
to be drawn.  A specialist has a corresponding {\em layer} that draws
an individual map based on the drawing objects.  Finally, A {\em map
bean} manages a group of layers, overlaying their maps to produce a
single combined map for the user.  Map beans and layers are Java
Beans, which can be conveniently embedded into Java applications. 

In Figure~\ref{fig:openmap}, we show the structure of an example
OpenMap application where information from separate terrain and
political boundary data sources are combined to present the user with
a map of the Boston area. While the structure shown in
Figure~\ref{fig:openmap} appears at first glance to be a pipeline, it
is important to note that it actually operates in a request-response
manner.  Computation happens only when the user decides to change the
{\em projection} of the map (the set of layers and the region of the
planet that is being viewed.)

OpenMap is interactive---computation happens as a direct result of a
projection change.  To provide a good user experience, the time from a
projection change to the resulting map display should be short,
consistent, and predictable.  A good abstraction for this requirement
is a deadline placed on the computation initiated by a projection
change.  Achieving such deadlines is challenging because specialists
and data sources may be located at distant sites and run on shared,
unreserved hosts communicating via the Internet.  However, missing
OpenMap deadlines only degrades the user's experience---OpenMap is
resilient.
 
The components of OpenMap were designed from the start to be
physically distributed using CORBA communication mechanisms.  We can
use this enabler to build replicated specialists and data sources,
as we highlight in gray in Figure~\ref{fig:openmap}.  This provides a
choice of which specialist is used to satisfy a projection change for
a given layer, which is one of the degrees of freedom that a
best-effort real-time system can exploit.

An OpenMap specialist can also expose two quality parameters to a
best-effort real-time system.  First, it can adjust the query sent to
its data source to change the amount of computation that needs to be
done to satisfy the request and the size of the response.  This is a
useful adaptation for when all of the replicated data sources are
simultaneously too busy.  The second quality parameter a specialist
can expose is the size of the response it sends to its layer.  If
the communication path from the specialist to the layer should become
very constrained, the specialist can send a lower quality drawing.


\comment{
For example, suppose the user decides to ``zoom''
in to display central Boston.  The map bean will request that the two
layers redraw themselves given the new projection.  Each layer must
then request a more detailed rendering from their corresponding
specialists, which in turn request new data from their corresponding
data sources.   It is only at this point that data begins to flow
toward the user.  


i) Description

(1) User requests to Bean result in layer updates

(2) Layer updates involve requests to specialists that take their local information and render polygons (well, drawing commands), which get sent back to the Bean, which merges them with other updates and draws

ii) Interactive - obvious, example of movement 

iii) Resilient to missed deadlines - guarantees not needed

(1) Non-JTF planner example

iv) Distributed and Replicable components

(1) Java Bean, specialists, replicated specialists

v) Application-unaware adaptively

(1) Choice of specialist in replicated setup

vi) Application-aware adaptively 

(1) Location of rendering component (weak, probably toss this)

vii) Communication/Compute characteristics

(1) Data sources are almost necessarily physically distributed - can argue that there is always some new data source that is of interest

(2) Lots of data is static and easily Replicable - political boundaries, vegetation, etc

(3) Other data changes slowly and can be replicated with low overhead for updates

(4) Example of quickly changing data - whiteboardish planning application - drawing layer

}

\section{Structure of a best-effort real-time system}

\begin{figure}
\epsfysize=2in
\centerline{\epsfbox{system_model.eps}}
\caption{Structure of best-effort real-time system.  The highlighted
portion is discussed.}
\label{fig:struct}
\end{figure}

Figure~\ref{fig:struct} illustrates the structure of our proposed
best-effort real-time system.  The boxes represent system components.
Thin arrows represent the flow of measured and predicted information,
thick arrows represent the control the system exerts over the
application, and the dotted arrow represents the flow of application
requests.  We have highlighted the parts of the design that we discuss
in this paper.

To be used with our system, an application must have the attributes we
discussed in Section~\ref{sec:app_chars}: the execution of the
application is decomposable into tasks with well-defined inputs,
outputs, and resource requirements, these tasks can be run on any of
the available hosts, and the end-user's quality (and the computation
and communication required) can be changed via application-specific
quality parameters.

The system accepts requests to run tasks from the application and then
uses a mapping algorithm to assign these tasks to the hosts where they
are most likely to meet their deadlines.  A task mapping request
includes a description of the data the task needs, its resource
requirements (either supplied by the application programmer or
predicted based on past executions of the task), and the required
deadline.  The mapping algorithm uses predictions of the load on each
of the available hosts and on the network to determine the expected
running time of the task on each of the hosts, and then runs the task
on one of the hosts (the target host) where it is likely to meet its
deadline with high confidence.  If no such host exists, the system can
either map the task suboptimally, hoping that the resource crunch is
transient behavior which should soon disappear, or adjust the quality
parameters of the application to attempt to reduce the task's
computation and/or communication requirements or to add more slack to
the deadline.

The choice of target host and the choice of application-specific
quality parameters are adaptation mechanisms that are used at
different time scales.  A target host is chosen for every task, but
quality parameters are only changed when many tasks have failed to
meet their deadlines, or when it becomes clear that sufficient
resources exist to improve quality while still insuring that most
deadlines can be met by task mapping.  Intuitively, the system tries
to meet deadlines by using adaptation mechanisms that do not affect
the user's experience, falling back on mechanisms that do only on
failure.  Of course, the system can also make these adjustments
pre-emptively based on predictions.

The performance of the system largely reflects the quality of its
measurements and predictions.  The measurement and prediction stages
run continuously, taking periodic measurements and fitting predictive
models to series of these measurements.  For example, each host runs a
daemon that measures the load with some periodicity.  As the daemon
produces measurements, they are passed to a prediction stage which
uses them to improve its predictive model.  When servicing a mapping
request, the mapping algorithm requests predictions from these models.
The predictions are in the form of a series of estimated future values
of the sequence of measurements, annotated with estimates of their
quality and measures of the past performance of the predictive model.
Given the predictions, the mapping algorithm computes the running time
of the task as a confidence interval whose length is determined by the
quality measures of the prediction.  Higher quality predictions lead
to shorter confidence intervals, which makes it easier for the mapping
algorithm to choose between its various options.


\section{Evidence for a history-based prediction approach}

Is history-based prediction a feasible approach to implementing a
best-effort real-time service for the application domain we described
earlier?  In this section, we present two pieces of supporting
evidence here that suggest that it is.  The first is a trace-based
simulation study of simple mapping algorithms that use past task
running times to predict on which host a task is most likely to meet
its deadline.  The study shows that being able to map a task to any
host in the system exposes significant opportunities to meet its
deadline.  Further, one of the algorithms provides near optimal
performance in the environments we simulated.  The second piece of
evidence is a study of linear time series models for predicting host
load.  We found that simple, practical models can provide very good
load predictions, and these good predictions lead to short confidence
intervals on the expected running times of tasks, making it easier for
a mapping algorithm to choose between the hosts.  Network prediction
is of considerably current interest, so we conclude with a short
discussion of representative results from the literature.


\subsection{Relationship of load and running time}
\label{sec:relationship}

The results in this section depend on the strong relationship that
exists between host load and running time for CPU-bound tasks.  We
measure the host load as the Unix one-minute load average and sample
it every second.  We collected week-long traces of such measurements
on 38 different machines in August 1997 and March 1998.  We use these
extensive traces to compute realistic running times for the simulation
experiments of Section~\ref{sec:simple_prob} and in
Section~\ref{sec:load_pred} we directly predict them with an eye to
using the predictions to estimate running times.  A detailed
statistical analysis of the traces is available
in~\cite{DINDA-STAT-PROPS-HOST-LOAD-LCR98}.  Considering load as a
continuous signal $z(t)$ the relationship between load and running
time $t_{running}$ is
\begin{displaymath}
\int_{0}^{t_{running}}\frac{1}{1+z(t)}dt=t_{nominal}
\end{displaymath}
where $t_{nominal}$ is the running time of the task on a completely
unloaded host.  Notice that the integral simply computes the
inverse of the average load during execution.  Further details on how
this relationship was derived and verified can be found
in~\cite{DINDA-LOAD-PRED-TR-98}.  

\subsection{Simple prediction-based mapping algorithms}
\label{sec:simple_prob}
\newcommand{\RangeCounter}[0]{{\em RangeCounter(W)}}

Intelligently exploiting the freedom of which host to map a task to
can significantly increase the number of tasks that meet their
deadlines compared to strategies such as always mapping to the same
host or mapping to random hosts.  We reached this conclusion by
developing and evaluating a variety of simple mapping algorithms for
mapping compute-bound tasks.

We developed and evaluated nine different mapping algorithms that use
a past history of task running times on the different hosts to predict
the host on which the current task's deadline is most likely to be
met.  These algorithms included several different window-average
schemes, schemes based on confidence intervals computed using
stationary IID stochastic models of running times, and simple neural
networks.  For conciseness, we will only discuss the most successful
of the algorithms here.  That algorithm, which we call \RangeCounter,
associates a quality value, initially zero, with each host.  To choose
a host, the following steps are taken: If no host has a quality value
significantly higher than zero, a random host is chosen, otherwise the
host with the highest quality value is chosen.  Next, the quality
values of all hosts are reduced (aged) by a small amount.  If the
mapping results in the deadline being met, the chosen host's quality
value is increased by the inverse of our statistical confidence in it.
The statistical confidence is the cumulative probability over the
deadline range of a Normal distribution parameterized by the sample
mean and variance of previous task running times on the host.  If the
mapping is unsuccessful, we divide the chosen host's quality value by
two.
\RangeCounter\ rewards good choices based on how unlikely they appear
to be able to bear fruit and punishes bad choices equally.  The idea
is to encourage exploration and risk taking, but to react quickly and
decisively to failure.  The aging process discourages complacency.

We evaluated the algorithms with a trace-driven simulator that uses
the load traces described in Section~\ref{sec:relationship} to
synthesize highly realistic execution times.   The following loop was
simulated: 
\begin{verbatim}
for i=1 to N do
   MAP task() IN t_max
endfor
\end{verbatim}
where the nominal execution time, $t_{nominal}$, of
\verb.task. is either constant or chosen from an
enumerated distribution, and the deadline, $t_{max}$ can be varied.
Communication costs are ignored here---\verb.task. is entirely
compute-bound and requires no inputs or outputs to be communicated.

Given a test configuration of a group of hosts (represented by their
load traces), the simulator determines how many tasks meet their
deadlines for each algorithm.  In addition, the simulator also
evaluates a random mapping and determines the maximum (optimal) number
of tasks that could have met their deadlines using all of the hosts,
and using only the best individual host.

We studied a six different configurations of hosts, and for each one
we used eight different values of the nominal execution time
$t_{nominal}$, and twelve different deadlines $t_{max}$, ranging
$t_{nominal}$ to three times $t_{nominal}$.  We simulated $N=100,000$
tasks for each of these 576 different combinations for a total of 57.6
million simulated tasks.  An additional 72 combinations (7.2 million
calls) representing varying $t_{nominal}$ were also simulated.

\begin{figure*}
\centerline{
\begin{tabular}{cc}
\epsfxsize=3.25in
\epsfbox{nom_8mm.eps} &
\epsfxsize=3.25in
\epsfbox{const_8mm.eps} \\
(a) & (b) \\
\end{tabular}
}
\caption{Representative data from our study of simple prediction-based
mapping algorithms:  (a) the effect of varying nominal execution time,
 (b) the effect of increasing deadline slack.}
\label{fig:predmap}
\end{figure*}

Figure~\ref{fig:predmap}(a), which is representative of our
overall results, illustrates the effect of varying the nominal time,
$t_{nominal}$ with a tight deadline of $t_{max}=t_{nominal}$.  The
configuration being studied contains 8 hosts with a wide variety of
different load behaviors.  The x-axis in the figure is the nominal
time, $t_{nominal}$, and ranges from 50 milliseconds to 60 seconds,
while the y-axis is the percentage of tasks that meet their deadlines.
In addition to the performance of the \RangeCounter\ algorithm, the
performance of random mapping, always mapping to the best individual
host, and mapping optimally to all of the hosts are also plotted.

Notice that there is a substantial gap between the performance of the
optimal mapping algorithm and the performance of random mappings.
Further, it is clear that always mapping to the same host, even the
best one does not result in an impressive number of tasks meeting
their deadlines.  These differences suggest that a good mapping
algorithm can make a large difference.  We can also see that
\RangeCounter\ performs quite well, even for fairly long tasks.


Even with relaxed deadlines, a good mapping algorithm can make a
significant difference. Figure~\ref{fig:predmap}(b), which
is representative of our results, illustrates the effect of relaxing
the deadline $t_{max}$ (normalized to $t_{nominal}$ on the x-axis) on
the percentage of tasks that meet their deadlines.  As before, data is
plotted for \RangeCounter, random mapping, always mapping to the best
individual host, and optimal mapping.  Notice that there is a large,
persistent difference between random mapping and optimal mapping even
with a great deal of slack.  Further, using the best single host is
suboptimal until the deadline is almost doubled.  On the other hand,
\RangeCounter\ presents nearly optimal performance even with the
tightest possible deadline.

In a real system, the cost of communication will result in fewer
deadlines being possible to be met and will tend to encourage the
placement of tasks near the data they would consume, and thus possibly
reduce the benefits of prediction.  Still, this study shows that it is
possible to meet many deadlines by using prediction to exploit the
degree of freedom that being able to run a task on any host gives us.

\subsection{Linear time series models for host load prediction}
\label{sec:load_pred}

Implicitly predicting the current task's running time on a particular
host based on previous task running times on that host, as we did in
the previous section, limits scalability because each application must
keep track of running times for every task on every node, and the
predictions are all made on a single host.

Because running time is so strongly related to load, as we discussed
in Section~\ref{sec:relationship}, another possibility is to have each
host directly measure and predict its own load and then make these
predictions available to other hosts.  When mapping a task submitted
on some host, the best-effort real-time system can use the load
predictions and the resource requirements of the task to estimate its
running time on each of the other hosts and then choose one where the
task is likely to meet its deadline with high confidence.

\def\benefitfigwidth{\epsfxsize=3in}

\begin{figure}
\centerline{
\begin{tabular}{cc}
\benefitfigwidth
\epsfbox{unix16-simple.epsf} &
\benefitfigwidth
\epsfbox{axpfea-simple.epsf} \\
(a) & (b) \\
\end{tabular}
}
\caption{Benefits of prediction: (a) server machine with 40 users and
long term load average of 10, (b) interactive cluster machine with long term
load average of 0.17.}
\label{fig:benefits}
\end{figure}


Load predictions are unlikely to be perfect, so the estimates of
running time are actually confidence intervals.  Better load
predictions lead to smaller confidence intervals, which makes it
easier for the mapping algorithm to decide between the available
hosts.  We have found that very good load predictions that lead to
acceptably small confidence intervals can be made using relatively
simple linear time series models.  Due to space limits, we do not
discuss these models here, but interested readers will find Box, et
al.~\cite{BOX-JENKINS-TS} or Brockwell and
Davis~\cite{INTRO-TIME-SERIES-FORECASTING} to be worthy introductions.
Figure~\ref{fig:benefits} illustrates the benefits of such models on
(a) a heavily loaded server and (b) a lightly loaded interactive
cluster machine.  In each of the graphs we plot the length of the
confidence interval for the running time of a one second task as a
function of how far ahead the load predictions are made.
Figure~\ref{fig:benefits}(a) compares the confidence intervals for a
predictive AR($9$) model and for the raw variance of the load.  Notice
that for short-term predictions, the AR($9$) model provides confidence
intervals that are almost an order of magnitude smaller than the raw
signal.  For example, when predicting 5 seconds ahead, the confidence
interval for the AR($9$) model is less than 1 second while the
confidence interval for the raw load signal is about 4 seconds.
Figure~\ref{fig:benefits}(b) shows that such benefits are also
possible on a very lightly loaded machine.

The prediction process begins by fitting a {\em model} to a history of
periodically sampled load measurements.  As new measurements become
available, they are fed to the fitted model in order to refine its
predictions.  At any point after the model is fitted, we can request
predictions for an arbitrary number of samples into the future.  The
quality of predictions for $k$ samples into the future is measured by
the {\em $k$-step ahead mean squared error}, which is the average of
the squares of the differences between the $k$-step ahead predictions
and their corresponding actual measurements.  After some number of new
samples have been incorporated, a decision is made to refit the model
and the process starts again.  We refer to one iteration of this
process as a {\em testcase.}

A good model provides {\em consistent predictability} of load, by
which we mean it satisfies the following two requirements.  First, for
the average testcase, the model must have a considerably lower
expected mean squared error than the expected raw variance of the
load.  The second requirement is that this expectation is also very
likely, or that there is little variability from testcase to testcase.
Intuitively, the first requirement says that the model provides good
predictions on average, while the second says that most predictions
are close to that average.

\def\loadpredfigwidth{\epsfxsize=3in}

\begin{figure}
\centerline {
\begin{tabular}{cc}
\loadpredfigwidth
\epsfbox{all_lead15_8to8.eps} &
\loadpredfigwidth
\epsfbox{axp0_lead15_8to8.eps} \\
(a) & (b) 
\end{tabular}
}
\caption{Performance of 8 parameter host load prediction models for 15
second ahead predictions: (a) All traces, (b) Moderately loaded
interactive host.}
\label{fig:lead15_8to8}
\end{figure}

In a previous paper~\cite{DINDA-LOAD-PRED-TR-98}, we evaluated the
performance of linear time series models for predicting our load
traces using the criterion of consistent predictability discussed
above.  Although load exhibits statistical properties such as
self-similarity (Bassingwaighte, et
al.~\cite{FRACTAL-STRUCTURES-CHAOS-SCI-MED} and
Beran~\cite{BERAN-STAT-LONG-RANGE-METHODS} provide good introductions
to self-similarity and long-range dependence) and epochal
behavior~\cite{DINDA-STAT-PROPS-HOST-LOAD-LCR98} that suggest that
complex, expensive models such as
ARFIMA~\cite{HOSKING-FRACT-DIFF,GRANGER-JOYEUX-FRACT-DIFF} or
TAR~\cite{TAR-TONG-BOOK} models might be necessary to ensure
consistent predictability, we found that relatively simple models
actually performed just about as well.

Figure~\ref{fig:lead15_8to8}(a) shows the performance of 8 parameter
versions of the models we studied for 15 second ahead predictions,
aggregated over all of our traces, while
Figure~\ref{fig:lead15_8to8}(b) shows the performance on the trace of
a single, moderately loaded interactive host. Each of the graphs in
Figure~\ref{fig:lead15_8to8} is a Box plot that shows the distribution
of 15-step-ahead (predicting load 15 seconds into the future) mean
squared error.  The data for the figure comes from running a large
number of randomized testcases.  Each testcase fits a model to a
random section of a trace and then tests the model on a consecutive
random section of the trace.  In the figure, each category is a
specific model and is annotated with the number of testcases used.
For each model, the circle indicates the expected mean squared error,
while the triangles indicated the 2.5th and 97.5th percentiles
assuming a normal distribution.  The center line of each box shows the
median while the lower and upper limits of the box show the 25th and
75th percentiles and the lower and upper whiskers show the actual
2.5th and 97.5th percentiles.

Each of the predictive models has a significantly lower expected mean
squared error than the expected raw variance of the load (measured by
the MEAN model) and there is also far less variation in mean square
error from testcase to testcase.  These are the criteria for
consistent predictability that we outlined earlier.  Another important
point is that there is little variation in the performance across the
predictive models, other than that the MA model does not perform well.
This is interesting because the ARMA, ARIMA, and especially the
long-range dependence capturing ARFIMA models are vastly more
expensive to fit and use than the simpler AR and BM models.  An
important conclusion of our study is that reasonably high order AR
models are sufficient for predicting host load.  We recommend AR(16)
models or better.

\subsection{Network prediction}
Predicting network traffic levels is a challenging task due to the
large numbers of machines which can create traffic over a shared
network.  Statistical models are providing a better understanding of
how both wide-area~\cite{wan-olympics,poisson-failure} and
local-area~\cite{SELF-SIM-HIGH-VAR-ETHERNET-WILLINGER-SIGCOMM95}
network traffic behave.  Successes have been reported in using linear
time series models to predict both long
term~\cite{TIME-SERIES-MODEL-LONG-TERM-NSFNET-GROSCHWITZ-ICC94} and
short term~\cite{TIME-SERIES-MODELS-INTERNET-TRAFFIC-BASU-GT-TR-95}
Internet traffic.  These results and systems such as NWS~\cite{nws} and
Remos~\cite{remos} are being developed to provide the predictions of
network performance that are needed to provide best effort real-time
service.  For applications which are interested in predicting the
behavior of a link on which they are already communicating on, passive
monitoring may be appropriate~\cite{bolliger-passive,spand}.



\section{Conclusion}

We have described two applications which can benefit from a
best-effort real-time service.  Without such a service, people using
such interactive applications would face a choice between acquiring
other, possibly reserved or dedicated resources or running the
application at degraded quality.  QuakeViz and OpenMap represent an
interesting class of applications which people such as engineering
researchers, and military and civilian GIS users can benefit from being
able to use in their existing environments.

The success of best-effort real-time depends on the accuracy of the
predictions of resource availability.  We have shown that CPU load can
be predicted with a high degree of accuracy using simple history-based
time series models.  When combined with currently available network
information systems, these resources allow decisions to be made for
locating tasks and selecting application parameters to provide a
usable system.

We are currently integrating the CPU load prediction and network
prediction into the Remos architecture.  The development of this
system and integration of QuakeViz and OpenMap to make use of such a
system will allow us to test the success of best-effort real-time at
meeting real-world quality demands in a variety of environments
similar to how these applications are used by end-users.


\small 
\tinyspacing
\bibliography{texbib,bruce}
\normalsize
\normspacing

\end{document}

