% most useful Latex command
\newcommand{\comment}[1]{}
\documentclass[12pt,twoside]{report}
\usepackage{fullpage}
\usepackage{cmu-titlepage}
\usepackage{epsf}
\usepackage{times}
%\usepackage[nomarkers]{endfloat}
%\usepackage{changebar}
%\usepackage{lscape}


%\input{restore}

\renewcommand{\dbltopfraction}{.9}
\renewcommand{\topfraction}{.9}
\renewcommand{\textfraction}{.1}
\renewcommand{\floatpagefraction}{.9}
\renewcommand{\dblfloatpagefraction}{.9}

%\def\normspacing {\renewcommand{\baselinestretch}{0.9}}
%\def\tinyspacing {\renewcommand{\baselinestretch}{0.7}}

\def\normspacing {}
\def\tinyspacing {}

\normspacing

\setcounter{secnumdepth}{10}

\def\thefigure{\arabic{figure}}
\def\thesection{\arabic{section}}
\def\theequation{\arabic{equation}}

%\newcounter{subsubsection}[subsection]
%\def\thesubsection{\thesection.\arabic{subsection}}
%\def\thesubsubsection{\thesubsection.\arabic{subsubsection}}



%\newcounter {subsubsubsection}[subsubsection]
%\def\thesubsubsubsection {\thesubsubsection.\arabic{subsubsubsection}}
%\def\subsubsubsection {\subsubsection*}

\makeatletter\long\def\unmarkedfootnote#1{{\long\def\@makefntext##1{##1}\footnotetext{#1}
}}

\def\pagefigwidth {\epsfxsize=6in}
\def\colfigwidth {\epsfxsize=3in}
\def\figfontsize {}
%\def\leadfigwidth{\epsfxsize=2in}
%\def\absfigsize {\epsfxsize=3.0in}
%\def\relfigsize {\epsfxsize=3.0in} 

\newcounter{ecount}
\newenvironment{ecompact}{
  \begin{list}%
    {\arabic{ecount}}{\usecounter{ecount}
    \parsep 0pt plus 1pt
    \partopsep 0pt plus 1pt
    \topsep 2pt plus 2pt minus 1pt
    \itemsep 0pt plus 1pt}}%
  {\end{list}}

\newenvironment{icompact}{
  \begin{list}{$\bullet$}{
    \parsep 0pt plus 1pt
    \partopsep 0pt plus 1pt
    \topsep 2pt plus 2pt minus 1pt
    \itemsep 0pt plus 1pt}}%
  {\end{list}}


\begin{document}

\def\RPS{RPS}

\def\REFIT{REFIT$<$T$>$}

\bibliographystyle{acm}

\title{An Extensible Toolkit for Resource Prediction\\ In Distributed Systems}

\author{Peter A. Dinda \and David R. O'Hallaron}

\date{July 1999}

\trnumber{CMU-CS-99-138}

\keywords{resource prediction, performance prediction, 
linear time series models, time series prediction, prediction
toolkits, distributed systems, predictive scheduling,
application-level scheduling}

\support{
\small
Effort sponsored in part by the Advanced Research Projects Agency and
Rome Laboratory, Air Force Materiel Command, USAF, under agreement
number F30602-96-1-0287, in part by the National Science Foundation
under Grant CMS-9318163, and in part by a grant from the Intel
Corporation. The U.S. Government is authorized to reproduce and
distribute reprints for Governmental purposes notwithstanding any
copyright annotation thereon.  The views and conclusions contained
herein are those of the authors and should not be interpreted as
necessarily representing the official policies or endorsements, either
expressed or implied, of the Advanced Research Projects Agency, Rome
Laboratory, or the U.S. Government.
\normalsize
}


\abstract{
This paper describes the design, implementation, and performance of
\RPS, an extensible toolkit for building flexible on-line and 
off-line resource prediction systems in which resources are
represented by independent, periodically sampled, scalar-valued
measurement streams.  \RPS-based prediction systems predict future
values of such streams from past values.  Systems are composed at
run-time out of an extensible set of communicating prediction
components which are in turn constructed using \RPS's sensor,
prediction, and communication libraries.  We have used \RPS\ to
evaluate predictive models and build on-line prediction systems for
host load and network bandwidth.  The overheads involved in such
systems are quite low.  }

\maketitle

\cleardoublepage

\section{Introduction}
\label{sec:intro}

In order for a distributed, interactive application such as a
scientific visualization tool~\cite{DV-PRELIM-REPORT-PDPTA99} to be
responsive to the user's demands when running on a shared, unreserved
distributed computing environment, it must adapt its behavior to
dynamically changing resource
availability~\cite{DINDA-CASE-BERT-WPDRTS-99,GRID-BOOK}.  Adaptation
can take many forms, including dynamically choosing where to execute
the application's tasks and changing the computation that a task
performs, perhaps by changing the quality of its results.

Implicit in such application-level scheduling~\cite{APPLES-HPDC96} is
resource prediction.  The application scheduler adapts the
application's behavior based on predicted resource availability.  Each
resource has an associated sensor mechanism which periodically
measures its availability as a scalar value.  The prediction system
for a particular resource uses an appropriate predictive model to map
current and past resource measurements into predictions of future
resource measurements.  These predictions and the application's
resource requirements are then mapped into an application-level metric
such as task execution time.  The scheduler uses these metrics to make
scheduling decisions.

Because of their central role, resource prediction systems are of
considerable importance.  Unfortunately, tools to simplify studying
resource prediction and building appropriate and efficient resource
prediction systems are scarce.  To remedy this situation, we have
created the \RPS\ (Resource Prediction System) toolkit.  \RPS\ is a
set of libraries and programs implemented using them that simplifies
the evaluation of prospective models and the creation of efficient and
flexible resource prediction systems out of communicating components.
This paper describes the design and implementation of \RPS\, and the
performance of a representative \RPS-based resource prediction system.

To understand the role \RPS\ plays, consider the process of building a
prediction system for a new kind of resource.  Once a sensor mechanism
has been chosen, this entails essentially two steps.  The first is an
off-line process consisting of analyzing representative measurement
traces, choosing candidate predictive models based on the analysis,
and evaluating these models using the traces.  The second step is to
build an on-line prediction system that implements the most
appropriate model with minimal overhead.  There are a wide variety of
statistical and signal processing tools for interactive analysis of
measurement traces that work very well for performing most of the
first step.  However, tools for doing large scale trace-based model
evaluation are usually ad hoc and do not take advantage of the
available parallelism.  With regard to the second step, building an
on-line predictive system using the appropriate model, \RPS\ provides
tools for quickly building a on-line resource prediction system out of
communicating components.

\RPS\  is designed to be generic, extensible, distributable, portable,
and efficient.  The basic abstraction is the prediction of
periodically sampled, scalar-valued measurement streams.  Many such
streams arise in a typical distributed system.  \RPS\ can be easily
extended with new classes of predictive models and new components can
be easily implemented using \RPS.  These components inherit the
ability to run on any host in the network and can communicate in
powerful ways.  The only tool needed to build \RPS\ is a C++ compiler,
and it has been ported to four different Unix systems and Windows NT.
For typical measurement stream sample rates and predictive models, the
addition load that an \RPS-based prediction system places on a host is
in the noise, while the maximum sample rates possible on a typical
machine are as high as 2.7 KHz.

Our experience shows that it is possible to use \RPS\ to find and
evaluate appropriate predictive models for a measure of resource
availability, and then implement a low overhead prediction system that
provides timely and useful predictions for that resource.  We used
\RPS\  for an extensive evaluation of linear models for host load
prediction~\cite{DINDA-LOAD-PRED-HPDC-99} and a smaller evaluation of
linear models for network bandwidth prediction.  Following the
evaluation studies, we used \RPS\ to implement on-line prediction
systems for these resources.  These systems have been used in the CMU
Remos~\cite{REMOS-HPDC98} resource measurement system, the BBN QuO
distributed object quality of service system~\cite{QUO-JOURNAL-98},
and are currently being integrated in to the Dv distributed
visualization framework~\cite{DV-PRELIM-REPORT-PDPTA99}.  Parts of
\RPS\ have also been used to explore the relationship between SNMP
network measurements and application-level
bandwidth~\cite{LOWEKAMP-SNMP-VS-APP-HPD99}.

We begin our description of \RPS\  by enumerating, in general terms, the
steps that a researcher would follow in providing prediction for a new
resource (Section~\ref{sec:req}.)  Simplifying the final two steps of
the process is the primary goal of \RPS.  In the same section, we also 
detail the additional requirements we placed on the \RPS\ 
implementation.  Next, we describe the high level composition of the
\RPS\ toolkit and how the pieces fit together
(Section~\ref{sec:overall}.)  Then we describe each piece of \RPS\  in
detail: the sensor libraries, which measure host load and network flow
bandwidth (Section~\ref{sec:sensors}); the time series prediction
library which provides an extensible abstraction for prediction and
implementations of a number of useful predictive models
(Section~\ref{sec:tspl}); the mirror communication template library,
which provides the infrastructure to build prediction components which
can communicate in sophisticated ways (Section~\ref{sec:mirror}); and
the prediction components we have built (Section~\ref{sec:predcomp}.)

Throughout these sections, we provide examples of how these tools can
be used by a researcher or practitioner.  In Section~\ref{sec:crossval_system},
we describe the parallelized cross-validation system we implemented
around the time series prediction library in order to evaluate
prospective models on measurement traces.  

We illustrate the performance of \RPS\ in two ways.  First, in
Section~\ref{sec:tspl_perf}, we detail the computational costs
involved in using the various predictive models in the time series
prediction library on representative measurement streams.  The cost to
fit and use different predictive models varies widely.  Second, in
Section~\ref{sec:on-line_perf}, we describe a representative and
realistic \RPS-based prediction system for host load.  We measure the
performance of this system in terms of the achievable sample rates,
the measurement to prediction latency, and the additional CPU and
communication loads presented by the system when run at different
sample rates.  For samples rates that of are interest in this system,
we find that the additional load is barely discernible.  Its maximum
achievable rate is about two orders of magnitude higher than we need.

Finally, we address related work in Section~\ref{sec:related} and
conclude in Section~\ref{sec:conclusion} with a discussion of the
limitations of \RPS\ and the work that is needed to simplify managing
running \RPS-based prediction systems.


\section{Goal and requirements}
\label{sec:req}

The goal of \RPS\ is to facilitate two aspects of the construction of
on-line resource prediction systems. The aspects are the off-line
evaluation of predictive models and the construction of on-line
systems using appropriate models.  In addition, we placed other
requirements on \RPS\ that we felt would provide flexibility for
further research into resource prediction.

\subsection{Designing a new prediction system}
\label{sec:newpredsys}

To understand the goal of \RPS, it is useful to consider the task of a
researcher interesting in predicting a new kind of resource.  Roughly,
he will follow these steps:
\begin{enumerate}
\item Construct a {\em sensor} for the resource.  The sensor generates a
periodically sampled, scalar-valued {\em measurement stream}.  We will
also refer to such a stream as a {\em signal.}
\item Collect measurement {\em traces} from representative environments. 
\item Analyze these traces using various statistical tools.
\item Choose candidate models based on that analysis.
\item Evaluate the models in an unbiased off-line evaluation study
based on the traces to find which of the candidate models are indeed
appropriate.
\item Implement an on-line prediction system based on the appropriate 
models.
\end{enumerate}
The first two steps generally require the researcher to implement
custom tools using his domain-specific knowledge of the resource in
question.  When carrying out the third step, the researcher is aided
by powerful, commonly available tools.  This step, as well as the
fourth step, also rely heavily on the statistical expertise of the
researcher.  The final two steps can benefit considerably from
automated and reusable tools.  However, such tools are scarce.  The
goal of \RPS\ is to provide tools for these last two steps.  The
researcher should be able to use \RPS\ to conduct off-line model
evaluation, and then construct an on-line resource prediction system
based on the appropriate models.

The first step requires domain-specific knowledge.  There are a myriad
of ways in which interesting signals in a distributed environment can
be captured.  For example, our load prediction work uses OS-specific
mechanisms to retrieve Unix load averages (average run queue
lengths)~\cite{DINDA-STAT-PROP-HOST-LOAD-SCIPROG-99}.  The Network
Weather Service uses benchmarking to measure network bandwidths and
latencies between hosts~\cite{FORECASTING-NETWORK-NWS-WOLSKI-HPDC97}
and load average and accounting-based sensors for host
load~\cite{WOLSKI-PRED-CPU-AVAIL-HPDC-99}.  Remos uses SNMP queries to
measure bandwidths and latencies on supported
LANs~\cite{REMOS-HPDC98}.  These different mechanisms developed from
the expertise of their researchers.  Of course, some measurement
issues are more general.  For example, determining the signal's
bandwidth and how to resample possibly aperiodically sampled signals
to periodicity are important signal processing issues that arise for
all signals.

Once the issues involved in the sensor are resolved, the researcher
must use his expertise to choose a representative set of environments
and capture traces from them using the sensor implementation.  This
second step is also highly dependent on the particulars of the signal
and the environments of interest. 

Unlike the first two steps, the third step and fourth steps, analyzing
the collected traces and choosing candidate models based on the
analysis, requires far less domain-specific knowledge and more general
purpose statistical knowledge.  Consequently, there are a number of
tools available to help the researcher perform these steps.  For
example, our research into the properties of host load
signals~\cite{DINDA-STAT-PROP-HOST-LOAD-SCIPROG-99} made extensive use
of the exploratory data analysis and time series analysis tools
available in S-Plus~\cite{SPLUS-MANUAL}, and in
Matlab's~\cite{MATLAB-MANUAL} System Identification
Toolbox~\cite{MATLAB-SYSIDENT-MANUAL}.

Surprisingly, there are few tools to simplify the fifth step,
evaluating the candidate models in an unbiased manner.  Time series
analysis methodologies such as that of Box and
Jenkins~\cite{BOX-JENKINS-TS}, do generally have an evaluation step,
but it is very interactive and primarily concerned with the fit of the
model and not its predictive power, which is really what is of
interest to the distributed computing community.  Furthermore, these
methodologies often assume that computational resources are scarce,
when, in fact, they are currently widely available.

We believe that the appropriate way to to evaluate prospective models
is to study their predictive performance when confronted by many
randomly selected testcases.  Running this plethora of testcases
requires considerable computational resources.  While it is possible
to script tools such as Matlab and S-Plus to do the job, it would be
considerably more efficient to have a more specialized tool that could
exploit the embarrassing parallelism available here.  Furthermore, it
would be desirable for the tool to use the same model implementations
that would ultimately be used in the on-line prediction system.  We
evaluated the use of linear models for host load prediction using such
a parallelized tool implemented using
\RPS~\cite{DINDA-LOAD-PRED-HPDC-99}, which we describe in 
Section~\ref{sec:crossval_system}.

The final step, implementing an on-line prediction system for the
signal, requires implementing, or reusing, the model or models that
survived the evaluation step and enabling them to communicate with
sensors and the applications or middleware that may be interested in
the predictions.  In addition, mechanisms to evaluate and control a
prediction system must be provided.  In
Section~\ref{sec:host_load_pred_sys} we show how we use \RPS\ to
implement an on-line host load prediction system using the same models
we evaluated earlier.


\subsection{Implementation requirements}
\label{sec:implreqs}

The goal of \RPS\ is to provide a toolkit that simplifies the final
two steps of the prediction process, as described above.  We also
required that our implementation would provide generacity,
extensibility, distributability, portability, and efficiency.  These
requirements were intended to make \RPS\ as flexible as possible for
future research into resource prediction.  We address each of these
requirements below, noting how our implementation addresses these
requirements.

{\bf Generacity:} Nothing in the system should be tied to a specific
kind of signal or measurement approach.  \RPS\ should be able to
operate on periodically sampled, scalar-valued measurement streams
from any source.  Generacity is important in a research system such as
\RPS\ because there are a plethora of different signals in a
distributed system whose predictability we would like to study.  While
our research has focused primarily on the prediction of host load as
measured by the load average, we have also used \RPS\ to study the
predictability of network bandwidth as measured by Remos and through
commonly available traces.

{\bf Extensibility:} It should be easy to add new models to
\RPS\ and to write new \RPS-based prediction components.  Being able to add
new models is important because the statistical study of a signal can
point to their appropriateness.  For example, we added an
implementation of the fractional ARIMA model to \RPS\ after we noted
that host load traces exhibited self-similarity.  An obvious example
of the need for easily constructable components is implementing new
sensors.  For example, we added a Remos-based network bandwidth sensor
many months after writing a host load sensor.

{\bf Distributability:} It should be possible to place \RPS-based
prediction components in different places on the network and have them
communicate using various transports.  On-line prediction places
computational load on the distributed system and it should be possible
to distribute this load across the hosts as desired.  Similarly, it
should be possible to adjust the communication load and performance by
using different transports.  For example, many applications may be
interested in host load predictions, so it should be easy to multicast
them.  If two communicating components are located on the same
machine, they should be able to use a faster transport.  \RPS-based
components are distributable and can communicate using TCP, UDP, Unix
domain sockets, pipes, and files.

{\bf Portability:} It should be easy to port \RPS\ to new platforms,
including those without threads, such as FreeBSD, and non-Unix
systems, such as NT.  FreeBSD and NetBSD are important target
platforms for Remos and \RPS.  Some of the potential users of \RPS\,
such as organizations like the ARPA Quorum project, are increasingly
interested in NT-based software.  \RPS\ requires only a modern C++
compiler to build and has been ported to Linux, Digital Unix, Solaris,
FreeBSD, NetBSD, and Windows NT.

{\bf Efficiency:} An on-line prediction system implemented using \RPS\
components should be able to operate at reasonably high measurement
rates, and place only minor computational and communication loads on
the system when operating at typical sampling rates.  Obviously, what
is reasonable depends on the signal and the model being used to
predict it.  The idea here is not to achieve near-optimal performance,
but rather to achieve sufficient performance to make an \RPS-based
resource prediction system usable in practice.  Ultimately, something
like a host load prediction system would probably be implemented as a
single, hand-coded daemon which would be considerably faster than a
system composed out of communicating \RPS\ prediction components, such
as we measure later.  However, the latter \RPS-based system can
operate at over 700 Hz and offers noise-floor level load at
appropriate rates with median prediction latencies in the 2
millisecond range.  A comparable monolithic system, composed at
compile-time using \RPS, can sustain a rate of 2.7 KHz.


\section{Overall system design}
\label{sec:overall}

\begin{figure}
\centerline{
\colfigwidth
\epsfbox{predoverview.eps}
}
\caption{Overview of an on-line resource prediction system.}
\label{fig:predoverview}
\end{figure}

Here we describe the structure of \RPS\ as it relates to the
construction of an on-line resource prediction system (step 6 of
Section~\ref{sec:newpredsys}.)  In addition, \RPS\ can also be used to
implement off-line evaluation systems as per step 5 of
Section~\ref{sec:newpredsys}.  In the following, we focus on on-line
prediction, pointing out the particulars of off-line prediction only
in passing.  Section~\ref{sec:crossval_system} presents more details
about a parallel off-line system.

Figure~\ref{fig:predoverview} presents an overview of an on-line time
series prediction system.  In the system, a {\em sensor} produces a
{\em measurement stream} (we also refer to this as a {\em signal}) by
periodically sampling some attribute of the distributed system and
presenting it as a scalar.  The measurement stream is the input of a
{\em predictor}, which, for each individual measurement produces a
vector-valued prediction.  The vector contains predictions for the
next $m$ values of the measurement stream, where $m$ is configurable.
Each of the predictions in a vector is annotated with an estimate of
its error.  Consecutive vectors form a {\em prediction stream}, which
is the output of the predictor.  {\em Applications} (including other
middleware) can subscribe directly to the prediction stream.  The
prediction stream also flows into a {\em buffer}, which keeps short
history of the prediction stream and permits applications to access
these predictions asynchronously, via a request/response mechanism.
The measurement and prediction streams also feed an optional {\em
evaluator}, which continuously monitors the performance of the
predictor by comparing the predictor's actual prediction error with a
maximum permitted error and by comparing the predictor's estimates of
its error with another maximum permitted error level.  If either
maximum is exceeded --- the predictor is either making too many errors
or is misestimating its own error --- the evaluator calls back to the
predictor to tell it to refit its model.  The user can exert control
of the system by an asynchronous request/response mechanism.  For
example, he might change the sampling rate of the sensor, the model
the predictor is using, or the size of the buffer's history.

The implementation of Figure~\ref{fig:predoverview} relies on several
functionally distinct pieces of software:  the sensor libraries, the
time series prediction library, the mirror communication template
library, the prediction components, and scripts and other ancillary
codes. 

{\bf Sensor libraries} implement function calls that measure some
underlying signal and return a scalar.  Section~\ref{sec:sensors}
provides more information about host load and flow bandwidth libraries
that we provide.

The {\bf time series prediction library} provides an extensible,
object-oriented C++ abstraction for time series prediction software as
well as implementations of a variety of useful linear models.
Section~\ref{sec:tspl} provides a detailed description of this library
and a study of the stand-alone performance of the various models we
implemented.

The {\bf mirror communication template library} provides C++ template
classes that implement the communication represented by arrows in
Figure~\ref{fig:predoverview}.  It makes it very easy to create a
component, such as predictor, or any of the other boxes in the figure,
which has a large amount of flexibility.  In particular, the library
provides run-time configurability, the ability to handle multiple data
sources and targets, request/response interactions, and the ability to
operate over a variety of transports.  Section~\ref{sec:mirror}
describes the mirror library in detail.

{\bf Prediction components} are programs that we implemented using the
preceding libraries.  They realize the boxes of
Figure~\ref{fig:predoverview} and can be connected as desired when
they are run.  Section~\ref{sec:predcomp} describes the prediction
components we implemented.  Section~\ref{sec:on-line_perf} describes
the performance and overhead of an on-line host load prediction system
for composed from these components and communicating using TCP.

{\bf Ancillary software} includes scripts to instantiate prediction
systems on machines, tools for replaying host load traces, and tools
for testing host load prediction-based schedulers.  We don't describe
this software in any further detail in this paper.  One piece of
software that has not been implemented is a system for keeping track
of instantiated prediction systems and their underlying data streams.
Currently, the client middleware or the user must instantiate
prediction components and manage them.


\section{Sensor libraries}
\label{sec:sensors}

Currently, two sensor libraries have been implemented.  The first
library, {\em GetLoadAvg} provides a function that retrieves the load
averages (ie, average run queue length) of the Unix system it is
running on.  On some systems, such as Digital Unix, these are 5, 30,
and 60 second averages, while on others, such as Linux, these are 1,
5, and 15 minute averages.  We have shown elsewhere that the execution
time of a compute-bound task on a Digital Unix system is strongly
related to the average load it experiences during
execution~\cite{DINDA-LOAD-PRED-HPDC-99}.  

The GetLoadAvg code was borrowed from Xemacs and considerably
modified.  It uses efficient OS-specific mechanisms to retrieve the
numbers where possible.  When such mechanisms are not available, or
when the user's permissions are inadequate, it runs the Unix uptime
utility and parses its output.  Because NT does not have an equivalent
to the load average, it gracefully fails on that platform.  Additional
code is available from us for directly sampling the run-queue length
on an NT system using the registry interface.  On a 500 MHz Digital
Unix machine, approximately 640,000 GetLoadAvg calls can be made per
second.  The maximum observed latency is about 10 milliseconds.  We
normally operate at about 1 call per second.

The second library, {\em GetFlowBW}, provides a function that measures
the bandwidth that a prospective new flow between two IP addresses
would receive, assuming no change in other flows.  The implementation
is based on Remos~\cite{REMOS-HPDC98}, which uses SNMP queries to
estimate this value on LANs and benchmarking to estimate it on WANs.
For SNMP queries on a private LAN, about 14 calls can be made per
second.  

\section{Time series prediction library}
\label{sec:tspl}

The time series prediction library is an extensible set of C++ classes
that cooperate to fit models to data, create predictors from fitted
models, and then evaluate those predictors as they are used.  While
the abstractions of the library are designed to facilitate on-line
prediction, we have also implemented several off-line prediction tools
using the library, including a parallelized cross-validation tool.
Currently, the library implements the Box-Jenkins linear time series
models (AR, MA, ARMA, ARIMA), a fractionally integrated ARIMA model
which is useful for modeling long-range dependence dependence such as
arises from self-similar signals, a ``last value'' model, a windowed
average model, and a long term average model.  In addition, we
implemented a template-based utility model which can be parameterized 
with another model resulting in a version of the underlying model that 
periodically refits itself to data.

\subsection{Abstractions}

\begin{figure}
\centerline{
\colfigwidth
\epsfbox{ppl.eps} 
}
\caption{Abstractions of the time series prediction library}
\label{fig:ppl}
\end{figure}

The abstractions of the time series prediction library are illustrated
in Figure~\ref{fig:ppl}.  The user begins with a {\em measurement
sequence}, $\langle z_{t-N},\ldots,z_{t-2},z_{t-1} \rangle$, which is a
sequence of $N$ scalar values that were collected at periodic
intervals, and a {\em model template} which contains information about
the structure of the desired model.  Although the user can create a
model template himself, a function is also provided to create a model
template by parsing a sequence of strings such as command-line
arguments.

The measurement sequence and model template are supplied to a {\em
modeler} which will fit a {\em model} of the appropriate structure to
the sequence and return the model to the user.  The user can select
the appropriate modeler himself or use a provided function which
chooses it based on the model template.  The returned model represents
a fit of the model structure described in the model template to the
measurement sequence.   

To predict future values, the model creates a {\em predictor.}  A
predictor is a filter which operates on a scalar-valued {\em
measurement stream}, $z_{t},z_{t+1},\ldots$,  
producing a vector-valued {\em prediction stream},
$
[\hat{z}_{t,t+1},\hat{z}_{t,t+2},\ldots,\hat{z}_{t,t+m}],
[\hat{z}_{t+1,t+2},\hat{z}_{t+1,t+3},\ldots,\hat{z}_{t+1,t+1+m}],\ldots
$
Each new measurement generates predictions for what the next $m$
measurements will be, conditioned on the fitted model and on all the
measurements up to and including the new measurement.  $m$ can be
different for each step and the predictor can be asked for any
arbitrary next $m$ values at any point.  The predictor can also
produce {\em error estimates} for its $1,2,\ldots,m$-step ahead
predictions.  Ideally, the prediction error will be normally
distributed and so these estimates can serve to compute a confidence
interval for the prediction. 

The measurement and prediction streams can also be supplied to an {\em
evaluator}, which evaluates the actual quality of the predictions
independent of any particular predictor, producing {\em error
metrics.}  The user can compare the evaluator's error metrics and the
predictor's error estimates to determine whether a new model needs to
be fitted.

\subsection{Implementation}
\label{sec:tspl_impl}

\begin{figure}
\small
\centerline{
\begin{tabular}{|c|l|}
\hline
Model           & Notes \\
\hline
\multicolumn{2}{|c|}{Simple Models} \\
\hline
MEAN            & Long-range mean \\
LAST            & Last-value \\
BM(p)           & Mean over ``best'' window \\
\hline
\multicolumn{2}{|c|}{Box-Jenkins Models}\\
\hline
AR(p)           & Uses Yule-Walker \\
MA(q)           & Uses Powell \\
ARMA(p,q)       & Uses Powell \\
ARIMA(p,d,q)    & Captures non-stationarity, uses Powell \\
\hline
\multicolumn{2}{|c|}{Self-similar Models}\\
\hline
ARFIMA(p,d,q)   & Captures long-range dependence \\
\hline
\multicolumn{2}{|c|}{Utiltity Models}\\
\hline
\REFIT\        & Auto-refitting model \\
\hline
\end{tabular}
\normalsize
}
\caption{Currently implemented predictive models.}
\label{fig:model_structs}
\end{figure}

The time series prediction library is implemented in C++.  To extend
the basic framework shown in Figure~\ref{fig:ppl} to implement a new
model, one creates subclasses of model template, modeler, model and
predictor, and updates several helper functions.  We implemented the
nine models shown in Figure~\ref{fig:model_structs} in this way.
Evaluator can also subclassed, but the base class already provides a
comprehensive implementation.  

In the remainder of this section, we shall focus on the implementation
of the time series prediction library and the treatment of the
individual modeling techniques is intentionally brief.  More detail
can be found in our earlier paper on linear models for host load
prediction~\cite{DINDA-LOAD-PRED-HPDC-99} and in standard references
such as Box and Jenkins~\cite{BOX-JENKINS-TS}, Brockwell and
Davis~\cite{INTRO-TIME-SERIES-FORECASTING}, and in the classic
articles on fractional ARIMA
models~\cite{HOSKING-FRACT-DIFF,GRANGER-JOYEUX-FRACT-DIFF}.
S-Plus~\cite{SPLUS-MANUAL} and Matlab's System Identification
Toolbox~\cite{MATLAB-SYSIDENT-MANUAL} provide good tools for learning
and experimenting with these models.

\begin{figure}[tbp]
\centerline {
\pagefigwidth
\epsfbox{ltsmodel.eps}
}
\caption{Linear time series model.}
\label{fig:ltsmodels}
\end{figure}

The models we implemented are different kinds of linear time series
models.  In fitting such a model, the idea is to treat the measurement
sequence $\langle z_t
\rangle$ as the output of a linear filter being driven by a white
noise sequence $\langle a_t \rangle$.  Figure~\ref{fig:ltsmodels}
illustrates this decomposition.  The filter coefficients $\psi_j$ are
estimated from past observations of the sequence with the goal of
minimizing the variance (or energy) of the driving source,
$\sigma_a^2$.  This residual variance is then interpreted as an
estimate of prediction error of the model for one-step-ahead
predictions.

This general form of the linear time series model is impractical,
since it involves an infinite summation using an infinite number of
completely independent weights.  The practical models we implemented
model the filter coefficients $\psi_j$ as the coefficients of a ratio
of polynomials in the backshift operator $B$, where $B^dz_t =z_{t-d}$.
Using this scheme, the models we implemented are all variants of the
following form:
\begin{equation}
z_t = \frac{\theta(B)}{\phi(B)(1-B)^d}a_t + \mu
\label{eqn:ts_poly}
\end{equation}

\subsubsection{Implemented models}

{\bf MEAN model:} The MEAN model has $z_t =\mu$, so all future values
of the sequence are predicted to be the mean.  This is the best
predictor, in terms of minimum mean squared error, for a sequence
which has no correlation over time --- in other words, it is best if
the sequence is entirely white noise.  The MEAN modeler and model
classes essentially do no work, while the MEAN predictor class
maintains a running estimate of the mean and variance of the signal.

{\bf LAST model:} LAST models have $z_t = \frac{1}{\phi(B)}a_t$ where
$\phi(B)$ has one coefficient, set to one.  In other words, $z_t =
z_{t-1}$, so the one step ahead prediction is simply the last measured
value.  LAST is implemented as a BM(1) model, which we describe next.

{\bf BM($p$) models:} BM($p$) models have $z_t = \frac{1}{\phi(B)}a_t$
where the $\phi(B)$ has $N$, $N \leq p$, coefficients, each set to
$1/N$.  This simply predicts the next sequence value to be the average
of the previous $N$ values, a simple windowed mean.  The BM($p$)
modeler chooses $N$ to minimize the one-step-ahead prediction error
for the measurement sequence.  The BM($p$) model simply keeps track of
this $N$ and the BM($p$) predictor implements the windowed average.

{\bf AR($p$) models:} AR($p$) (purely autoregressive) models have $z_t
=\frac{1}{\phi(B)}a_t + \mu$ where $\phi(B)$ has $p$ coefficients
which the modeler chooses to minimize $\sigma_a^2$.  Our
implementation uses the Yule-Walker technique to fit the model.  In
this technique, the autocorrelation function of the measurement
sequence is computed to a maximum lag of $p$ and then a $p$-wide
Toeplitz system of linear equations in the coefficients is solved.
Even for relatively large values of $p$, this can be done quite
quickly, and the technique makes no assumptions about the error
distribution.  The AR($p$) model stores the $p$ coefficients, $\mu$,
and $\sigma_a^2$.  Prediction is done with an Eta-theta predictor,
which we describe next.

{\bf Eta-theta predictor:} The AR, MA, ARMA, ARIMA, and ARFIMA models
share a single predictor implementation.  That predictor maintains a
copy of the model coefficients ($\eta(B) = \phi(B)(1-B)^d$),
$\theta(B)$ as before, $\mu$, and $\sigma_a^2$.  In addition, it
maintains a prediction state in the form of a history of previous
one-step-ahead predictions and their corresponding errors (the white
noise).  It operates linearly on this state and the coefficients to
produce predictions.  New measurement stream values change the state.

{\bf MA($q$) models:} 
MA($q$) (purely moving average) models have $z_t =
\theta(B)a_t$ where $\theta(B)$ has $q$ coefficients. The
modeler uses the Numerical Recipes implementation of Powell's method
for multi-dimensional function
minimization~\cite{NUM-RECIPES-FORTRAN-86}, pp. 406--413, to choose
coefficients which minimize $\sigma_a^2$ for the measurement sequence.
The MA($q$) model stores the $q$ coefficients, $\mu$, and
$\sigma_a^2$.


{\bf ARMA($p$,$q$) models:} ARMA($p$,$q$) (autoregressive moving
average) models have $z_t=\frac{\theta(B)}{\phi(B)}a_t + \mu$ where
$\phi(B)$ has $p$ coefficients and $\theta(B)$ has $q$ coefficients.
The modeler uses Powell's function minimization routine to choose the
$p+q$ coefficients to minimize $\sigma_a^2$ for the measurement
sequence.  The ARMA($q$) model stores the $p+q$ coefficients, $\mu$,
and $\sigma_a^2$.

{\bf ARIMA($p$,$d$,$q$) models:} ARIMA($p$,$d$,$q$) (autoregressive
integrated moving average) models implement Equation~\ref{eqn:ts_poly}
for $d=1,2,\ldots$  The purpose of these unitary roots is to
introduce integration of the signal, which allows ARIMA models to
model non-stationary signals.  The modeler fits ARIMA($p$,$d$,$q$)
models by differencing the sequence $d$ times and then fitting an
ARMA($p$,$q$) model as above to the result.  The model contains the
$d$, the $p+q$ coefficients, $\mu$, and $\sigma_a^2$.  When an
eta-theta predictor is constructed, the $d$ unitary roots are folded
into the $\eta$ portion of the predictor.

{\bf ARFIMA($p$,$d$,$q$) models:} ARFIMA($p$,$d$,$q$) (autoregressive
fractionally integrated moving average ) models implement
Equation~\ref{eqn:ts_poly} for fractional values of $d$, $0<d<0.5$.
It can be shown that this fractional integration can model long-range
dependence such as arises from
self-similarity~\cite{BERAN-STAT-LONG-RANGE-METHODS,HOSKING-FRACT-DIFF,GRANGER-JOYEUX-FRACT-DIFF}.
In addition, the ``ARMA part'' of the model models the short-range
dependence in the signal.  To fit ARFIMA models, we use Fraley's
Fortran 77 code~\cite{FRALEY-FRACDF}, which does maximum likelihood
estimation of ARFIMA models assuming a normally distributed white
noise source following Haslett and
Raftery~\cite{HASLETT-RAFTERY-LT-DEP-WIND}.  This implementation is
also used by commercial packages such as S-Plus.  When the predictor
is constructed, we truncate $(1-B)^d$ at 300 coefficients (other
choices are possible).

{\bf \REFIT\ model:} The \REFIT\ modeler, model, and predictor
are C++ template classes that are parameterized by some modeler class
and produce models of the underlying type that will automatically
refit themselves at regular, user-specified, intervals.  For example,
\small
\begin{verbatim}
  RefittingModeler<ARModeler>::Fit(seq,seqlen,modeltemplate,interval)
\end{verbatim}
\normalsize
will return an AR model whose predictor will automatically fit a new
AR model and update itself after every \verb.interval. new samples.

\subsubsection{Use of Powell's method} 

The choice of Powell's method, which we use in our implementations of
the MA, ARMA and ARIMA models is a compromise.  Powell's method does
not require derivatives of the function being minimized, but operates
more slowly than other methods which can make use of derivatives.

We use this method because we want to minimize $\sigma_a^2$ (the sum
of squared prediction errors) directly.  Other, faster methods to fit
MA, ARMA, and ARIMA models exist.  Instead of minimizing $\sigma_a^2$,
these methods maximize the likelihood, which is a function of
$\sigma_a^2$ whose form is determined by the distribution of the
errors.  By {\em assuming a particular distribution}, a function with
known derivatives is produced and this allows the use of faster
function minimization methods.  However, we found that assuming a
particular error distribution was rarely valid for host load and
network flow bandwidth, two signals that were of considerable interest
to us.  The prediction errors of linear time series models on real
signals are rarely distributed according to a convenient analytic
distribution, although they usually are quite white (uncorrelated) and
have low $\sigma_a^2$.

\subsubsection{Prediction evaluation}
\label{sec:evaluator}

The evaluator we implemented measures the following error metrics of a
predictor.  For each lead time, the minimum, median, maximum, mean,
mean absolute, and mean squared prediction errors are computed.  Of
these, the mean squared prediction errors are especially useful, since
they can be compared against the predictor's own estimates to
determine whether a new model needs to be fitted.  Of course, a new
model can also be fitted if the prediction error is simply too high,
or for any reason, at any time.

The one-step-ahead prediction errors (ie, $a_{t+i}^1$,
$i=1,2,\ldots,n$) are also subject to IID and normality tests as
described by Brockwell and Davis~\cite{INTRO-TIME-SERIES-FORECASTING},
pp. 34--37.  IID tests include the fraction of the autocorrelations
that are significant, the Portmanteau Q statistic (the power of the
autocorrelation function), the turning point test, and the sign test.
Recall that with an adequate model, the prediction errors should be
uncorrelated (white) noise.  If an IID test finds significant
correlation in the errors, then a new model can be fitted to attempt
to capture this correlation.  The evaluator also tests if the errors
are distributed normally by computing the $R^2$ value of a
least-squares fit to a quantile-quantile plot of the errors versus a
sequence of normals of the same mean and variance.  If the $R^2$ is
high, then using the simplifying assumption that the errors are
normally distributed is well founded.


\subsection{Example}
\label{sec:ts_example}

The following is a code fragment to show how the time series
prediction library can be used.  In the code, we fit an ARMA(2,2)
model to the first half of the sequence
\verb.seq. and then do 8-step-ahead predictions on the second half of
the sequence
\small
\begin{verbatim}
  ModelTemplate *template;  
  Model         *model;    
  Predictor     *predictor;
  Evaluator     *evaluator;
  double        predictions[8], errorestimates[8];

  // fit model to 1st half and create predictor
  template  = ParseModel(3,{"ARMA","2","2"});
  model     = FitThis(&(seq[0]),seqlen/2,*template);
  predictor = model->MakePredictor();
  eval      = new Evaluator;

  // bring predictor state up to date
  Prime(predictor,&(seq[0]),seqlen/2);

  evaluator->Initialize(8);

  // 8-ahead predictions for rest of sequecne
  for (i=seqlen/2+1;i<seqlen;i++) {
    // Step the new observation into the predictor - this
    // returns the current one step ahead prediction, but
    // we're just ignoring it here.
    predictor->Step(seq[i]);
    // Ask for predictions + errors from 1 to 8 steps into the future
    // given the state in the predictor at this point
    predictor->Predict(8,predictions);
    predictor->ComputeVariances(8,errorestimates);
    // Send output to evaluator
    evaluator->Step(seq[i],predictions); 
    // do something useful with  predictions here
  }

  // Get final stats from evaluator
  evaluator->Drain();
  PredictionStats *predstats = evaluator->GetStats();
\end{verbatim}
\normalsize
To use a different model, all that is needed is to change the
arguments to the \verb.ParseModel. call, which could just as easily
come from the command line.  \verb.ParseModel. and \verb.FitThis. are
helper functions to simplify dealing with the large and extensible set of
available model templates, modelers, models, and predictors.  It is
also possible to invoke modelers directly, with or without model
templates.  

\subsection{Parallel cross-validation system}
\label{sec:crossval_system}

Using the time series prediction library, we implemented a parallel
cross-validation system for studying the predictive power of models
one traces of measurement data.  The user supplies a measurement trace
and a file containing a sequence of testcase templates.  A testcase
template contains ranges of valid values for model classes, numbers of
model parameters, lengths of sequences to fit models to, and lengths
of subsequent sequences to test the fitted models on.  

As the system runs, testcases are randomly generated by a master
program using the template's limits on valid values and parceled out
to worker processes using PVM~\cite{PVM-BOOK}.  The workers run code
similar to that of Section~\ref{sec:ts_example} to evaluate a
testcase.  Essentially, the result is a set of error metrics for a
randomly chosen model fit to a random section of the trace and tested
on a subsequent random section of the trace.  When a worker finishes
evaluating a testcase, it sends the resulting set of error metrics back
to the master, which prints them in a form suitable for importing into
a database table for further study.

Because the testcases are randomly generated, the database of
testcases can be used to draw unbiased conclusions about the absolute
and relative performance of particular prediction models on particular
kinds of measurement sequences.

\subsection{Performance}
\label{sec:tspl_perf}

In implementing an on-line resource prediction system, it is obviously
important to know the costs involved in using the various models
supported by the time series prediction library.  For example, if the
measurement stream produces data at a 10 Hz rate and the predictor
requires 200 ms to produce a prediction, then it will fall further and
further behind, producing ``predictions'' for times that are
increasingly further in the past.   Clearly, such a predictor is
useless.  Another predictor that requires 100 ms will give up-to-date
predictions, but at the cost of saturating the CPU of the machine
where it is running.  A predictor that requires 1 ms or less would
clearly be desirable since it would consume only 1\% of the CPU.
Similarly, the cost of fitting a model and the measurement rate
determines how often we can refit the model.  At the 10 Hz rate, a
model that takes 10 seconds to fit cannot be fit any more often than
every 100 measurements, and only then if we are willing to saturate
the CPU.  


\def\predfigsize{\epsfxsize=3in}

\begin{figure}
\centerline{
\begin{tabular}{cc}
\predfigsize
\epsfbox{ar600.epsf}
&
\predfigsize
\epsfbox{bm600.epsf}
\\
(a) AR Models & (b) BM Models \\
\predfigsize
\epsfbox{ma600.epsf}
&
\predfigsize
\epsfbox{arma600.epsf}
\\
(c) MA Models & (d) ARMA Models \\
\predfigsize
\epsfbox{arima600.epsf}
&
\predfigsize
\epsfbox{arfima600.epsf}
\\
(e) ARIMA Models & (f) ARFIMA Models \\
\end{tabular}
}
\caption{Timing of various prediction models, 600 sample fits.}
\label{fig:ts600}
\end{figure}


\begin{figure}
\centerline{
\begin{tabular}{cc}
\predfigsize
\epsfbox{ar2000.epsf}
&
\predfigsize
\epsfbox{bm2000.epsf}
\\
(a) AR Models & (b) BM Models \\
\predfigsize
\epsfbox{ma2000.epsf}
&
\predfigsize
\epsfbox{arma2000.epsf}
\\
(c) MA Models & (d) ARMA Models \\
\predfigsize
\epsfbox{arima2000.epsf}
&
\predfigsize
\epsfbox{arfima2000.epsf}
\\
(e) ARIMA Models & (f) ARFIMA Models \\
\end{tabular}
}
\caption{Timing of various prediction models, 2000 sample fits.}
\label{fig:ts2000}
\end{figure}

We measured the costs, in terms of system and user time required to
(1) fit a model and create a predictor and (2) step one measurement
into the predictor producing one set of 30-step-ahead predictions.
Because the time to fit a model is dependent on the length of the
measurement sequence, while the predictor time is not, we measured the
costs for two different measurement sequence lengths, 600 samples and
2000 samples.  The measurement sequence used was a representative host
load trace.

The results are shown in Figures~\ref{fig:ts600} and~\ref{fig:ts2000}.
Each figure contains six plots, one for the (a) MEAN, LAST, and AR
models, and one each for the remaining (b) BM, (c) MA, (d) ARMA, (e)
ARIMA, and (f) ARFIMA models.  The \REFIT\ variants were not measured,
although their performance can certainly be derived from the
measurements we did take.  For each model, we plot several different
and interesting combinations of parameter values.  For each
combination, we plot two bars, the first bar (Fit/Init) plots the time
to fit the model and produce a predictor while the second bar
(Step/Predict) plots the time to step that predictor.  Each bar is the
average of 30 trials, each of which consists of one Fit/Init step and
a large number of Step/Predict steps.  The y axis on each plot is
logarithmic.  We replicate some of the bars from graph to graph to
simplify comparing models across graphs and we also draw horizontal
lines at roughly 1 ms and 100 ms, which are the Fit/Init times of
AR(16) and AR(512) models, respectively.  1 ms is also the
Step/Predict time of an AR(512) predictor.

There are several important things to note when examining
Figures~\ref{fig:ts600} and~\ref{fig:ts2000}.  First, the inclusion of 
LAST and MEAN on the (a) plots provide measures of the overhead of the 
predictor and modeler abstractions, since LAST's predictor and MEAN's
modeler do hardly any work.  As we can see, the overhead of the
abstractions are quite low and on par with virtual function calls, as
we might expect.

The second important observation from the figures is that AR models,
even with very high order, are quite inexpensive to fit.  An AR(512)
fit on a 2000 element sequence takes about 100 ms.  In fact, ignoring
LAST and MEAN, the only real competition to even the AR(512) comes
from very low order versions of the other models.  The downside of
high order AR models is that the Step/Predict time tends to be much
higher than that of lower order versions of the more complex models.
For example, the predictor for an ARIMA(8,3,8) model operates in 1/100
the time of an AR(512).  This is because the number of operations an
eta-theta predictor performs is linear in the number of model
parameters.  If very high measurement rates are important, these more
parsimonious models may be preferable.  Interestingly, the ARFIMA
models also have very expensive predictors.  This is because, although 
the model captures long-range dependence very parsimoniously in the
form of the $d$ parameter, we multiply out the $(1-B)^d$ term to
generate 300 coefficients in the eta-theta predictor.  It is not clear 
how to avoid this.

A final observation is that the MA, ARMA, and ARIMA models, quite
surprisingly, are considerably more expensive to fit than the much
more complex ARFIMA models.  This is because we use a highly-tuned
maximum likelihood code that assumes a normal error distribution to
fit the ARFIMA model.  The MA, ARMA, and ARIMA models are fit without
making this assumption using a function optimizer which does not
require derivatives.  We used this approach because experimentation
with Matlab, which uses a maximum likelihood approach, showed that the
assumption was rarely valid for traces we were interested in.  Maximum
likelihood based modelers for MA, ARMA, and ARIMA models would reduce
their Fit/Init times to a bit below those of the ARFIMA models.
However, fitting even high-order AR models should still be cheaper
because AR($p$) models are fit by solving a $p$-diagonal Toeplitz
while the other models require some form of function optimization over
their parameters.



\section{Mirror communication template library}
\label{sec:mirror}

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigwidth
\epsfbox{mirror.eps} &
\colfigwidth
\epsfbox{mirror_detail.eps} \\
(a) & (b) \\
\end{tabular}
}
\caption{The mirror abstraction (a) and implementation (b).}
\label{fig:mirror}
\end{figure}

As we began to implement an on-line resource prediction service for
host load and contemplated implementing another for network bandwidth,
we discovered that we were often rewriting the same communication code
in each new program.  As the number of such prediction components
began to grow and we sought to incorporate more sophisticated
communication transports such as multicast IP, the situation became
untenable.  Stepping back, we factored out the communication
requirements of the prediction components and decided to implement
support for them separately.

\subsection{Motivation}

Consider Figure~\ref{fig:predoverview}, which shows a high level view
of how the components of an on-line prediction service communicate.
Notice that each component can is roughly similar in how it
communicates with other components.  It receives data from one or more
input {\em data streams} and sends data to one or more output data
streams.  When a new data item becomes available on some input data
stream, the component performs computation on it and forwards it to
all of the output data streams.  In addition to this data path the
component also provides {\em request-response control} which operates
asynchronously.  We refer to this abstraction as a {\em mirror} (with
no computation, input data is ``reflected'' to all of the outputs) and
illustrate it in Figure~\ref{fig:mirror}(a).

We wanted to be able to implement the communication a mirror performs in
different ways depending on where the components are situated and how
many there are.  For example, the predictor component in
Figure~\ref{fig:predoverview} accepts a stream of measurements from a
sensor and produces a stream of predictions which are consumed by the
buffer and the evaluator.  In addition, the evaluator and the user can
asynchronously request that the predictor refit its model.
Applications may connect at any time to receive the prediction
stream.  If we colocate all of the components on a single machine, Unix
domain sockets or even pipes might be the fastest communication
mechanism to use.  If the components are on different machines --- for
example, it may be impossible to locate any components other than the
sensor on a network router --- TCP-based communication may be
preferable.  If a large number of applications are interested in the
prediction stream, it may be necessary to use multicast IP in order to
keep network traffic low.  This is the kind of flexibility we expected 
from our mirror implementation.

\subsection{Rejected implementation approaches}

We considered five different ways of implementing the mirror
abstraction.  The first possibility was to implement each component as
a CORBA~\cite{CORBA-FUND-PROG-SIEGEL-BOOK} server and use remote
method invocation (RMI) to transfer data for the streams as well as
for control communication.  This would have the advantage that the IDL
compiler would generate all of the necessary object serialization code
for us and simplify making changes in the future.  However, it would
require users of \RPS\ to buy, install, and understand a CORBA
implementation.  Furthermore, there is no single COBRA implementation
that is available on all the platforms we wanted to be able to port
to, and some of our target platforms did not have a CORBA
implementation at all.

The second possibility was to use an even higher level tool, the CORBA
event channel service~\cite{CORBA-FUND-PROG-SIEGEL-BOOK}[196--204].  Event
channels essentially provide the data path part of our mirror
abstraction, but without computation.  Multiple producers place atomic
events into the channel and the channel delivers them to all
listeners.  An input event channel driving a single request/response
server driving an output event channel would serve to implement a
mirror.  While this approach would have the greatly simplifying our
implementation, it has the disadvantage that it would require users of
\RPS\  to not only buy, install, and understand a CORBA implementation
but also an event channel implementation.  Ultimately, we decided this
would be too heavyweight.

The third possibility was to use Java.  Java RMI~\cite{JAVA-RMI-SPEC}
is in some ways even more powerful than CORBA RMI in that more complex
object structures can be automatically serialized.  Java also provides
threads which would have considerably simplified programming mirrors.
However, we would have had to make extensive use of the Java Native
Interface (JNI) to call our prediction library and sensor code, both
of which provide only C++ interfaces.  Our experience with JNI on some
of the more uncommon platforms we wanted to support, especially
FreeBSD and NetBSD, led us to fear the Java approach.  Avoiding JNI by
porting the rest of \RPS\ to Java to avoid was not really an option
since some components, such sensors, must use native code.

The fourth possibility was to use our home-grown distributed object
system, LDOS.  LDOS consists of a CORBA IDL compiler and a lightweight
run-time system.  The combination allows for the easy development of
C++ objects that can be called over the network via object reference
stubs, a dynamic invocation mechanism, or via http and an html forms
interface.  Using LDOS would confer several of the benefits of the
CORBA and Java approaches while not requiring users to purchase CORBA
implementations or raising the complexity of JNI for us.  However,
LDOS relies heavily on native threads (pthreads or Win32 threads),
which are simply not available on some of our target platforms, such
as NetBSD and FreeBSD.  

The final possibility we considered, and which we ultimately decided
to implement, was a ``from scratch'' mirror implementation based on
sockets and the Unix select call.

\subsection{Implementation}

Our mirror implementation, illustrated in Figure~\ref{fig:mirror}(b),
is a C++ template class which is parameterized at compile-time by
handlers for stream input and for request/response input.
Additionally, it is parameterized by handlers for new connection
arrivals for streams and for request/response traffic, although the
default handlers are usually used for this functionality.
Parameterized stream input and request-response handlers are also
supplied for serializable objects, which can be used hide all the
details of communication from the computation that a mirror performs 
for data or control.  Beyond this, there are other template classes
and default handler implementations to simplify using a mirror.  For
example, our prediction mirror implementation uses one of these
templates, \verb.FilterWithControl<>., to simplify its design:
\small
\begin{verbatim}
class Measurement : public SerializableInfo {...};
class PredictionResponse : public SerializableInfo {...};
class PredictionReconfigurationRequest : public SerializableInfo {...};
class PredictionReconfigurationReply : public SerializableInfo {...};

class Prediction {
...
public: 
  static int Compute(Measurement &measure, 
                     PredictionResponse &pred);
...
}

class Reconfiguration {
...
public:
 static int Compute(PredictionReconfigurationRequest &req, 
                    PredictionReconfigurationResponse &resp);
...
}

typedef FilterWithControl<
  Measurement,
  Prediction,
  PredictionResponse,
  PredictionReconfigurationRequest,
  Reconfiguration,
  PredictionReconfigurationResponse
> PredictionMirror;
\end{verbatim}
\normalsize
To implement serialization, the classes descended from
\verb.SerializeableInfo. implement methods for getting their packed
size and for packing and unpacking their data to and from a buffer
object.  The implementer of the prediction mirror does not write any
communication code, which is all provided in the
\verb.FilterWithControl. template which ultimately expands into a
mirror template.

As shown in Figure~\ref{fig:mirror}(b), the heart of a mirror is a
select statement that waits for activity on the file descriptors
associated with the various input streams, request/response ports, and
ports where new connections arrive.  Streams can also originate from
in-process sources, and so the select includes a timeout for
periodically calling back to these local sources to get input stream
data.  

When the select falls through, all local callbacks that are past due
are executed and their corresponding stream handler is executed on the
new data item.   Next, each open file descriptor that has a read pending
on it is passed to its corresponding stream, request/response, or new
connection handler.  A stream handler will unserialize an input data
item from the stream, perform computation on it yielding an output
data item, which it passes to the mirror's data forwarder component.

The data forwarder will then serialize the item to all the open output
streams.  If a particular output stream is not writeable, it will
buffer the write and register a handler with the selector to be called
when the stream is once again writeable.  This guarantees that the
mirror's operation will not block due to an uncooperative
communication target.  

A request/response handler will unserialize the input data item from
the file descriptor, perform computation yielding an output data item,
and then serialize that output data item onto the same file
descriptor.  A new connection handler will simply accept the new
connection, instantiate the appropriate handler for it (ie, stream or
request response) and then register the handler with the connection
manager.

The mirror class knows about a variety of different transport
mechanisms, in particular, TCP, UDP (including multicast IP), Unix
domain sockets, and pipes or file-like entities.  The user asks the
mirror to begin listening at a particular port for data or control
messages either through an explicit mirror interface or by using
\verb.EndPoint.s, which are objects that encapsulate all of a mirror's
available transport mechanisms and can parse a string into an internal
representation of a particular transport mechanism.  


\subsection{Example}

Here is how the prediction server instantiates a prediction mirror
that will receive measurements from a host named ``pyramid'' using TCP
at port 5009, support reconfiguration requests via TCP at port 5010,
and send predictions to all parties that connect via TCP at port 5011
or listen via multicast IP at address 239.99.99.99, port 5012, and
also to standard out:
\small
\begin{verbatim}
PredictionMirror mirror;
EndPoint tcpsource, tcpserver, tcpconnect
EndPoint multicasttarget, stdouttarget;

tcpsource.Parse("source:tcp:pyramid:5009");
tcpserver.Parse("server:tcp:5010");
tcpconnect.Parse("connect:tcp:5011");
multicasttarget.Parse("target:udp:239.99.99.99:5012");
stdouttarget.Parse("target:stdio:stdout");

mirror.AddEndPoint(tcpsource);
mirror.AddEndPoint(tcpserver);
mirror.AddEndPoint(tcpconnect);
mirror.AddEndPoint(multicasttarget);
mirror.AddEndPoint(stdouttarget);

mirror.Run();
\end{verbatim}
\normalsize

In order to simplify writing clients for mirrors, we also implemented
3 reference classes, one for streaming input, one for streaming
output, and one for request/response transactions.  Here is how a
client would begin to receive the multicasted prediction stream
produced by the mirror code above:
\small
\begin{verbatim}
EndPoint source;
StreamingInputReference<PredictionResponse> ref;
PredictionResponse pred;

source.Parse("source:udp:239.99.99.99:5012");
ref.ConnectTo(source);
while (...) { 
   ref.GetNextItem(pred);
   pred.Print();
}
ref.Disconnect();
\end{verbatim}
\normalsize
Similarly, here is the code that a client might use to reconfigure the
prediction mirror via its TCP request/response interface, assuming the
mirror is running on mojave:
\begin{verbatim}
EndPoint source;
Reference<PredictionReconfigurationRequest,
          PredictionReconfigurationResponse> ref;
PredictionReconfigurationRequest  req(...);
PredictionReconfigurationResponse resp;

source.Parse("source:tcp:mojave:5010");
ref.ConnectTo(source);
ref.Call(req,resp);
resp.Print();
\end{verbatim}

\section{Prediction components}
\label{sec:predcomp}

\begin{figure}
\small
\centerline{
\begin{tabular}{|l|l|}
\hline
Component & Function \\
\hline
\multicolumn{2}{|c|}{Host Load Measurement} \\
\hline
loadserver    & Generates stream of host load measurements \\
loadclient    & Prints loadserver's stream                 \\
loadreconfig  & Changes a loadserver's host load measurement rate          \\
loadbuffer    & Buffers measurements with request/response access \\
loadbufferclient & Provides access to a loadbuffer \\
load2measure  & Converts a load measurement stream to generic
measurement stream \\
\hline        
\multicolumn{2}{|c|}{Flow Bandwidth Measurement} \\
\hline
flowbwserver    & Generates stream of flow bandwidth measurements
using Remos \\
flowbwclient    & Prints flowbwserver's stream                 \\
flowbwreconfig  & Reconfigures a running flowbwserver          \\
flowbwbuffer    & Buffers a flow bandwidth measurement stream with request/response access \\
flowbwbufferclient & Provides access to a flowbwbuffer \\
flowbw2measure  & Converts a flow bandwidth measurement stream to generic
measurement stream \\
\hline        
\multicolumn{2}{|c|}{Measurement Management} \\
\hline
measureclient  &  Prints a generic measurement stream \\
measurebuffer  &  Buffers generic measurements with request/response
access \\
measurebufferclient & Provides access to a measurebuffer \\
\hline        
\multicolumn{2}{|c|}{Stream-based Prediction}  \\
\hline
predserver        & Computes predictions for a generic measurement stream \\
predserver\_core   & Performs actual computations to contain failures
\\
predreconfig      & Reconfigures a running predserver \\
evalfit           & Evalutes a running predserver and reconfigures it
when necessary \\
predclient        & Prints a prediction stream \\
predbuffer        & Buffers a prediction stream with request/response
access \\
predbufferclient  & Provides access to a predbuffer \\
\hline        
\multicolumn{2}{|c|}{Request/Response Prediction}  \\
\hline
pred\_reqresp\_server & Computes ``one-off'' predictions for
request/response clients \\
pred\_reqresp\_client & Makes ``one-off'' prediction requests on a
pred\_reqresp\_server \\
\hline
\end{tabular}
}
\caption{Prediction components implemented using \RPS\ libraries.}
\label{fig:predcomps}
\normalsize
\end{figure}

Using the functionality implemented in the sensor libraries of
Section~\ref{sec:sensors}, the time series prediction library of
Section~\ref{sec:tspl}, and the mirror communication template library
of Section~\ref{sec:mirror}, we implemented a set of {\em prediction
components}.  Each component is a program that implements a specific
\RPS\ function.   On-line resource prediction systems are implemented by
composing these components.  The communication connectivity of a
component is specified via command-line arguments, which means the
location of the components and what transport any two components use
to communicate can be determined at startup time.  In addition, the
components also support transient connections to allow run-time
reconfiguration and to permit multiple applications to use their
services. In Section~\ref{sec:host_load_pred_sys} we compose an
on-line host load prediction system out of the components we describe
in this section.

We implemented a large set of prediction components, which are shown
in Figure~\ref{fig:predcomps}.  They fit into five basic groups: host
load measurement, flow bandwidth measurement, measurement management,
stream-based prediction, and request/response prediction.  

The host load measurement and flow bandwidth measurement groups
implement sensors and tools for working with them.  In each group, the
sensor component (eg, loadserver, flowbwserver) generates a stream of
sensor-specific measurements, while the other components provide
mechanisms to control the sensor, read the measurement streams, buffer
the measurement streams to provide asynchronous request/response
access to the measurements, and, finally, to convert sensor-specific
measurements into a generic measurement type.  The remainder of the
components use these generic measurement streams.

The measurement management group provides tools for receiving generic
measurement streams, buffering generic measurements, and accessing
such buffers.

The stream-oriented prediction group provides continuous prediction
services for generic measurement streams.  Predserver is the main
component in this group.  When started up, it retrieves a measurement
sequence from a measurebuffer, fits the desired model to it, and then
creates a predictor.  As new measurements arrive in the stream, they
are passed through the predictor to form $m$-step-ahead predictions
and corresponding estimates of prediction error.  These operations are
similar to those described in Section~\ref{sec:ts_example}.  The
actual work is done by a subprocess, predserver\_core.  This limits
the impact of a crash caused by a bad model fit.  If predserver\_core
crashes, predserver simply starts a new copy.

Predserver also provides a request/response control interface for
changing the type of model, the length of the sequence to which the
model is fit, and the number, $m$, of predictions it will make.  This
interface can be used by the user through the predreconfig program.
Alternatively, and even at the same time evalfit can use the
interface.  Evalfit receives a generic measurement stream and a
prediction stream, and continuously evaluates the quality of the
predictions using an evaluator as discussed in
Section~\ref{sec:evaluator}.  When the prediction quality exceeds
limits set by the user, evalfit will force the predserver it is
monitoring to refit the model.

The remaining components in the stream-oriented prediction services
group simply provide buffering and client functionality for prediction 
streams. 

The request/response prediction group provides classic client/server
access to the time series prediction library.  Pred\_reqresp\_client
sends a measurement sequence and a model template to
pred\_reqresp\_server, which fits a model and return predictions for
the next $m$ values of the sequence.

It is important to note that the set of of prediction components is
not fixed.  It is quite easy to construct new components using the
libraries we described earlier.  Indeed, we constructed additional
components for the performance evaluation we describe in the next
section.


\section{Performance}
\label{sec:on-line_perf}

The \RPS-based prediction components described in the previous
section are composed at startup time to form on-line prediction
systems.  To evaluate the performance of \RPS\ for constructing such
systems, we constructed a representative \RPS-based prediction system
and measured its performance in terms of the timeliness of its
predictions, the maximum measurement rates that can be achieved, and
the additional computational and communication load it places on the
distributed system.  In addition to the composed system, we also
constructed a monolithic system using the \RPS\ libraries directly
and measured the maximum measurement rates it could support.  

The conclusion of our study is that, for interesting measurement
rates, both the composed and the monolithic systems can provide timely
predictions using only tiny amounts of CPU time and network bandwidth.
In addition, the maximum achievable measurement rates are 2 to 3
orders of magnitude higher than we currently need. 

It is important to note that \RPS\ is a toolkit for resource
prediction, and, because of the inherent flexibility of such a design,
it is difficult to measure \RPS's performance for creating on-line
resource prediction systems in a vacuum.  Prediction components can be
composed in many different ways to construct on-line resource
prediction systems and the \RPS\ libraries enable the construction of
additional components or increased integration of functionality.
Furthermore, prediction components can communicate in different ways.
Finally, different resources require different measurement rates and
predictive models.  More complex predictive models require more
computational resources while higher measurement rates require more
computational and communication resources.

Because of the intractability of attempting to characterize this
space, we instead focused on measuring the performance of a
representative \RPS-based on-line host load prediction system.  The
system is representative in the sense that it implements the
functionality of Figure~\ref{fig:predoverview} using the prediction
components.  Furthermore, it is also a realistic system.  It uses a
predictive model that we have found appropriate for host load
prediction in other work~\cite{DINDA-LOAD-PRED-HPDC-99}.  Finally, it
is a fairly widely used system which has been distributed with
Remos~\cite{REMOS-HPDC98}, is currently used in
QuO~\cite{QUO-JOURNAL-98}, and is currently being used in our
distributed real-time scheduling
work~\cite{DINDA-CASE-BERT-WPDRTS-99}.

\subsection{Host load prediction system}
\label{sec:host_load_pred_sys}

\begin{figure}
\centerline{
\colfigwidth
\epsfbox{loadpred.eps} 
}
\caption{Online host load prediction system composed out of the \RPS\ prediction components described in Section~\ref{sec:predcomp}. }
\label{fig:loadpredsys}
\end{figure}

Figure~\ref{fig:loadpredsys} shows the configuration of prediction
components used for host load prediction.  The boxes in the figure
represent prediction components while dark arrows represent stream
communication between components, and symmetric arrows represent
request/response communication between components.  The arrows are
annotated with communication volumes per cycle of operation of the
system for streams and per call for request/response communication.
$s$ is the number of measurements being requested asynchronously from
the measurebuffer while $m$ is the number of steps ahead for which
predictions are made and $w$ is the number of predictions being
requested from the predbuffer.  Notice the similarity of this system
to the high-level view of Figure~\ref{fig:predoverview}.

The system works as follows.  The loadserver component periodically
measures the load on the host on which it is running using the
GetLoadAvg library described in Section~\ref{sec:sensors}.  Each new
measurement is forwarded to any attached loadclients and also to
load2measure, which converts it to a generic measurement form and
forwards it to measurebuffer.  Measurebuffer buffers the last $N$
measurements and provides request/response access to them.  It also
forwards the current measurement to predserver and evalfit.
Predserver consumes the measurement and produces an $m$-step-ahead
prediction using its subprocess, predserver\_core.  It forwards the
prediction to predbuffer and to evalfit.  Evalfit continuously
compares predserver's predictions with the measurements it receives
from measurebuffer and computes its own assessment of the quality of
the predictions.  For each new measurement, it compares its assessment
with the requirements the user has specified as well as with the
predictor's own estimates of their quality.  When quality limits are
exceeded it calls predserver to refit the model.  Predserver's
predictions also flow to predbuffer, which provides request/response
access to some number of previous predictions and also forwards the
predictions to any attached predclients.  Predbufferclients can
asynchronously request predictions from predbuffer.  Of course,
applications can decide, at any time, to access the prediction stream
or the buffered predictions in the manner of predclient and
predbufferclient.

Each measurement that loadserver produces is timestamped.  This
timestamp is passed along as the measurement makes its way through the
system and is joined with a timestamp for when the corresponding
prediction is completed, and for when the prediction finally arrives
at an attached predclient.  We shall use these timestamps to measure
the latency from when a measurement is made to when its corresponding
prediction is available for applications.

The system can be controlled in various ways.  For example, the user
can change loadserver's measurement rate, the predictive model that
predserver uses, and the time horizon for predictions.  We used the
control over loadserver's measurement rate to help determine the
computational and communication resources the system uses.

So far, we have not specified where each of the components runs or how
the components communicate.  As we discussed in the previous section,
\RPS\ lets us defer these decisions until startup time
and even run-time.  In the study we describe in this section, we ran
all of the components on the same machine and arrange for them to
communicate using TCP.    The machine we used is a 500 MHz Alpha
21164-based DEC personal workstation.

This configuration of prediction components is an interesting one to
measure.  It is reasonable to run all the components on a single
machine since relatively low measurement rates and reasonably simple
predictive models are sufficient for host load
prediction~\cite{DINDA-LOAD-PRED-HPDC-99}.  It would be more efficient
to use a local IPC mechanism such as pipes or Unix domain sockets to
communicate between components.  Indeed, a production host load
prediction system might very well be implemented as a single process.
We briefly discuss the performance of such an implementation in
Section~\ref{sec:all-in-one}.  TCP is interesting to look at because
it gives us some idea of how well an \RPS-based system might perform
running on multiple hosts, which might be desirable, for, say, network
bandwidth prediction.  Furthermore, if \RPS\ can achieve reasonable
performance levels in such a flexible configuration, it is surely the
case that a performance-optimized \RPS-based system would do at least
as well.

The predictive model that is used is an AR(16) fit to 600 samples and
evalfit is configured so that model refitting does not occur.
Predictions are made 30 steps into the future.  The default
measurement rate is 1 Hz.  This model and rate is appropriate for host
load prediction, as we discuss in earlier
work~\cite{DINDA-STAT-PROP-HOST-LOAD-SCIPROG-99,DINDA-LOAD-PRED-HPDC-99}.

The following illustrates how the various prediction components are
started:
\small
\begin{verbatim}
% loadserver 1000000 server:tcp:5000 connect:tcp:5001 &
% loadclient source:tcp:`hostname`:5001 &
% load2measure 0 source:tcp:`hostname`:5001 connect:tcp:5002 &
% measurebuffer 1000 source:tcp:`hostname`:5002 
   server:tcp:5003 connect:tcp:5004 &
% predserver source:tcp:`hostname`:5004 
   source:tcp:`hostname`:5003 server:tcp:5005 connect:tcp:5006 &
% evalfit source:tcp:`hostname`:5004 source:tcp:`hostname`:5006 
   source:tcp:`hostname`:5005 
   30 999999999 1000.0 999999999 600 30 AR 16  &
% predbuffer 100 source:tcp:`hostname`:5006 server:tcp:5007 
   connect:tcp:5008 &
% predclient source:tcp:`hostname`:5008  &
\end{verbatim}
\normalsize
The use of measurebufferclient and predbufferclient are not shown
above since these are run only intermittently.  

\subsection{Limits}
\label{sec:perf_limits}

Before we present the details of the performance of the host load
prediction system, it is a good idea to understand the limits of
achievable performance on this machine.  Recall from
Section~\ref{sec:sensors} that the host load sensor library requires
only about 1.6 $\mu$s to acquire a sample.  As for the cost of
prediction, Figure~\ref{fig:ts600} indicates that fitting and
initializing an AR(16) model on 600 data points requires about 1 ms of
CPU time, with a step/predict time of about 100 $\mu$s.  The
computation involved in evalfit, load2measure, and the various buffers
amounts to about 50 $\mu$s, thus the total computation time per cycle
is $151.6$ $\mu$s. If no communication was involved, we would expect
the prediction system to operate at a rate no higher than 6.6 KHz.

However, the prediction system also performs communication.
Examination of Figure~\ref{fig:loadpredsys} indicates that, for
30-step-ahead ($m=30$) predictions, eight messages are sent for each
cycle.  There are 3 28 byte messages, 2 52 byte messages, and 3 536
byte messages.  The measured bandwidths of the host for messages of
this size are 2.4 MB/s (28 bytes), 4.2 MB/s (52 bytes), and 15.1 MB/s
(536 bytes).  Therefore the lower bound transfer times for these
messages are 11.7 $\mu$s (28 bytes), 12.4 $\mu$s (52 bytes), and 35.5
$\mu$s (536 bytes).  The total communication time per cycle is
therefore at least $(3)11.7 + (2)12.4 + (3)35.5 = 166.4$ $\mu$s, and
the total time per cycle is at least $151.6 + 166.4 = 318$ $\mu$s,
which suggests a corresponding upper bound on system's rate of about
3.1 KHz.

It is important to note that these rates are far in excess of the 1 Hz
rate we expect from a host load prediction system, or even the 14 Hz
peak rate for our Remos-based network sensor.  What these high rates
suggest, however, is that, for rates of interest to us, we can expect
the prediction system to use only a tiny percentage of the CPU.  In
terms of communication load, we will only place $(3)28 + (2)52 +
(3)536 = 1796$ bytes onto the network per cycle.  At a rate of 14 Hz,
this amounts to about 25 KB/s of traffic.

Of course, these are the upper limits of what is possible.  We would
expect that overheads of the mirror communication template library and 
the data copying implied by the TCP communication we use to result in
lower performance levels. 

The host we evaluated the system on has a timer interrupt rate of 1024
Hz, which means the all measurement rates in excess of this amount to
``as fast as possible.''  This rate also results in a clock accuracy
of approximately one millisecond.



\subsection{Evaluation}
\label{sec:perf_eval}

We configured the host load prediction system so that the model will
be fit only once, and thus measured the system in steady state.  We
measured the prediction latency, communication bandwidth, and the CPU
load as functions of the measurement rate, which we swept from 1 Hz to
1024 Hz in powers of 2.  We found that the host load prediction system
can sustain measurement rates of 730 Hz with mean and median
prediction latencies of around 2 ms.  For measurement rates that are of
interest to us, such as the 1 Hz rate for load and the 14 Hz for flow
bandwidth, the additional load the system places on the machine is
minimal.

\subsubsection{Prediction latency}

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigwidth
\epsfbox{predlat_mean.epsf}
&
\colfigwidth
\epsfbox{predlat_minmaxmed.epsf}
\\
(a) & (b) 
\end{tabular}
}
\caption{Prediction latency as a function of measurement rate: 
(a) 95\% confidence interval of mean latency, (b) Minimum, median,
maximum latency}
\label{fig:predlat}
\end{figure}

In an on-line prediction system, the timeliness of the predictions is
paramount.  No matter how good a prediction is, it is useless if if it
does not arrive sufficiently earlier than the measurement it predicts.
We measured this timeliness in the host load prediction system as the
latency from when a measurement becomes available to when the
prediction it generates becomes available to applications that are
interested in it.  This is the latency from the loadserver component
to the predclient component in Figure~\ref{fig:loadpredsys}.

The prediction latency should be independent of the measurement rate
until the prediction system's computational or communication resource
demands saturate the CPU or the network.  Figure~\ref{fig:predlat}
shows that this is indeed the case.  Figure~\ref{fig:predlat}(a) plots
the 95\% confidence interval for the mean prediction latency as a
function of increasing measurement rates.  We do not plot the latency
for the 1024 Hz rate since at this point the CPU is saturated and the
latency increases with backlogged predictions.  Up to this point, the
mean prediction latency is roughly 2 ms.

Figure~\ref{fig:predlat}(b) plots the minimum, median, and maximum
prediction latencies as a function of increasing measurement rate.
Once again, we have elided the 1024 Hz rate since latency begins to
grow with backlog.  The median latency is 2 ms, while the minimum
latency is at 1 ms, which is the resolution of the timer we used.
The highest latency we saw was 33 ms.

\subsubsection{Resource usage}

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigwidth
\epsfbox{sweep.epsf}
&
\colfigwidth
\epsfbox{sweep_detail.epsf}
\\
(a) & (b) \\
\colfigwidth
\epsfbox{sweep_sysusr.epsf}
&
\colfigwidth
\epsfbox{sweep_load.epsf}
\\
(c) & (d) 
\end{tabular}
}
\caption{CPU load produced by system.  The measurement rate is swept from
1 Hz to 1024 Hz.  (a) shows total percentage of CPU used over time,
(b) is the same as (a) but includes operational details, (c) shows user and
system time, (d) shows load average.  }
\label{fig:sweep_load}
\end{figure}

In addition to providing timely predictions, an on-line resource
prediction system should also make minimal resource demands.  After
all, the purpose of the system is to predict resource availability for 
applications, not to consume the resources for itself.  

To measure the CPU usage of our representative host load prediction
system, we did the following.  First, we started two resource
monitors, vmstat and our own host load sensor.  Vmstat is run every
second and prints the percentage of the last second that has been
charged to system and user time.  Our host load sensor measures the
average run queue length every second.  After the sensors were
started, we started the prediction system at its default rate of 1 Hz
and let it quiesce.  Next, we started a predclient and let the system
quiesce.  Then, we swept the measurement rate from 1 Hz to 1024 Hz in
powers of 2.  For each of the 10 rates, we let the system quiesce.
Finally, we reset the rate to 1 Hz.  Figure~\ref{fig:sweep_load} shows
plots of what the sensors recorded over time.

Figure~\ref{fig:sweep_load}(a) shows the percentage of the CPU that
was in use over time, as measured by vmstat.
Figure~\ref{fig:sweep_load}(b) is the same graph, annotated with the
operational details described above.  Figure~\ref{fig:sweep_load}(c)
breaks down the CPU usage into its system and user components.  The
system component is essentially the time spent doing TCP-based IPC
between the different components.  Figure~\ref{fig:sweep_load}(d)
shows the output of the load average sensor.  When the load measured
by this sensor exceeds one, we have saturated the CPU.

There are several important things to notice about
Figure~\ref{fig:sweep_load}.  First, we can sustain a measurement rate
of between 512 Hz and 1024 Hz on this machine.  Interpolating, it
seems that we can sustain about a 730 Hz rate using TCP-based IPC, or
about 850 Hz ignoring the system-side cost of IPC.  While this is
nowhere near the upper bound of 3.1 KHz that we arrived at in
Section~\ref{sec:perf_limits} , it is still much faster than we
actually need for the purposes of host load prediction (1 Hz) and than
the limits of our network flow bandwidth sensor (14 Hz).  

In Section~\ref{sec:all-in-one}, we compare the maximum rate
achievable by this composed host load prediction system to a
monolithic system.  The monolithic system achieves much higher rates
overall, and those rates are closer to the upper bound.

A second observation is that for these interesting 1 and 14 Hz rates,
CPU usage is quite low.  At 1 Hz, it is around 2\% while at 16 Hz
(closest rate to 14 Hz) it is about 5\%.  For comparison, the
``background'' CPU usage measured when only running the vmstat probe
is itself around 1.5\%.  Figure~\ref{fig:sweep_load}(d) shows that
this is also the case when measured by load average.  


\begin{figure}
\small
\centerline{
\begin{tabular}{|c|c|}
\hline
Rate (Hz) & Bytes/sec \\
\hline
1    & 1796 \\
2    & 3592 \\
4    & 7184 \\
8    & 14368 \\
16   & 28736 \\
32   & 57472 \\
64   & 114944 \\
128  & 229888 \\
256  & 459776 \\
512  & 919552 \\
1024 & 1839104 \\
\hline
\end{tabular}
}
\normalsize
\caption{Bandwidth requirements as a function of measurement rate.}
\label{fig:bandwidth}
\end{figure}

Figure~\ref{fig:bandwidth} shows the bandwidth requirements of the
system at the different measurement rates.  To understand how small
these requirements are, consider a 1 Hz host load prediction system
running on each host in the network and multicasting its predictions
to each of the other hosts.  Approximately 583 hosts could multicast
their prediction streams in 1 MB/s of sustained traffic, with each
host using only 0.5\% of its CPU to run its prediction system.
Alternatively, 42 network flows measured at the maximum rate could be
predicted.  If each host or flow only used the network to provided
asynchronous request/response access to its predictions, many more
hosts and flows could be predicted.  For example, if prediction
requests from applications arrived at a rate of one per host per
second, introducing 552 bytes of traffic per prediction
request/response transaction, 1900 hosts could operate in 1 MB/s.

\subsubsection{A monolithic system}
\label{sec:all-in-one}

\begin{figure}
\small
\centerline{
\begin{tabular}{llccc}
\hline
System & Transport & Optimal Rate & Measured Rate & Percent of Optimal \\
\hline
Monolithic & In-process    &  6.6 KHz  & 5.3 KHz  & 80 \% \\
Monolithic & Unix domain socket & 5.5 KHz   & 3.6 KHz  & 65 \% \\
Monolithic & TCP           & 5.3 KHz   & 2.7 KHz  & 51 \% \\
Composed   & TCP           & 3.1 KHz   & 720 Hz   & 24 \% \\
\hline
\end{tabular}
}
\normalsize
\caption{Maximum measurement rates achieved by monolithic and composed 
host load prediction systems.}
\label{fig:max_rates}
\end{figure}

The composed host load prediction system we have described so far can
operate at a rate 52--730 times higher than we need and uses
negligible CPU and communication resources at the rates at which we
actually desire to operate it.  However, the maximum rate it can
sustain is only 24\% of the upper bound we determined in
Section~\ref{sec:perf_limits}.  To determine if higher rates are
indeed possible, we implemented a monolithic, single process host load
prediction system using the \RPS\ libraries directly.  This design can
sustain a peak rate of 2.7 KHz when configured to use TCP, which is
almost four times higher than the composed system.

Figure~\ref{fig:max_rates} shows the maximum rates the monolithic
system achieved for three transports: in process, where the client is
in the same process; Unix domain socket, where the (local) client
listens to the prediction stream through a Unix domain socket; and
TCP, where the client operates as with the earlier system.  For
comparison, it also includes the maximum rate of the composed system
described earlier.  In each case, we also show the optimal rate, which
is derived in a manner similar to Section~\ref{sec:perf_limits}.  The
in-process case shows us the overhead of using the mirror
communication template library, which enables considerable
flexibility.  That overhead is approximately 20\%.  The domain socket
and TCP cases include additional, unmodeled overheads that are
specific to these transports.


\section{Related work}
\label{sec:related}

Application-level schedulers~\cite{APPLES-HPDC96}, such as
best-effort real-time schedulers~\cite{DINDA-CASE-BERT-WPDRTS-99}, are 
the primary users of on-line resource prediction systems.  

Research into resource prediction has focused on determining
appropriate predictive models for host
behavior~\cite{DINDA-LOAD-PRED-HPDC-99,WOLSKI-PRED-CPU-AVAIL-HPDC-99,PRED-BASED-SCHED-DIST-COMP-SAMADANI-UNPUB-96},
and network
behavior~\cite{FORECASTING-NETWORK-NWS-WOLSKI-HPDC97,TIME-SERIES-MODELS-INTERNET-TRAFFIC-BASU-GT-TR-95,TIME-SERIES-MODEL-LONG-TERM-NSFNET-GROSCHWITZ-ICC94}.
\RPS\ is a toolbox that can help facilitate this research.

Interactive data analysis tools such as
Matlab~\cite{MATLAB-MANUAL,MATLAB-SYSIDENT-MANUAL} and
S-Plus~\cite{SPLUS-MANUAL} provide many of the statistical and signal
processing procedures needed to study resource signals and to find
appropriate predictive models.  Ideally, large scale, randomized
evaluations of such candidate models are then needed.  We have built
\RPS-based tools to help to efficiently conduct such studies.

Resource measurement systems, such as the Network Weather
Service~\cite{NWS-DESIGN-JOURNAL-99,WOLSKI-PRED-CPU-AVAIL-HPDC-99,FORECASTING-NETWORK-NWS-WOLSKI-HPDC97},
Remos~\cite{REMOS-HPDC98}, and
Topology-d~\cite{SERVICE-NET-AWARE-APPS-SPDT-98} provide sensors that
create the measurement streams that \RPS-based systems can attempt to
predict.

On-line resource prediction systems collect measurements from resource
measurement systems and use them to predict future measurements.
Other than \RPS\, the Network Weather
Service~\cite{NWS-DESIGN-JOURNAL-99,WOLSKI-PRED-CPU-AVAIL-HPDC-99,FORECASTING-NETWORK-NWS-WOLSKI-HPDC97}
(NWS) is the only example of an on-line resource prediction system we
are aware of.  While NWS is a production system that tries to provide
a ubiquitous resource prediction service for metacomputing, \RPS\ is a
toolkit for constructing such systems and others.  The \RPS\ user can
commit to as little or as much of
\RPS\ as is desired.  NWS and \RPS\ are complementary.  For example, \RPS-based
systems could use NWS sensors, or NWS could use \RPS's predictive
models.

\section{Conclusion}
\label{sec:conclusion}

We have designed, implemented, and evaluated \RPS, a toolkit for
constructing on-line and off-line resource prediction systems in which
resources are represented by independent, periodically sampled,
scalar-valued measurement streams.  \RPS\ consists of resource sensor
libraries, an extensive time series prediction library, a
sophisticated communication library, and a set of prediction
components out of which resource prediction systems can be readily
composed.  The performance of \RPS is quite good.  We measured a
representative \RPS-based host load prediction system and found that
it provides timely predictions with minimal CPU and network load at
reasonable measurement rates.  The system can operate at measurement
rates approaching 1 KHz, while a second, monolithic \RPS-based system
can operate at 2.7 KHz.

These results support the feasibility of resource prediction in
general, and of using \RPS-based systems for resource prediction in
particular.  The \RPS-based host load prediction systems we
implemented are being used in our research into prediction-based
best-effort real-time systems~\cite{DINDA-CASE-BERT-WPDRTS-99} for
interactive applications such as scientific visualization
tools~\cite{DV-PRELIM-REPORT-PDPTA99}.  The systems have within the
Remos measurement system~\cite{REMOS-HPDC98} and QuO object quality of
service system~\cite{QUO-JOURNAL-98}.

We have several extensions in mind for \RPS.  Currently, the user is
responsible for instantiating and finding \RPS-based prediction
components.  In the future, we intend to provide a service, probably
over SLP~\cite{SLP-RFC}, to manage running \RPS\ components,
prediction systems, and the measurement streams that they predict.  We
also will provide high-level calls to estimate task execution times
based on application input and predicted host and network loads.  We
may increase the functionality of the mirror library's support for
request/response communication, making it more like a CORBA ORB.  This
would make it possible to write template classes to make the
construction of very fast monolithic prediction systems trivial.
Finally, we will gradually add new predictive models as they become
desirable.  Currently, we are interested in providing support for
nonlinear threshold autoregressive models~\cite{TAR-TONG-BOOK} and
models appropriate for chaotic
signals~\cite{ABARBANEL-CHAOTIC-DATA-BOOK}.


\small 
\tinyspacing 
\bibliography{texbib} 
\normalsize
\normspacing 

\end{document}
