\documentclass[11pt]{article}
\usepackage{times}
\usepackage{epsf}
\usepackage{fullpage}

\renewcommand{\dbltopfraction}{.9}
\renewcommand{\topfraction}{.9}
\renewcommand{\textfraction}{.1}
\renewcommand{\floatpagefraction}{.9}
%\renewcommand{\dbltopnumber}{1}
\renewcommand{\dblfloatpagefraction}{.9}

%
% goal: 3000 words
% start: 15000 words
%


\def\colfigsize{\epsfxsize=2.5in}

\def\pagefigsize{\epsfxsize=6in}
\def\squashedpagedfigsize{\epsfxsize=6in \epsfysize=1.5in}

\newcounter{ecount}
\newenvironment{ecompact}{
  \raggedright
  \begin{list}%
    {\arabic{ecount}}{\usecounter{ecount}
    \parsep 0pt plus 1pt
    \partopsep 0pt plus 1pt
    \topsep 2pt plus 2pt minus 1pt
    \itemsep 0pt plus 1pt}}%
  {\end{list}}


\newenvironment{icompact}{
  \raggedright
  \begin{list}{$\bullet$}{
    \parsep 0pt plus 1pt
    \partopsep 0pt plus 1pt
    \topsep 2pt plus 2pt minus 1pt
    \itemsep 0pt plus 1pt}}%
  {\end{list}}

\newcommand{\comment}[1]{}

\begin{document}

\title{A Running Time Advisor\\
(Extended Abstract) }
\author{
Peter A. Dinda \hspace{.5in} David R. O'Hallaron\\
\{pdinda, droh\}\@cs.cmu.edu \\
School of Computer Science\\
Carnegie Mellon University\\
5000 Forbes Avenue\\
Pittsburgh, PA 15213
}
\maketitle

\section{Introduction}

Consider an application that wants to schedule a real-time task with
known resource requirements on one of the hosts of a shared
distributed computing environment that supports neither reservations
nor a global notion of priority.  If the application could predict the
running time of the task on each of the available hosts, it could
trivially choose an approprirate host to run the task.  Even if no
host existed on which the task could meet its original deadline, such
predictions of running time would permit the application to modify the
resource requirements of the task or its deadline until an appropriate
host could be found.

This paper describes a service, the running time advisor, that can
supply these predictions for the case of compute-bound tasks.  In
order to characterize the variability inherent to distributed systems
and to the process of prediction, the running time advisor predicts a
task's running time as a confidence interval computed to the
application's requested confidence level.  Confidence intervals
provide a simple abstraction to the application, but still provide
sufficient information to enable valid statistical reasoning in the
scheduling process. The running time advisor has the following interface:
\small
\begin{verbatim}
int PredictRunningTime(RunningTimePredictionRequest  &req,
                       RunningTimePredictionResponse &resp);

struct RunningTimePredictionRequest {
  Host   host;
  double conf;
  double tnom;
};

struct RunningTimePredictionResponse {
  Host   host;
  double tnom;
  double conf;
  double texp;
  double tlb;
  double tub;
};
\end{verbatim}
\normalsize
A query takes the form of a \verb.host., a confidence level
\verb.conf. (eg, 95\%), and a nominal time \verb.tnom. ($t_{nom}$),
which is the running time of the task on an otherwise vacant machine.
A response consists of a copy of the request's fields, the expected
running time of the task \verb.texp. ($t_{exp}$), and the upper and
lower bounds of the \verb.conf. confidence interval for the running
time, [\verb.tlb., \verb.tub.] ($[t_{lb},t_{ub}]$.)  $t_{exp}$ is a
point estimate which represents the most likely running time.  The
actual running time, $t_{act}$, will likely be different from
$t_{exp}$ but be near it.  The confidence interval represents a range
of values around $t_{exp}$ such that $t_{act}$ will be in the range a
fraction \verb.conf. of the time.

The advisor's response is computed from host load predictions.  As we
reported in an earlier paper, we have found that host load,
specifically the Digital Unix 5 second load average, can be usefully
predicted using simple autoregressive (AR) models of order 16 or
higher.  We have implemented an extremely low overhead online host
load prediction system, based on our RPS Toolkit, that is based on
AR(16) models.  In addition to generating predictions for future
sampels of the host load signal, this system also estimates the
covariance matrix for their prediction errors.  Our algorithm for
predicting running times is based on estimating the average of the
load signal over an interval of time using the predicted samples, and
the variance of that estimate using the covariance matrix.  We assume
a fair, priority-less, fluid model scheduler.  This assumption seems
to work well in practice, except for short tasks.  Unix schedulers
give tasks a quickly decaying priority boost when they leave the I/O
state, and such boosts significantly reduce the effect of background
load on short tasks.  To account for such boosts, we introduce a
discounting of the predicted load.  

We evaluated the system using a real environment where the background
load on a host is supplied by host load trace playback.  Using this
technique, we can (re)construct a realistic repeatable workload on a
real machine without limiting ourselves to synthetic workloads that
may not capture the higher order behavior that host load prediction
depends on, and which real load signals exhibit.  We used traces from
38 different machines.  For each trace, we ran appoximately 3000
testcases, which consisted of tasks with nominal times randonly selected
from 0.1 to 10 seconds randomly arriving 5 to 15 seconds apart.  In
all cases a confidence level of 95\% was used.


We draw a considerable number of conclusions from our examination of
the testcases.  The main result is that our system, using a strong
host load predictor such as the AR(16) predictor, does indeed compute
reasonable and useful confidence intervals for the running time of
tasks.  The expected running times that are computed are also quite
accurate.  The benefits of more sophisticated predictors depend on the
how heavily loaded the host is and on the nominal time of the task.
The 39 traces in the study evaluation fit into five classes as far as
the performance of our system is concerned.  For most of these
classes, and $>90$ \% of traces, the AR(16) predictor produces the
best results.


\section{Host load predictions}


\section{Algorithm}
\label{sec:confint}

The running time of a task, $t_{exec}$, can be related to the average
load it experiences while it runs using the following continuous time
model:
\begin{displaymath}
\frac{t_{exec}}{1+\frac{1}{t_{exec}}\int_{0}^{t_{exec}} z(t) dt}
=
t_{nom}
\end{displaymath}
Here $z(t)$ is the load signal, shifted such that $z(0)$ is the value
of the signal at the current time, $t_{now}$.  We introduce this shift
to simplify the presentation of our algorithm, and to conform to the
Box-Jenkins notation for time series prediction that we used in the
previous chapter.  This simplification does not affect the results.
$t_{nom}$ is the nominal running time of the task, which quantifies
the CPU demand of the task as its running time on an unloaded machine.

This continuous time equation is basically a fluid model of a
priority-less host scheduler.  We will use this simple model to
describe our estimation procedure.  However, real schedulers
incorporate priorities that can change over time.  We assume that the
majority of the workload runs at similar priorities.  In particular,
we assume that there are no processes whose priorities have been
drastically increased or decreased, such as with the Unix ``nice''
facility.  Ultimately, we will relax this assumption slightly and
model the temporary priority boosts that most Unix implementations
give processes immediately after they become unblocked.  Given this
extension, the procedure we outline in this section works quite well.

\subsection{Continuous time}

The above equation is somewhat unwieldy to discretize and use, so,
before we continue, let's define the {\em available time
function}
\begin{equation}
at(t) = \frac{t} {1 + al(t) } \ , \  t>0
\end{equation}
which depends on the {\em average load function} 
\begin{equation}
al(t) = \frac{1}{t}\int_{0}^{t} z(\tau) d\tau \ , \   t>0
\end{equation}
$at(t)$ represents the available CPU time over the next $t$ seconds,
which is inversely related to the average load during that interval,
$al(t)$.  As the average load increases, the available time decreases.
$t_{exec}$ is then the minimum $t$ for which $at(t)=t_{nom}.$ Using
this available time function function makes it easier to explain how
our algorithm estimates the running time of a task, and, of course,
the available time function is offered directly through the API
described in Section~\ref{sec:execpred.api}.

\subsection{Discrete time}

In our system, $z$ is not a continuous signal, but rather it is a
discrete time approximation of the continuous signal with a sampling
interval of $\Delta$ seconds samples $z_{t+i}$ with a period $\Delta$,
$i=1,2,\ldots,\infty$.  Note the similarity of this notation to that
of $1,2,\ldots,\infty$-step-ahead predictions introduced in
Chapter~\ref{chap:statprop}.  We will indeed later replace these
values with predictions.  $z_{t+1}$ represents $z(t)$ for
$0<t\leq\Delta$, and so on.  We approximate $z(t)$ as $z(t) =
z_{\lceil t/\Delta \rceil}$.  This lets us write a discrete time
approximation of $at(t)$ and $al(t)$:
\begin{equation}
at_i = 
\left \{
\begin{array}{ll}
0 & i=0 \\
\frac{i\Delta}{1+al_i} &  i>0
\end{array}
\right .
\end{equation}
\begin{equation}
al_i = \frac{1}{i} \sum_{j=1}^{i} z_{t+j} \  ,  \ i>0
\end{equation}
$at_i$ is the the time available during the next $i\Delta$ seconds and
$al_i$ is the average load that will be encountered over the next
$i\Delta$ seconds.  We then estimate the available time $at(t)$ by
linear interpolation:
\begin{equation}
at(t) = 
at_{\lfloor t/\Delta \rfloor} +
\frac{t-\lfloor t/\Delta \rfloor}{\Delta} (at_{\lceil t /\Delta
\rceil} - at_{\lfloor t/\Delta \rfloor}) 
\end{equation}

\subsection{Host load predictions}
\label{sec:execpred.hlp}

Given these definitions, we substitute the predicted load signal
$\widehat{z}_{t+i}$ for $z_{t+i}$ resulting in the predicted average
load $\widehat{al}_i$, and continue substituting back to give the
predicted available time $\widehat{at}_i$ and its corresponding
continuous time approximation:
\begin{equation}
\widehat{at}_i = 
\left \{
\begin{array}{ll}
0 & i=0 \\
\frac{i\Delta}{1+\widehat{al}_i} & i>0
\end{array}
\right .
\end{equation}
\begin{equation}
\widehat{al}_i = \frac{1}{i} \sum_{j=1}^{i} \widehat{z}_{t+j}, i>0 \\
\label{eqn:execpred.alhat}
\end{equation}
\begin{equation}
\widehat{at}(t) = 
\widehat{at}_{\lfloor t/\Delta \rfloor} +
\frac{t-\lfloor t/\Delta \rfloor}{\Delta} (\widehat{at}_{\lceil t /\Delta
\rceil} - \widehat{at}_{\lfloor t/\Delta \rfloor}) 
\label{eqn:execpred.athat}
\end{equation}
Then, the expected running time of the task, $t_{exp}$,  is simply the
smallest $t$ for which $\widehat{at}(t)=t_{nom}$.  

\subsection{Confidence intervals}
\label{sec:execpred.cis}

Because host load predictions are not perfect, we also report the
running time or available time as a confidence interval, computed to a
user-specified confidence level.  The better the predictions are, the
narrower the confidence interval is.  

The predicted load signal is $\widehat{z}_{t+i} = z_{t+i} + a_{t+i}$,
where $z_{t+i}$ is the real value of the signal and $a_{t+i}$ is the
$i$-step-ahead prediction error term which we summarize with a
variance $\sigma^2_{a,i}$.  Our uncertainty in estimating the
available time $at_i$ is due to our uncertainty in estimating the
average load $al_i$, which is due in turn to these error terms and
their variance.  To represent this uncertainty in the form of a
confidence interval, we must push the underlying error variances
through the equations above to arrive at a variance for the average
load $al_i$

Notice that the average load (Equation~\ref{eqn:execpred.alhat}) sums
the estimates $\widehat{z}_{t+i}$.  Rewriting the equation, we can see
that 
\begin{equation}
\widehat{al}_i = al_i + \frac{1}{i} \sum_{j=1}^{i} a_{t+j}  \\
\end{equation}
By the central limit theorem, then, $\widehat{al}_i$ will become
increasingly normally distributed with increasing $i$.  Given that the
errors $a_{t+i}$ are of zero mean, $\widehat{al}_i$ has a mean
(expected) value of $al_i$ and a variance that depends on the sum of
the prediction errors $a_{t+i}$:
\begin{equation}
\widehat{al}_i \sim  N\left(al_i, Var\left\{ \frac{1}{i} \sum_{j=1}^{i}
a_{t+j} \right\} \right)
\end{equation}
It is important to note that for short jobs or large $\Delta$, this
normality assumption may be invalid.  We will evaluate the system
later and determine whether the results of the assumption are
reasonable. 

Suppose the user requests a confidence interval at 95\% confidence.
We can then compute a confidence interval for $al_i$ (for $i>0$):
\begin{equation}
[al_i^{low},al_i^{high}] = 
Min\left(0,\widehat{al}_i \mp \frac{1.96}{\sqrt{i}}\sqrt{Var\left\{\sum_{j=1}^{i}
a_{t+j} \right\}}\right)
\label{eqn:execpred.alci}
\end{equation}
What this means is that we predict that $al_i$ will be in the range
$[al_i^{low}, al_i^{high}]$ with 95\% probability.  The $1.96$ is the
number of standard deviations of a standard normal needed to capture
$42.5$\% of values.  The $Min$ is important because the average load
cannot drop below zero, although the prediction errors can make that
appear to be the case.  

We can now back-substitute these upper and lower bounds of the
confidence interval into $at(t)$, resulting in upper and lower
confidence intervals for $at(t) = [at^{low}(t), at^{high}(t)]$.  Then
the confidence interval on the running time is $[t_{lb},t_{ub}]$,
where $t_{lb}$ is the minimum $t$ for which $at^{high}(t)=t_{nom}$ and
$t_{ub}$ is the minimum $t$ for which $at^{low}(t)=t_{nom}$.

\subsection{Correlated prediction errors}
\label{sec:execpred.corrpred}

Given the discussion of the previous section, we must still determine
the variance of a sum of consecutive predictions in
Equation~\ref{eqn:execpred.alci} in order to compute the confidence
interval.  This is one of the subtler issues in converting from load
predictions to running time predictions.

In essence, we are summing the one, two, $i$-step-ahead prediction
errors, which are the random variables $a_{t+1}, a_{t+2}, \ldots,
a_{t+i}$.  Each of these random variables has a corresponding
variance: $\sigma^2_{a,1}, \sigma^2_{a,2}, \ldots, \sigma^2_{a,i}$.
These variances are simply the mean squared error estimates produces
by the predictor.  We want to know the variance of their sum.  By the
central limit theorem, the distribution of such a sum converges to
normality.  It is tempting, but wrong, to compute the variance of the
sum as
\begin{equation}
Var\left\{\sum_{j=1}^{i} a_{t+j} \right \} = 
\sum_{j=1}^{i} Var\{ a_{t+j} \} =
\sum_{j=1}^{i} \sigma^2_{a,j}
\label{eqn:execpred.var_wrong}
\end{equation}
This would be valid if the prediction errors were independent, but it
turns out that they are not.  Almost by their very nature, linear
models produce prediction errors which are correlated over time.  The
two-step-ahead prediction is not independent of the one-step-ahead
prediction.  Computing the variance of the sum in the above manner
understates the variance, leading confidence intervals that are too
small.

To see that the prediction errors are correlated, consider an AR(1)
model, $z_{t+1} = \phi_0z_t + a_{t+1}$.  At time $t$, the remaining
values will be:
\begin{displaymath}
\begin{array}{rclcl}
z_{t+1} & = & \phi_0z_t + a_{t+1} & \\
z_{t+2} & = & \phi_0z_{t+1} + a_{t+2} & = &\phi_0^2z_t + \phi_0a_{t+1}
+ a_{t+2}
\\
z_{t+3} & = & \phi_0z_{t+2} + a_{t+3} & = & \phi_0^3z_t +
\phi^2_0a_{t+1} + \phi_0a_{t+2} + a_{t+3} \\
z_{t+n} & = & \phi_0z_{t+n-1} + a_{t+n} & = & \phi_0^nz_t +
\sum_{j=0}^{n-1} \phi_0^ja_{t+n-j}
\end{array}
\end{displaymath}
The predictions at time $t$ must assume that all the white noise
terms $a_{t+k}$ are their expected values, which are zero:
\begin{displaymath}
\begin{array}{rclcl}
\widehat{z}_{t+1} & = & \phi_0z_t & & \\
\widehat{z}_{t+2} & = & \phi_0\widehat{z}_{t+1} & = & \phi_0^2z_t \\ 
\widehat{z}_{t+3} & = & \phi_0\widehat{z}_{t+2} & = & \phi_0^3z_t \\
\widehat{z}_{t+n} & = & \phi_0\widehat{z}_{t+n-1} & = & \phi_0^nz_t \\
\end{array}
\end{displaymath}
Thus the one-step-ahead prediction error is $a_{t+1}$.  The
two-step-ahead prediction error is $\phi_0a_{t+1} + a_{t+2}$, the
value of which is clearly dependent on the one-step-ahead prediction
error.  The three-step-ahead prediction error is $\phi_0^2a_{t+1} +
\phi_0a_{t+2} + a_{t+3}$, which depends on the one- and two-step-ahead
prediction errors, and so on.

To correctly compute the variance of the sum of load predictions, we
must compute the covariance of each of the prediction errors with each
of the other prediction errors and then sum all $i^2$ terms of this
covariance matrix.  Entry $j,k$ of this matrix is
$CoVar\{a_{t+j}a_{t+k}\} =
\sigma_{a,j}\sigma_{a,k}$ and is the covariance of the $j$-step-ahead
prediction with the $k$-step-ahead prediction.  Notice that variances
of the individual predictions are simply the diagonal elements of the
matrix.

The prediction errors' correlation over lead time is akin to a
signal's autocorrelation over time.  Recall that an autocorrelation
sequence is simply a normalized autocovariance sequence.  The
covariances are easily computed from the autocovariance sequence.  In
particular, $CoVar\{a_{t+j},a_{t+k}\} = AutoCoVar\{a_{t},a_{| j-k
|}\}$.  

The host load prediction system uses the algorithm of Box, et al to
compute the autocovariance sequence for any linear
model~\cite[pp. 159--160]{BOX-JENKINS-TS}.  Since the LAST predictor
is simply an AR(1) model with $\phi_0=1$, its autocovariances can also
be computed using Box, et al's method.  In the case of the MEAN
predictor, the autocovariances are simply the autocovariances of the
signal itself.  

The client can request that the host load prediction system summarize
the prediction errors either by the individual variances (the diagonal
of the covariance matrix), the covariances of all the predictions (the
entire covariance matrix), or the variance of the sum of the first
$1,2,\ldots,i$ predictions (the sum of the first $i$ rows and columns
of the covariance matrix.)  The variance of the sum of the first $i$
predictions, which we will write as $\sigma^2_{s,i}$ is then computed as
\begin{equation}
\begin{array}{rcl}
\sigma^2_{s,i} & = & Var\left\{\sum_{j=1}^{i} a_{t +j} \right\} \\
               & = & SumVar\{\langle
a_{t+1},a_{t+2},\ldots,a_{t+i}\rangle\} \\
               & = & \sum_{j=1}^i\sum_{k=1}^{i} CoVar\{a_{t+j}, a_{t+k}\} \\
               & = & \sum_{j=1}^i\sum_{k=1}^{i}\sigma_{a,j}\sigma_{a,k}\\
\end{array}
\label{eqn:execpred.covar}
\end{equation}
The sum is computed by the host load prediction system in order to
avoid communicating the whole covariance matrix.  

Typically, the non-diagonal elements of the covariance matrix are
larger than zero.  This results in Equation~\ref{eqn:execpred.covar}
being larger than Equation~\ref{eqn:execpred.var_wrong}.  This increased
variance of the sum widens the confidence interval for $al_i$ and,
correspondingly, for the available time $at(t)$ and for the running
time $t_{exec}$.


\subsection{Load discounting}

After we implemented the algorithm for computing the expected task
running time ($t_{exp}$) of a task and its confidence interval
($[t_{lb},t_{ub}]$) described thus far in this section, we found that
our results were somewhat dependent on the nominal running time of the
task, $t_{nom}$.  We will describe our evaluation methodology in
detail in Section~\ref{sec:execpred.setup}, but, for now,
consider running, at a random time, a task with a randomly chosen
nominal time $t_{nom}$ on a host with an interesting background load.
Before the task is run, we compute its expected running time, 
$t_{exp}$.  Then we run the task and record its actual running time,
$t_{act}$.

\begin{figure}
\centerline{
\epsfxsize=3in
\epsfbox{tnomest_pyramid_before.epsf}
}
\caption[Relative prediction error without load discounting]{Relative errors of predictions as a function of nominal time
without load discounting}
\label{fig:execpred.relerr_b4_discount}
\end{figure}

Figure~\ref{fig:execpred.relerr_b4_discount} shows the result of
running many such tasks.  The figure plots the relative error
($(t_{exp}-t_{act})/t_{exp}$) of the predictions versus the nominal
time of the task.  The figure plots 3000 tasks with nominal times
chosen from a uniform distribution between 0.1 to 10 seconds.  The
interval between task arrivals was chosen from a uniform distribution
between 5 to 15 seconds.  The background load was the August 1997 axp0
trace and the predictor used was an AR(16).

Notice that relative error is always positive and increases markedly
as nominal time decreases.  For one second tasks, the running time is
over-predicted by 80\%.  The confidence intervals were also skewed,
with far too many points falling below the lower bounds of the
intervals.

The problem is due to the Digital Unix scheduler, which is
priority-based and which gives an ``I/O boost'' to processes
(increases their priority) when a blocking I/O operation completes.
For example, a process (such as a CORBA ORB, an active frame server,
or the spin server we use in our evaluation
(Section~\ref{sec:execpred.spin})) that is blocked in a read on a
network socket will have its priority boosted when the read completes.
Over time, the process's priority will ``age'' to its baseline level.
The result of this is that the awakened process will get more than its
fair share of the CPU until its priority has degraded.  The shorter
the task, the more this mechanism will benefit it, and the more
inaccurate our running time estimate will be, just as shown in
Figure~\ref{fig:execpred.relerr_b4_discount}

Although we had originally intended to avoid modeling priority
scheduling, leaving the modeling of a full priority-based scheduler
for later, it was clear that we had to capture this priority boost
effect.

Our solution is load discounting.  Conceptually, when a task begins to
run, its priority boost means that the background load on the system
will effect it only slightly.  As it continues to run, its priority
drops and the background load becomes more and more important.  We
model this by exponentially decaying the load predictions
$\widehat{z}_{t+j}$, the discounted load $\widehat{zd}_{t+j}$ being
\begin{equation}
\widehat{zd}_{t+ j} = (1-e^{-j\Delta/\tau_{discount}})
\widehat{z}_{t+j}
\label{eqn:execpred.discount}
\end{equation} 
How quickly the initial load discounting decays depends on the setting
of $\tau_{discount}$.

\begin{figure}
\centerline{
\epsfxsize=3in
\epsfbox{tauest_pyramid_before.epsf}
}
\caption[Estimation of $\tau_{discount}$]{Relative errors of predictions as a function of $\tau_{discount}$, and
derivation of $\tau_{discount}$ by linear fit.}
\label{fig:execpred.relerr_vs_tau}
\end{figure}

We determined the value of $\tau_{discount}$ empirically by again
running a large number of randomized testcases as described above.
For this set of testcases, we used load discounting and chose
$\tau_{discount}$ randomly from the range 0 to 10 seconds.
Figure~\ref{fig:execpred.relerr_vs_tau} plots the relative error of
these testcases as a function of $\tau_{discount}$.  Although there is
plenty of dispersion (recall that these are point estimates $t_{exp}$,
not confidence intervals)  a linear trend clearly exists.  We fitted a
line to the points and determined that it crossed zero relative error
at $\tau_{discount}=4.5$ seconds.

\begin{figure}
\centerline{
\epsfxsize=3in
\epsfbox{tnomest_pyramid_after.epsf}
}
\caption[Relative prediction errors with load discounting]{Relative errors of predictions as a function of nominal time
with load discounting and appropriate $\tau_{discount}$ value.}
\label{fig:execpred.relerr_after_discount}
\end{figure}

Next, we ran a further large number of testcases with load discounting
and $\tau_{discount}=4.5$ The results are plotted in
Figure~\ref{fig:execpred.relerr_after_discount}.  The figure plots the
relative error as a function of the nominal time and is suitable for
direct comparison with Figure~\ref{fig:execpred.relerr_b4_discount}.
As can be seen, the appropriate $\tau_{discount}$ value has
eliminated the dependence of the relative error on the nominal time
and has further reduced the average relative error to almost exactly
zero, which we would also expect from these point estimates.

Load discounting is an effective solution to the priority boosts that
most Unix schedulers give to processes that have become unblocked.  It
is important to note, however, that other priority problems remain.
For example, a background process which has had its priority
significantly reduced (eg, a process which has has been ``reniced'') but
which remains compute bound will result in artificially exaggerated
predictions.  Similarly, a process with high priority will result in
predictions that are too low.

\subsection{Implementation}

The following summarizes how the implementation implements the
\verb.PredictRunningTime. call.  
\begin{ecompact}
\item Set a prediction horizon $n=3\lceil t_{nom}/\Delta \rceil$.
\item Collect the latest predictions,
$\widehat{z}_{t+1},\widehat{z}_{t+2},\ldots,\widehat{z}_{t+n}$, and
the cumulative variances of their sums,
$\sigma^2_{s,1},\sigma^2_{s,2},\ldots,\sigma^2_{s,n}$, from the host
load prediction system running on the host.  The variances are
computed as per Equation~\ref{eqn:execpred.covar}.
\item Generate the discounted load predictions,
$\widehat{zd}_{t+1},\widehat{zd}_{t+2},\ldots,\widehat{zd}_{t+n}$, as
per Equation~\ref{eqn:execpred.discount}.
\item Using the discounted load predictions, compute the discrete time
expected available time, $\widehat{at}_1, \widehat{at}_2,
\ldots,\widehat{at}_n$ using the technique of
Section~\ref{sec:execpred.hlp}, and the discrete time confidence
intervals on the available time,
$[at_1^{low},at_1^{high}],[at_2^{low},at_2^{high}], \ldots,
[at_n^{low},at_n^{high}]$ using the technique of
Section~\ref{sec:execpred.cis}.     
\item If the lower bound on the maximum available time, $at_n^{low}$ ,
is less than the nominal time, $t_{nom}$, increase the prediction
horizon $n$ and go to step 2.
\item To find the expected running time, $t_{exp}$, do binary search
to find the smallest $i$ for which $at_i>t_{nom}$, interpolate $at(t)$
 around $i\Delta$ using Equation~\ref{eqn:execpred.athat}, and find $t$
such that $at(t)=t_{now}$.  This is $t_{exp}$.
 \item Repeat the previous step with the $at_i^{low}$ and
 $at_i^{high}$ sequences to find the upper and lower bounds of the
 confidence interval, $[t_{lb},t_{ub}]$.  
\end{ecompact}
 The \verb.PredictAvailableTime. call is implemented similarly.

\section{Experimental infrastructure}
\label{sec:execpred.setup}

The most appropriate way to evaluate task running time prediction is
to actually run our implementation on a host, submit tasks to it, and
measure how well the predicted times match the actual times.  This
section describes the infrastructure that we used to do this.  We also
used this infrastructure to evaluate the advisory real-time service.
The next chapter describes that evaluation.

The task running time prediction system is a composite where errors
are due to inaccurate host load predictions and modeling errors in
transforming from the host load predictions and task CPU demand to
task running time.  We are interested in how this composite performs
as a whole.  The essential comparison to make when evaluating the
system is the comparison between predicted and actual running times.
To make such a comparison in a simulation would require a method for
realistically estimating the actual running time, but that is a part
of the system we want to evaluate!  The measurement-based approach
avoids this problem because the actual running time is measured
directly.  

The goal of our experimental infrastructure is to provide a realistic,
repeatable environment for evaluating running time prediction.  The
infrastructure hardware consists of two Alphastation 255 hosts
connected with a private network.  Both machines run Digital Unix
4.0D.  One host is referred to as the measurement host while the other
is called the recording host.  The hosts have no other load on them.

The recording host runs software that interrogates the components
running on the measurement host and submits tasks to it.  The
measurement host runs the following components:
\begin{itemize}
\item{A load playback tool operating on some load trace}
\item{A host load sensor}
\item{One or more host load prediction systems}
\item{A spin server}
\end{itemize}
The load playback tool, described in the next section, performs work
that replicates the workload captured in the load trace.  The choice
of load trace is experiment-specific.  The host load sensor provides
an interface for the recording host to request the latest load
measurement on the host.  The host load prediction systems are
described in Section~\ref{sec:rps.host_load_pred_sys} and provide an
interface for the recording host to request the latest load
predictions using some experiment-specific prediction model.  The spin
server runs tasks --- it takes requests to compute (using a busy loop)
for some number of CPU-seconds and then returns the wall-clock time
that the task took to complete.

The remainder of this section describes the load playback tool and the
spin server in more detail.


\subsection{Load trace playback}

To evaluate task running time prediction, a realistic background
workload is necessary.  The load signal produced by running this
workload should be have a realistic correlation structure, since it is
precisely this that a predictor attempts to model.

To achieve these requirements, our infrastructure uses {\em load trace
playback}, a new technique in which the workload is generated
according to a load trace.  With no other work on the host, this
background load results in the host's load signal repeating that of
the load trace.  Furthermore, we can repeatedly play back the same
load and explore the prediction system's behavior using the traces
from many different hosts.  Because the workload reflects the trace,
which certainly records the real behavior of a host, it is suitable
grist for the prediction mill.

Playback may seem like overkill, but it is not.  As we discovered in
Chapter~\ref{chap:statprop}, host load signals typically have complex
behaviors.  They typically have long, complex autocorrelation
structures which can change abruptly.  Our prediction systems exploit
the autocorrelation structure to increase predictability and treat
abrupt changes as points at which models need to be refit.  Any
synthetic load generator would have to reproduce these properties, and
that is difficult.  Furthermore, there may be other properties of the
signals that we do not yet understand but which may be important for
prediction.  For these reasons, playing back actual load behavior is
most appropriate.

\begin{figure}
\centerline{
\begin{tabular}{cc}
\epsfxsize=3in
\epsfbox{replay_axp0.23.txt.pyramid.out.WEB.ps}
&
\epsfxsize=3in
\epsfbox{playback_percenterrs_axp0_pyramid_1stday.eps} \\
(a) Time series of first hour & (b) Histogram of relative errors
during first day
\end{tabular}
}
\caption[Host load trace playback example]{Example of host load trace playback.}
\label{fig:execpred.playback}
\end{figure}

Figure~\ref{fig:execpred.playback} is an example of load trace
playback.  In this case, the trace collected on the host axp0.psc on
August 23, 1997 is being played back.
Figure~\ref{fig:execpred.playback}(a) plots the signal in the load
trace, the target load signal, and the measured load signal versus
time for the first hour of playback.  As can be seen, there is very
close agreement between the three signals.  The target signal and the
measured signal are slightly delayed compared to the load trace's
signal because work-baed playback (explained later) is being used and
there is a modicum of other work on the machine.
Figure~\ref{fig:execpred.playback}(b) plots a histogram of the
relative error in playback, as measured by the load signal, for the
whole trace.  Notice that most of the errors are quite low.  Most of
the larger errors are due to small absolute errors being magnified
when the load is very low.  The graph also overstates the actual error
in terms of the work performed.  This is due to other extraneous load
on the system, and sampling issues and the effects of the kernel's
smoothing filter, which we discuss in
Section~\ref{sec:execpred.imperfect}. 


\subsubsection{The playback algorithm}

To describe the operation of the system, we'll focus on periodically
sampled trace.  The load playback tool can also operate with
nonperiodically sampled traces.

The load average is an exponential average of the number of processes
in the system's ready-to-run queue.  Conceptually, the length of the
ready queue is periodically sampled, these samples $x_i$ flow through
an exponential filter
\begin{displaymath}
  z_{i+1} = (e^{-\Delta/\tau_{record}})z_i + (1-e^{-\Delta/\tau_{record}})x_i
\end{displaymath}
where $\Delta$ is the sampling interval and the application-visible
load average is the $z_i$ series.  A load trace is gathered by
periodically sampling the load average and recording the time-stamped
samples to a file.  The sample rate should be at least twice is high
as the sample rate of the in-kernel process, so we rewrite the above
equation like this:
\begin{displaymath}
   trace_{i+1} = (e^{-\Delta/2\tau_{record}})trace_{i} + (1-e^{-\Delta/2\tau_{record}})x'_i
\end{displaymath}
In load trace playback, we want the measured load average to track the
load average samples in the trace file.  To do this, we treat the
$x'_i$ in the above equation is the expected run-queue length during
the last $\Delta/2$ seconds.  To determine the $x'_i$, we ``deconvolve
out'' the smoothing function using the method described by Arndt,
et~al~\cite{WINNER-WS-98}.  All that is necessary to do this is
knowledge of the smoothing constant $\tau_{record}$ for the machine on
which the trace was taken.  For Digital Unix, $\tau_{record}$ is 5
seconds.  On most other Unix systems, $\tau_{record}$ is 60 seconds.
A larger $\tau_{record}$ value indicates that the load average will
behave more smoothly. 

A value $x'_i$ represents the amount of contention the CPU saw for the
the time interval.  To reproduce this contention, we split the interval
into smaller subintervals, each of which is bigger than the scheduler
quantum and stochastically assign subprocesses to either work or
sleep during the subintervals.  For example, suppose $x'_i=1.5$.  We
would assign subprocess 0 to work during a subinterval with
probability 1, and subprocess 1 to work during a subinterval with
probability 0.5.   

After each individual $x'_i$ value has been played, the load average
on the system is measured.  This measurement, $measure_i$, is compared
against the load trace {\em as it would have appeared given the
$\tau_{playback}$ of the playback machine}, $trace'_i$.  $trace'_i$ is
determined by smoothing the $x'_i$ series with an exponential
averaging filter having the playback machine's $\tau_{playback}$.

\subsubsection{Interacting with other load}


Most likely, there will be other processes on the system --- the tasks
we will run during the evaluation, for example --- and so it is
important to understand how the load the load playback tool generates
reacts to it.  Two modes of operation are possible, time-based and
work-based.  In our evaluation we use work-based playback.

In the time-based mode of operation, the contention probabilities
implied by the $x'_i$ value last only till the end of $x'_i$'s interval
in real time.  Essentially, this means that the other load on the
machine can only amplitude modulate the played back load.  For example,
suppose there is a uniform 1.0 load on the machine and the load trace
dictates a 1.0 load from 0 to 1 second and zero elsewhere.  Then, the
measured load will (ideally) be 2.0 from 0 to 1 second and 1.0
elsewhere. 

In the work-based mode of operation, the load is interpreted as work
that must always be done, but which may be slowed down by other load.
This means that the other load on the system can also frequency
modulate the played load.  Using the same example as the previous
paragraph, the measured load would be 2.0 from 0 to 2 seconds and 1.0
elsewhere.  

The load playback tool also supports a negative feedback mechanism.
Using this mechanism, it can sometimes more accurately track the load
average in the trace.  However, the feedback mechanism will also try
to correct for other load on the system, trying to make the total
load track the trace load.  Because of this obvious flaw, we do not
use it in our evaluation.

\subsection{Spin server}
\label{sec:execpred.spin}

A spin server runs submitted tasks and measures how long they take to
complete.  A task consists of the number of CPU seconds that should be
run.  The spin server uses a busy loop that periodically checks the
accounted system and user time (using the getrusage system call) that
the process has consumed.  The loop exits when the requested amount of
CPU seconds have been consumed and reports the wall clock time used.
At startup, spin server calibrates itself with respect to the host's
running rate and the overhead of the getrusage call.  This permits
the loop to be extremely precise with the amount of computation it
does using a minimal number of getrusage calls.   The relative
absolute error is much less than 1\%.


\section{Evaluation}

To evaluate task running time prediction, we ran 114,000 randomized
testcases using the infrastructure and studied the results.  The
testcases were randomized with respect to their starting time, their
nominal running time (0.1 to 10 seconds), the underlying host load
predictor that was used (MEAN, LAST, and AR(16)), and the load trace
used to generate the background load (all of the August, 1997 traces).
We characterized the quality of the expected running times and
confidence intervals produced by the system using three metrics.  We
evaluated the effect on these metrics of the different traces, load
predictors, and nominal running time.

The main conclusion is that our system does indeed produce high
quality predictions for task running times.  Performance depends most
strongly on the choice of host load predictor.  There is a marked
increase in performance in moving from the MEAN predictor (where
prediction errors are due the raw signal variance) to the LAST
predictor (which is the simplest predictor to take advantage of the
autocorrelation of load signals).  There is a smaller, but important,
gain in moving from the LAST predictor to the AR(16) predictor (which
is the predictor that we found most appropriate for host load
prediction in the previous chapter.)  The nature of these gains
depends on how heavily loaded the host is.  Performance also depends
on the running time of the task and generally improves as running time
increases.


\subsection{Methodology}
\label{sec:execpred.method}

To evaluate task running time prediction given a particular traced
host, we start up the experimental infrastructure described in
Section~\ref{sec:execpred.setup} on the measurement and recording
hosts.  The load playback tool is set to replay the selected trace.
The host load sensor is configured to run at 1 Hz.  Three host load
prediction systems are started: MEAN, LAST, and AR(16).  The systems
are configured to fit to 300 measurements (5 minutes) and to refit
themselves when the absolute error for a one-step-ahead prediction
exceeds 0.01 or the average measured one-step-ahead mean squared error
exceeds the estimated one-step-ahead mean squared error by more than
5\%.  The minimum interval between refits is 30 seconds and maximum
interval before the measured mean squared error is tested is 300
seconds.  

The prediction and measurement software are permitted to quiesce for
at least 600 seconds.  Then 3000 consecutive testcases are run on the
recording host, each according to this procedure:
\begin{ecompact}
\item Wait for a delay interval, $t_{interval}$, selected from a
uniform distribution from 5 to 15 seconds.
\item Get the current time $t_{now}$.
\item Select the task's nominal time, $t_{nom}$, randomly selected
from a uniform distribution from 100 ms to 10 seconds.
\item Select a random host load prediction system from among MEAN, LAST,
AR(16).
\item Use the \verb.PredictRunningTime. API to compute the expected running time $t_{exp}$ and the 95\% confidence interval $[t_{lb},t_{ub}]$
using the latest available predictions available from the selected
host load prediction system.
\item Run the task on the spin server and retrieve its actual running
time, $t_{act}$.
\item Record the timestamp $t_{now}$, the prediction system used, the
nominal time $t_{nom}$, the expected running time $t_{exp}$, the
confidence interval $[t_{lb}, t_{ub}]$.
\end{ecompact} 
After all 3000 testcases have been run, their records are imported
into a database table corresponding to the trace.  

It takes approximately 13 hours to complete 3000 testcases.
To evaluate the running time predictions, the database is mined.  

\subsection{Metrics}

The \verb.PredictRunningTime. function predicts the running time of a task
in two ways:  as an expected value and as a confidence interval.
Because the lower bound of the confidence interval is artificially
limited due to the fact that load cannot drop below zero, the expected
time is not necessarily in the middle of the confidence interval.  

Evaluating the quality of the confidence interval, $[t_{lb},t_{ub}]$,
is a somewhat complex endeavor.  Suppose we generate run a wide
variety of testcases with a specified confidence, say 95\%.  If we
used the ideal algorithm for computing confidence intervals and the
best possible predictor, the lengths of the tasks' confidence
intervals would be the minimum possible such that 95\% of the tasks
would have running times in their predicted intervals.  An imperfect
algorithm, such as ours, will compute confidence intervals that were
larger or smaller than ideal where fewer or more than 95\% of the
tasks completed in their intervals.  The important point is that to
evaluate a confidence interval algorithm, we must measure the lengths
of the confidence intervals it produces and the number of tasks which
complete within these confidence intervals.  To evaluate confidence
intervals, we will use following two metrics:
\begin{itemize}
\item {\bf Coverage}: the fraction of tasks which complete with their
predicted confidence intervals
\item {\bf Span} : the average width of the confidence interval width
in seconds
\end{itemize}
The ideal system will have the minimum possible span such that the
coverage is 95\%.

To evaluate the quality of the expected running time, $t_{exp}$, we
are interested in how strongly the actual time, $t_{act}$, correlates
with it.  Imagine plotting the actual times versus the expected times
for all the tasks.  With an perfect predictor, the result would be a
perfect 45 degree line starting at 0.  An imperfect predictor, such as
ours, will approximate the line but with points scattered around it.
The degree of this scatter is measured by the $R^2$ value of the
linear fit.  $R^2=1$ would correspond to a perfect predictor while
$R^2=0$ corresponds to randomness.  We will evaluate the quality of the
expected times using this $R^2$ value, which we will simply refer to
as the {$\mathbf{R^2}$} value.  This is not the entire story.  We also
care about the slope and intercept of the line.  We discuss this
further in Section~\ref{sec:execpred.expected_overall}.


\subsection{Results}

Running 3000 randomized testcases on each of the 39 traces in our
August 1997 set of traces resulted in 114,000 testcases to mine.  This
plethora of testcases allows us to characterize the performance of
task running time prediction in a wide variety of ways.

We want to answer several questions.  The most important of these is
whether our system does indeed provide useful predictions of task
running times, both in terms of the expected running time and in terms
of the confidence interval for the running time.  We are most
interested in the confidence intervals because these are the
predictions that we will use in the next chapter to make real-time
scheduling decisions.  In addition we want to understand how the
choice of underlying host load predictor affects prediction
performance, and how that performance depends on the nominal time of
the task.  

To address these questions, we looked at the testcases in five ways.
First, we measured the quality of the confidence intervals
independently of the nominal time of the task.  For each trace, we
computed the confidence interval metrics of coverage and span.  Then
we compared the different predictors based on these per-trace metrics.
Next, we conditioned this comparison on the nominal time of the task,
dividing the range in to small, medium, and large tasks.  The third
and fourth steps repeat this approach but using the $R^2$ metric for
the expected running time.  Finally, we hand-classified each trace
based on the relationship of the performance metrics and the nominal
time.  This resulted in five classes.  We then developed a
recommendation for each class.

The overall conclusion is that the system does indeed predict
reasonable point estimates and confidence intervals for task running
times.  Using the AR(16) predictor, we found that there were only five
traces (out of 39) in which fewer than 90\% of the tasks completed in
their confidence interval and only one host where fewer than 90\% were
within their computed confidence intervals.  The relationships between
MEAN, LAST, and AR(16) depend on whether the host is heavily loaded or
not.  On hosts with high load, the more sophisticated predictors were
able to produce significantly better coverage by estimating wider
spans.  On hosts with low load, they achieved appropriate coverage
with much smaller spans.  The AR(16) predictor generally produced
better results than LAST, and much better than MEAN.  Performance
generally improved as nominal time was increased.  In terms of
predicting the expected running time, although there were significant
differences between the predictors, none of them could be considered
to have the ``best'' performance.  Interestingly, unlike with
confidence intervals, the expected running time predictions grew worse
with increasing nominal time.

\subsubsection{Nominal time independent evaluation of confidence
interval quality}
\label{sec:execpred.overall}

Consider mapping a task with a nominal time of 0.1 to 10 seconds on a
randomly selected host.  Will the \verb.PredictRunningTime. function
produce an accurate confidence interval for its running time?  Will
the accuracy depend on which underlying host load predictor is used?

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{pred_fracinci_overall_0_10.epsf}
&
\colfigsize
\epsfbox{pred_ciwidth_overall_0_10.epsf}
\\
(a) Coverage & (b) Span
\end{tabular}
}
\caption[Confidence interval metrics, all hosts, all nominal times]{Confidence interval metrics for 0.1 to 10 second tasks on all
hosts, independent of nominal time.}
\label{fig:execpred.overall_0_10}
\end{figure}

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{pred_fracinci_vsmean_0_10.epsf}
&
\colfigsize
\epsfbox{pred_ciwidth_vsmean_0_10.epsf} \\
(a) Coverage & (b) Span \\
\end{tabular}
}
\caption[Confidence interval metrics, LAST and AR(16) vs. MEAN, all nominal times]{Confidence interval metrics showing LAST and AR(16) versus MEAN
for 0.1 to 10 second tasks on all hosts, independent of nominal time.}
\label{fig:execpred.vsmean_0_10}
\end{figure}

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{pred_fracinci_vslast_0_10.epsf}
&
\colfigsize
\epsfbox{pred_ciwidth_vslast_0_10.epsf}
\\
(a) Coverage & (b) Span \\
\end{tabular}
}
\caption[Confidence interval metics, AR(16) vs. LAST, all nominal times]{Confidence interval metrics showing AR(16) versus LAST for 0.1
to 10 second tasks on all hosts, independent of nominal time.}
\label{fig:execpred.vslast_0_10}
\end{figure}

Figures~\ref{fig:execpred.overall_0_10},
\ref{fig:execpred.vsmean_0_10}, and ~\ref{fig:execpred.vslast_0_10}
provide answers to these questions.  These figures present the same
information, but arranged in different forms in order to more clearly
illustrate different observations.  For example,
Figure~\ref{fig:execpred.overall_0_10} plots the (a) coverage, and (b)
span for each of the load traces in our study.  Consider
Figure~\ref{fig:execpred.overall_0_10}(a).  The coverage of trace axp1
using the MEAN predictor is approximately 0.62.  This point is the
result of running 3000 testcases (Section~\ref{sec:execpred.method})
on the axp1 trace, selecting the approximately 1000 of these that used
the MEAN predictor, counting the number of those testcases in which
the deadline was met, and dividing by the total number of MEAN
testcases.  Recall that our goal is a 95\% confidence interval, so
MEAN is clearly inappropriate.  We can see, however, that for the same
trace, the AR(16) predictor provides much better coverage, about
92\%, while the LAST predictor does slightly better at about 98\%.  If
we look at the axp1's span metric in
Figure~\ref{fig:execpred.overall_0_10}(b), we can begin to understand
why.  To produce the point for the MEAN metric, the confidence
interval widths for the approximately 1000 MEAN testcases were
averaged.  We see that the span is about 3.4 seconds.  The AR(16)
predictor's span, computed similarly, is a slightly higher (worse) 4
seconds, while the LAST predictor's span is a huge 6.4 seconds.  The
small increase in span going from the MEAN to the AR(16) predictor
brings the coverage from 0.62 to 0.92, which is very close to the
target of 0.95.  The much greater increase in span going from AR(16)
to LAST overshoots the target coverage, going as far as 0.98.

As can be seen from Figure~\ref{fig:execpred.overall_0_10}(a), for
almost every trace in our study, there is some predictor that provides
a coverage that is near our target 95\%.  This is an important result
that supports our claim of being able to compute accurate confidence
intervals.  However, perhaps the predictors are producing inordinately
wide spans, reducing their usefulness.  The spans in
Figure~\ref{fig:execpred.overall_0_10}(b) are difficult to
interpret.  There is clearly a {\em difference} between the
predictors, and the MEAN predictor seems to usually be much wider than
the LAST and AR(16) predictors.  However, there are some traces where
the more sophisticated predictors seem to run aground, resulting in
larger spans than MEAN.

Figure~\ref{fig:execpred.overall_0_10} is powerful for making
observations about individual hosts, but it is somewhat difficult to
compare predictor performance between hosts using it.
Figure~\ref{fig:execpred.vsmean_0_10} rearranges the data in order to
make such comparisons easier.  We plot the performance metrics' values
for LAST and AR(16) for each trace versus the metrics values for the
MEAN predictor.  This helps to answer the question: What is the
benefit of using a more sophisticated predictor than MEAN?  For
example, consider Figure~\ref{fig:execpred.vsmean_0_10}(a).  For each
trace, a dot is plotted at (coverage of MEAN testcases, coverage of
LAST testcases), and a triangle is plotted at (coverage of MEAN
testcases, coverage of AR(16) testcases).  On the graph, a 45 degree
line separates the regions where the coverage is better than MEAN from
that where it is worse.  Figure~\ref{fig:execpred.vsmean_0_10}(b) is
arrived at with exactly the same methodology, except using the span
metric.

Figure~\ref{fig:execpred.vsmean_0_10}(a) and (b) make it clear what
the LAST and AR(16) predictors generally provide quite different
performance results than the simple MEAN predictor.  For nine of the
traces, the more sophisticated predictors provide significantly better
coverage than the MEAN predictor.  For the remainder of the traces,
the more sophisticated predictors have slightly lower coverage.  In
terms of the span metric, nine of the traces show significantly
wider spans than MEAN, while the remainder are much narrower.  

The nine traces on which the LAST and AR(16) produce better coverage
performance are the same traces in which they produce worse span
performance than MEAN.  These nine traces correspond to the hosts that
exhibit greater mean load (and correspondingly, greater variability in
load (Chapter~\ref{chap:statprop}.))  The LAST and AR(16) predictors
are better able to ``understand'' such hosts and compute appropriately
wider confidence intervals compared to MEAN.  These wider confidence
intervals result in a far greater chance of a task's actual running
time falling within its computed confidence interval.  This is
precisely the behavior that we want.  Our goal is that 95\% of tasks
fall within their confidence intervals.  With the AR(16) predictor,
only 5 cases are less than 90% and only one less that 85\%, whereas
with the MEAN predictor, only one of the high load traces is better
than 85\%. The gain from MEAN to AR(16) can be as much as 30\%, and it
is typically around 15\%.

Two effects are at work here.  First, the predictions of the LAST and
AR(16) predictors depend most strongly on recent measurements.  The
MEAN predictor, on the other hand, always presents the long term mean
of the signal.  As a result, the LAST and AR(16) predictors will
respond much more quickly during the period after an epoch transition
(Chapter~\ref{chap:statprop}) before a model refit happens.  This
means that their predictions, and thus the center point of the
confidence interval will much more likely be in the right place.  The
second effect results from how the confidence interval length is
computed.  Recall that with the MEAN predictor the autocovariance of
the signal is used to compute the confidence interval, while for the
LAST and AR(16) predictors it is the autocovariance of their
prediction errors that is used.  On a high load, high variability
host, an epoch transition is more likely than on a low load, low
variability host to make the autocovariance of the signal fail to
characterize the new epoch well.  The structure of the autocovariance
of the prediction errors will not change at all, although the
individual predictions may be less accurate.

For those hosts which have lower load and variability, the LAST and
AR(16) predictors produce significantly narrower confidence intervals
than MEAN while still capturing a reasonable number of tasks within
their computed confidence intervals.  On average, the confidence
intervals are shrunk by 2-3 seconds while the fraction of tasks within
their confidence intervals shrinks by about 5\%.  Since for these
lightly loaded hosts, the MEAN predictor results in coverages that are
larger than the target 95\%, this is not an unreasonable tradeoff.
Essentially, for these low load hosts, moving from MEAN to AR(16)
reduces coverage by about 5\% while decreasing the span by 2-3
seconds (about 33\%).

At this point, we have shown that our algorithm does indeed compute
reasonable confidence intervals for task running times and that it
does so more accurately when using a more sophisticated predictor than
MEAN.  Now we would like to know whether we should prefer the LAST
predictor or the AR(16) predictor.  We have already pointed out some
of the differences between these two.  To continue, it is useful to
plot the data differently once again.
Figure~\ref{fig:execpred.vslast_0_10} plots the performance of the
AR(16) predictor versus the LAST predictor.  From
Figure~\ref{fig:execpred.vslast_0_10}(a), we can see that the
confidence intervals computed using AR(16) generally include more of
their tasks than those computed using LAST.  Using the AR(16)
predictor, only four of the cases are at less than 90\% and only one
less than 85\%.  Using LAST, 9 are less than 90\%, while four are less
than 85\%.  This gain is due to AR(16) predictors producing slightly
wider confidence intervals, as can be seen from
Figure~\ref{fig:execpred.vslast_0_10}(b).  The span increases by 0.5
to 1 second in moving from LAST to AR(16).

On heavily loaded hosts, the gain in moving from LAST to AR(16) is
more significant --- either we see a large increase in coverage, on
the order of 10\% or more, or there is a slight decline of 5\% or
significantly less.  Correspondingly, the span either grows on the
order of 2-3 seconds or it shrinks by 1-2 seconds.    

To summarize, the benefit of using the LAST or AR(16) predictor over
MEAN is two fold.  For heavily loaded hosts, the more sophisticated
predictors compute wider confidence intervals which lead to a much
larger fraction of the tasks running within their confidence
intervals.  For lightly loaded hosts, LAST and AR(16) produce much
narrower confidence intervals than MEAN at a small, reasonable cost to
coverage.

The benefit of using the AR(16) predictor over the LAST predictor is
also two fold.  For heavily loaded hosts, on about half of these
traces it generally produces wider confidence intervals that
significantly increase coverage, driving it much closer to the target
of 95\%.  On the other half it, shrinks their confidence intervals at
small cost to coverage.  For almost all lightly loaded hosts, the span
increases slightly, driving the coverage slightly closer to the target
level.


\subsubsection{Effect of nominal time on confidence interval quality}

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{pred_fracinci_overall_0_3.epsf}
&
\colfigsize
\epsfbox{pred_ciwidth_overall_0_3.epsf}
\\
(a) Coverage, 0.1 to 3 second tasks & (b) Span, 0.1 to 3 second tasks \\
\colfigsize
\epsfbox{pred_fracinci_overall_3_6.epsf}
&
\colfigsize
\epsfbox{pred_ciwidth_overall_3_6.epsf}
\\
(c) Coverage, 3 to 6 second tasks & (d) Span, 3 to 6 second tasks \\
\colfigsize
\epsfbox{pred_fracinci_overall_6_10.epsf}
&
\colfigsize
\epsfbox{pred_ciwidth_overall_6_10.epsf}
\\
(e) Coverage, 6 to 10 second tasks & (f) Span, 6 to 10 second tasks \\
\end{tabular}
}
\caption
[Effect of nominal time on confidence interval metrics, all hosts]
{Effect of nominal time on confidence interval metrics, all hosts.}
\label{fig:execpred.overall_all}
\end{figure}


\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{pred_fracinci_vsmean_0_3.epsf}
&
\colfigsize
\epsfbox{pred_ciwidth_vsmean_0_3.epsf}
\\
(a) Coverage, 0.1 to 3 second tasks & (b) Span, 0.1 to 3 second tasks \\
\colfigsize
\epsfbox{pred_fracinci_vsmean_3_6.epsf}
&
\colfigsize
\epsfbox{pred_ciwidth_vsmean_3_6.epsf}
\\
(c) Coverage, 3 to 6 second tasks & (d) Span, 3 to 6 second tasks \\
\colfigsize
\epsfbox{pred_fracinci_vsmean_6_10.epsf}
&
\colfigsize
\epsfbox{pred_ciwidth_vsmean_6_10.epsf}
\\
(e) Coverage, 6 to 10 second tasks & (f) Span, 6 to 10 second tasks \\
\end{tabular}
}
\caption
[Effect of nominal time on confidence interval metrics, LAST and AR(16) compared to MEAN]
{Effect of nominal time on confidence interval metrics, all
hosts, LAST and AR(16) compared to MEAN.}
\label{fig:execpred.vsmean_all}
\end{figure}


\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{pred_fracinci_vslast_0_3.epsf}
&
\colfigsize
\epsfbox{pred_ciwidth_vslast_0_3.epsf}
\\
(a) Coverage, 0.1 to 3 second tasks & (b) Span, 0.1 to 3 second tasks \\
\colfigsize
\epsfbox{pred_fracinci_vslast_3_6.epsf}
&
\colfigsize
\epsfbox{pred_ciwidth_vslast_3_6.epsf}
\\
(c) Coverage, 3 to 6 second tasks & (d) Span, 3 to 6 second tasks \\
\colfigsize
\epsfbox{pred_fracinci_vslast_6_10.epsf}
&
\colfigsize
\epsfbox{pred_ciwidth_vslast_6_10.epsf}
\\
(e) Coverage, 6 to 10 second tasks & (f) Span, 6 to 10 second tasks \\
\end{tabular}
}
\caption
[Effect of nominal time on confidence interval metrics, all
hosts, AR(16) compared to LAST]
{Effect of nominal time on confidence interval metrics, all
hosts, AR(16) compared to LAST.}
\label{fig:execpred.vslast_all}
\end{figure}

The performance of task running time prediction depends on the nominal
time of the task, $t_{nom}$.  To illustrate this dependence, we use the
same methodology to mine the testcases as in
Section~\ref{sec:execpred.overall}, but we condition the results based
on $t_{nom}$.  Recall that the range of $t_{nom}$ is from 0.1 to 10
seconds.  In Section~\ref{sec:execpred.overall}, to produce a point on
a graph, we used testcases from this entire range.  In this section,
we divide the range of $t_{nom}$ into three subranges: 0.1 to 3
seconds (``small tasks''), 3 to 6 seconds (``medium tasks''), and 6 to
10 seconds (``large tasks'').  Then, for each of these subranges we
produce a graph.  Where in Section~\ref{sec:execpred.overall} we had
one graph where each point represented the average of approximately
1000 testcases, in this section we will have three graphs and a point
on a graph represents the average of approximately 333 testcases in
the appropriate range.  The semantics of the graphs are identical to
those of Section~\ref{sec:execpred.overall}.  In the next section, we
will dwelve deeper, dividing the hosts into five classes and
illustrating how performance depends on nominal time for individual
representative hosts.

Figures~\ref{fig:execpred.overall_all}, \ref{fig:execpred.vsmean_all},
and \ref{fig:execpred.vslast_all} illustrate how the coverage and span
of confidence intervals depend on the nominal time.  The first figure
presents the performance metrics of the three predictors on each of
the traces for each of the three sizes of tasks.
Figures~\ref{fig:execpred.overall_all}(a) and (b) show the coverage and
span for tasks of 0.1 to 3 second nominal times, (c) and (d) for 3 to
6 second tasks, and (e) and (f) for 6 to 10 second tasks.

In terms of the coverage, Figure~\ref{fig:execpred.overall_all}
appears to show that all of the predictors do better as the nominal
time increases.  For short 0.1 to 3 second tasks
(Figure~\ref{fig:execpred.overall_all}(a)), there are a number of
cases where none of the predictors provide better than 90\% coverage,
while for 6 to 10 second tasks
(Figure~\ref{fig:execpred.overall_all}(e)), there is almost always
some predictor which provides a span of over 95\%.   In terms of the
span, there appears to be a relative decrease in the gain of the more
sophisticated predictors.

Figure~\ref{fig:execpred.vsmean_all} replots the data to compare the
LAST and AR(16) predictors to the MEAN predictor while
Figure~\ref{fig:execpred.vslast_all} compares the AR(16) predictor
with the LAST predictor.  This illustrates the benefits of moving to
the more sophisticated predictors.  The benefit clearly increases as
the nominal time increases.  For example, at 0.1 to 3 seconds
(Figure~\ref{fig:execpred.vsmean_all}(a) and (b)) we can see that the
more sophisticated predictors actually have worse coverage than the
simple MEAN predictor because their spans are too narrow.  In general,
however, we can see that the AR(16) predictor does better LAST
(Figure~\ref{fig:execpred.vslast_all}(a) and (b)), producing wider
spans that result in better coverage.  However, this is clearly a bad
case, since even with AR(16), the coverage on half of the traces is
less than 80\%.

As the nominal time increases
(Figure~\ref{fig:execpred.vsmean_all}(c)--(f),
Figure~\ref{fig:execpred.vslast_all}(c)--(f)) we begin to see a
picture that is more similar to the overall picture of
Section~\ref{sec:execpred.overall}. On the heavily loaded hosts, the
more sophisticated predictors produce wider spans which result in much
better (and more appropriate) coverage.  On the lightly loaded hosts,
the coverage is similar between the different predictors, but LAST and
AR(16) produce much narrower spans.  In comparing LAST and AR(16), we
can see that as the nominal time increases the relationship between
these two predictors becomes more complex.  For medium sized tasks
(Figure~\ref{fig:execpred.vslast_all}(c)-(d)), AR(16) produces large
gains in coverage over LAST, bringing coverage to the target 95\%
levels in many cases.  This is bought at a typically small increase in
span, although the difference is larger for more heavily loaded hosts.
For large tasks (Figure~\ref{fig:execpred.vslast_all}(e)-(f)), LAST
and AR(16) seem to be in a dead heat.


\subsubsection{Nominal time independent evaluation of quality of
expected running times}
\label{sec:execpred.expected_overall}

\begin{figure}
\centerline{
\colfigsize
\epsfbox{pred_tacttexpr2_overall_0_10.epsf}
}
\caption
[$R^2$, all hosts, all nominal times]
{$R^2$ metric for 0.1 to 10 second tasks on all hosts, independent of
nominal time.}
\label{fig:execpred.r2_overall_0_10}
\end{figure}

\begin{figure}
\centerline{
\colfigsize
\epsfbox{pred_tacttexpr2_vsmean_0_10.epsf}
}
\caption
[$R^2$, LAST and AR(16) vs. MEAN, all nominal times]
{$R^2$ metric showing LAST and AR(16) versus MEAN for  0.1 to 10
second tasks on all hosts, independent of nominal time.}
\label{fig:execpred.r2_vsmean_0_10}
\end{figure}

\begin{figure}
\centerline{
\colfigsize
\epsfbox{pred_tacttexpr2_vslast_0_10.epsf}
}
\caption
[$R^2$, AR(16) vs. LAST, all nominal times]
{$R^2$ metric showing AR(16) versus LAST for  0.1 to 10 second
tasks on all hosts, independent of nominal time.}
\label{fig:execpred.r2_vslast_0_10}
\end{figure}

This section evaluates how well the expected running time, $t_{exp}$
provided by our algorithm predicts the actual running time, $t_{act}$,
encountered when running the task.  We will begin by using the same
methodology as the previous sections, but with the $R^2$ metric.  We
will also discuss certain important characteristics that are not
measured by $R^2$

Figure~\ref{fig:execpred.r2_overall_0_10} shows the $R^2$ values
measured on each of the traces using each of the predictors.  Each
point represents approximately 1000 testcases.  As we can see, in
almost all cases, some predictor provides an $R^2>0.9$ --- Our task
running time predictions explain over 90\% of the variability in the
running time.  There seems to be little distinction between the
different predictors, however.

Figure~\ref{fig:execpred.r2_vsmean_0_10} plots the $R^2$ values of the
LAST and AR(16) predictors versus their corresponding values for the
MEAN predictor.  Roughly half of the hosts benefit from the more
sophisticated predictors.  Among these are the more heavily loaded
hosts.  The other half of the hosts do not benefit from the more
sophisticated predictors, but their loss is not as significant as the
gains of the half that do.  It is also clear that the AR(16)
predictors generally have very little loss compared to MEAN.

Figure~\ref{fig:execpred.r2_vslast_0_10} compares the $R^2$ values of
the LAST and AR(16) predictors.  About 2/3 of the hosts see some
benefit from the move to the AR(16) predictor, while the rest see some
detriment.  The gain or decline is on the order of 0.05 or so.  

It turns out that most of the variation in the running time of tasks
can be explained by the nominal time of the task itself.  Typically,
the $R^2$ using the nominal time of as predictor is on the order of
0.8, meaning that 80\% of the variation in the running time can be
explained simply by the nominal time of the task, $t_{nom}$.  Because
of this there is little room for improvement by the predictor, which
is why the predictors have similar performance.  

This is not to say that the nominal time is a good overall predictor,
because it can be very sensitive to transient behavior that a
predictor based on host load can more appropriately deal with.
Furthermore, the $R^2$ value shows how much of a linear relationship
there is between the actual running time and the prediction, but it
does not give any insight into what the slope and intercept of this
relationship are.  Ideally, a good predictor would have $t_{act} = m
t_{exp} + b + noise$ where $m=1$ and $b=0$.  Intuitively, the value of
$m$ is determined by how well the predictor captures the contribution
of the load to the running time.

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{pred_compare_direct_mojave_tnom.epsf}
&
\colfigsize
\epsfbox{pred_compare_direct_mojave_last.epsf}
\\
(a) $t_{nom}$ predictor & (b) LAST predictor \\
\end{tabular}
}
\caption
[Actual versus expected running times, lightly loaded host]
{Actual versus expected running times for lightly loaded host
with intermittent load.}
\label{fig:execpred.mojave}
\end{figure}

Figure~\ref{fig:execpred.mojave} illustrates how a more sophisticated
predictor can avoid transient behavior, resulting in significantly
higher $R^2$.  Each of the graphs plots 1000 tasks chosen from 0.1 to
30 seconds, plotting their actual running time versus the predicted
running time, determined either as (a) the nominal time $t_{nom}$ or
(b) by using the LAST predictor.  We have also plotted a 50\% error
region around the predictions.  Because the host had a background
process that ran at infrequent times, the $t_{nom}$ predictor turned
out to be wildly inaccurate for about 10\% of the tasks,
under-predicting running times by a factor of 2.  The LAST predictor,
which incorporates the most recent host load measurements, was able to
detect when the background process ran, offering higher predicted
running times when this was the case.  Because of this, the $R^2$ rose
from 0.79 to 0.99.  More importantly, the linear fit has improved from
has $m=1.1$ and $b=0.02$ to $m=1$ and $b=-0.03$.

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{pred_compare_direct_axp0_tnom.epsf}
&
\colfigsize
\epsfbox{pred_compare_direct_axp0_ar16.epsf}
\\
(a) $t_{nom}$ predictor & (b) LAST predictor \\
\end{tabular}
}
\caption
[Actual versus expected running time, heavily loaded host]
{Actual versus expected running times for heavily loaded host
(axp0 trace).}
\label{fig:execpred.axp0}
\end{figure}

Figure~\ref{fig:execpred.axp0} shows another typical behavior.  In
this case we have run approximately 3000 tasks ranging from 0.1 to 10
seconds and have plotted them as per the previous figure.  The host
was playing back the high load axp0 trace.  In
Figure~\ref{fig:execpred.axp0}(a) we see that over 74\% of the
$t_{nom}$ predictions are wrong by more than 50\%.  On the other hand,
using the AR(16) predictor (Figure~\ref{fig:execpred.axp0}(b)) only
3\% of the predictions are wrong by more than 50\%.  However, note
that the $R^2$ value has changed only marginally, from 0.82 to 0.87
--- the linear fit to the $t_{nom}$ graph is not much worse than that
of the AR(16) predictor.  However, the fit has improved from
$m=2.15$ and $b=0.82$ to $m=0.95$ and $b=0.06$.


\subsubsection{Effect of nominal time on quality of expected running times}

\begin{figure}
\centerline{
\begin{tabular}{c}
\colfigsize
\epsfbox{pred_tacttexpr2_overall_0_3.epsf} \\
(a) $R^2$, 0.1 to 3 second tasks \\
\colfigsize
\epsfbox{pred_tacttexpr2_overall_3_6.epsf} \\
(b) $R^2$, 3 to 6 second tasks \\
\colfigsize
\epsfbox{pred_tacttexpr2_overall_6_10.epsf} \\
(c) $R^2$, 6 to 10 second tasks \\
\end{tabular}
}
\caption
[Effect of nominal time on $R^2$, all hosts]
{Effect of nominal time on $R^2$ metric, all hosts.}
\label{fig:execpred.r2_overall_all}
\end{figure}

\begin{figure}
\centerline{
\begin{tabular}{c}
\colfigsize
\epsfbox{pred_tacttexpr2_vsmean_0_3.epsf} \\
(a) $R^2$, 0.1 to 3 second tasks \\
\colfigsize
\epsfbox{pred_tacttexpr2_vsmean_3_6.epsf} \\
(b) $R^2$, 3 to 6 second tasks \\
\colfigsize
\epsfbox{pred_tacttexpr2_vsmean_6_10.epsf} \\
(c) $R^2$, 6 to 10 second tasks \\
\end{tabular}
}
\caption
[Effect of nominal time on $R^2$, LAST and AR(16) vs. MEAN]
{Effect of nominal time on $R^2$ metric, all hosts, LAST and
AR(16) compared to MEAN.}
\label{fig:execpred.r2_vsmean_all}
\end{figure}


\begin{figure}
\centerline{
\begin{tabular}{c}
\colfigsize
\epsfbox{pred_tacttexpr2_vslast_0_3.epsf} \\
(a) $R^2$, 0.1 to 3 second tasks \\
\colfigsize
\epsfbox{pred_tacttexpr2_vslast_3_6.epsf} \\
(b) $R^2$, 3 to 6 second tasks \\
\colfigsize
\epsfbox{pred_tacttexpr2_vslast_6_10.epsf} \\
(c) $R^2$, 6 to 10 second tasks \\
\end{tabular}
}
\caption
[Effect of nominal time on $R^2$, AR(16) vs. LAST]
{Effect of nominal time on $R^2$ metric, all hosts, AR(16)
compared to LAST.}
\label{fig:execpred.r2_vslast_all}
\end{figure}

The $R^2$ of a prediction is strongly dependent on the nominal time of
the task, but is somewhat independent of which host load predictor is
used.  Figure~\ref{fig:execpred.r2_overall_all} illustrates how the
average $R^2$ value measured for each of the load traces depends on
the nominal time.  As can be seen from the figure, $R^2$ values
decline significantly as tasks grow in size.  For small tasks
(Figure~\ref{fig:execpred.r2_overall_all}(a)), there is generally some
predictor which provides $R^2>0.9$, while for large tasks
(Figure~\ref{fig:execpred.r2_overall_all}(c)), most $R^2$ values are
significantly below 0.8.  Furthermore, as tasks increase in size,
there seems to be greater variability in the performance of the
different predictors.  

This variability in performance among the predictors is not, alas,
evidence that one of the predictors is consistently better than the
others for all, or even a large group of hosts.  Indeed,
Figure~\ref{fig:execpred.r2_vsmean_all}, which compares the $R^2$ of
the LAST and AR(16) predictors to that of the MEAN predictor, and
Figure~\ref{fig:execpred.r2_vslast_all}, which compares the $R^2$ of
the LAST and AR(16) predictors, show that there are no clear
favorites.  As running times increase, $R^2$ values decline
precipitously and no clear pattern of difference among the hosts
becomes clear --- no one predictor is preferable.  If $R^2$ is the
desired figure of merit, then a multiple expert approach, where
different predictors are used simultaneously, and the predictor with
the best recent predictions is the one whose value is reported to the
user.


\subsubsection{Host classes}

For each individual load trace, we can plot our three performance
metrics (coverage, span, and $R^2$) versus the nominal time
$t_{nom}$.  When we did this, we found that an interesting pattern
emerged.  By visual inspection, the results for the 39 traces could be
placed into five categories.  At present, we make no use of these
categorizations, and we certainly have not automated the
characterization process.  Nonetheless, examining representatives of
each of the classes is enlightening and permits us to make
recommendations. 

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{predresults_axp2_tnom0:2:10_.table.percent.ci.eps}
&
\colfigsize
\epsfbox{predresults_axp2_tnom0:2:10_.table.avgciwidth.noci.eps}
\\
(a) Coverage & (b) Span \\
\multicolumn{2}{c}{
\colfigsize
\epsfbox{predresults_axp2_tnom0:2:10_.table.tactvtexp_r2.eps}
} \\
\multicolumn{2}{c}{
c) $R^2$
}\\
\end{tabular}
}
\caption
[Confidence interval metrics on Class I hosts]
{Performance metrics as function of time for Class I
(``typical low load host'') host axp2.}
% low load 2
\label{fig:execpred.class_i}
\end{figure}

{\bf Class I} : Class I, which we also call the ``typical low load
host'' class represents the most common behavior by far that we have
encountered. The class consists of 29 of the 39 hosts (76\%): axp2,
axp3, axp6, axp7, axp8, axp9, axpfea, axpfeb, manchester-1,
manchester-2, manchester-5, manchester-6, manchester-7, manchester-8,
mojave, sahara, aphrodite, asbury-park, asclepius, cobain, darryl,
hestia, hawaii, newark, pryor, rhea, rubix, uranus, and zeno.

Class I is exemplified by trace axp2, whose metrics we plot as
functions of the nominal time in Figure~\ref{fig:execpred.class_i}.
Each point in the graph represents the average of about 200 testcases
and represents a 2 second span of nominal time, extending from one
second before the point's x-coordinate to one second after.  The main
characteristics of the class are the following.  The coverage is only
slightly dependent on the nominal time, increasing slightly for all
predictors as the nominal time increases.  The MEAN predictor
typically has almost 100\% coverage and is closely followed by the
AR(16) and then the LAST predictor.  The LAST and AR(16) predictors
have significantly narrower spans than the MEAN predictor, with AR(16)
producing slightly wider spans than LAST.  The $R^2$ drops
precipitously with increasing nominal time, especially for MEAN.  The
drop is less precipitous for LAST and AR(16), and LAST is usually
slightly better than AR(16) in terms of $R^2$.  For this particular
trace, there is a slight upturn for LAST and AR(16) with increasing
$R^2$, while in in other cases the decline is monotonic.  

Our primary concern is producing accurate confidence intervals.  For
this reason, we believe that the AR(16) is the best predictor for this
class of host.  The coverage is nearly as good as MEAN and is
typically near the target 95\% point, while LAST tends to lag behind,
especially for smaller tasks.  Furthermore, the span of AR(16) is
typically half that of MEAN and only slightly wider than LAST.  In
most hosts, then, a better predictor produces much narrower accurate
confidence intervals.

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{predresults_axp1_tnom0:2:10_.table.percent.ci.eps}
&
\colfigsize
\epsfbox{predresults_axp1_tnom0:2:10_.table.avgciwidth.noci.eps}
\\
(a) Coverage & (b) Span  \\
\multicolumn{2}{c} {
\colfigsize
\epsfbox{predresults_axp1_tnom0:2:10_.table.tactvtexp_r2.eps}
}
\\
\multicolumn{2}{c} {
c) $R^2$
}
\\
\end{tabular}
}
\caption
[Confidence interval metrics on Class II hosts]
{Performance metrics as function of time for Class II
(``atypical low load host'') host axp1.}
%(``low load 1'')
\label{fig:execpred.class_ii}
\end{figure}

{\bf Class II} : Class II hosts, which we refer to as being in the
``atypical low load host'' class, present the second most common
behavior among our traces.  The class consists of 4 of the 39 
hosts (10\%): axp1, manchester-3, manchester-4, and bruce.

An exemplar of Class II is the trace axp1.  We have plotted the
metrics of axp1 in Figure~\ref{fig:execpred.class_ii}.  The methodology
behind the graphs is identical to that of Class I.  An important
distinguishing feature of this class is that the coverage of the MEAN
predictor drops precipitously with increasing nominal time because the
span of its confidence interval is not sufficiently large.  In
contrast, LAST and AR(16) compute slightly larger confidence intervals
which result in excellent coverage that increases with increasing
nominal time.  LAST and AR(16) have similar coverage (in this case
LAST is slightly ahead, in other cases AR(16) is slightly ahead).  The
$R^2$ of MEAN also decays quickly with increasing nominal time, while
those of LAST and AR(16) decay much more slowly.  Again, in some
cases, AR(16) is slightly ahead, in others LAST is slightly ahead by
this metric.  

In terms of computing confidence intervals, either AR(16) or LAST
seems adequate for producing confidence intervals for this class of
host.  Compared to MEAN, both produce significantly larger spans that
result in much better coverage.   In terms of the expected time,
AR(16) and LAST are also to be strongly preferred over the MEAN
predictor, while there is no clear advantage to recommending one over
the other.

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{predresults_axp4_tnom0:2:10_.table.percent.ci.eps}
&
\colfigsize
\epsfbox{predresults_axp4_tnom0:2:10_.table.avgciwidth.noci.eps}
\\
(a) Coverage & (b) Span \\
\multicolumn{2}{c}{
\colfigsize
\epsfbox{predresults_axp4_tnom0:2:10_.table.tactvtexp_r2.eps}
}
\\
\multicolumn{2}{c}{
(c) $R^2$ 
}\\
\end{tabular}
}
\caption
[Confidence interval metrics on Class III hosts]
{Performance metrics as function of time for typical class III
(``high load 1'') host axp4.}
\label{fig:execpred.class_iii}
%(``high load 1'')
\end{figure}

{\bf Class III} : The remainder of the five host classes all contain
high load hosts.  There does not seem to be a ``typical'' behavior on
a high load host, so we will simply enumerate these classes.  Class
III, which we also call ``high load 1'', consists of a 3  of the 39
hosts (8\%) : axp4, axp5, and argus.

Using the same methodology as before,
Figure~\ref{fig:execpred.class_iii} plots the performance metrics as a
function of the nominal time for an exemplar, axp4.  Compared to the
low load hosts, this high load 1 host displays much more complex
behavior.  The predictor with the best coverage depends strongly on
the nominal time.  For very short tasks, MEAN is slightly better than
AR(16), which is much better than LAST, although the coverage is quite
poor with all three predictors.  For medium size tasks, AR(16)
provides the best coverage, followed at a distance by MEAN and LAST,
which become interchangeable.  For large tasks, AR(16) and LAST have
similar coverage, with AR(16) lagging slightly, while MEAN's coverage
is far behind.  In terms of the span, AR(16) and LAST both compute
much wider confidence intervals than MEAN, which explains why their
coverage is so much better.  MEAN is unable to understand the
dynamicity of this kind of host.  Predictably, for the nominal times
where AR(16) is preferable to LAST, it has a larger span.  In terms of
the $R^2$, AR(16) and LAST are clearly preferable to MEAN since their
performance declines much more gradually with increasing nominal time.
They are interchangeable.

In terms of computing accurate expected running times for this class
of hosts, either AR(16) or LAST are appropriate since their similar
$R^2$ metrics are significantly better than that of MEAN.  In terms of
computing accurate confidence intervals, the best predictor is highly
dependent on the nominal time.  For very short tasks, MEAN or AR(16)
is preferable, but either has rather poor coverage.  For medium tasks,
AR(16) produces the best performance.  For large tasks, LAST is best. 


\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{predresults_axp10_tnom0:2:10_.table.percent.ci.eps}
&
\colfigsize
\epsfbox{predresults_axp10_tnom0:2:10_.table.avgciwidth.noci.eps}
\\
(a) Coverage & (b) Span\\
\multicolumn{2}{c}{
\colfigsize
\epsfbox{predresults_axp10_tnom0:2:10_.table.tactvtexp_r2.eps}
}
\\
\multicolumn{2}{c}{
(c) $R^2$
} \\
\end{tabular}
}
\caption
[Confidence interval metrics on Class IV hosts]
{Performance metrics as function of time for typical class IV (``high load 2'') host axp10.}
\label{fig:execpred.class_iv}
\end{figure}

{\bf Class IV} : This class, which we also refer to as the ``high load
2'' class, contains two hosts: axp10 and themis.  

Figure~\ref{fig:execpred.class_iv} plots the performance of the
predictors on a representative trace, axp10 using the same
methodology as before.  We can see that the coverage of LAST and
AR(16) are virtually identical here and increase slowly with nominal
time.  MEAN has similar coverage for small tasks, but then behaves
increasingly poorly, with coverage decreasing rapidly with nominal
time.  In terms of the span, LAST grows much more quickly than MEAN
with increasing nominal time, while AR(16) is almost exactly in
between them.  For very short nominal times the spans are all
identical.  The $R^2$ of MEAN drops in step with increasing nominal
time, while the $R^2$ of LAST and AR(16), which remain virtually
identical, drop much more slowly.

In terms of computing confidence intervals, AR(16) clearly produces
the best results for this class of hosts, getting coverage identical
to that of LAST with a span that is often half as wide.  In terms of
computing expected times, AR(16) and LAST are nearly identical, and
significantly better than MEAN.

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{predresults_axp0_tnom0:2:10_.table.percent.ci.eps}
&
\colfigsize
\epsfbox{predresults_axp0_tnom0:2:10_.table.avgciwidth.noci.eps}
\\
(a) Coverage & (b) Span \\
\multicolumn{2}{c}{
\colfigsize
\epsfbox{predresults_axp0_tnom0:2:10_.table.tactvtexp_r2.eps}
}
\\
\multicolumn{2}{c}{
(c) $R^2$
}
\\
\end{tabular}
}
\caption
[Confidence interval metrics on Class V hosts]
{Performance metrics as function of time for typical class V (``high load 3'') host axp0.}
\label{fig:execpred.class_v}
\end{figure}

{\bf Class V} : Class V, which we also refer to as the ``high load 3''
class, consists of a single host, axp0.  

Figure~\ref{fig:execpred.class_v} plots the performance of the
predictors on axp0 using the same methodology as before.  In terms of
coverage, AR(16) is clearly the winner here, especially for medium
sized tasks.  It achieves its reasonable coverage (the goal is 95\%)
by computing slightly larger confidence intervals than MEAN.  LAST
computes confidence intervals that are far too small, resulting in
abysmal coverage.  In terms of $R^2$, LAST and AR(16) provide nearly
identical performance that drops much more gradually with increasing
nominal time than that of MEAN. 

AR(16) is clearly the preferable predictor for this class of hosts in
terms of computing confidence intervals.  For the expected time,
AR(16) and LAST are interchangeable here.


\section{Conclusion}

We began this chapter with a resource prediction service capable of
providing accurate host load predictions.  However, applications and
application schedulers are interested in application-level
predictions.  In particular, our advisory real-time service needs
predictions of task running time.  To provide these predictions, we
developed an algorithm for transforming from host load predictions and
a task's nominal time, a measure of the CPU demand of the task, into a
prediction of the task's running time.  The prediction is expressed
both as a point estimate (the expected running time) and a confidence
interval for the running time.  

We implemented the algorithm to provide a new service, the task
running time prediction service, on top of the existing host load
prediction service.  We constructed a sophisticated run-time evaluation
infrastructure based on the new technique of load trace replay and
evaluated our implementation using it.  The system works well when
combined with an appropriate host load predictor, such as the AR(16)
predictor that we found most appropriate when we evaluated host load
prediction in the previous chapter.  

The following summarizes the important elements of task running time
prediction and its properties:
\begin{icompact}
\item Our algorithm is based on a discretized version of a fluid model
of a priority-less host scheduler.  The assumption is that all
processes run at the same priority.
\item The expected running time is computed based on the individual host load
predictions. 
\item The confidence interval is computed based on the estimated covariances of
of host load prediction errors.
\item Capturing the correlation between host load prediction errors in
the form of these covariances is of critical importance in computing
accurate confidence intervals.
\item Load discounting is needed to capture the priority boost that
a Unix scheduler gives a task when it begins.
\item Load trace playback is an effective tool for (re)producing a realistic
workload on a host and is the basis of our evaluation infrastructure.
\item The combination of our algorithm and an appropriate host load
predictor can indeed produce effective predictions of task running time.
\item There is almost always a host load predictor that can be used to
compute reasonable confidence intervals, both in terms of having the
desired coverage and having small span.
\item Of the host load predictors we studied (MEAN, LAST, and AR(16)) 
The best predictor is usually AR(16).  This is in keeping with our
results of the previous chapter.
\item There is a significant performance gain, in terms of either
shorter span or better coverage in moving from the MEAN to the LAST
predictor.
\item Compared to MEAN, the LAST and AR(16) predictors tend to provide
significantly better coverage (high load hosts) or significantly
shorter spans (low load hosts).
\item There is a smaller, but significant performance gain in moving
from the LAST predictor to the AR(16) predictor, especially on more
heavily loaded hosts.
\item The AR(16) and LAST predictors perform well in terms of the
expected running times of tasks.  
\item Predictor performance depends on the nominal time of the task.
The quality of confidence intervals, in terms of span and coverage,
increases with increasing nominal time.  On the other hand, the
quality of expected running times generally declines with increasing
nominal time.  For both predictions the differences between the different
predictors grows with increasing nominal time.
\item We identified five different classes of behavior in how the
performance metrics vary with the underlying host load predictor and
with the task nominal time.
\item For all the classes, for most nominal times, the AR(16)
predictor is preferable.
\end{icompact}



\small
\bibliographystyle{acm}
\bibliography{texbib}
\normalsize

\end{document}