\chapter{Running Time Advisor}
\label{chap:execpred}

This dissertation argues for basing real-time scheduling advisors on
explicit resource-oriented prediction, specifically on the prediction
of resource signals.  In the previous chapters, we presented a
methodology and tools for understanding such signals and developing
prediction systems for them.  We then applied the approach to host
load to develop an appropriate host load prediction system.  In
this chapter, we concentrate on how to use these resource-level
predictions to provide application-level predictions, predictions of
task running time.

When deciding on which host to run a task, applications and adaptation
advisors such as the real-time scheduling advisor need to know the
likely running time of the task on each of the available hosts.  If
they can modify the CPU demand of a task, perhaps by adjusting a
quality parameter such as resolution, they may also be interested in
the likely available processor time on each of the hosts.  This
chapter defines an interface for this purpose, describes an algorithm
that uses host load predictions to implement the interface, and
evaluates the performance of an implementation of that algorithm using
a large variety of experimental environments.  We refer to that
implementation as the running time advisor.

To use the system, the application supplies the prospective host, the
CPU demand of the task (expressed as the nominal running time on an
unloaded host), and a confidence level.  The system then collects the
latest host load predictions and their estimated mean squared errors
and correlations from the prospective host's load prediction system.  
It then uses the algorithm to compute an estimate of the running time
of the task on that host.

The estimate the system reports contains two values.  The first is the
expected running time of the task on the host.  The second is a
confidence interval (as per the user's supplied confidence level) for
the running time.  For the purposes of the real-time scheduling
advisor, we are most concerned about the confidence intervals that the
system computes.

We evaluated the system using a real environment where the background
load on a host is supplied by playing back a load trace.  Using this
technique, we can (re)construct a realistic repeatable workload on a
real machine without limiting ourselves to synthetic workloads that
may not capture the higher order behavior that host load prediction
depends on, and which real load signals exhibit.

The host load predictions used to evaluate the system are provided by
the on-line host load prediction system described in
Chapter~\ref{chap:rps}.  The study described in
Chapter~\ref{chap:loadpred} concluded that the most appropriate
predictor to use for host load prediction was based on an AR(16) model
or better.  In evaluating the algorithm developed in this chapter, we
use predictors based on the MEAN, LAST, and AR(16) models, which
provide a measure of how well the algorithm works given (respectively)
the raw variance of the load signal, the considerably lower mean
squared errors of a simple predictor, and the mean squared error of
the most appropriate predictor.

The evaluation consists of 114,000 randomized testcases run against
each of our August, 1997 traces.  To understand the performance of the
running time advisor, we data-mined these testcases.

We draw a considerable number of conclusions from our examination of
the testcases.  The main result is that our system, using a strong
host load predictor such as the AR(16) predictor, indeed computes
reasonable and useful confidence intervals for the running time of
tasks.  The expected running times that are computed are also quite
accurate.  The benefits of more sophisticated predictors depend on the
how heavily loaded the host is and on the nominal time of the task.
The 39 traces in the evaluation fit into five classes as far as the
performance of our system is concerned.  For most of these classes,
and more than 90\% of traces, the AR(16) predictor produces the best
results.

\section{Programming interface}
\label{sec:execpred.api}

When an application or application scheduler considers running a task
on a particular host, it wants to know what the running time of the
task on that host will be given the task's CPU demand.  Alternatively,
it may want to know the available CPU time during a given interval
extending into the future. 

The API we provide allows the application to ask these questions and
get qualified answers.  The running time advisor will
use the host load predictions that the host makes available to compute
the answers.  The answers are qualified in the sense that they are
expressed both as expected values and as confidence intervals.  The
confidence interval captures the uncertainty associated with
prediction and modelling errors.  The API has the following form:

\small
\begin{verbatim}
int PredictRunningTime(Host                          &host,
                       RunningTimePredictionRequest  &req,
                       RunningTimePredictionResponse &resp);

struct RunningTimePredictionRequest {
  double conf;
  double tnom;
};

struct RunningTimePredictionResponse {
  double tnom;
  double conf;
  double texp;
  double tlb;
  double tub;
};

int PredictAvailableTime(Host                            &host,
                         AvailableTimePredictionRequest  &req,
                         AvailableTimePredictionResponse &resp);

struct AvailableTimePredictionRequest {
  double conf;
  double interval;
};

struct AvailableTimePredictionResponse {
  double conf;
  double interval;
  double texp;
  double tlb;
  double tub;
};
\end{verbatim}
\normalsize

The first function, \verb.PredictRunningTime., is used to determine the
running time of a task of nominal time \verb.tnom. ($t_{nom}$) on the
host.  The request also includes a confidence level, \verb.conf., to
describe how accurate the prediction needs to be.  These components of
the query are fields of the\\
\verb.RunningTimePredictionRequest. structure.  The response returns
in the\\
 \verb.RunningTimePredictionResponse. structure.  In addition to
a copy of the request's fields, it includes the expected running time
of the task, \verb.tnom. ($t_{exp}$), as well as the upper and lower
bounds of the \verb.conf. confidence interval for the running time,
[\verb.tlb., \verb.tub.] ($[t_{lb},t_{ub}]$).  $t_{exp}$ is a point
estimate which represents the most likely running time.  The actual
running time, $t_{act}$, will likely be different from $t_{exp}$ but
be near it.  The confidence interval represents a range of values
around $t_{exp}$ such that $t_{act}$ will be in the range a fraction
\verb.conf. of the time.

The second function, \verb.PredictAvailableTime., is used to ask how
much time is the available on the host during the next
\verb.interval. seconds.  The request and response structures are
similar to those already described.  The differences are that the
query contains the interval instead of the nominal task time, and that
the response's estimates are of the available time.

\section{Predicting running times}
\label{sec:execpred.confint}

In this section, we describe the algorithm we developed for
transforming from host load predictions and a task's nominal time to a
prediction of the task's running time.  We begin by describing the
core algorithm, first illustrating how it works in continuous time,
then discretizing it, and finally incorporating host load predictions.
To compute confidence intervals using the algorithm, we must determine
the variance of a sum of host load predictions, which turns out to be
somewhat subtle because the prediction errors are themselves
correlated.  Next, we discover that the algorithm's predictions are
erroneous because it does not model an important aspect of typical Unix
schedulers.  We introduce load discounting to repair the algorithm. 
Finally, we summarize the steps of the algorithm.

\subsection{Core algorithm}

In Chapter~\ref{chap:statprop}, we showed experimental results that
related the running time of a task, $t_{exec}$, to the average load it
experiences while it runs using the following continuous time model:
\begin{displaymath}
\frac{t_{exec}}{1+\frac{1}{t_{exec}}\int_{0}^{t_{exec}} z(t) dt}
=
t_{nom} .
\end{displaymath}
Here $z(t)$ is the load signal, shifted such that $z(0)$ is the value
of the signal at the current time, $t_{now}$.  We introduce this shift
to simplify the presentation of our algorithm, and to conform to the
Box-Jenkins notation for time series prediction that we used in the
previous chapter.  This simplification does not affect the results.
$t_{nom}$ is the nominal running time of the task, which quantifies
the CPU demand of the task as its running time on an unloaded machine.

This continuous time equation is basically a fluid model of a
priority-less host scheduler.  We will use this simple model to
describe our estimation procedure.  However, real schedulers
incorporate priorities that can change over time.  We assume that the
majority of the workload runs at similar priorities.  In particular,
we assume that there are no processes whose priorities have been
drastically increased or decreased, such as with the Unix ``nice''
facility.  Ultimately, we will relax this assumption slightly and
model the temporary priority boosts that most Unix implementations
give processes immediately after they become unblocked.  Given this
extension, the procedure we outline in this section works quite well.

\subsubsection{Continuous time}

The above equation is somewhat unwieldy to discretize and use, so,
before we continue, let's define the {\em available time
function}
\begin{equation}
at(t) = \frac{t} {1 + al(t) } \ , \  t>0
\end{equation}
which depends on the {\em average load function} 
\begin{equation}
al(t) = \frac{1}{t}\int_{0}^{t} z(\tau) d\tau \ , \   t>0 \ .
\end{equation}
$at(t)$ represents the available CPU time over the next $t$ seconds,
which is inversely related to the average load during that interval,
$al(t)$.  As the average load increases, the available time decreases.
$t_{exec}$ is then the minimum $t$ for which $at(t)=t_{nom}.$ Using
this available time function makes it easier to explain how our
algorithm estimates the running time of a task, and, of course, the
available time function is offered directly through the API described
in Section~\ref{sec:execpred.api}.

\subsubsection{Discrete time}

In our system, $z$ is not a continuous-time signal, but rather it is a
discrete-time $1/\Delta$ Hz sampling of that signal, $\langle z_{t+i}
\rangle$, $z_{t+i} = z(i\Delta)$ for $i=1,2,\ldots,\infty$.  Note the
intentional similarity of this notation to that of the
$1,2,\ldots,\infty$-step-ahead predictions introduced in
Chapter~\ref{chap:statprop}.  We will indeed later replace these
values with predictions.  $z_{t+1}$ represents $z(t)$ for
$0<t\leq\Delta$, and so on.  We approximate $z(t)$ as $z(t) =
z_{\lceil t/\Delta \rceil}$.  This lets us write a discrete time
approximation of $at(t)$ and $al(t)$:
\begin{equation}
at_i = 
\left \{
\begin{array}{ll}
0 & i=0 \\
\frac{i\Delta}{1+al_i} &  i>0
\end{array}
\right .
\end{equation}
\begin{equation}
al_i = \frac{1}{i} \sum_{j=1}^{i} z_{t+j} \  ,  \ i>0
\end{equation}
$at_i$ is the time available during the next $i\Delta$ seconds and
$al_i$ is the average load that will be encountered over the next
$i\Delta$ seconds.  We then estimate the available time $at(t)$ by
linear interpolation:
\begin{equation}
at(t) = 
at_{\lfloor t/\Delta \rfloor} +
\frac{t-\lfloor t/\Delta \rfloor}{\Delta} (at_{\lceil t /\Delta
\rceil} - at_{\lfloor t/\Delta \rfloor}) 
\label{eqn:execpred.atinterp}
\end{equation}

\subsubsection{Host load predictions}
\label{sec:execpred.hlp}

Given these definitions, we substitute the predicted load signal
$\widehat{z}_{t+i}$ for $z_{t+i}$s resulting in the predicted average
load $\widehat{al}_i$, and then continue substituting back to give the
predicted available time $\widehat{at}_i$ and its corresponding
continuous time approximation:
\begin{equation}
\widehat{at}_i = 
\left \{
\begin{array}{ll}
0 & i=0 \\
\frac{i\Delta}{1+\widehat{al}_i} & i>0
\end{array}
\right .
\end{equation}
\begin{equation}
\widehat{al}_i = \frac{1}{i} \sum_{j=1}^{i} \widehat{z}_{t+j}, i>0 \\
\label{eqn:execpred.alhat}
\end{equation}
\begin{equation}
\widehat{at}(t) = 
\widehat{at}_{\lfloor t/\Delta \rfloor} +
\frac{t-\lfloor t/\Delta \rfloor}{\Delta} (\widehat{at}_{\lceil t /\Delta
\rceil} - \widehat{at}_{\lfloor t/\Delta \rfloor}) 
\label{eqn:execpred.athat}
\end{equation}
Then, the expected running time of the task, $t_{exp}$,  is simply the
smallest $t$ for which $\widehat{at}(t)=t_{nom}$.  

\subsubsection{Confidence intervals}
\label{sec:execpred.cis}

Because host load predictions are not perfect, we also report the
running time or available time as a confidence interval, computed to a
user-specified confidence level.  The better the predictions are, the
narrower the confidence interval is.  

The predicted load signal is $\widehat{z}_{t+i} = z_{t+i} + a_{t+i}$,
where $z_{t+i}$ is the real value of the signal and $a_{t+i}$ is the
$i$-step-ahead prediction error term which we summarize with a
variance $\sigma^2_{a,i}$.  Our uncertainty in estimating the
available time $at_i$ is due to our uncertainty in estimating the
average load $al_i$, which is due in turn to these error terms and
their variance.  To represent this uncertainty in the form of a
confidence interval, we must push the underlying error variances
through the equations above to arrive at a variance for the average
load $al_i$.

Notice that the average load (Equation~\ref{eqn:execpred.alhat}) sums
the estimates $\widehat{z}_{t+i}$.  Rewriting the equation, we can see
that 
\begin{equation}
\widehat{al}_i = al_i + \frac{1}{i} \sum_{j=1}^{i} a_{t+j}  \\
\end{equation}
By the central limit theorem, then, $\widehat{al}_i$ will become
increasingly normally distributed with increasing $i$.  Given that the
errors $a_{t+i}$ are of zero mean, $\widehat{al}_i$ has a mean
(expected) value of $al_i$ and a variance that depends on the sum of
the prediction errors $a_{t+i}$:
\begin{equation}
\widehat{al}_i \sim  N\left(al_i, Var\left\{ \frac{1}{i} \sum_{j=1}^{i}
a_{t+j} \right\} \right)
\end{equation}
It is important to note that for short jobs or large $\Delta$, this
normality assumption may be invalid.  We will evaluate the system
later and determine whether the results of the assumption are
reasonable. 

Suppose the user requests a confidence interval at 95\% confidence.
We can then compute a confidence interval for $al_i$ (for $i>0$):
\begin{equation}
[al_i^{low},al_i^{high}] = 
Min\left(0,\widehat{al}_i \mp \frac{1.96}{\sqrt{i}}\sqrt{Var\left\{\sum_{j=1}^{i}
a_{t+j} \right\}}\right)
\label{eqn:execpred.alci}
\end{equation}
What this means is that we predict that $al_i$ will be in the range
$[al_i^{low}, al_i^{high}]$ with 95\% probability.  The $1.96$ is the
number of standard deviations of a standard normal needed to capture
$42.5$\% of values.  The $Min$ is important because the average load
cannot drop below zero, although the prediction errors can make that
appear to be the case.  

We can now back-substitute these upper and lower bounds of the
confidence interval into $at(t)$
(Equation~\ref{eqn:execpred.atinterp}), resulting in upper and lower
confidence intervals for $at(t) = [at^{low}(t), at^{high}(t)]$.  Then
the confidence interval on the running time is $[t_{lb},t_{ub}]$,
where $t_{lb}$ is the minimum $t$ for which $at^{high}(t)=t_{nom}$ and
$t_{ub}$ is the minimum $t$ for which $at^{low}(t)=t_{nom}$.

\subsection{Correlated prediction errors}
\label{sec:execpred.corrpred}

Given the discussion of the previous section, we must still determine
the variance of a sum of consecutive predictions in
Equation~\ref{eqn:execpred.alci} to compute the confidence
interval.  This is one of the subtler issues in converting from load
predictions to running time predictions.

In essence, we are summing the one, two, $i$-step-ahead prediction
errors, which are the random variables $a_{t+1}, a_{t+2}, \ldots,
a_{t+i}$.  Each of these random variables has a corresponding
variance: $\sigma^2_{a,1}, \sigma^2_{a,2}, \ldots, \sigma^2_{a,i}$.
These variances are simply the mean squared error estimates produced
by the predictor.  We want to know the variance of their sum.  By the
central limit theorem, the distribution of such a sum converges to
normality.  It is tempting, but wrong, to compute the variance of the
sum as
\begin{equation}
Var\left\{\sum_{j=1}^{i} a_{t+j} \right \} = 
\sum_{j=1}^{i} Var\{ a_{t+j} \} =
\sum_{j=1}^{i} \sigma^2_{a,j}
\label{eqn:execpred.var_wrong}
\end{equation}
This would be valid if the prediction errors were independent, but it
turns out that they are not.  Almost by their very nature, linear
models produce prediction errors which are correlated over time.  The
two-step-ahead prediction is not independent of the one-step-ahead
prediction.  Computing the variance of the sum in the above manner
understates the variance, producing confidence intervals that are too
small.

To see that the prediction errors are correlated, consider an AR(1)
model, $z_{t+1} = \phi_0z_t + a_{t+1}$.  At time $t$, the remaining
values will be:
\begin{displaymath}
\begin{array}{rclcl}
z_{t+1} & = & \phi_0z_t + a_{t+1} & \\
z_{t+2} & = & \phi_0z_{t+1} + a_{t+2} & = &\phi_0^2z_t + \phi_0a_{t+1}
+ a_{t+2}
\\
z_{t+3} & = & \phi_0z_{t+2} + a_{t+3} & = & \phi_0^3z_t +
\phi^2_0a_{t+1} + \phi_0a_{t+2} + a_{t+3} \\
z_{t+n} & = & \phi_0z_{t+n-1} + a_{t+n} & = & \phi_0^nz_t +
\sum_{j=0}^{n-1} \phi_0^ja_{t+n-j}
\end{array}
\end{displaymath}
The predictions at time $t$ must assume that all the white noise
terms $a_{t+k}$ are their expected values, which are zero, and thus
\begin{displaymath}
\begin{array}{rclcl}
\widehat{z}_{t+1} & = & \phi_0z_t & & \\
\widehat{z}_{t+2} & = & \phi_0\widehat{z}_{t+1} & = & \phi_0^2z_t \\ 
\widehat{z}_{t+3} & = & \phi_0\widehat{z}_{t+2} & = & \phi_0^3z_t \\
\widehat{z}_{t+n} & = & \phi_0\widehat{z}_{t+n-1} & = & \phi_0^nz_t \\
\end{array}
\end{displaymath}
Now then, the one-step-ahead prediction error is $a_{t+1}$.  The
two-step-ahead prediction error is $\phi_0a_{t+1} + a_{t+2}$, the
value of which is clearly dependent on the one-step-ahead prediction
error.  The three-step-ahead prediction error is $\phi_0^2a_{t+1} +
\phi_0a_{t+2} + a_{t+3}$, which depends on the one- and two-step-ahead
prediction errors, and so on.

To correctly compute the variance of the sum of load predictions, we
must compute the covariance of each of the prediction errors with each
of the other prediction errors and then sum all $i^2$ terms of this
covariance matrix.  Entry $j,k$ of this matrix is
$CoVar\{a_{t+j}a_{t+k}\} =
\sigma_{a,j}\sigma_{a,k}$ and is the covariance of the $j$-step-ahead
prediction with the $k$-step-ahead prediction.  Notice that variances
of the individual predictions are simply the diagonal elements of the
matrix.

The prediction errors' correlation over lead time is akin to a
signal's autocorrelation over time.  Recall that an autocorrelation
sequence is simply a normalized autocovariance sequence.  The
covariances are easily computed from the autocovariance sequence.\\
In particular, $CoVar\{a_{t+j},a_{t+k}\} = AutoCoVar\{a_{t},a_{| j-k
|}\}$.  

The host load prediction system uses the algorithm of Box, et al to
compute the autocovariance sequence for any linear
model~\cite[pp. 159--160]{BOX-JENKINS-TS}.  Since the LAST predictor
is simply an AR(1) model with $\phi_0=1$, its autocovariances can also
be computed using Box, et al's method.  In the case of the MEAN
predictor, the autocovariances are simply the autocovariances of the
signal itself.

The client can request that the host load prediction system summarize
the prediction errors either by the individual variances (the diagonal
of the covariance matrix), the covariances of all the predictions (the
entire covariance matrix), or the variance of the sum of the first
$1,2,\ldots,i$ predictions (the sum of the first $i$ rows and columns
of the covariance matrix).  The variance of the sum of the first $i$
predictions, which we will write as $\sigma^2_{s,i}$ is then computed as
\begin{equation}
\begin{array}{rcl}
\sigma^2_{s,i} & = & Var\left\{\sum_{j=1}^{i} a_{t +j} \right\} \\
               & = & SumVar\{\langle
a_{t+1},a_{t+2},\ldots,a_{t+i}\rangle\} \\
               & = & \sum_{j=1}^i\sum_{k=1}^{i} CoVar\{a_{t+j}, a_{t+k}\} \\
               & = & \sum_{j=1}^i\sum_{k=1}^{i}\sigma_{a,j}\sigma_{a,k}\\
\end{array}
\label{eqn:execpred.covar}
\end{equation}
The sum is computed by the host load prediction system to avoid
communicating the whole covariance matrix.

Typically, the non-diagonal elements of the covariance matrix are
larger than zero.  This results in Equation~\ref{eqn:execpred.covar}
being larger than Equation~\ref{eqn:execpred.var_wrong}.  This increased
variance of the sum widens the confidence interval for $al_i$ and,
correspondingly, for the available time $at(t)$ and for the running
time $t_{exec}$.


\subsection{Load discounting}

After we implemented the algorithm for computing the expected task
running time ($t_{exp}$) of a task and its confidence interval
($[t_{lb},t_{ub}]$) as  described thus far in this section, we found that
our results were somewhat dependent on the nominal running time of the
task, $t_{nom}$.  We will describe our evaluation methodology in
detail in Section~\ref{sec:execpred.setup}, but, for now,
consider running, at a random time, a task with a randomly chosen
nominal time $t_{nom}$ on a host with an interesting background load.
Before the task is run, we compute its expected running time, 
$t_{exp}$.  Then we run the task and record its actual running time,
$t_{act}$.

\begin{figure}
\centerline{
\epsfxsize=3in
\epsfbox{eps/tnomest_pyramid_before.epsf}
}
\caption[Relative prediction error without load discounting]{Relative errors of predictions as a function of nominal time
without load discounting}
\label{fig:execpred.relerr_b4_discount}
\end{figure}

Figure~\ref{fig:execpred.relerr_b4_discount} shows the result of
running many such tasks.  The figure plots the relative error
($(t_{exp}-t_{act})/t_{exp}$) of the predictions versus the nominal
time of the task.  The figure plots 3000 tasks with nominal times
chosen from a uniform distribution between 0.1 to 10 seconds.  The
interval between task arrivals was chosen from a uniform distribution
between 5 to 15 seconds.  The background load was the August, 1997 axp0
trace and the predictor used was an AR(16).

Notice that relative error is always positive and increases markedly
as nominal time decreases.  For one second tasks, the running time is
over-predicted by as much as 80\%.  The confidence intervals are
correspondingly skewed, with far too many points falling below the
lower bounds of the intervals.

The problem is due to the Digital Unix scheduler, which is
priority-based and which gives an ``I/O boost'' to processes
(increases their priority) when a blocking I/O operation completes.
For example, a process (such as a CORBA ORB, an active frame server,
or the spin server we use in our evaluation
(Section~\ref{sec:execpred.spin})) that is blocked in a read on a
network socket will have its priority boosted when the read completes.
Over time, the process's priority will ``age'' to its baseline level.
The result of this is that the awakened process will get more than its
fair share of the CPU until its priority has degraded.  The shorter
the task, the more this mechanism will benefit it, and the more
inaccurate our running time estimate will be, just as shown in
Figure~\ref{fig:execpred.relerr_b4_discount}

Although we had originally intended to avoid modeling priority
scheduling, leaving the modeling of a full priority-based scheduler
for later, it was clear that we had to capture this priority boost
effect.

Our solution is load discounting.  Conceptually, when a task begins to
run, its priority boost means that the background load on the system
will effect it only slightly.  As it continues to run, its priority
drops and the background load becomes more and more important.  We
model this by exponentially decaying the load predictions
$\widehat{z}_{t+j}$, the discounted load $\widehat{zd}_{t+j}$ being
\begin{equation}
\widehat{zd}_{t+ j} = (1-e^{-j\Delta/\tau_{discount}})
\widehat{z}_{t+j}
\label{eqn:execpred.discount}
\end{equation} 
How quickly the initial load discounting decays depends on the setting
of $\tau_{discount}$.

\begin{figure}
\centerline{
\epsfxsize=3in
\epsfbox{eps/tauest_pyramid_before.epsf}
}
\caption[Estimation of $\tau_{discount}$]{Relative errors of predictions as a function of $\tau_{discount}$, and
derivation of $\tau_{discount}$ by linear fit.}
\label{fig:execpred.relerr_vs_tau}
\end{figure}

We determined the value of $\tau_{discount}$ empirically by again
running a large number of randomized testcases as described above.
For this set of testcases, we used load discounting and chose
$\tau_{discount}$ randomly from the range 0 to 10 seconds.
Figure~\ref{fig:execpred.relerr_vs_tau} plots the relative error of
these testcases as a function of $\tau_{discount}$.  Although there is
plenty of dispersion (recall that these are point estimates $t_{exp}$,
not confidence intervals)  a linear trend clearly exists.  We fitted a
line to the points and determined that it crossed zero relative error
at $\tau_{discount}=4.5$ seconds.

\begin{figure}
\centerline{
\epsfxsize=3in
\epsfbox{eps/tnomest_pyramid_after.epsf}
}
\caption[Relative prediction errors with load discounting]{Relative errors of predictions as a function of nominal time
with load discounting and appropriate $\tau_{discount}$ value.}
\label{fig:execpred.relerr_after_discount}
\end{figure}

Next, we ran a further large number of testcases with load discounting
and $\tau_{discount}=4.5$. The results are plotted in
Figure~\ref{fig:execpred.relerr_after_discount}.  The figure plots the
relative error as a function of the nominal time and is suitable for
direct comparison with Figure~\ref{fig:execpred.relerr_b4_discount}.
As can be seen, the appropriate $\tau_{discount}$ value has
eliminated the dependence of the relative error on the nominal time
and has further reduced the average relative error to almost exactly
zero, which we would also expect from these point estimates.

Load discounting is an effective solution to the priority boosts that
most Unix schedulers give to processes that have become unblocked.  It
is important to note, however, that other priority problems remain.
For example, a background process which has had its priority
significantly reduced (eg, a process which has has been ``reniced'')
but which remains compute bound will result in artificially
exaggerated predictions.  Similarly, a background process with high
priority will result in predictions that are too low.

\subsection{Implementation}

The following summarizes how the implementation implements the
\verb.PredictRunningTime. call.  
\begin{enumerate}
\item Set a prediction horizon $n=3\lceil t_{nom}/\Delta \rceil$.
\item Collect the latest predictions,\\
$\widehat{z}_{t+1},\widehat{z}_{t+2},\ldots,\widehat{z}_{t+n}$, and
the cumulative variances of their sums,
$\sigma^2_{s,1},\sigma^2_{s,2},\ldots,\sigma^2_{s,n}$, from the host
load prediction system running on the host.  The variances are
computed as per Equation~\ref{eqn:execpred.covar}.
\item Generate the discounted load predictions,
$\widehat{zd}_{t+1},\widehat{zd}_{t+2},\ldots,\widehat{zd}_{t+n}$, as
per Equation~\ref{eqn:execpred.discount}.
\item Using the discounted load predictions, compute the discrete time
expected available time,\\
 $\widehat{at}_1, \widehat{at}_2,
\ldots,\widehat{at}_n$ using the technique of
Section~\ref{sec:execpred.hlp}, and the discrete time confidence
intervals on the available time,
$[at_1^{low},at_1^{high}],[at_2^{low},at_2^{high}], \ldots,
[at_n^{low},at_n^{high}]$ using the technique of
Section~\ref{sec:execpred.cis}.     
\item If the lower bound on the maximum available time, $at_n^{low}$ ,
is less than the nominal time, $t_{nom}$, increase the prediction
horizon $n$ and go to step 2.
\item To find the expected running time, $t_{exp}$, do binary search
to find the smallest $i$ for which $at_i>t_{nom}$, interpolate $at(t)$
 around $i\Delta$ using Equation~\ref{eqn:execpred.athat}, and find $t$
such that $at(t)=t_{now}$.  This is $t_{exp}$.
 \item Repeat the previous step with the $at_i^{low}$ and
 $at_i^{high}$ sequences to find the upper and lower bounds of the
 confidence interval, $[t_{lb},t_{ub}]$.  
\end{enumerate}
The \verb.PredictAvailableTime. call is implemented similarly.

\section{Experimental infrastructure}
\label{sec:execpred.setup}

The most appropriate way to evaluate the running time advisor is to
actually run it on a host, submit tasks to that host, and measure how
well the predicted times match the actual times.  This section
describes the infrastructure that we used to do this.  We also used
this infrastructure to evaluate the real-time scheduling advisor.  The
next chapter describes that evaluation.

The running time advisor is a composite whose errors are due to
inaccurate host load predictions as well as modeling errors in
transforming from the host load predictions and task CPU demand to
task running time.  We are interested in how this composite performs
as a whole.  The essential comparison to make when evaluating the
system is the comparison between predicted and actual running times.
To make such a comparison in a simulation would require a method for
realistically estimating the actual running time, but that is a part
of the system we want to evaluate!  The measurement-based approach
avoids this problem because the actual running time is measured
directly.

The goal of our experimental infrastructure is to provide a realistic,
repeatable environment for evaluating the running time advisor, and,
in the next chapter, the real-time scheduling advisor).  The
infrastructure hardware consists of two Alphastation 255 hosts
connected with a private network.  Both machines run Digital Unix
4.0D.  One host is referred to as the measurement host while the other
is called the recording host.  The hosts have no other load on them.

The recording host runs software that interrogates the components
running on the measurement host and submits tasks to it.  The
measurement host runs the following components:
\begin{itemize}
\item{A load playback tool operating on some load trace}
\item{A host load sensor}
\item{One or more host load prediction systems}
\item{A spin server}
\end{itemize}
The load playback tool, described in the next section, performs work
that replicates the workload captured in the load trace.  The choice
of load trace is experiment-specific.  The host load sensor provides
an interface for the recording host to request the latest load
measurement on the host.  The host load prediction systems are
described in Section~\ref{sec:rps.host_load_pred_sys} and provide an
interface for the recording host to request the latest load
predictions using some experiment-specific prediction model.  The spin
server runs tasks---it takes requests to compute (using a busy loop)
for some number of CPU-seconds and then returns the wall-clock time
that the task took to complete.

The remainder of this section describes the load playback tool and the
spin server in more detail.


\subsection{Load trace playback}

To evaluate the running time advisor, a realistic background workload
is necessary.  The load signal produced by running this workload
should have a realistic correlation structure, since it is
precisely this that a predictor attempts to model.

To meet these requirements, our infrastructure uses {\em load trace
playback}, a new technique in which a workload is generated
according to a load trace.  With no other work on the host, this
background load results in the host's load signal repeating that of
the load trace.  Furthermore, we can repeatedly play back the same
load and explore the prediction system's behavior using the traces
from many different hosts.  Because the workload reflects the trace,
which certainly records the real behavior of a host, it is suitable
for evaluating prediction-based systems.

Playback may seem like overkill, but it is not.  As we discovered in
Chapter~\ref{chap:statprop}, host load signals typically have complex
behaviors.  They typically have long, complex autocorrelation
structures which can change abruptly.  Our prediction systems exploit
the autocorrelation structure to increase predictability and treat
abrupt changes as points at which models need to be refit.  Any
synthetic load generator would have to reproduce these properties, and
that is difficult.  Furthermore, there may be other properties of the
signals that we do not yet understand, but which may be important for
prediction.  For these reasons, playing back actual load behavior is
most appropriate.

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/manch-7.good.compare.3600.out.targettrace.eps}
&
\colfigsize
\epsfbox{eps/manch-7.good.compare.3600.out.errhist.eps}
\\
(a) Target trace  & (c) Error histogram
\\
\colfigsize
\epsfbox{eps/manch-7.good.compare.3600.out.measuredtrace.eps}
&
\colfigsize
\epsfbox{eps/manch-7.good.compare.3600.out.erracf.eps}
\\
(b) Measured trace  &  (d) Error autocorrelation
\\
\end{tabular}
}
\caption[Host load trace playback example]{Example of host load trace playback.}
\label{fig:execpred.playback}
\end{figure}

Figure~\ref{fig:execpred.playback} is an example of load trace
playback.  In this case, the trace collected on the host axp0.psc on
August 23, 1997 is being played back.
Figure~\ref{fig:execpred.playback} plots (a) the target load signal,
(b) the measured load signal, (c) the histogram of playback errors,
and (d) the autocorrelation function of the playback error for
playback of the first hour of the trace.  As can be seen, the measured
load tracks the target load with low and largely uncorrelated error.
There is a slight cumulative delay in the playback process because
work-based playback (explained later) is being used.  Another
extraneous source of playback error is the modicum of other work on
the machine.  The histogram shows us that most of the errors are quite
small.  The histogram also slightly overstates the actual error in
terms of the work performed.  This is due to other extraneous load on
the system, sampling issues, and the effects of the kernel's smoothing
filter, which we discuss later in this section.  More details on host
load playback, including its performance on platforms other than DUX,
can be found elsewhere~\cite{DINDA-LOAD-PLAYBACK-LCR00}.


\subsubsection{The playback algorithm}

To describe the operation of the system, we'll focus on a periodically
sampled trace.  The load playback tool can also operate with
nonperiodically sampled traces.

The load average is an exponential average of the number of processes
in the system's ready-to-run queue.  Conceptually, the length of the
ready queue is periodically sampled, these samples $x_i$ flow through
an exponential filter
\begin{equation}
  z_{i+1} = (e^{-\Delta/\tau_{record}})z_i +
(1-e^{-\Delta/\tau_{record}})x_i
\label{eqn:execpred.filter}
\end{equation}
where $\Delta$ is the sampling interval and the application-visible
load average is the $z_i$ series.  A load trace is gathered by
periodically sampling the load average and recording the time-stamped
samples to a file.  The sample rate should be at least twice is high
as the sample rate of the in-kernel process, so we rewrite the above
equation like this:
\begin{equation}
   trace_{i+1} = (e^{-\Delta/2\tau_{record}})trace_{i} + (1-e^{-\Delta/2\tau_{record}})x'_i
\label{eqn:execpred.tracefilter}
\end{equation}
In load trace playback, we want the measured load average to track the
load average samples in the trace file.  To do this, we treat the
$x'_i$ in the above equation is the expected run-queue length during
the last $\Delta/2$ seconds.  To determine the $x'_i$, we ``deconvolve
out'' the smoothing function using the method described by Arndt,
et~al~\cite{WINNER-WS-98}.  All that is necessary to do this is
knowledge of the smoothing constant $\tau_{record}$ for the machine on
which the trace was taken.  For Digital Unix, $\tau_{record}$ is 5
seconds.  On most other Unix systems, $\tau_{record}$ is 60 seconds.
A larger $\tau_{record}$ value indicates that the load average will
behave more smoothly. 

A value $x'_i$ represents the amount of contention the CPU saw for the
time interval.  To reproduce this contention, we split the interval
into smaller subintervals, each of which is larger than the scheduler
quantum, and then stochastically assign subprocesses to either work or
sleep during the subintervals.  For example, suppose $x'_i=1.5$.  We
would assign subprocess 0 to work during a subinterval with
probability 1, and subprocess 1 to work during a subinterval with
probability 0.5.

After each individual $x'_i$ value has been played, the load average
on the system is measured.  This measurement, $measure_i$, is compared
against the load trace {\em as it would have appeared given the
$\tau_{playback}$ of the playback machine}, $trace'_i$.  $trace'_i$ is
determined by smoothing the $x'_i$ series with an exponential
averaging filter having the playback machine's $\tau_{playback}$.

\subsubsection{Interacting with other load}


Most likely, there will be other processes producing load on the
system---the tasks we will run during the evaluation, for example.  It
is important to understand how the load introduced by the load
playback tool reacts to this other load.  Two modes of operation are
possible, time-based and work-based.  In our evaluation we use
work-based playback.

In the time-based mode of operation, the contention probabilities
implied by the $x'_i$ value last only till the end of $x'_i$'s interval
in real time.  Essentially, this means that the other load on the
machine can only amplitude modulate the played back load.  For example,
suppose there is a uniform 1.0 load on the machine and the load trace
dictates a 1.0 load from 0 to 1 second and zero elsewhere.  Then, the
measured load will (ideally) be 2.0 from 0 to 1 second and 1.0
elsewhere. 

In the work-based mode of operation, the load is interpreted as work
that must always be done, but which may be slowed down by other load.
This means that the other load on the system can also frequency
modulate the played load.  Using the same example as the previous
paragraph, the measured load would be 2.0 from 0 to 2 seconds and 1.0
elsewhere.  

The load playback tool also supports a negative feedback mechanism.
Using this mechanism, it can sometimes more accurately track the load
average in the trace.  However, the feedback mechanism will also try
to correct for other load on the system, trying to make the total
load track the trace load.  Because of this obvious flaw, we do not
use it in our evaluation.

\subsubsection{Imperfections}
\label{sec:execpred.imperfect}

The measured load during playback tracks the desired load only
approximately in some cases.  This is mainly because because the
OS's scheduler is functioning both as a sampling processes and as part
of the sampled process.

The sampling process that produces the $x_i$s in the
Equation~\ref{eqn:execpred.filter} is the OS's scheduler, which sees
only integral numbers of processes on the ready queue.  We treat our
approximation of $x_i$, $x'_i$, as the expected value of $x_i$,
$E[x_i]$.  In playing back $x'_i$ we contrive to make the value the
scheduler samples this expected value.  However, the second moment,
$E[x^2_i]$ is nonzero, so the actual value sampled may be different
even if the playback is dead on.  Consider the following contrived
example.  Suppose $x'_i=0.5$, we have a subprocess spend 50\% of its
time sleeping and 50\% working, and we alternate randomly between
these extremes.  The expected run queue length the scheduler would see
is then $E[x_i] = (0.5)(1) + (0.5)(0) = 0.5$.  However, the scheduler
will really sample either 1 or 0.  The effect on the load average in
either case will be drastically different than the expected value
resulting in error.  Furthermore, because the load is an exponential
average, that error will persist over some amount of time (it will
decline to 33\% after $\tau_{record}$ seconds).  Another way of
thinking about this is to consider the second moment, $E[x_i^2]$.  For
this example, $E[x_i^2] = (0.5)(1)^2 + (0.5)(0)^2 = 0.5$, so the
standard deviation of the distribution of $E[x_i]$ is
$\sqrt{E[x_i^2]-E[x_i]^2} = \sqrt{0.5-0.25} = 0.5$ which explains the
variability in what the scheduler will actually sample.

Even if the sample rate of the trace is low compared to the sample
rate of the scheduler, resulting in the $x'_i$'s corresponding
measurements being derived from more than one observation by the
scheduler, any extant error is still propagated by the exponential
filter.

Another source of error is that an actual process on the ready queue
may not consume its full quantum.  


\subsubsection{Real traces versus synthetic traces}

In almost all cases, when a real load trace that has been sampled at
a sufficiently high rate is played, the $x'_i$s are close to integers.
Intuitively, this makes sense - the scheduler can only observe
integral numbers of processes on the ready queue, and if our estimates
of its observations (the $x'_i$s) are accurate, they should also mostly
be integers.  

It's easy to construct synthetic traces that are actually not sensible
in terms of short term behavior.  One such trace is the continuous 0.5
example discussed in the previous section.  Deconvolving the trace
produces $x'_i$s slightly above 0.5.  Load playback reasonably
produces a situation where the expected ready queue length is 0.5 by
having a process spend half of its time working and half sleeping.
However, the observations the kernel (the $x_i$s) will make will be
either 0 or 1.  Thus the measured load trace will vary widely.  The
average load average will match the 0.5 we desire, but the short term
behavior will not.  It's important to note that load playback tool is
doing what the user probably wants (keeping the CPU 50\% busy), but
that the load average fails to be a good measure for how well the
generated load conforms.

Another way a synthetic trace can fail is to have very abrupt changes.
For example, a square wave will be reproduced with lots of overshoot
and error.  In order for the abrupt swings of a square wave to have
been the output of the exponential smoothing, the input $x_i$ must have
been much more abrupt, and the estimate $x'_i$ will also be quite
large.  This means load playback has to have many processes try to contend
for the CPU at once, which raises the variability considerably.

The best way to produce synthetic traces is to create a string of
integer valued $x_i$s and smooth them with the appropriate $\tau_{record}$.  
Another possibility is to present these $x_i$s directly to load playback
with a very small $\tau_{record}$ trace value.

\subsection{Spin server}
\label{sec:execpred.spin}

The spin server runs submitted tasks and measures how long they take
to complete.  A task consists of the number of CPU seconds that should
be run.  The spin server uses a busy loop that periodically checks the
accounted system and user time (using the getrusage system call) that
the process has consumed.  The loop exits when the requested amount of
CPU seconds have been consumed and reports the wall clock time used.
At startup, the spin server calibrates itself with respect to the
host's running rate and the overhead of the getrusage call.  This
permits the loop to be extremely precise about the amount of
computation it does using a minimal number of getrusage calls.  The
relative absolute error is much less than 1\%.


\section{Evaluation}

To evaluate the running time advisor, we ran 114,000 randomized
testcases using the infrastructure and studied the results.  The
testcases were randomized with respect to their starting time, their
nominal running time (0.1 to 10 seconds), the underlying host load
predictor that was used (MEAN, LAST, and AR(16)), and the load trace
used to generate the background load (all of the August, 1997 traces).
We characterized the quality of the expected running times and
confidence intervals produced by the system using three metrics.  We
evaluated the effect on these metrics of the different traces, load
predictors, and nominal running time.

The main conclusion is that the running time advisor does indeed
produce high quality predictions for task running times.  Performance
depends most strongly on the choice of host load predictor.  There is
a marked increase in performance in moving from the MEAN predictor
(where prediction errors are due the raw signal variance) to the LAST
predictor (which is the simplest predictor to take advantage of the
autocorrelation of load signals).  There is a smaller, but important,
gain in moving from the LAST predictor to the AR(16) predictor (which
is the predictor that we found most appropriate for host load
prediction in the previous chapter).  The nature of these gains
depends on how heavily loaded the host is.  Performance also depends
on the running time of the task and generally improves as running time
increases.


\subsection{Methodology}
\label{sec:execpred.method}

To evaluate the running time advisor given a particular traced host,
we start up the experimental infrastructure described in
Section~\ref{sec:execpred.setup} on the measurement and recording
hosts.  The load playback tool is set to replay the selected trace.
The host load sensor is configured to run at 1 Hz.  Three host load
prediction systems are started: MEAN, LAST, and AR(16).  The systems
are configured to fit to 300 measurements (5 minutes) and to refit
themselves when the absolute error for a one-step-ahead prediction
exceeds 0.01 or the measured one-step-ahead mean squared error exceeds
the estimated one-step-ahead mean squared error by more than 5\%.  The
minimum interval between refits is 30 seconds and maximum interval
before the measured mean squared error is tested is 300 seconds.

The prediction and measurement software are permitted to quiesce for
at least 600 seconds.  Then 3000 consecutive testcases are run on the
recording host, each according to this procedure:
\begin{enumerate}
\item Wait for a delay interval, $t_{interval}$, selected from a
uniform distribution from 5 to 15 seconds.
\item Get the current time $t_{now}$.
\item Select the task's nominal time, $t_{nom}$, randomly selected
from a uniform distribution from 100 ms to 10 seconds.
\item Select a random host load prediction system from among MEAN, LAST,
AR(16).
\item Use the \verb.PredictRunningTime. API to compute the expected running time $t_{exp}$ and the 95\% confidence interval $[t_{lb},t_{ub}]$
using the latest available predictions available from the selected
host load prediction system.
\item Run the task on the spin server and measure its actual running
time, $t_{act}$.
\item Record the timestamp $t_{now}$, the prediction system used, the
nominal time $t_{nom}$, the expected running time $t_{exp}$, the
confidence interval $[t_{lb}, t_{ub}]$.
\end{enumerate} 
After all 3000 testcases have been run, their records are imported
into a database table corresponding to the trace.  

It takes approximately 13 hours to complete 3000 testcases.  To
evaluate the running time advisor, we then mined the database.

\subsection{Metrics}

The \verb.PredictRunningTime. function predicts the running time of a task
in two ways:  as an expected value and as a confidence interval.
Because the lower bound of the confidence interval is artificially
limited due to the fact that load cannot drop below zero, the expected
time is not necessarily in the middle of the confidence interval.  

Evaluating the quality of the confidence interval, $[t_{lb},t_{ub}]$,
is a somewhat complex endeavor.  Suppose we generate run a wide
variety of testcases with a specified confidence, say 95\%.  If we
used the ideal algorithm for computing confidence intervals and the
best possible predictor, the lengths of the tasks' confidence
intervals would be the minimum possible such that 95\% of the tasks
would have running times in their predicted intervals.  An imperfect
algorithm, such as ours, will compute confidence intervals that were
larger or smaller than ideal where more or fewer than 95\% of the
tasks complete in their intervals.  The important point is that to
evaluate a confidence interval algorithm, we must measure the lengths
of the confidence intervals it produces, and the number of tasks which
complete within these confidence intervals.  We used following two
metrics:
\begin{itemize}
\item {\bf Coverage}: the fraction of tasks which complete with their
predicted confidence intervals
\item {\bf Span} : the average width of the confidence interval width
in seconds
\end{itemize}
The ideal system will have the minimum possible span such that the
coverage is 95\%.

To evaluate the quality of the expected running time, $t_{exp}$, we
are interested in how strongly the actual time, $t_{act}$, correlates
with it.  Imagine plotting the actual times versus the expected times
for all the tasks.  With an perfect predictor, the result would be a
perfect 45 degree line starting at 0.  An imperfect predictor, such as
ours, will approximate the line but with points scattered around it.
The degree of this scatter is measured by the $R^2$ value of the
linear fit.  $R^2=1$ would correspond to a perfect predictor while
$R^2=0$ corresponds to randomness.  We will evaluate the quality of the
expected times using this $R^2$ value, which we will simply refer to
as the {$\mathbf{R^2}$} value.  This is not the entire story.  We also
care about the slope and intercept of the line.  We discuss this
further in Section~\ref{sec:execpred.expected_overall}.


\subsection{Results}

Running 3000 randomized testcases on each of the 39 traces in our
August, 1997 set of traces resulted in 114,000 testcases to mine.
This plethora of testcases allows us to characterize the performance
of the running time advisor in a wide variety of ways.  Furthermore,
because the testcases are randomized, our results generalize to hosts
for which our set of traces are representative, which we believe to be
a considerable portion of hosts.

We want to answer several questions.  The most important of these is
whether our system does indeed provide useful predictions of task
running times, both in terms of the expected running time and in terms
of the confidence interval for the running time.  We are most
interested in the confidence intervals because these are the
predictions that we will use in the next chapter to make real-time
scheduling decisions.  In addition we want to understand how the
choice of underlying host load predictor affects prediction
performance, and how that performance depends on the nominal time of
the task.  

To address these questions, we looked at the testcases in five ways.
First, we measured the quality of the confidence intervals
independently of the nominal time of the task.  For each trace, we
computed the confidence interval metrics of coverage and span.  Then
we compared the different predictors based on these per-trace metrics.
Next, we conditioned this comparison on the nominal time of the task,
dividing the range in to small, medium, and large tasks.  The third
and fourth steps repeat this approach but using the $R^2$ metric for
the expected running time.  Finally, we hand-classified each trace
based on the relationship of the performance metrics and the nominal
time.  This resulted in five classes.  We then developed a
recommendation for each class.

The overall conclusion is that the system does indeed predict
reasonable point estimates and confidence intervals for task running
times.  Using the AR(16) predictor, we found that there were only five
traces (out of 39) in which fewer than 90\% of the tasks completed in
their confidence interval, and only one host where fewer than 90\%
were within their computed confidence intervals.  The relationships
between MEAN, LAST, and AR(16) depend on whether the host is heavily
loaded or not.  On hosts with high load, the more sophisticated
predictors were able to produce significantly better coverage by
estimating wider spans.  On hosts with low load, they achieved
appropriate coverage with much smaller spans.  The AR(16) predictor
generally produced better results than LAST, and much better than
MEAN.  Performance generally improved as nominal time was increased.
In terms of predicting the expected running time, although there were
significant differences between the predictors, none of them could be
considered to have the ``best'' performance.  Interestingly, unlike
with confidence intervals, the expected running time predictions grew
worse with increasing nominal time.

\subsubsection{Nominal time independent evaluation of confidence
interval quality}
\label{sec:execpred.overall}

Consider mapping a task with a nominal time of 0.1 to 10 seconds on a
randomly selected host.  Will the \verb.PredictRunningTime. function
produce an accurate confidence interval for its running time?  How
will the accuracy depend on which underlying host load predictor is
used?

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/pred_fracinci_overall_0_10.epsf}
&
\colfigsize
\epsfbox{eps/pred_ciwidth_overall_0_10.epsf}
\\
(a) Coverage & (b) Span
\end{tabular}
}
\caption[Confidence interval metrics, all hosts, all nominal times]{Confidence interval metrics for 0.1 to 10 second tasks on all
hosts, independent of nominal time.}
\label{fig:execpred.overall_0_10}
\end{figure}

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/pred_fracinci_vsmean_0_10.epsf}
&
\colfigsize
\epsfbox{eps/pred_ciwidth_vsmean_0_10.epsf} \\
(a) Coverage & (b) Span \\
\end{tabular}
}
\caption[Confidence interval metrics, LAST and AR(16) vs. MEAN, all nominal times]{Confidence interval metrics showing LAST and AR(16) versus MEAN
for 0.1 to 10 second tasks on all hosts, independent of nominal time.}
\label{fig:execpred.vsmean_0_10}
\end{figure}

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/pred_fracinci_vslast_0_10.epsf}
&
\colfigsize
\epsfbox{eps/pred_ciwidth_vslast_0_10.epsf}
\\
(a) Coverage & (b) Span \\
\end{tabular}
}
\caption[Confidence interval metrics, AR(16) vs. LAST, all nominal times]{Confidence interval metrics showing AR(16) versus LAST for 0.1
to 10 second tasks on all hosts, independent of nominal time.}
\label{fig:execpred.vslast_0_10}
\end{figure}

Figures~\ref{fig:execpred.overall_0_10},
\ref{fig:execpred.vsmean_0_10}, and \ref{fig:execpred.vslast_0_10}
provide answers to these questions.  These figures present the same
information, but arranged in different forms to more clearly
illustrate different observations.  For example,
Figure~\ref{fig:execpred.overall_0_10} plots the (a) coverage, and (b)
span for each of the load traces in our study.  Consider
Figure~\ref{fig:execpred.overall_0_10}(a).  The coverage of trace axp1
using the MEAN predictor is approximately 0.62.  This point is the
result of running 3000 testcases (Section~\ref{sec:execpred.method})
on the axp1 trace, selecting the approximately 1000 of these that used
the MEAN predictor, counting the number of those testcases in which
the deadline was met, and dividing by the total number of MEAN
testcases.  Recall that our goal is a 95\% confidence interval, so
MEAN is clearly inappropriate.  We can see, however, that for the same
trace, the AR(16) predictor provides much better coverage, about 92\%,
while the LAST predictor does slightly better at about 98\%.  If we
look at the axp1's span metric in
Figure~\ref{fig:execpred.overall_0_10}(b), we can begin to understand
why.  To produce the point for the MEAN metric, the confidence
interval widths for the approximately 1000 MEAN testcases were
averaged.  We see that the span is about 3.4 seconds.  The AR(16)
predictor's span, computed similarly, is a slightly higher (worse) 4
seconds, while the LAST predictor's span is a huge 6.4 seconds.  The
small increase in span going from the MEAN to the AR(16) predictor
brings the coverage from 0.62 to 0.92, which is very close to the
target of 0.95.  The much greater increase in span going from AR(16)
to LAST overshoots the target coverage, going as far as 0.98.

As can be seen from Figure~\ref{fig:execpred.overall_0_10}(a), for
almost every trace in our study, there is some predictor that provides
a coverage that is near our target 95\%.  This is an important result
that supports our claim of being able to compute accurate confidence
intervals.  However, perhaps the predictors are producing inordinately
wide spans, reducing their usefulness.  The spans in
Figure~\ref{fig:execpred.overall_0_10}(b) are difficult to interpret.
There is clearly a {\em difference} between the predictors, and the
span of the MEAN predictor seems to usually be much wider than that of
the LAST and AR(16) predictors.  However, there are some traces where
the more sophisticated predictors seem to run aground, resulting in
larger spans than MEAN.

Figure~\ref{fig:execpred.overall_0_10} is powerful for making
observations about individual hosts, but it is somewhat difficult to
compare predictor performance between hosts using it.
Figure~\ref{fig:execpred.vsmean_0_10} rearranges the data to make such
comparisons easier.  We plot the performance metrics' values for LAST
and AR(16) for each trace versus the metrics values for the MEAN
predictor.  This helps to answer the question: What is the benefit of
using a more sophisticated predictor than MEAN?  For example, consider
Figure~\ref{fig:execpred.vsmean_0_10}(a).  For each trace, a dot is
plotted at (coverage of MEAN testcases, coverage of LAST testcases),
and a triangle is plotted at (coverage of MEAN testcases, coverage of
AR(16) testcases).  On the graph, a 45 degree line separates the
regions where the coverage is better than MEAN from that where it is
worse.  Figure~\ref{fig:execpred.vsmean_0_10}(b) is arrived at with
exactly the same methodology, except using the span metric.

Figure~\ref{fig:execpred.vsmean_0_10}(a) and (b) make it clear what
the LAST and AR(16) predictors generally provide quite different
performance results than the simple MEAN predictor.  For nine of the
traces, the more sophisticated predictors provide significantly better
coverage than the MEAN predictor.  For the remainder of the traces,
the more sophisticated predictors have slightly lower coverage.  In
terms of the span metric, nine of the traces show significantly
wider spans than MEAN, while the remainder are much narrower.  

The nine traces on which the LAST and AR(16) produce better coverage
performance are the same traces in which they produce worse span
performance than MEAN.  These nine traces correspond to the hosts that
exhibit greater mean load (and correspondingly, greater variability in
load (Chapter~\ref{chap:statprop})).  The LAST and AR(16) predictors
are better able to ``understand'' such hosts and compute appropriately
wider confidence intervals compared to MEAN.  These wider confidence
intervals result in a far greater chance of a task's actual running
time falling within its computed confidence interval.  This is
precisely the behavior that we want.  Our goal is that 95\% of tasks
fall within their confidence intervals.  With the AR(16) predictor,
only 5 cases are less than 90\% and only one less that 85\%, whereas
with the MEAN predictor, only one of the high load traces is better
than 85\%. The gain from MEAN to AR(16) can be as much as 30\%, and it
is typically around 15\%.

Two effects are at work here.  First, the predictions of the LAST and
AR(16) predictors depend most strongly on recent measurements.  The
MEAN predictor, on the other hand, always presents the long term mean
of the signal.  As a result, the LAST and AR(16) predictors will
respond much more quickly during the period after an epoch transition
(Chapter~\ref{chap:statprop}) before a model refit happens.  This
means that their predictions, and thus the center point of the
confidence interval will much more likely be in the right place.  The
second effect results from how the confidence interval length is
computed.  Recall that with the MEAN predictor the autocovariance of
the signal is used to compute the confidence interval, while for the
LAST and AR(16) predictors it is the autocovariance of their
prediction errors that is used.  On a high load, high variability
host, an epoch transition is more likely than on a low load, low
variability host to make the autocovariance of the signal fail to
characterize the new epoch well.  In contrast, the structure of the
autocovariance of the prediction errors will probably change
considerably less drastically.

For those hosts which have lower load and variability, the LAST and
AR(16) predictors produce significantly narrower confidence intervals
than MEAN while still capturing a reasonable number of tasks within
their computed confidence intervals.  On average, the confidence
intervals are shrunk by 2-3 seconds while the fraction of tasks within
their confidence intervals shrinks by about 5\%.  Since for these
lightly loaded hosts, the MEAN predictor results in coverages that are
larger than the target 95\%, this is not an unreasonable tradeoff.
Essentially, for these low load hosts, moving from MEAN to AR(16)
reduces coverage by about 5\% while decreasing the span by 2-3
seconds (about 33\%).

At this point, we have shown that our algorithm does indeed compute
reasonable confidence intervals for task running times and that it
does so more accurately when using a more sophisticated predictor than
MEAN.  Now we would like to know whether we should prefer the LAST
predictor or the AR(16) predictor.  We have already pointed out some
of the differences between these two.  To continue, it is useful to
plot the data differently once again.
Figure~\ref{fig:execpred.vslast_0_10} plots the performance of the
AR(16) predictor versus the LAST predictor.  From
Figure~\ref{fig:execpred.vslast_0_10}(a), we can see that the
confidence intervals computed using AR(16) generally include more of
their tasks than those computed using LAST.  Using the AR(16)
predictor, only four of the cases are at less than 90\% and only one
less than 85\%.  Using LAST, 9 are less than 90\%, while four are less
than 85\%.  This gain is due to AR(16) predictors producing slightly
wider confidence intervals, as can be seen from
Figure~\ref{fig:execpred.vslast_0_10}(b).  The span increases by 0.5
to 1 second in moving from LAST to AR(16).

On heavily loaded hosts, the gain in moving from LAST to AR(16) is
more significant---either we see a large increase in coverage, on
the order of 10\% or more, or there is a slight decline of 5\% or
significantly less.  Correspondingly, the span either grows on the
order of 2-3 seconds or it shrinks by 1-2 seconds.    

To summarize, the benefit of using the LAST or AR(16) predictor over
MEAN is two fold.  For heavily loaded hosts, the more sophisticated
predictors compute wider confidence intervals which lead to a much
larger fraction of the tasks running within their confidence
intervals.  For lightly loaded hosts, LAST and AR(16) produce much
narrower confidence intervals than MEAN at a small, reasonable cost to
coverage.

The benefit of using the AR(16) predictor over the LAST predictor is
also two fold.  For heavily loaded hosts, on about half of these
traces it generally produces wider confidence intervals that
significantly increase coverage, driving it much closer to the target
of 95\%.  On the other half it, shrinks their confidence intervals at
small cost to coverage.  For almost all lightly loaded hosts, the span
increases slightly, driving the coverage slightly closer to the target
level.  

In essence, the MEAN predictor typically produces confidence intervals
with larger than needed spans, resulting in more coverage than is
requested. In contrast, the LAST predictor typically produces
insufficiently large spans, resulting in smaller than requested
coverage.  The AR(16) predictor typically operates between these two
extremes, producing appropriate coverage with small span.  On some
high load machines, the behaviors of MEAN and LAST are reversed, but
AR(16) still produces the best results.


\subsubsection{Effect of nominal time on confidence interval quality}

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/pred_fracinci_overall_0_3.epsf}
&
\colfigsize
\epsfbox{eps/pred_ciwidth_overall_0_3.epsf}
\\
(a) Coverage, 0.1 to 3 second tasks & (b) Span, 0.1 to 3 second tasks \\
\colfigsize
\epsfbox{eps/pred_fracinci_overall_3_6.epsf}
&
\colfigsize
\epsfbox{eps/pred_ciwidth_overall_3_6.epsf}
\\
(c) Coverage, 3 to 6 second tasks & (d) Span, 3 to 6 second tasks \\
\colfigsize
\epsfbox{eps/pred_fracinci_overall_6_10.epsf}
&
\colfigsize
\epsfbox{eps/pred_ciwidth_overall_6_10.epsf}
\\
(e) Coverage, 6 to 10 second tasks & (f) Span, 6 to 10 second tasks \\
\end{tabular}
}
\caption
[Effect of nominal time on confidence interval metrics, all hosts]
{Effect of nominal time on confidence interval metrics, all hosts.}
\label{fig:execpred.overall_all}
\end{figure}


\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/pred_fracinci_vsmean_0_3.epsf}
&
\colfigsize
\epsfbox{eps/pred_ciwidth_vsmean_0_3.epsf}
\\
(a) Coverage, 0.1 to 3 second tasks & (b) Span, 0.1 to 3 second tasks \\
\colfigsize
\epsfbox{eps/pred_fracinci_vsmean_3_6.epsf}
&
\colfigsize
\epsfbox{eps/pred_ciwidth_vsmean_3_6.epsf}
\\
(c) Coverage, 3 to 6 second tasks & (d) Span, 3 to 6 second tasks \\
\colfigsize
\epsfbox{eps/pred_fracinci_vsmean_6_10.epsf}
&
\colfigsize
\epsfbox{eps/pred_ciwidth_vsmean_6_10.epsf}
\\
(e) Coverage, 6 to 10 second tasks & (f) Span, 6 to 10 second tasks \\
\end{tabular}
}
\caption
[Effect of nominal time on confidence interval metrics, LAST and AR(16) compared to MEAN]
{Effect of nominal time on confidence interval metrics, all
hosts, LAST and AR(16) compared to MEAN.}
\label{fig:execpred.vsmean_all}
\end{figure}


\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/pred_fracinci_vslast_0_3.epsf}
&
\colfigsize
\epsfbox{eps/pred_ciwidth_vslast_0_3.epsf}
\\
(a) Coverage, 0.1 to 3 second tasks & (b) Span, 0.1 to 3 second tasks \\
\colfigsize
\epsfbox{eps/pred_fracinci_vslast_3_6.epsf}
&
\colfigsize
\epsfbox{eps/pred_ciwidth_vslast_3_6.epsf}
\\
(c) Coverage, 3 to 6 second tasks & (d) Span, 3 to 6 second tasks \\
\colfigsize
\epsfbox{eps/pred_fracinci_vslast_6_10.epsf}
&
\colfigsize
\epsfbox{eps/pred_ciwidth_vslast_6_10.epsf}
\\
(e) Coverage, 6 to 10 second tasks & (f) Span, 6 to 10 second tasks \\
\end{tabular}
}
\caption
[Effect of nominal time on confidence interval metrics, all
hosts, AR(16) compared to LAST]
{Effect of nominal time on confidence interval metrics, all
hosts, AR(16) compared to LAST.}
\label{fig:execpred.vslast_all}
\end{figure}

The performance of the running time advisor depends on the nominal
time of the task, $t_{nom}$.  To illustrate this dependence, we use
the same methodology to mine the testcases as described in the
previous section, but we condition the results based on $t_{nom}$.
Recall that the range of $t_{nom}$ is from 0.1 to 10 seconds.  In the
previous section, to produce a point on a graph, we used testcases
from this entire range.  In this section, we divide the range of
$t_{nom}$ into three subranges: 0.1 to 3 seconds (``small tasks''), 3
to 6 seconds (``medium tasks''), and 6 to 10 seconds (``large
tasks'').  Then, for each of these subranges we produce a graph.
Where previously we had one graph where each point represented the
average of approximately 1000 testcases, we now will have three graphs
and a point on a graph represents the average of approximately 333
testcases in the appropriate range.  The semantics of the graphs are
as before.  In the next section, we will delve deeper, dividing the
hosts into five classes and illustrating how performance depends on
nominal time for individual representative hosts.

Figures~\ref{fig:execpred.overall_all}, \ref{fig:execpred.vsmean_all},
and \ref{fig:execpred.vslast_all} illustrate how the coverage and span
of confidence intervals depend on the nominal time.  The first figure
presents the performance metrics of the three predictors on each of
the traces for each of the three sizes of tasks.
Figures~\ref{fig:execpred.overall_all}(a) and (b) show the coverage and
span for tasks of 0.1 to 3 second nominal times, (c) and (d) for 3 to
6 second tasks, and (e) and (f) for 6 to 10 second tasks.

In terms of the coverage, Figure~\ref{fig:execpred.overall_all}
appears to show that all of the predictors do better as the nominal
time increases.  For short 0.1 to 3 second tasks
(Figure~\ref{fig:execpred.overall_all}(a)), there are a number of
cases where none of the predictors provide better than 90\% coverage,
while for 6 to 10 second tasks
(Figure~\ref{fig:execpred.overall_all}(e)), there is almost always
some predictor which provides a span of over 95\%.   In terms of the
span, there appears to be a relative decrease in the gain of the more
sophisticated predictors.

Figure~\ref{fig:execpred.vsmean_all} replots the data to compare the
LAST and AR(16) predictors to the MEAN predictor while
Figure~\ref{fig:execpred.vslast_all} compares the AR(16) predictor
with the LAST predictor.  This illustrates the benefits of moving to
the more sophisticated predictors.  The benefit clearly increases as
the nominal time increases.  For example, at 0.1 to 3 seconds
(Figure~\ref{fig:execpred.vsmean_all}(a) and (b)) we can see that the
more sophisticated predictors actually have worse coverage than the
simple MEAN predictor because their spans are too narrow.  In general,
however, we can see that the AR(16) predictor does better than LAST
(Figure~\ref{fig:execpred.vslast_all}(a) and (b)), producing wider
spans that result in better coverage.  However, this is clearly a
difficult case, since even with AR(16), the coverage on half of the
traces is less than 80\%.

As the nominal time increases
(Figure~\ref{fig:execpred.vsmean_all}(c)--(f),
Figure~\ref{fig:execpred.vslast_all}(c)--(f)) we begin to see a
picture that is more similar to the overall picture of
Section~\ref{sec:execpred.overall}. On the heavily loaded hosts, the
more sophisticated predictors produce wider spans which result in much
better (and more appropriate) coverage.  On the lightly loaded hosts,
the coverage is similar between the different predictors, but LAST and
AR(16) produce much narrower spans.  In comparing LAST and AR(16), we
can see that as the nominal time increases the relationship between
these two predictors becomes more complex.  For medium sized tasks
(Figure~\ref{fig:execpred.vslast_all}(c)-(d)), AR(16) produces large
gains in coverage over LAST, bringing coverage to the target 95\%
levels in many cases.  The cost is a typically small increase in span,
although the difference is larger for more heavily loaded hosts.  For
large tasks (Figure~\ref{fig:execpred.vslast_all}(e)-(f)), LAST and
AR(16) seem to be in a dead heat.


\subsubsection{Nominal time independent evaluation of quality of
expected running times}
\label{sec:execpred.expected_overall}

\begin{figure}
\centerline{
\colfigsize
\epsfbox{eps/pred_tacttexpr2_overall_0_10.epsf}
}
\caption
[$R^2$, all hosts, all nominal times]
{$R^2$ metric for 0.1 to 10 second tasks on all hosts, independent of
nominal time.}
\label{fig:execpred.r2_overall_0_10}
\end{figure}

\begin{figure}
\centerline{
\colfigsize
\epsfbox{eps/pred_tacttexpr2_vsmean_0_10.epsf}
}
\caption
[$R^2$, LAST and AR(16) vs. MEAN, all nominal times]
{$R^2$ metric showing LAST and AR(16) versus MEAN for  0.1 to 10
second tasks on all hosts, independent of nominal time.}
\label{fig:execpred.r2_vsmean_0_10}
\end{figure}

\begin{figure}
\centerline{
\colfigsize
\epsfbox{eps/pred_tacttexpr2_vslast_0_10.epsf}
}
\caption
[$R^2$, AR(16) vs. LAST, all nominal times]
{$R^2$ metric showing AR(16) versus LAST for  0.1 to 10 second
tasks on all hosts, independent of nominal time.}
\label{fig:execpred.r2_vslast_0_10}
\end{figure}

This section evaluates how well the expected running time, $t_{exp}$,
provided by our algorithm predicts the actual running time, $t_{act}$,
encountered when running the task.  We will begin by using the same
methodology as the previous sections, but with the $R^2$ metric.  We
will also discuss certain important characteristics that are not
measured by $R^2$

Figure~\ref{fig:execpred.r2_overall_0_10} shows the $R^2$ values
measured on each of the traces using each of the predictors.  Each
point represents approximately 1000 testcases.  As we can see, in
almost all cases, some predictor provides an $R^2>0.9$---Our task
running time predictions explain over 90\% of the variability in the
running time.  There seems to be little distinction between the
different predictors, however.

Figure~\ref{fig:execpred.r2_vsmean_0_10} plots the $R^2$ values of the
LAST and AR(16) predictors versus their corresponding values for the
MEAN predictor.  Roughly half of the hosts benefit from the more
sophisticated predictors.  Among these are the more heavily loaded
hosts.  The other half of the hosts do not benefit from the more
sophisticated predictors, but their loss is not as significant as the
gains of the half that do.  It is also clear that the AR(16)
predictors generally have very little loss compared to MEAN.

Figure~\ref{fig:execpred.r2_vslast_0_10} compares the $R^2$ values of
the LAST and AR(16) predictors.  About 2/3 of the hosts see some
benefit from the move to the AR(16) predictor, while the rest see some
detriment.  The gain or decline is on the order of 0.05 or so.  

It turns out that most of the variation in the running time of tasks
can be explained by the nominal time of the task itself.  Typically,
the $R^2$ using the nominal time of as predictor is on the order of
0.8, meaning that 80\% of the variation in the running time can be
explained simply by the nominal time of the task, $t_{nom}$.  Because
of this there is little room for improvement by the predictor, which
is why the predictors have similar performance.  

This is not to say that the nominal time is a good overall predictor,
because it can be very sensitive to transient behavior that a
predictor based on host load can more appropriately deal with.
Furthermore, the $R^2$ value shows how much of a linear relationship
there is between the actual running time and the prediction, but it
does not give any insight into what the slope and intercept of this
relationship are.  Ideally, a good predictor would have $t_{act} = m
t_{exp} + b + noise$ where $m=1$ and $b=0$.  Intuitively, the value of
$m$ is determined by how well the predictor captures the contribution
of the load to the running time.

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/pred_compare_direct_mojave_tnom.epsf}
&
\colfigsize
\epsfbox{eps/pred_compare_direct_mojave_last.epsf}
\\
(a) $t_{nom}$ predictor & (b) LAST predictor \\
\end{tabular}
}
\caption
[Actual versus expected running times, lightly loaded host]
{Actual versus expected running times for lightly loaded host
with intermittent load.}
\label{fig:execpred.mojave}
\end{figure}

Figure~\ref{fig:execpred.mojave} illustrates how a more sophisticated
predictor can avoid transient behavior, resulting in significantly
higher $R^2$.  Each of the graphs plots 1000 tasks chosen from 0.1 to
30 seconds, plotting their actual running time versus the predicted
running time, determined either as (a) the nominal time $t_{nom}$ or
(b) by using the LAST predictor.  We have also plotted a 50\% error
region around the predictions.  Because the host had a background
process that ran at infrequent times, the $t_{nom}$ predictor turned
out to be wildly inaccurate for about 10\% of the tasks,
under-predicting running times by a factor of 2.  The LAST predictor,
which incorporates the most recent host load measurements, was able to
detect when the background process ran, offering higher predicted
running times when this was the case.  Because of this, the $R^2$ rose
from 0.79 to 0.99.  More importantly, the linear fit has improved from
$m=1.1$ and $b=0.02$ to $m=1$ and $b=-0.03$.

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/pred_compare_direct_axp0_tnom.epsf}
&
\colfigsize
\epsfbox{eps/pred_compare_direct_axp0_ar16.epsf}
\\
(a) $t_{nom}$ predictor & (b) LAST predictor \\
\end{tabular}
}
\caption
[Actual versus expected running time, heavily loaded host]
{Actual versus expected running times for heavily loaded host
(axp0 trace).}
\label{fig:execpred.axp0}
\end{figure}

Figure~\ref{fig:execpred.axp0} shows another typical behavior.  In
this case we have run approximately 3000 tasks ranging from 0.1 to 10
seconds and have plotted them as per the previous figure.  The host
was playing back the high load axp0 trace.  In
Figure~\ref{fig:execpred.axp0}(a) we see that over 74\% of the
$t_{nom}$ predictions are wrong by more than 50\%.  On the other hand,
using the AR(16) predictor (Figure~\ref{fig:execpred.axp0}(b)) only
3\% of the predictions are wrong by more than 50\%.  However, note
that the $R^2$ value has changed only marginally, from 0.82 to
0.87---the linear fit to the $t_{nom}$ graph is not much worse than
that of the AR(16) predictor.  However, the structure of the fit has
improved from $m=2.15$ and $b=0.82$ to $m=0.95$ and $b=0.06$.


\subsubsection{Effect of nominal time on quality of expected running times}

\begin{figure}
\centerline{
\begin{tabular}{c}
\colfigsize
\epsfbox{eps/pred_tacttexpr2_overall_0_3.epsf} \\
(a) $R^2$, 0.1 to 3 second tasks \\
\colfigsize
\epsfbox{eps/pred_tacttexpr2_overall_3_6.epsf} \\
(b) $R^2$, 3 to 6 second tasks \\
\colfigsize
\epsfbox{eps/pred_tacttexpr2_overall_6_10.epsf} \\
(c) $R^2$, 6 to 10 second tasks \\
\end{tabular}
}
\caption
[Effect of nominal time on $R^2$, all hosts]
{Effect of nominal time on $R^2$ metric, all hosts.}
\label{fig:execpred.r2_overall_all}
\end{figure}

\begin{figure}
\centerline{
\begin{tabular}{c}
\colfigsize
\epsfbox{eps/pred_tacttexpr2_vsmean_0_3.epsf} \\
(a) $R^2$, 0.1 to 3 second tasks \\
\colfigsize
\epsfbox{eps/pred_tacttexpr2_vsmean_3_6.epsf} \\
(b) $R^2$, 3 to 6 second tasks \\
\colfigsize
\epsfbox{eps/pred_tacttexpr2_vsmean_6_10.epsf} \\
(c) $R^2$, 6 to 10 second tasks \\
\end{tabular}
}
\caption
[Effect of nominal time on $R^2$, LAST and AR(16) vs. MEAN]
{Effect of nominal time on $R^2$ metric, all hosts, LAST and
AR(16) compared to MEAN.}
\label{fig:execpred.r2_vsmean_all}
\end{figure}


\begin{figure}
\centerline{
\begin{tabular}{c}
\colfigsize
\epsfbox{eps/pred_tacttexpr2_vslast_0_3.epsf} \\
(a) $R^2$, 0.1 to 3 second tasks \\
\colfigsize
\epsfbox{eps/pred_tacttexpr2_vslast_3_6.epsf} \\
(b) $R^2$, 3 to 6 second tasks \\
\colfigsize
\epsfbox{eps/pred_tacttexpr2_vslast_6_10.epsf} \\
(c) $R^2$, 6 to 10 second tasks \\
\end{tabular}
}
\caption
[Effect of nominal time on $R^2$, AR(16) vs. LAST]
{Effect of nominal time on $R^2$ metric, all hosts, AR(16)
compared to LAST.}
\label{fig:execpred.r2_vslast_all}
\end{figure}

The $R^2$ of a prediction is strongly dependent on the nominal time of
the task, but is somewhat independent of which host load predictor is
used.  Figure~\ref{fig:execpred.r2_overall_all} illustrates how the
average $R^2$ value measured for each of the load traces depends on
the nominal time.  As can be seen from the figure, $R^2$ values
decline significantly as tasks grow in size.  For small tasks
(Figure~\ref{fig:execpred.r2_overall_all}(a)), there is generally some
predictor which provides $R^2>0.9$, while for large tasks
(Figure~\ref{fig:execpred.r2_overall_all}(c)), most $R^2$ values are
significantly below 0.8.  Furthermore, as tasks increase in size,
there seems to be greater variability in the performance of the
different predictors.  

This variability in performance among the predictors is not, alas,
evidence that one of the predictors is consistently better than the
others for all, or even a large group of,  hosts.  Indeed,
Figure~\ref{fig:execpred.r2_vsmean_all}, which compares the $R^2$ of
the LAST and AR(16) predictors to that of the MEAN predictor, and
Figure~\ref{fig:execpred.r2_vslast_all}, which compares the $R^2$ of
the LAST and AR(16) predictors, show that there are no clear
favorites.  As running times increase, $R^2$ values decline
precipitously and no clear pattern of difference among the hosts
becomes clear---no one predictor is preferable.  If $R^2$ is the
desired figure of merit, then a multiple expert approach, where
different predictors are used simultaneously, and the predictor with
the best recent predictions is the one whose value is reported to the
user.


\subsubsection{Host classes}

For each individual load trace, we plotted our three performance
metrics (coverage, span, and $R^2$) versus the nominal time
$t_{nom}$.  When we did this, we found that an interesting pattern
emerged.  By visual inspection, the results for the 39 traces could be
placed into five classes.  At present, we make no use of these
categorizations, and we certainly have not automated the
characterization process.  Nonetheless, examining representatives of
each of the classes is enlightening and permits us to make
recommendations for each of the classes.

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/class_i_coverage.epsf}
&
\colfigsize
\epsfbox{eps/class_i_span.epsf}
\\
(a) Coverage & (b) Span \\
\multicolumn{2}{c}{
\colfigsize
\epsfbox{eps/class_i_r2.epsf}
} \\
\multicolumn{2}{c}{
c) $R^2$
}\\
\end{tabular}
}
\caption
[Confidence interval metrics on Class I hosts]
{Performance metrics as function of time for Class I
(``typical low load host'') host axp2.}
% low load 2
\label{fig:execpred.class_i}
\end{figure}

{\bf Class I} : Class I, which we also call the ``typical low load
host'' class represents the most common behavior by far that we have
encountered. The class consists of 29 of the 39 hosts (76\%): axp2,
axp3, axp6, axp7, axp8, axp9, axpfea, axpfeb, manchester-1,
manchester-2, manchester-5, manchester-6, manchester-7, manchester-8,
mojave, sahara, aphrodite, asbury-park, asclepius, cobain, darryl,
hestia, hawaii, newark, pryor, rhea, rubix, uranus, and zeno.

Class I is exemplified by trace axp2, whose metrics we plot as
functions of the nominal time in Figure~\ref{fig:execpred.class_i}.
Each point in the graph represents the average of about 200 testcases
and represents a 2 second span of nominal time, extending from one
second before the point's x-coordinate to one second after.  The main
characteristics of the class are the following.  The coverage is only
slightly dependent on the nominal time, increasing slightly for all
predictors as the nominal time increases.  The MEAN predictor
typically has almost 100\% coverage and is closely followed by the
AR(16) and then the LAST predictor.  The LAST and AR(16) predictors
have significantly narrower spans than the MEAN predictor, with AR(16)
producing slightly wider spans than LAST.  The $R^2$ drops
precipitously with increasing nominal time, especially for MEAN.  The
drop is less precipitous for LAST and AR(16), and LAST is usually
slightly better than AR(16) in terms of $R^2$.  For this particular
trace, there is a slight upturn for LAST and AR(16) with increasing
$R^2$, while in in other cases the decline is monotonic.  

Our primary concern is producing accurate confidence intervals.  For
this reason, we believe that the AR(16) is the best predictor for this
class of host.  The coverage is nearly as good as MEAN and is
typically near the target 95\% point, while LAST tends to lag behind,
especially for smaller tasks.  Furthermore, the span of AR(16) is
typically half that of MEAN and only slightly wider than LAST.  In
most hosts, then, a better predictor produces much narrower accurate
confidence intervals.

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/class_ii_coverage.epsf}
&
\colfigsize
\epsfbox{eps/class_ii_span.epsf}
\\
(a) Coverage & (b) Span \\
\multicolumn{2}{c}{
\colfigsize
\epsfbox{eps/class_ii_r2.epsf}
}
\\
\multicolumn{2}{c} {
c) $R^2$
}
\\
\end{tabular}
}
\caption
[Confidence interval metrics on Class II hosts]
{Performance metrics as function of time for Class II
(``atypical low load host'') host axp1.}
%(``low load 1'')
\label{fig:execpred.class_ii}
\end{figure}

{\bf Class II} : Class II hosts, which we refer to as being in the
``atypical low load host'' class, present the second most common
behavior among our traces.  The class consists of 4 of the 39 
hosts (10\%): axp1, manchester-3, manchester-4, and bruce.

An exemplar of Class II is the trace axp1.  We have plotted the
metrics of axp1 in Figure~\ref{fig:execpred.class_ii}.  The methodology
behind the graphs is identical to that of Class I.  An important
distinguishing feature of this class is that the coverage of the MEAN
predictor drops precipitously with increasing nominal time because the
span of its confidence interval is not sufficiently large.  In
contrast, LAST and AR(16) compute slightly larger confidence intervals
which result in excellent coverage that increases with increasing
nominal time.  LAST and AR(16) have similar coverage (in this case
LAST is slightly ahead, in other cases AR(16) is slightly ahead).  The
$R^2$ of MEAN also decays quickly with increasing nominal time, while
those of LAST and AR(16) decay much more slowly.  Again, in some
cases, AR(16) is slightly ahead, in others LAST is slightly ahead by
this metric.  

In terms of computing confidence intervals, either AR(16) or LAST
seems adequate for producing confidence intervals for this class of
host.  Compared to MEAN, both produce significantly larger spans that
result in much better coverage.   In terms of the expected time,
AR(16) and LAST are also to be strongly preferred over the MEAN
predictor, while there is no clear advantage to recommending one over
the other.

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/class_iii_coverage.epsf}
&
\colfigsize
\epsfbox{eps/class_iii_span.epsf}
\\
(a) Coverage & (b) Span \\
\multicolumn{2}{c}{
\colfigsize
\epsfbox{eps/class_iii_r2.epsf}
}
\\
\multicolumn{2}{c}{
(c) $R^2$ 
}\\
\end{tabular}
}
\caption
[Confidence interval metrics on Class III hosts]
{Performance metrics as function of time for typical class III
(``high load 1'') host axp4.}
\label{fig:execpred.class_iii}
%(``high load 1'')
\end{figure}

{\bf Class III} : The remainder of the five host classes all contain
high load hosts.  There does not seem to be a ``typical'' behavior on
a high load host, so we will simply enumerate these classes.  Class
III, which we also call ``high load 1'', consists of a 3  of the 39
hosts (8\%) : axp4, axp5, and argus.

Using the same methodology as before,
Figure~\ref{fig:execpred.class_iii} plots the performance metrics as a
function of the nominal time for an exemplar, axp4.  Compared to the
low load hosts, this high load 1 host displays much more complex
behavior.  The predictor with the best coverage depends strongly on
the nominal time.  For very short tasks, MEAN is slightly better than
AR(16), which is much better than LAST, although the coverage is quite
poor with all three predictors.  For medium size tasks, AR(16)
provides the best coverage, followed at a distance by MEAN and LAST,
which become interchangeable.  For large tasks, AR(16) and LAST have
similar coverage, with AR(16) lagging slightly, while MEAN's coverage
is far behind.  In terms of the span, AR(16) and LAST both compute
much wider confidence intervals than MEAN, which explains why their
coverage is so much better.  MEAN is unable to understand the
dynamicity of this kind of host.  Predictably, for the nominal times
where AR(16) is preferable to LAST, it has a larger span.  In terms of
the $R^2$, AR(16) and LAST are clearly preferable to MEAN since their
performance declines much more gradually with increasing nominal time.
They are interchangeable.

In terms of computing accurate expected running times for this class
of hosts, either AR(16) or LAST are appropriate since their similar
$R^2$ metrics are significantly better than that of MEAN.  In terms of
computing accurate confidence intervals, the best predictor is highly
dependent on the nominal time.  For very short tasks, MEAN or AR(16)
is preferable, but either has rather poor coverage.  For medium tasks,
AR(16) produces the best performance.  For large tasks, LAST is best. 


\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/class_iv_coverage.epsf}
&
\colfigsize
\epsfbox{eps/class_iv_span.epsf}
\\
(a) Coverage & (b) Span \\
\multicolumn{2}{c}{
\colfigsize
\epsfbox{eps/class_iv_r2.epsf}
}
\\
\multicolumn{2}{c}{
(c) $R^2$
} \\
\end{tabular}
}
\caption
[Confidence interval metrics on Class IV hosts]
{Performance metrics as function of time for typical class IV (``high load 2'') host axp10.}
\label{fig:execpred.class_iv}
\end{figure}

{\bf Class IV} : This class, which we also refer to as the ``high load
2'' class, contains two hosts: axp10 and themis.  

Figure~\ref{fig:execpred.class_iv} plots the performance of the
predictors on a representative trace, axp10, using the same
methodology as before.  We can see that the coverage of LAST and
AR(16) are virtually identical here and increase slowly with nominal
time.  MEAN has similar coverage for small tasks, but then behaves
increasingly poorly, with coverage decreasing rapidly with nominal
time.  In terms of the span, LAST grows much more quickly than MEAN
with increasing nominal time, while AR(16) is almost exactly in
between them.  For very short nominal times the spans are all
identical.  The $R^2$ of MEAN drops in step with increasing nominal
time, while the $R^2$ of LAST and AR(16), which remain virtually
identical, drop much more slowly.

In terms of computing confidence intervals, AR(16) clearly produces
the best results for this class of hosts, getting coverage identical
to that of LAST with a span that is often half as wide.  In terms of
computing expected times, AR(16) and LAST are nearly identical, and
significantly better than MEAN.

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/class_v_coverage.epsf}
&
\colfigsize
\epsfbox{eps/class_v_span.epsf}
\\
(a) Coverage & (b) Span \\
\multicolumn{2}{c}{
\colfigsize
\epsfbox{eps/class_v_r2.epsf}
}
\\
\multicolumn{2}{c}{
(c) $R^2$
}
\\
\end{tabular}
}
\caption
[Confidence interval metrics on Class V hosts]
{Performance metrics as function of time for typical class V (``high load 3'') host axp0.}
\label{fig:execpred.class_v}
\end{figure}

{\bf Class V} : Class V, which we also refer to as the ``high load 3''
class, consists of a single host, axp0.  

Figure~\ref{fig:execpred.class_v} plots the performance of the
predictors on axp0 using the same methodology as before.  In terms of
coverage, AR(16) is clearly the winner here, especially for medium
sized tasks.  It achieves its reasonable coverage (the goal is 95\%)
by computing slightly larger confidence intervals than MEAN.  LAST
computes confidence intervals that are far too small, resulting in
abysmal coverage.  In terms of $R^2$, LAST and AR(16) provide nearly
identical performance that drops much more gradually with increasing
nominal time than that of MEAN. 

AR(16) is clearly the preferable predictor for this class of hosts in
terms of computing confidence intervals.  For the expected time,
AR(16) and LAST are interchangeable here.


\section{Conclusion}

We began this chapter with a resource prediction service capable of
providing accurate host load predictions.  However, applications and
application schedulers are interested in application-level
predictions.  In particular, the real-time scheduling advisor needs
predictions of task running time.  To provide these predictions, we
developed an algorithm for transforming from host load predictions and
a task's nominal time, a measure of the CPU demand of the task, into a
prediction of the task's running time.  The prediction is expressed
both as a point estimate (the expected running time) and a confidence
interval for the running time.

We implemented the algorithm to provide a new service, the running
time advisor, on top of the existing host load prediction service.  We
constructed a sophisticated run-time evaluation infrastructure based
on the new technique of load trace playback and evaluated our
implementation using it.  The main result is that the system works
well when combined with an appropriate host load predictor, such as
the AR(16) predictor that we found most appropriate when we evaluated
host load prediction in the previous chapter.

%The following summarizes the important elements of task running time
%prediction and its properties:
%\begin{itemize}
%\item Our algorithm is based on a discretized version of a fluid model
%of a priority-less host scheduler.  The assumption is that all
%processes run at the same priority.
%\item The expected running time is computed based on the individual host load
%predictions. 
%\item The confidence interval is computed based on the estimated covariances of
%of host load prediction errors.
%\item Capturing the correlation between host load prediction errors in
%the form of these covariances is of critical importance in computing
%accurate confidence intervals.
%\item Load discounting is needed to capture the priority boost that
%a Unix scheduler gives a task when it begins.
%\item Load trace playback is an effective tool for (re)producing a realistic
%workload on a host and is the basis of our evaluation infrastructure.
%\item The combination of our algorithm and an appropriate host load
%predictor can indeed produce effective predictions of task running time.
%\item There is almost always a host load predictor that can be used to
%compute reasonable confidence intervals, both in terms of having the
%desired coverage and having small span.
%\item Of the host load predictors we studied (MEAN, LAST, and AR(16)) 
%The best predictor is usually AR(16).  This is in keeping with our
%results of the previous chapter.
%\item There is a significant performance gain, in terms of either
%shorter span or better coverage in moving from the MEAN to the LAST
%predictor.
%\item Compared to MEAN, the LAST and AR(16) predictors tend to provide
%significantly better coverage (high load hosts) or significantly
%shorter spans (low load hosts).
%\item There is a smaller, but significant performance gain in moving
%from the LAST predictor to the AR(16) predictor, especially on more
%heavily loaded hosts.
%\item The AR(16) and LAST predictors perform well in terms of the
%expected running times of tasks.  
%\item Predictor performance depends on the nominal time of the task.
%The quality of confidence intervals, in terms of span and coverage,
%increases with increasing nominal time.  On the other hand, the
%quality of expected running times generally declines with increasing
%nominal time.  For both predictions the differences between the different
%predictors grows with increasing nominal time.
%\item We identified five different classes of behavior in how the
%performance metrics vary with the underlying host load predictor and
%with the task nominal time.
%\item For all the classes, for most nominal times, the AR(16)
%predictor is preferable.
%\end{itemize}


%
% 3 column landscape versions
%
%
%


%\begin{landscape}
%\begin{figure}
%\centerline{
%\begin{tabular}{ccc}
%\lscape3colfigsize
%\epsfbox{eps/pred_fracinci_overall_0_10.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_ciwidth_overall_0_10.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_tacttexpr2_overall_0_10.epsf}
%\\
%(a) coverage & (b) span & (c) tacttexpr2 \\
%\end{tabular}
%}
%\caption{Overall performance metrics, 0.1 to 10 second tasks, all hosts.}
%\label{fig:execpred.overall_0_10}
%\end{figure}
%\begin{figure}
%\centerline{
%\begin{tabular}{ccc}
%\lscape3colfigsize
%\epsfbox{eps/pred_fracinci_vsmean_0_10.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_ciwidth_vsmean_0_10.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_tacttexpr2_vsmean_0_10.epsf}
%\\
%(a) coverage & (b) span & (c) tacttexpr2 \\
%\end{tabular}
%}
%\caption{Perrformance metrics showing LAST and AR(16) versus MEAN, 0.1 to 10 second tasks, all hosts.}
%\label{fig:execpred.vsmean_0_10}
%\end{figure}
%\begin{figure}
%\centerline{
%\begin{tabular}{ccc}
%\lscape3colfigsize
%\epsfbox{eps/pred_fracinci_vslast_0_10.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_ciwidth_vslast_0_10.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_tacttexpr2_vslast_0_10.epsf}
%\\
%(a) coverage & (b) span & (c) tacttexpr2 \\
%\end{tabular}
%}
%\caption{Performance metrics showing AR(16) versus LAST, 0.1 to 10 second tasks, all hosts.}
%\label{fig:execpred.vslast_0_10}
%\end{figure}
%\end{landscape}


%\begin{landscape}
%\begin{figure}
%\centerline{
%\begin{tabular}{ccc}
%\lscape3colfigsize
%\epsfbox{eps/pred_fracinci_overall_0_3.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_ciwidth_overall_0_3.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_tacttexpr2_overall_0_3.epsf}
%\\
%(a) coverage 0:3 & (b) span 0:3 & (c) tacttexpr2 0:3 \\
%\lscape3colfigsize
%\epsfbox{eps/pred_fracinci_overall_3_6.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_ciwidth_overall_3_6.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_tacttexpr2_overall_3_6.epsf}
%\\
%(a) coverage 3:6 & (b) span 3:6  & (c) tacttexpr2 3:6 \\
%q\lscape3colfigsize
%\epsfbox{eps/pred_fracinci_overall_6_10.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_ciwidth_overall_6_10.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_tacttexpr2_overall_6_10.epsf}
%\\
%(a) coverage 6:10 & (b) span 6:10  & (c) tacttexpr2 6:10 
%\end{tabular}
%}
%\caption{Performance metrics of LAST and AR(16) and MEAN as a function of $t_{nom}$ on all hosts.}
%\label{fig:execpred.overall_all}
%\end{figure}
%\end{landscape}

%\begin{landscape}
%\begin{figure}
%\centerline{
%\begin{tabular}{ccc}
%\lscape3colfigsize
%\epsfbox{eps/pred_fracinci_vsmean_0_3.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_ciwidth_vsmean_0_3.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_tacttexpr2_vsmean_0_3.epsf}
%\\
%(a) coverage 0:3 & (b) span 0:3 & (c) tacttexpr2 0:3 \\
%\lscape3colfigsize
%\epsfbox{eps/pred_fracinci_vsmean_3_6.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_ciwidth_vsmean_3_6.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_tacttexpr2_vsmean_3_6.epsf}
%\\
%(a) coverage 3:6 & (b) span 3:6  & (c) tacttexpr2 3:6 \\
%q\lscape3colfigsize
%\epsfbox{eps/pred_fracinci_vsmean_6_10.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_ciwidth_vsmean_6_10.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_tacttexpr2_vsmean_6_10.epsf}
%\\
%(a) coverage 6:10 & (b) span 6:10  & (c) tacttexpr2 6:10 
%\end{tabular}
%}
%\caption{Performance metrics of LAST and AR(16) versus MEAN as a function of $t_{nom}$ on all hosts.}
%\label{fig:execpred.vsmean_all}
%\end{figure}
%\end{landscape}
%}


%\begin{landscape}
%\begin{figure}
%\centerline{
%\begin{tabular}{ccc}
%\lscape3colfigsize
%\epsfbox{eps/pred_fracinci_vslast_0_3.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_ciwidth_vslast_0_3.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_tacttexpr2_vslast_0_3.epsf}
%\\
%(a) coverage 0:3 & (b) span 0:3 & (c) tacttexpr2 0:3 \\
%\lscape3colfigsize
%\epsfbox{eps/pred_fracinci_vslast_3_6.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_ciwidth_vslast_3_6.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_tacttexpr2_vslast_3_6.epsf}
%\\
%(a) coverage 3:6 & (b) span 3:6  & (c) tacttexpr2 3:6 \\
%q\lscape3colfigsize
%\epsfbox{eps/pred_fracinci_vslast_6_10.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_ciwidth_vslast_6_10.epsf}
%&
%\lscape3colfigsize
%\epsfbox{eps/pred_tacttexpr2_vslast_6_10.epsf}
%\\
%(a) coverage 6:10 & (b) span 6:10  & (c) tacttexpr2 6:10 
%\end{tabular}
%}
%\caption{Performance metrics of AR(16) versus LAST as a function of $t_{nom}$ on all hosts.}
%\label{fig:execpred.vslast_all}
%\end{figure}
%\end{landscape}


% Individual figures

%\comment{
%\begin{figure}
%\centerline{
%\epsfxsize=3in
%\epsfbox{eps/pred_fracinci_overall_0_10.epsf}
%}
%\caption{Fraction of tasks in confidence intervals, 0.1 to 10 second tasks, all hosts}
%\label{fig:execpred.fracinci_overall_0_10}
%\end{figure}

%\begin{figure}
%\centerline{
%\epsfxsize=3in
%\epsfbox{eps/pred_ciwidth_overall_0_10.epsf}
%}
%\caption{Confidence interval widths, 0.1 to 10 second tasks, all hosts}
%\label{fig:execpred.span_overall_0_10}
%\end{figure}

%\begin{figure}
%\centerline{
%\epsfxsize=3in
%\epsfbox{eps/pred_tacttexpr2_overall_0_10.epsf}
%}
%\caption{$R^2$ for $t_{act}$ vs $t_{exp}$, 0.1 to 10 second tasks, all hosts}
%\label{fig:execpred.tacttexpr2_overall_0_10}
%\end{figure}


%\begin{figure}
%\centerline{
%\epsfxsize=3in
%\epsfbox{eps/pred_fracinci_vsmean_0_10.epsf}
%}
%\caption{Fraction of tasks in confidence intervals, 0.1 to 10 second tasks, LAST and AR(16) plotted versus MEAN}
%\label{fig:execpred.fracinci_vsmean_0_10}
%\end{figure}

%\begin{figure}
%\centerline{
%\epsfxsize=3in
%\epsfbox{eps/pred_ciwidth_vsmean_0_10.epsf}
%}
%\caption{Confidence interval widths, 0.1 to 10 second tasks, all hosts, LAST and AR(16) plotted against MEAN.}
%\label{fig:execpred.span_vsmean_0_10}
%\end{figure}

%\begin{figure}
%\centerline{
%\epsfxsize=3in
%\epsfbox{eps/pred_tacttexpr2_vsmean_0_10.epsf}
%}
%\caption{$R^2$ for $t_{act}$ vs $t_{exp}$, 0.1 to 10 second tasks,LAST and AR(16) plotted against MEAN.}
%\label{fig:execpred.tacttexpr2_vsmean_0_10}
%\end{figure}


%\begin{figure}
%\centerline{
%\epsfxsize=3in
%\epsfbox{eps/pred_fracinci_vslast_0_10.epsf}
%}
%\caption{Fraction of tasks in confidence intervals, 0.1 to 10 second tasks, AR(16) plotted versus LAST.}
%\label{fig:execpred.fracinci_vslast_0_10}
%\end{figure}

%\begin{figure}
%\centerline{
%\epsfxsize=3in
%\epsfbox{eps/pred_ciwidth_vslast_0_10.epsf}
%}
%\caption{Confidence interval widths, 0.1 to 10 second tasks, all hosts, AR(16) plotted versus LAST.}
%\label{fig:execpred.span_vslast_0_10}
%\end{figure}

%\begin{figure}
%\centerline{
%\epsfxsize=3in
%\epsfbox{eps/pred_tacttexpr2_vslast_0_10.epsf}
%}
%\caption{$R^2$ for $t_{act}$ vs $t_{exp}$, 0.1 to 10 second tasks, AR(16) plotted versus LAST.}
%\label{fig:execpred.tacttexpr2_vslast_0_10}
%\end{figure}
%}



%We begin by describing the application level API which we wanted to
%provide.  Implementing the API required solving two significant
%problems: dealing with the priority boost that Unix schedulers tend to
%give to newly arrived tasks, and computing a confidence interval given
%correlated prediction errors on the part of the predictors.  We
%describe our solutions to these problems and then evaluate the
%resulting system.  The evaluation is done using load trace playback, a
%new technique that allows us to generate work on a host which conforms
%to that represented by a load trace.  We describe this technique and
%our load trace playback-based experimental setup.  While the host is
%playing back a representative trace (we use all of the traces in the
%August 1997 group), we run randomly selected testcases and evaluate
%how well their predicted running time agrees with the actual running
%time.  We use three different predictors: MEAN, LAST, and AR(16).  As
%noted in the previous chapter, AR(16) seems most appropriate for host
%load prediction.  We use three metrics three metrics for evaluation.


%begin{equation}
%l(t) = \frac{\Delta - (t-at_{\lfloor t/\Delta \rfloor })}{\Delta} at_{\lfloor t/\Delta \rfloor} + \frac{t-at_{\lfloor t/\Delta \rfloor}}{\Delta} at_{\lceil t/\Delta \rceil}
%end{equation}



%Essentially, we are evaluating three things here.  The first whether
%our algorithm actually usefully transforms from the predictions of an
%underlying host load predictor to a confidence interval for a task's
%running time.  The second thing we want to learn is how the quality of
%such transformations (as measured by the coverage and the span)
%depends on the underlying host load predictor---what are the
%benefits from moving from the variance of the underlying load signal
%(MEAN predictor), to a simple local predictor (LAST), and to the
%predictor we found most appropriate for host load prediction in
%Chapter~\ref{chap:hostloadpred} ?  The third question is how the
%performance depends on the nominal running time of the task.



%This section addresses the following questions:
%\begin{enumerate}
%\item Does task running time prediction produce effective confidence
%intervals on a wide variety of different hosts for 100ms to 10 second
%tasks?
%\item How does the quality of the confidence interval depend on choice
%of host and on the host load predictor?
%\item Does task running time prediction produce effective expected
%running times on a variety of hosts?
%\item How does the quality of the expected running time depend on
%the choice of host and host load predictor?
%\end{enumerate}
%These are the questions this section attempts to answer.  

%To address the questions, we look at the average value of each
%performance metric on each load trace using each host load predictor.
%Figure~\ref{fig:execpred.overall_0_10} shows the values for the (a)
%coverage, (b) span, and (c) tacttexpr2 metrics.  


%Our algorithm does indeed compute reasonable confidence intervals for
%the running time of a task.  It can do this using any underlying host
%load predictor.  We studied its behavior with the MEAN, LAST, and AR(16)
%predictor.  Either of the more sophisticated predictors is clearly
%preferable to MEAN.  MEAN computes far too narrow confidence intervals
%for heavily loaded hosts and far too wide confidence intervals for
%lightly loaded hosts.  The confidence intervals computed by LAST are a
%dramatic improvement, especially for heavily loaded hosts, but tend to
%be too narrow for lightly loaded hosts.  The confidence intervals
%computed by AR(16) are a small, but significant improvement over those
%of LAST.  



%%The first kind of graph will plot the absolute value of the
%%performance metric as measured for each of the three predictors on
%%each of our traces.  The second kind of graph will plot the values of
%%the metric for the AR(16) and LAST predictors versus the value for the
%%MEAN predictor.  This allows us to gauge the advantage of using a
%%predictor to characterize some of the variance of the underlying load
%%signal.  Finally, the third kind of graph will plot the value of the
%%metric for the AR(16) predictor versus the value for the LAST
%%predictor.  This illustrates the additional benefit that is gained
%%from moving from a simple predictor to a more sophisticated predictor.
%%Recall that the costs of the AR(16) predictor are not significantly
%%higher than the LAST predictor.  It is also important to note that in
%%all cases, we use a strong characterization of the signal variances
%%(the autocovariance function of the load signal) or the error signal
%%variance (the autocovariance function of the prediction errors).


%We begin by examining the overall performance of task running time
%prediction independent of the running time.  For each host we have
%3000 observations, approximately 1000 for each predictor.  Using these
%observations, we compute the average value of each of our three
%metrics for each host and predictor pair.
%Figure~\ref{fig:execpred.overall_0_10} presents these numbers for the
%(a) coverage, (b) span, and (c) tacttexpr2 metrics.
%Figure~\ref{fig:execpred.vsmean_0_10} presents the same data except we
%have plotted the LAST and AR(16) metrics for each host against their
%corresponding MEAN metric.  With this arrangement, the name of each
%individual host is lost, but it is now easy to
%compare performance with the MEAN predictor.  We aided

%In terms of the coverage metric,
%Figure~\ref{fig:execpred.overall_0_10}(a), the three predictors seem
%to have similar performance, except on about 1/3 of the hosts where
%there is considerable disparity.  



% That is the question that this section answers.  

%  - overall, the ar models produce consistent 95\% CIs - only 5 cases less than 95% and only one less that 90%
%     versus last where 5 cases below 85% and two below 65%
%  - note two groups (low and high load) on the coverage and span graphs compared to mean
%    - low load hosts have shorter CIs than mean with ar16 and last
%    - high load hosts have better coverage than mean with ar16 and last
%  - conclusion reasonable CIs are being computed
%  - no clear difference in perf of tacttexpr2 between the three models
%     - hostevermost of the r2 balues are in excess of .80% or higher  
%  - generally, ar16 produces slightly wider confidence intervals than last, and significantly wider than



%To evaluate the performance of task running time prediction, we
%decided on an off-line study, such as we performed to evaluate host
%load predictors in Chapter~\ref{chap:hostloadpred}, was inappropriate.
%In evaluating host load prediction we wanted to know how well past
%values of a discrete-time load signal could predict future values.
%Since we had complete knowledge of those future values, all errors
%were prediction errors.  In evaluating task running time prediction,
%we are evaluating a composite---the load predictors of
%Chapter~\ref{chap:loadpred} and the algorithm for transforming from
%load predictions to confidence intervals for running times described
%in Section~\ref{sec:execpred.confint}.  We want to know how well the
%task running time predictions produced by this composite compare to
%the actual running times.  However, in an off-line evaluation we
%wouldn't know the actual times.  We would have to use the load trace
%to synthesize a running time using some sort of model, as we did in
%the preliminary study of Chapter~\ref{chap:prelim}.  However, this model
%would introduce its own errors which would be difficult to
%disambiguate from the model in the composite, since, after all, it's a
%similar model.  On-line evaluation avoids this issue completely since
%the actual running time can be directly measured.

%The primary disadvantages of on-line evaluation is that it is
%difficult to capture behaviors on a wide range of host types and to
%repeat experiments.  It is one thing to convince the owner of an
%interesting host to let you use low overhead software to collect load
%signals, but it is quite another to convince him to let you run 1000s
%of compute-bound tasks on his machine.  If you are blessed by
%understanding host owners, an interesting behavior may not recur at
%convenient times.  This is especially critical when trying to perfect
%elements of the prediction system.  If you fix something that broke
%due to some load behavior, how can you test it?

%Systems are often evaluated using a synthetic load generator.  It is
%difficult to evaluate a prediction-based system in this way because
%the performance of prediction depends strongly on the system behavior
%being correlated over time, which is difficult to achieve in a
%synthetic generator.  Furthermore, this correlation is not expressed
%as a scalar value but as a vector-valued autocorrelation sequence.  In
%the specific case of host load, it is a sequence that can also change
%abruptly at different points of time.  To capture this behavior in a
%synthetic model would be a difficult undertaking.

%In our experimental environment, we use {\em load trace playback} to
%produce a repeatable background load that is nearly identical to that
%captured by a load trace taken from a real machine.    
