\chapter{Host Load Prediction}
\label{chap:loadpred}


\comment{
Chapter~\ref{chap:loadpred} describes a large scale study that
evaluated linear models to determine which are most appropriate
for host load prediction.  The study was based on running randomized
testcases using the load traces described in
Chapter~\ref{chap:statprop}, and data-mining the results.  We found
that despite the complex behavior of host load signals, relatively
simple and computationally inexpensive autoregressive models, of
sufficiently high order, are appropriate for host load prediction.
This new knowledge forms the basis for the host load prediction system
in Figure~\ref{fig:intro.rtsa_structure}.  It was simple to construct
this component using \RPS\ once the choice of model became clear.
}


This dissertation argues for basing real-time scheduling advisors on
explicit resource-oriented prediction, specifically on the prediction
of resource signals.  The signal we focus on is host load.  In the
last chapter, we studied the statistical properties of a large
collection of host load traces and came to the conclusion that linear
time series models might be appropriate for predicting host load.
This carried out the first four steps of our resource signal
methodology, as presented in Section~\ref{sec:rps.newpredsys}.  This
chapter carries out the final two steps of the methodology.  We use
the \RPS\ Toolkit to evaluate many different linear models on the load
traces we described in the previous chapter and come to the conclusion
that autoregressive models of order 16 or better are appropriate for
predicting the host load signal.  Having decided on a model,
implementing an \RPS-based on-line host load prediction system that
uses it is trivial, but the evaluation of that system is not, as we
shall see in the subsequent chapters.

It is important to note that a prediction consists not only of the
expected future values of the discrete-time host load signal, but also
the expected variance (or mean squared error) of their prediction
errors and the covariance of the errors with each other.  In other
words, each prediction is qualified with an estimate of how erroneous
it can be.  These measures of prediction quality provide the basis of
statistical reasoning using the predictions.  In the next chapter, we
use the predictions and the estimates of their quality to compute
confidence intervals for running times of tasks.

Intuitively, the expected mean squared error and the covariances are
measures of the ``surprise'' the user of the predictions is likely to
encounter.  The mean squared error is directly comparable to the
variance of the signal itself, while the covariances can be compared
to the autocorrelation of the load signal.  If we want to predict a
future value of the signal, then the mean squared error is sufficient
to qualify the prediction.  If the goal is to predict some function of
a number of future values (the average over an interval of time, for
example), then the covariances are needed because the prediction
errors of the individual values are not necessarily independent.  This
chapter focuses on the measured mean squared error of the predictors.
Ideally, the prediction errors will be less surprising (have lower
mean square error) than the signal itself.

The estimates of the mean squared error of the predictions are derived
from the fit of the predictive model to some region of the signal.
However, in prediction, we are concerned with the measured mean square
error of the model on the signal subsequent to where it was fit.  In
other words, we don't care how well the model fits the signal, but in
how well the fitted model, when used in practice, predicts the signal.
For this reason, this chapter concentrates on the measured mean square
error of the different predictors.

In a system such as \RPS, predictive models are likely to be refit at
arbitrary times whenever monitoring software decides that their
performance has degraded. Because of this, it is important to
consider, in as unbiased a fashion as possible, the performance of the
models in different situations.  Our evaluation of the predictive
models is randomized in order to avoid bias.  Another important point
is that we are interested in models that not only have a low measured
mean squared error on average, but whose measured mean squared error
is consistently low.  What this requirement for consistent
predictability means is that we desire that, regardless of when we fit
the model, we can expect it to provide good results.

We find that load is, in fact, consistently predictable to a very
useful degree from past behavior, and that simple linear time series
models, especially AR(16)s or better, are sufficiently powerful load
predictors.  These results are surprising given the complex behavior
of host load that we identified in the previous chapter. As far as we
are aware, this is the first study to identify these properties of
host load and then to evaluate the practical predictive power of
linear time series models that both can and cannot capture them.  It
is also unique in focusing on predictability on the scale of seconds
as opposed to minutes or longer time scales.


\section{Prediction}

Recall from the previous chapter that we write the host load signal as
$\langle z_t \rangle = \ldots,z_{t-1},z_t,z_{t+1},\ldots$, where $t$
is the current time and $z_{t+j}$ denotes a sample of the signal $j$
steps into the future.  The {\em prediction} of any particular load
sample will be written with a hat.  For example, $\hat{z}_{t+1}$
denotes a prediction of the value $z_{t+1}$ (the one-step-ahead
prediction) using all previous values of the signal.  Such predictions
use all available data up to and including time $t$.  We will also
write $\hat{z}_{t+i,t+j}$ to denote the prediction of $z_{t+j}$ using
all values up to and including $z_{t+i}$.

An \RPS-based resource prediction system provides one-step-ahead to
$m$-step-ahead predictions, where $m$ is user-specified.  At time
$t+i$, the value $z_{t+i}$ is measured and pushed to \RPS, which
responds with the vector
$[\hat{z}_{t+i,t+i+1},\hat{z}_{t+i,t+i+2},\ldots,\hat{z}_{t+i,t+i+m}]$.
When $z_{t+i+1}$ becomes available, we measure its one-step-ahead
prediction error as $a_{t+i,t+i+1}=z_{t+i+1}-\hat{z}_{t+i,t+i+1}$.
Similarly, when $z_{t+i+2}$ becomes available, we can compute the
two-step-ahead prediction error as
$a_{t+i,t+i+2}=z_{t+i+2}-\hat{z}_{t+i,t+i+2}$.  Taken over all $i$, we
summarize the set of one-step-ahead prediction errors by its variance,
which is commonly called the one-step-ahead {\em mean squared error}.
The 2-step-ahead mean squared error is computed similarly.  There is a
mean squared error associated with every {\em lead time} $k$.  We
shall study the one- to 30-step-ahead (or one- to 30-second-ahead,
given our sample rate of 1 Hz) mean squared errors.


It is important to note that the mean $k$-step-ahead prediction error
should be zero (which it is in this study).  A non-zero mean would
indicate that the predictor has an unhealthy systematic bias.  The
mean squared error is the important quantity to examine, since it
measures how far a particular prediction error is likely to vary from
zero.  With a good predictor, the sequence of $k$-step-ahead
prediction errors should be IID.  If the error distribution is normal,
then the $k$-step-ahead mean squared error is sufficient to place
confidence intervals on the predictions.  We found that the prediction
errors are indeed independent, but not normally distributed.  In the
next chapter we show that making the normality assumption nonetheless
leads to good results in predicting the running time of tasks.

While the sequence of $k$-step-ahead prediction errors are independent,
the $k+1$-step-ahead prediction error is dependent on the
$k$-step-ahead prediction error.  As we show in the next chapter, this
becomes important when it is necessary to sum over a number of
predictions, say over a time horizon from now to 10 seconds from now.
We do not evaluate the covariance of the prediction errors in this
chapter.  It is unclear what the figure of merit should be.  The more
predictable a signal is, the greater the covariances typically are.  


\section{Predictive models}


For the simplest prediction, the long-term mean of the signal, the
mean squared error (for any lead time) is simply the variance of the
load signal.  As we saw in the previous chapter, load has quite high
variance and exhibits other complex properties. The hope is that more
sophisticated prediction schemes will have much better mean squared
errors.

On the one hand, the analysis of the last chapter suggested that
linear time series models such as those in the
Box-Jenkins~\cite{BOX-JENKINS-TS} AR, MA, ARMA, and ARIMA classes
might be appropriate for predicting load.  On the other hand, the
existence of self-similarity induced long-range dependence suggested
that such models might require an impractical number of parameters or
that the much more expensive ARFIMA model class, which explicitly
captures long-range dependence, might be more appropriate.  Since it
is not obvious which model is best, we empirically evaluated the
predictive power of the AR, MA, ARMA, ARIMA, and ARFIMA model classes,
as well as a simple ad hoc windowed-mean predictor called BM and a
long-term mean predictor called MEAN.  We also looked at the
performance of BM(1) models, in which the last measured value of the
signal is used as the prediction of all future values.  We call this
the LAST model class.  These model classes, their implementations, and
their overheads are described in Chapter~\ref{chap:rps}.

It is important to note that host load is an exponentially smoothed
signal---an AR(1) with $\phi_1=0.8187$.  The benefits of higher order
linear models, as well as the clear benefits of the linear models over
LAST and BM show that there is significant predictability in the load
signal beyond that introduced by this exponential smoothing.
Furthermore, the better models show significantly better performance
for long lead times than would be expected from simply the exponential
smoothing.  Also, if we factor out the exponential smoothing, we still
see significant predictability.  Finally, as we show in the next
chapter, our predictions of the unmolested load signal work well to
estimate good confidence intervals for the running time of tasks.


\section{Evaluation methodology}
\label{sec:loadpred.method}

Our methodology is designed to determine whether there are consistent
differences in the {\em practical predictive power} of the different
model classes.  Other goals are also possible. For example, one could
determine the {\em explanatory power} of the models by evaluating how
well they fit data, or the {\em generative power} of the model by
generating new load traces from fitted models and evaluating how well
their statistical properties match the original traces.  We have
touched on these other goals, but do not discuss them here.  To assess
the practical predictive power of the different model classes, we
designed a randomized, trace-driven simulation methodology that fits
randomly selected models to subsequences of a load trace and tests
them on immediately following subsequences.

%
% The daemon has one thread of control of this form, and also at least
% one thread that services prediction requests
%
In essence, a host load prediction system has one thread of control of
the following form:
\small
\begin{verbatim}
do forever {
   get new load measurement; 
   update history;
   if (some criteria) {
      refit model to history;
      make new predictor;
   }
   step predictor;
   sleep for sample interval;
}
\end{verbatim}
\normalsize
where \verb.history. is a window of previous load values.  Fitting the
model to the history is an expensive operation.  Once a model is
fitted, a predictor can be produced from it.  {\em Stepping} this
predictor means informing it of a new load measurement so that it can
modify its internal state. This is an inexpensive operation.
Prediction requests arrive asynchronously and are serviced using the
current state of the predictor.  A prediction request that arrives at
time $t+i$ includes a lead time $k$. The predicted load values
$\hat{z}_{t+i,t+i+1},\hat{z}_{t+i,t+i+2},\ldots,\hat{z}_{t+i,t+i+k}$ are returned
along with model-based estimates of the mean squared prediction error
for each prediction.

Evaluating the predictive power of the different model classes in such
a context is complicated because there is such a vast space of
configuration parameters to explore.  These parameters include: the
trace, the model class, the number of parameters we allow the model
and how they are distributed, the lead time, the length of the history
to which the model is fit, the length of the interval during which the
model is used to predict, and at what points the model is refit.  We
also want to avoid biases due to favoring particular regions of
traces.

\begin{table}[tbp]
\centerline {
\begin{tabular}{c}
\epsfxsize=4in
\epsfbox{eps/testcase1.eps}
\\
\small
\begin{tabular}{|l|l|}
\hline
Model class & Number of parameters \\
\hline
MEAN  & none \\
BM(p) (inc. LAST)    & p=1..32, chosen by fit \\
\hline
AR(p) & p=1..32 \\
MA(q) & q=1..8 \\
ARMA(p,q) & p=1..4, q=1..4 \\
ARIMA(p,d,q) & p=1..4, d=1..2, q=1..4 \\
ARFIMA(p,d,q) & p=1..4, d by fit, q=1..4 \\
\hline
\end{tabular}
\normalsize
\end{tabular}
}
\caption
[Testcase generation]
{Testcase generation.}
\label{tab:loadpred.testcasegen}
\end{table}

To explore this space in a reasonably unbiased way, we ran a large
number of randomized testcases on each of the traces.
Table~\ref{tab:loadpred.testcasegen} illustrates the parameter space from
which testcase parameters are chosen.  A testcase is generated and
evaluated using the following steps. 
\begin{enumerate}
\item Choose a random {\em crossover point}, $t_{cross}$, from within the
trace.
\item Choose a random number of samples, $m$, from
$600,601,\ldots,10800$ (5 minutes to three hours).   The $m$ samples
preceding the crossover point, $z_{t_{cross}-m}, z_{t_{cross}-m+1},
\ldots,z_{t_{cross}-1}$ are in the {\em fit interval}.
\item Choose a random number of samples, $n$ from $600,601,\ldots,10800$ (5
minutes to three hours).  The $n$ samples including and following the
crossover point, $z_{t_{cross}}, z_{t_{cross}+1}, \ldots,
z_{t_{cross}+n-1}$,are in the {\em test interval}.
\item Choose a random AR, MA, ARMA, ARIMA, or ARFIMA {\em test model}
from the table in Table~\ref{tab:loadpred.testcasegen}, fit it to the
samples in the fit interval, and generate a predictor from the fitted
test model. 
\item For $i=m$ to $1$, step the predictor with $z_{t_{cross}-i}$
(the values in the fit interval) to initialize its internal state.
After this step, the predictor is ready to be tested.
\item For  $i=0$ to $n-1$ do the following:
\begin{enumerate}
\item Step the predictor with $z_{t_{cross}+i}$ (the next value in the
test interval).
\item For each lead time $k=1,2,\ldots,30$ seconds, produce the
predictions \\
$\hat{z}_{t_{cross}+i,t_{cross}+i+k}$.  $\hat{z}_{t_{cross}+i,t_{cross}+i+k}$ is
the prediction of $z_{t_{cross}+i+k}$ given the samples\\
$z_{t_{cross}-m},z_{t_{cross}-m+1},\ldots,z_{t_{cross}+i}$.
Compute
the prediction errors\\
$a_{t_{cross}+i,t_{cross}+i+k} = \hat{z}_{t_{cross}+i,t_{cross}+i+k} -z_{t_{cross}+i+k}$.
\end{enumerate}
\item For each lead time $k=1,2,\ldots,30$ seconds, analyze the
$k$-step-ahead prediction errors \\
$a_{t_{cross}+i,t_{cross}+i+k}$, $i=0,1,\ldots,n-1$.
\item Output the testcase parameters and the analysis of the
prediction errors.
\end{enumerate}
For clarity in the above, we focused on the linear time series model
under test.  Each testcase also includes a parallel evaluation of the
BM and MEAN models to facilitate direct comparison with the simple BM
model and the raw signal variance.

The lower limit we place on the length of the fit and test intervals
is purely prosaic---the ARFIMA model needs about this much data to
be successfully fit.  The upper limit is chosen to be greater than
most epoch lengths so that we can see the effect of crossing epoch
boundaries.  The models are limited to eight parameters because
fitting larger MA, ARMA, ARIMA, or ARFIMA models is prohibitively
expensive in a real system.  We did also explore larger AR models, up
to order 32.

The analysis of the prediction errors includes the following.  For
each lead time, the minimum, median, maximum, mean, mean absolute, and
mean squared prediction errors are computed.  The one-step-ahead
prediction errors (ie, $a_{t_{cross}+i,t_{cross}+i+1}$,
$i=1,2,\ldots,n$) are also subject to IID and normality tests as
described by Brockwell and
Davis~\cite[pp. 34--37]{INTRO-TIME-SERIES-FORECASTING}.  IID tests
included the fraction of the autocorrelations that are significant,
the Portmanteau Q statistic (the power of the autocorrelation
function), the turning point test, and the sign test.  Normality was
tested by computing the $R^2$ value of a least-squares fit to a
quantile-quantile plot of the values or errors versus a sequence of
normals of the same mean and variance.  

Because the properties of the two sets of traces are so similar, our
study focused on the August, 1997 set.  We implemented the randomized
evaluation using the parallelized \RPS-based tool described in
Chapter~\ref{chap:rps}.  We ran approximately 152,000 testcases, which
amounted to about 4000 testcases per trace, or about 1000 per model
class and parameter set, or about 30 per trace, model class and
parameter set.  We ran an additional 38,000 testcases explicitly to
compare AR(16) models with the LAST model (1000 testcases per trace).
It is these testcases that we discuss in the remainder of this
chapter.  We also ran an additional 152,000 fully randomized testcases
on the traces where we separately removed the exponential averaging.
Our parallelized simulation discarded testcases in which an unstable
model ``blew up,'' (less than 5\%, almost always ARIMA or ARFIMA
models) either detectably or due to a floating point exception.  The
results of the accepted testcases were committed to a SQL database to
simplify the analysis discussed in the following section.

%% end of part that needs to be formalized

\section{Results}
\label{sec:loadpred.results}

The section addresses the following questions: Is load consistently
predictable? If so, what are the consistent differences between the
different model classes, and which class is preferable?  To answer
these questions we analyze the database of randomized testcases from
Section~\ref{sec:loadpred.method}.  For the most part we will address
only the mean squared error results, although we will touch on the
other results as well.

\subsection{Load is consistently predictable}

For a model to provide consistent predictability of load, it must
satisfy two requirements.  First, for the average testcase, the model
must have a considerably lower expected mean squared error than the
expected raw variance of the load signal (ie, the expected mean squared
error of the MEAN model). The second requirement is that this
expectation is also very likely, or that there is little variability
from testcase to testcase.  Intuitively, the first requirement says
that the model provides better predictions on average, while the
second says that most predictions are close to that average.

\begin{figure}[tbp]
\centerline {
\colfigsize
\epsfbox{eps/all_lead1_8to8.eps} 
}
\caption
[Predictor performance, all traces, 1 second lead, 8 parameters]
{All traces, 1 second lead, 8 parameters}
\label{fig:loadpred.all_lead1_8to8}
\end{figure}

Figure~\ref{fig:loadpred.all_lead1_8to8} suggests that load is indeed
consistently predictable in this sense.  The figure is a Box plot that
shows the distribution of one-step-ahead mean squared error measures
for 8 parameter models on all of the traces.  In the figure, each
category is a specific class of model and is annotated with the number
of samples for that class.  For each class, the circle indicates the
expected mean squared error, while the triangles indicated the 2.5th
and 97.5th percentiles assuming a normal distribution.  The center
line of each box shows the median while the lower and upper limits of
the box show the 25th and 75th percentiles and the lower and upper
whiskers show the actual 2.5th and 97.5th percentiles.

Notice that the expected raw variance (MEAN) of a testcase is
approximately 0.05, while the expected mean squared error for {\em
all} of the model classes is nearly zero.  The figure also shows that
our second requirement for consistent predictability is met.  We see
that the variability around the expected mean squared error is much
lower for the predictive models than for MEAN. For example, the 97.5th
percentile of the raw variance is almost 0.3, while it is about 0.02
for the predictive models.

\begin{figure}[tbp]
\centerline {
\colfigsize
\epsfbox{eps/all_lead15_8to8.eps}
}
\caption
[Predictor performance, all traces, 15 second lead, 8 parameters]
{All traces, 15 second lead, 8 parameters.}
\label{fig:loadpred.all_lead15_8to8}
\end{figure}

\begin{figure}[tbp]
\centerline {
\colfigsize
\epsfbox{eps/all_lead30_8to8.eps}
}
\caption[Predictor performance, all traces, 30 second lead, 8 parameters]
{All traces, 30 second lead, 8 parameters.}
\label{fig:loadpred.all_lead30_8to8}
\end{figure}

Figures~\ref{fig:loadpred.all_lead15_8to8}
and~\ref{fig:loadpred.all_lead30_8to8} show the results for 15 second
predictions and 30 second predictions.  Notice that, except for the MA
models, the predictive models are consistently better than the raw
load variance, even with 30 second ahead predictions.  We also see
that MA models perform quite badly, especially at higher lead times.
This was also the case when we considered the traces individually and
broadened the number of parameters.  MA models are clearly ineffective
for load prediction.

\subsection{Successful models have similar performance}

Surprisingly,
Figures~\ref{fig:loadpred.all_lead1_8to8}~--~\ref{fig:loadpred.all_lead30_8to8} also
show that the differences between the successful models are actually
quite small.  This is also the case if we expand to include testcases
with 2 to 8 parameters instead of just 8 parameters.  With longer lead
times, the differences do slowly increase.

\begin{figure}[tbp]
\centerline {
\colfigsize
\epsfbox{eps/axp0_lead1_8to8.eps}
}
\caption
[Predictor performance, axp0 trace, 1 second lead, 8 parameters]
{Axp0, 1 second lead, 8 parameters}
\label{fig:loadpred.axp0_lead1_8to8}
\end{figure}

\begin{figure}[tbp]
\centerline {
\colfigsize
\epsfbox{eps/axp0_lead15_8to8.eps}
}
\caption
[Predictor performance, axp0 trace, 15 second lead, 8 parameters]
{Axp0, 15 second lead, 8 parameters}
\label{fig:loadpred.axp0_lead15_8to8}
\end{figure}

\begin{figure}[tbp]
\centerline {
\colfigsize
\epsfbox{eps/axp0_lead30_8to8.eps}
}
\caption
[Predictor performance, axp0 trace, 30 second lead, 8 parameters]
{Axp0, 30 second lead, 8 parameters}
\label{fig:loadpred.axp0_lead30_8to8}
\end{figure}

For more heavily loaded machines, the differences can be much more
dramatic.  For example, Figure~\ref{fig:loadpred.axp0_lead1_8to8}
shows the distribution of one-step-ahead mean squared error measures
for 8 parameter models on the axp0.psc trace.  Here we see an expected
raw variance (MEAN) of almost 0.3 reduced to about 0.02 by nearly all
of the models.  Furthermore, the mean squared errors for the different
model classes are tightly clustered around the expected 0.02 value,
quite unlike with MEAN, where we can see a broad range of values and
the 97.5th percentile is almost 0.5.  The axp0.psc trace and others
are also quite amenable to prediction with long lead times.  For
example, Figures~\ref{fig:loadpred.axp0_lead15_8to8}
and~\ref{fig:loadpred.axp0_lead30_8to8} show 15 and 30 second ahead
predictions for 8 parameter models on the axp0.psc trace.  With the
exception of the MA models, even 30 second ahead predictions are
consistently much better than the raw signal variance.  These figures
remain essentially the same if we include testcases with 2 to 8
parameters instead of just 8 parameters.

\begin{table}[tbp]
\figfontsize
\centerline{
\begin{tabular}{l|ccccccc}
	&MEAN	&BM	&AR	&MA	&ARMA	&ARIMA	&ARFIMA\\
\hline
MEAN	&$=$	&$>$	&$>$	&$>$	&$>$	&$>$	&$>$ \\
BM	&$<$	&$=$	&$=$	&$<$	&$=$	&$=$	&$=$ \\
\hline
AR	&$<$	&$=$	&$=$	&$<$	&$=$	&$=$	&$=$ \\ 
\hline
MA	&$<$	&$>$	&$>$	&$=$	&$>$	&$=$	&$>$ \\
ARMA	&$<$	&$=$	&$=$	&$<$	&$=$	&$=$	&$=$ \\
ARIMA	&$<$	&$=$	&$=$	&$<$	&$=$	&$=$	&$=$ \\
ARFIMA	&$<$	&$=$	&$=$	&$<$	&$=$	&$=$	&$=$ 
\end{tabular}
}
\caption
[Predictor performance, t-test comparisons, all traces, 1 second lead, 8 parameters]
{T-test comparisons - all traces aggregated, lead 1, 8
parameters.  Longer leads are essentially the same except MA is always
worse.}
\label{tab:loadpred.t-test-all-lead1-8to8}
\normalsize
\end{table}

Although the differences in performance between the successful models
are somewhat small, they are generally statistically significant.  We
can algorithmically compare the expected mean squared error of the
models using the unpaired t-test~\cite[pp. 209--212]{JAIN-PERF-ANALYSIS},
and do ANOVA procedures to verify that the differences we detect are
significantly above the noise floor.  For each pair of model classes,
the t-test tells us, with 95\% confidence, whether the first model is
better, the same, or worse than the second.  We do the comparisons for
the cross product of the models at a number of different lead times.
We consider the traces both in aggregate and individually, and we use
several different constraints on the number of parameters.

Table~\ref{tab:loadpred.t-test-all-lead1-8to8} shows the results of such a
t-test comparison for the aggregated traces, a lead time of 1, and 8
parameters.  In the figure, the row class is being compared to the
column class.  For example, at the intersection of the AR row and MA
column, there is a '$<$', which indicates that, with 95\% confidence,
the expected mean squared error of the AR models is less than that of
MA models, thus the AR models are better than MA models for this set
of constraints.  For longer lead times, we found that the results
remained essentially the same, except that MA became consistently
worse.  

The message of Table~\ref{tab:loadpred.t-test-all-lead1-8to8} is that,
for the typical trace and a sufficient number of parameters, there are
essentially no differences in the predictive powers of the BM, AR,
ARMA, ARIMA, and ARFIMA models, even with high lead times.  Further,
with the exception of the MA models, all of the models are better than
the raw variance of the signal (MEAN model).

\begin{table}[tbp]
\figfontsize
\centerline{
\begin{tabular}{l|ccccccc}
	&MEAN	&BM	&AR	&MA	&ARMA	&ARIMA	&ARFIMA \\
\hline
MEAN   &$=$     & $>$  &    $>$  &    $=$   &   $>$   &   $>$   &   $>$\\
BM     &$<$      &$=$  &    $>$   &   $<$   &   $=$   &   $=$   &   $>$\\
\hline
AR     &$<$      &$<$  &    $=$   &   $<$   &   $=$   &   $<$   &   $=$\\
\hline
MA     &$=$      &$>$  &    $>$   &   $=$   &   $>$   &   $>$   &   $>$\\
ARMA   &$<$      &$=$  &    $=$   &   $<$   &   $=$   &   $=$   &   $>$\\
ARIMA  &$<$      &$=$  &    $>$   &   $<$   &   $=$   &   $=$   &   $>$\\
ARFIMA &$<$      &$<$  &    $=$   &   $<$   &   $<$   &   $<$   &
$=$\\
\end{tabular}
}
\caption
[Predictor performance, t-test comparisons, axp0 trace, 16 second lead, 8 parameters]
{T-test comparisons - axp0, lead 16, 8 parameters.}
\label{tab:loadpred.t-test-axp0-lead16-8to8}
\normalsize
\end{table}

For machines with higher mean loads, there are more statistically
significant differences between the models.  For example,
Table~\ref{tab:loadpred.t-test-axp0-lead16-8to8} shows a t-test
comparison for the axp0.psc trace at a lead time of 16 seconds for 8
parameter models.  Clearly, there is more differentiation here, and we
also see that the ARFIMA models do particularly well, as we might
expect given the self-similarity result of
Section~\ref{sec:statprop.self-sim}.  However, notice that the AR
models are doing about as well.

The results of these t-test comparisons can be summarized as follows:
(1) Except for the MA models, the models we tested are significantly
better (in a statistical sense) than the raw variance of the trace and
the difference in expected performance is significant from a systems
point of view.  (2) The AR, ARMA, ARIMA, and ARFIMA models perform at
least as well as the BM model, and perform better on more heavily
loaded hosts.  (3) The AR, ARMA, ARIMA, and ARFIMA models generally
have similar expected performance.

\subsection{AR models are better than BM and LAST models}

Since the AR, ARMA, ARIMA, and ARFIMA models have similar performance
when evaluated using the methods of the previous sections, the
inclination is to prefer the AR models because of their much lower
overhead (Chapter~\ref{sec:rps.tspl_perf}).  However, should we prefer
AR models over BM models, or over their degenerate case, the LAST
model?  After all, these latter models also have low overhead and are
much easier to understand and build.

\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/ar1_bm32_direct.epsf}
&
\colfigsize
\epsfbox{eps/ar2_bm32_direct.epsf}
\\
(a) AR(1) and BM(32) & (b) AR(2) and BM(32)
\\
\colfigsize
\epsfbox{eps/ar4_bm32_direct.epsf}
&
\colfigsize
\epsfbox{eps/ar8_bm32_direct.epsf}
\\
(c) AR(4) and BM(32) & (d) AR(8) and BM(32)
\\
\colfigsize
\epsfbox{eps/ar16_bm32_direct.epsf}
&
\colfigsize
\epsfbox{eps/ar32_bm32_direct.epsf}
\\
(e) AR(16) and BM(32) & (f) AR(32) and BM(32)
\end{tabular}
}
\caption[Paired comparisons of AR and BM(32) models by lead time]
{Paired comparisons of AR and BM models by lead time.}
\label{fig:loadpred.paired_arbm_lead}
\end{figure}


To address this issue, we performed paired comparisons of testcases
that used these models.  Recall from Section~\ref{sec:loadpred.method}
that each testcase is run both with a randomly chosen model and with a
BM model.  For each testcase and lead time, we can directly compare
the variance of the test interval (as measured by the MEAN model), the
mean squared error of the randomly chosen model, and mean square error
of the BM model.  Further, we can normalize the reduction of variance
each model provides at a particular lead time to the variance of the
test interval itself (ie, (variance $-$ mean squared error)/variance).
Finally, we can aggregate these normalized reductions in variance
across testcases.

Figure~\ref{fig:loadpred.paired_arbm_lead} uses this process to
compare AR models of different orders to BM models.  Each graph in the
figure plots the average percentage reduction in variance as a
function of the lead time for AR models of a given order and their
corresponding BM models.  The performance of the MEAN model is 0\% in
all cases.  Each point on a graph encodes approximately 1000
testcases.

As the figure makes clear, AR models of sufficiently high order
outperform BM models, especially at high lead times.  Notice that for
predictions as far ahead as 30 seconds, the AR(16) and AR(32) models
provide lower variance than the raw load signal, while the BM models
produce prediction errors that are actually more surprising than the
signal itself.  Indeed, the BM models only seem useful at lead times
of 10 seconds or less.    Where they shine as well as the AR models is 
at extremely short prediction horizons.  

Figure~\ref{fig:loadpred.paired_arbm_lead} also shows that the
performance of the AR models generally increases as their order
increases.  There are diminishing returns with higher order models,
however.  Notice that the gain from the AR(16) model to the AR(32)
model is marginal.  For very low order models, rather odd effects can
occur, such as with AR(2) models
(Figure~\ref{fig:loadpred.paired_arbm_lead}(b)), which buck the
overall trend of improving on lower order models.  We noticed that AR
models of order less than 5 have highly variable behavior, while AR
models of order greater than 16 don't significantly improve
performance.

Figure~\ref{fig:loadpred.paired_arbm_lead}(a) shows the performance of
an AR(1) model.  If the only source of predictability of the host load
signal was the exponential smoothing done in the kernel, we would
expect that AR models of higher order would not provide better
predictability.  As we can see, however, higher order models do result
in lower mean squared prediction errors.

\comment{
\begin{figure}
\centerline{
\begin{tabular}{cc}
\colfigsize
\epsfbox{eps/ar_bm_1sec_direct.epsf}
&
\colfigsize
\epsfbox{eps/ar_bm_2sec_direct.epsf}
\\
(a) one second ahead & (b) two seconds ahead
\\
\colfigsize
\epsfbox{eps/ar_bm_4sec_direct.epsf}
&
\colfigsize
\epsfbox{eps/ar_bm_8sec_direct.epsf}
\\
(c) four seconds ahead & (d) eight seconds ahead
\\
\colfigsize
\epsfbox{eps/ar_bm_16sec_direct.epsf}
&
\colfigsize
\epsfbox{eps/ar_bm_30sec_direct.epsf}
\\
(e) sixteen seconds ahead & (f) thirty seconds ahead
\end{tabular}
}
\caption[Paired comparisons of AR and BM models by model order]
{Paired comparisons of AR and BM models by model order}
\label{fig:loadpred.paired_arbm_order}
\end{figure}
}

\begin{figure}
\centerline{
\colfigsize
\epsfbox{eps/ar16_last_direct.epsf}
}
\caption[Paired comparisons of AR(16) and LAST models by lead time]
{Paired comparisons of AR(16) and LAST models by lead time.}
\label{fig:loadpred.paired_arlast_lead}
\end{figure}

It is interesting to compare AR(16) models to the simplest possible
predictor, the LAST model.  We ran an additional set of testcases
comparing AR(16) and LAST models to do this.  It is possible, of
course, to mine the testcases in the original set to do the
comparison.  However, the testcases we would look at would not be
random, but rather those where the BM model had selected LAST as being
most appropriate.  By running a separate set of testcases, we were
able to eliminate this bias.

Figure~\ref{fig:loadpred.paired_arlast_lead} compares the average
percentage reduction in variance, computed as before, of the AR(16)
model and the LAST model as a function of the lead time.  As we can
see, the AR(16) model significantly outperforms the LAST model,
especially for lead times greater than five seconds.  The AR(16) curve 
of Figure~\ref{fig:loadpred.paired_arlast_lead} is different from that 
of Figure~\ref{fig:loadpred.paired_arbm_lead}(e) because of the use
of a different random set of testcases.


\subsection{AR(16) models or better are appropriate}

Clearly, using one of the model classes, other than MA, for load
prediction is beneficial.  Further, using an AR, ARMA, ARIMA, or
ARFIMA model is preferable to the BM model.  In paired comparisons, AR
models of sufficiently high order considerably outperform BM and LAST
models, especially at longer lead times.  Of this group of models, we
found in Section~\ref{sec:rps.tspl_perf} that the AR models are much
less expensive to fit than the ARMA, ARIMA, or ARFIMA models and are
of similar or lesser cost to use.  In addition, AR models can be
fitted in a deterministic amount of time.  Because of this combination
of high performance and low overhead, AR models are the most
appropriate for host load prediction.  AR models of order 5 or higher
seem to work fine.  However, we found the knee in the performance of
the AR models to be around order 16.  Since fitting AR models of this
order, or even considerably higher, is quite fast, we recommend using
AR($16$)s or better for load prediction.


%
% This section will probably have to go due to space constraints
%
\subsection{Prediction errors are not IID normal}

As we described in Section~\ref{sec:loadpred.method}, our evaluation
of each testcase includes tests for the independence and normality of
prediction errors.  Intuitively, we want the errors to be independent
so that they can be characterized with probability distribution
function, and we want the distribution function to be normal to
simplify computing confidence intervals from it.

For the most part, the errors we see with the different models are not
independent or normal to any confidence level.  However, from a
practical point of view, the errors are much less correlated over time
than the raw signal.  For example, for AR($8$) models, the Portmanteau
Q statistic tends to be reduced by an order of magnitude or more,
which suggests that those autocorrelations that are large enough to be
significant are only marginally so.  Furthermore, assuming normality
for computing confidence intervals for high confidence levels, such as
95\%, seems to be reasonable.  Furthermore, if we want to predict the
average host load over an interval of time, we will sum individual
predictions.  By the central limit theorem, the distribution of this
sum should converge to normality, although it may do so slowly.  In
the next chapter, we successfully rely on this convergence.
%However, since a load prediction system
%will probably continuously evaluate a fitted model in order to
%determine when to refit it, it seems reasonable for it to keep a
%histogram or other more detailed representation of the errors in order
%to more accurately compute confidence intervals.


%\begin{figure}[tbp]
%\figfontsize
%\centerline{
%\begin{tabular}{l|ccccccc}
%Model &	 0 &	1&	2&	3&	4&	5&	6 \\
%\hline	
%MEAN	&  0.00\% &	  0.00\% &	  0.00\% &	  0.00\% &	  0.00\% &	  0.00\% &	100.00\% \\	
%BM	& 13.79\% &	 17.24\% &	  3.45\% &	 24.14\% &	 27.59\% &	 13.79\% &	  0.00\% \\
%AR	& 20.69\% &	 24.14\% &	 17.24\% &	 13.79\% &	 10.34\% &	 13.79\% &	  0.00\% \\
%MA	&  3.45\% &	  6.90\% &	  3.45\% &	 10.34\% &	 31.03\% &	 44.83\% &	  0.00\% \\
%ARMA	&  6.90\% &	 20.69\% &	 24.14\% &	 34.48\% &	 10.34\% &	  3.45\% &	  0.00\% \\
%ARIMA	& 48.28\% &	  6.90\% &	 20.69\% &	  3.45\% &	  6.90\% &	 13.79\% &	  0.00\% \\
%ARFIMA	&  6.90\% &	 24.14\% &	 31.03\% &	 13.79\% &	 13.79\% &	 10.34\% &	  0.00\% \\
%\end{tabular}
%}
%\caption{Model rankings by mean for lead 1, 8:8 params}
%\label{fig:loadpred.t-test-example}
%\normalsize
%\end{figure}

%\begin{figure}[tbp]
%\figfontsize
%\centerline{
%\begin{tabular}{l|ccccccc}
%Model &	 0 &	1&	2&	3&	4&	5&	6 \\
%\hline	
%MEAN	&  2.63\% &	  5.26\% &	  7.89\% &	 26.32\% &	  7.89\% &	 50.00\% &	  0.00\% \\
%BM	&  5.26\% &	 23.68\% &	 18.42\% &	 15.79\% &	 18.42\% &	 10.53\% &	  7.89\% \\
%AR	& 21.05\% &	 13.16\% &	 13.16\% &	 18.42\% &	 23.68\% &	  7.89\% &	  2.63\% \\
%MA	&  2.63\% &	  7.89\% &	  5.26\% &	  5.26\% &	  2.63\% &	  7.89\% &	 68.42\% \\
%ARMA	&  5.26\% &	 21.05\% &	 15.79\% &	 21.05\% &	 18.42\% &	 10.53\% &	  7.89\% \\
%ARIMA	& 31.58\% &	 15.79\% &	 18.42\% &	  7.89\% &	 10.53\% &	  7.89\% &	  7.89\% \\
%ARFIMA	& 31.58\% &	 13.16\% &	 21.05\% &	  5.26\% &	 18.42\% &	  5.26\% &	  5.26\% \\
%\end{tabular}
%}
%\caption{Model rankings by mean for lead 16, 8:8 params}
%\label{fig:loadpred.t-test-example}
%\normalsize
%\end{figure}


\section{Implementation of on-line host load prediction system}

Using \RPS, we were able to quickly implement an on-line host load
prediction system based on the AR(16) model.  Indeed, the initial
implementation was simply to compose the prediction components
described in Section~\ref{sec:rps.predcomp}.  The predserver component
can have its predictive model changed at run-time, and so it is
trivial to configure it to use the AR(16) model.  Later, we
constructed a monolithic system with identical features. Both systems
are described in Section~\ref{sec:rps.on-line_perf}, which also
evaluates their performance and the added load they place on the host.
To reiterate the conclusions, both the composed and monolithic systems
have virtually unmeasurable overheads when used to predict the 1 Hz
load signal with AR(16) models.  Furthermore, they can be used at
measurement rates 2-3 orders of magnitude higher than we require
before saturating the CPU.

In the following chapters, we use the monolithic host load prediction
system with the MEAN, LAST, and AR(16) models. This set of models lets
us compare performance for the raw signal variance, an ad hoc
predictor, and the formal predictor that we have found to be most
appropriate.  For each of these models, the system supplies
predictions, mean squared error estimates, and the estimates of the
covariances of the prediction errors.  In the case of the MEAN model,
the mean squared error estimates are simply the measured variance of
the signal, while the covariances correspond to the measured
autocorrelation structure of the host load signal.  For the LAST and
AR(16) models, the mean squared error estimates and covariances are
computed from the fitted model (LAST is treated as an AR(1)).


\section{Conclusions}
\label{sec:loadpred.conclusions}

In this chapter, we applied the final two steps of the resource signal
methodology of Section~\ref{sec:rps.newpredsys} to host load signals,
specifically the 1 Hz Digital Unix five-second load average.  The last
chapter studied traces of such signals and concluded that linear
models such as the Box-Jenkins models (AR, MA, ARMA, and ARIMA) might
be appropriate for predicting them.  However, it also found that host
load was self-similar, which suggested that the more complex ARFIMA
model might be needed.  In addition, we found that host load exhibits
epochal behavior, which could rule out linear models altogether.

To determine which, if any, of these models is truly appropriate, we
studied their performance by running almost 200,000 randomized
testcases on the August, 1997 set of load traces and then analyzing
the results. We looked at predictions from 1 to 30 second into the
future.  In addition to the AR, MA, ARMA, ARIMA, and ARFIMA models, we 
also looked at the more intuitive BM and LAST models.  

The main contribution of our evaluation is to show, in a rigorous
manner, that host load on real systems is predictable to a very useful
degree from past behavior by using linear time series techniques.  In
addition, we discovered that, while there are statistically
significant differences between the different classes of models we
studied, the marginal benefits of the more complex models do not
warrant their much higher run-time costs.  This is a somewhat
surprising result given the complex properties we identified in the
last chapter.  Finally, we reached the conclusion that AR models of
order 16 or higher are sufficient for predicting the host load signal
up to 30 seconds in the future.  These models work very well and are
very inexpensive to fit to data and to use.

Using \RPS, we implemented an on-line host load prediction system that
uses the AR(16) model.  This extremely low overhead system is
described and evaluated in Section~\ref{sec:rps.on-line_perf}.  In the
next chapter, we will use the predictions of this system as the basis
for predicting the running time of a task.

\comment{
However, it is clear that host load signals are not generated by
linear systems.  Other techniques, such as threshold autoregressive
models~\cite{TAR-TONG-BOOK} or Abarbanel's methodology for chaotic
data~\cite{ABARBANEL-CHAOTIC-DATA-BOOK} may be more appropriate.  We
have applied parts of Abarbanel's approach to our traces with
promising results.  However, it may be more appropriate for finer or
coarser grain load prediction---we are satisfied with simple linear
methods for the regime we studied in this chapter.  We are currently
studying bandwidth and latency prediction in local area networks using
the evaluation framework described in this chapter.
}



%
% Old junk
%
%


\comment{
\begin{figure}[tbp]
\centerline{
\colfigsize
\epsfbox{eps/unix16-simple.epsf}
}
\caption
[Benefits of prediction, server machine]
{Benefits of prediction: server machine with 40 users and
long term load average of 10.}
\label{fig:loadpred.extreme}
\end{figure}

\begin{figure}[tbp]
\centerline{
\colfigsize
\epsfbox{eps/axpfea-simple.epsf}
}
\caption
[Benefits of prediction, interactive machine]
{Benefits of prediction: interactive cluster machine with long term
load average of 0.17.}
\label{fig:loadpred.argus-bm-better}
\end{figure}
}

\comment{
Better load predictions can lead to drastically smaller confidence
intervals on real hosts, both on those with very high overall load,
such as shared servers, and those with very low overall load, such as
desktop workstations.  Figure~\ref{fig:loadpred.extreme} shows the benefits of
good predictions on a heavily loaded server.  The figure plots the
length of the confidence interval for the running time of a one second
task as a function of how far ahead the load predictions are made.
The confidence intervals for a predictive AR($9$) model (described in
Section~\ref{sec:loadpred.ltsmodels}) and for the raw variance of the load
signal itself are represented.  Notice that for short-term
predictions, the AR($9$) model provides confidence intervals that are
almost an order of magnitude smaller than the raw signal.  For
example, when predicting 5 seconds ahead, the confidence interval for
the AR($9$) model is less than 1 second while the confidence interval for the
raw load signal is about 4 seconds.  Figure~\ref{fig:loadpred.argus-bm-better}
shows that such benefits are also possible on a very lightly loaded
machine.
}


\comment{
The more complex models do
indeed tend to {\em fit} the data of a load trace better than simple
models.  For example, fractional ARFIMA models had as much as 40\%
lower mean squared error in some cases.  Surprisingly, however, we
found that simple autoregressive models performed sufficiently well
for short range {\em prediction}~\cite{DINDA-LOAD-PRED-TR-98}.  While
there were statistically significant differences between the models we
studied, they were not sufficiently large to warrant the use of more
complex models.  Our recommendation is to use autoregressive models of
order 16 or higher for prediction horizons of up to 30 seconds.

%The context of our work is dynamically mapping real-time tasks to
%hosts in a distributed environment. In this context, we are interested
%in load models for classifying environments and hosts, synthesizing
%realistic running times in a simulator, and for predicting running
%times on remote hosts.  We will gladly share the load traces we have
%collected with members of the research community.
}


\comment{
\section{Statistical properties of load traces}
\label{sec:loadpred.stats}

%We collected and studied long, fine grain load traces from a wide
%variety of machines in order to discover common statistical properties
%and their implications for load prediction.  This work, which we
%summarize here, is reported in detail in~\cite{DINDA-STAT-PROPS-HOST-LOAD-LCR98}.  The
%results suggest that classical Box-Jenkins linear models may be
%appropriate for load prediction, but that large numbers of parameters
%may be necessary to capture its complicated dynamics.  Two interesting
%new results are that load signals are self-similar and that they
%display what we call epochal behavior.  Self-similarity suggests the
%use of expensive ARFIMA linear models, which can capture long-range
%dependence, may be appropriate.  Epochal behavior suggests that
%reasonably frequent model refitting, piecewise linear models (such as
%TAR models~\cite{TAR-TONG-BOOK}), or locally weighted regression on an
%``unfolded'' load signal~\cite{ABARBANEL} may be appropriate.  This
%chapter studies the performance of the classical Box-Jenkins models and
%the ARFIMA model.  In Section~\ref{eval} we evaluate these schemes on
%these load traces we introduce in this section.

%\subsection{Measurement methodology}
%\label{sec:loadpred.load_trace_measurement}

The load on a Unix system at any given instant is the number of
processes that are running or are ready to run, which in turn is the
length of the ready queue maintained by the scheduler.  The kernel
samples the length of the ready queue at some rate and exponentially
averages some number of previous samples to produce a load average
which can be accessed from a user program.  We wrote a small tool to
sample this value on Digital Unix (DUX) machines.  On these machines,
the time constant of the exponential average is five seconds.  By
subjecting these systems to varying loads and sampling at
progressively higher rates, we determined that DUX updates the load
average value at a rate of $1/2$ Hz.  We chose to sample at 1 Hz in
order to capture all of the load signal's dynamics.  We ran this trace
collection tool on 38 hosts belonging to the Computing, Media, and
Communication Laboratory (CMCL) at CMU and to the Pittsburgh
Supercomputing Center (PSC) for slightly more than one week in late
August, 1997.  A second set of week-long traces was acquired on almost
exactly the same set of machines in late February through early March,
1998.  The results of the statistical analysis were similar for the
two sets of traces.  In this chapter, we describe and use the August,
1997 set.  A more detailed description of our analysis can be found
in~\cite{DINDA-STAT-PROPS-HOST-LOAD-LCR98,DINDA-STAT-PROPS-HOST-LOAD-TR}
and the traces themselves can be found on the
web.~\footnote{http://www.cs.cmu.edu/\~{ }pdinda/LoadTraces}

%http://www.cs.cmu.edu/~pdinda/LoadTraces.

All of the hosts in the August, 1997 set were DEC Alpha DUX machines,
and they form four classes: {\em Production Cluster}, 13 hosts of the
PSC's ``Supercluster'', including two front-end machines (axpfea,
axpfeb), four interactive machines (axp0 through axp3), and seven
batch machines); {\em Research Cluster}, eight machines in an
experimental cluster in our lab (manchester-1 through manchester-8));
{\em Compute servers}, two high performance large memory machines used
by our group as compute servers for simulations and the like (mojave
and sahara); and {\em Desktops}, 15 desktop workstations owned by
members of our research group (aphrodite through zeno).


\begin{figure}[tbp]
\centerline{
\pagefigsize
\epsfbox{eps/Aug97BoxPlot.eps}
}
\caption
[Statistical summary of load traces]
{Statistical summary of load traces.}
\label{fig:loadpred.trace_stats}
\end{figure}

Figure~\ref{fig:loadpred.trace_stats} summarizes these traces in a Box plot
variant.  The central line in each box marks the median load, while
the lower and upper lines mark the 25th and 75th percentiles.  The
lower and upper ``whiskers'' extending from the box mark the actual
2.5th and 97.5th percentiles.  The circle marks the mean and the
triangles mark the 2.5th and 97.5th percentiles assuming a normal
distribution with the trace's mean and variance.  

The following points summarize the results of our statistical analysis
of these traces that are relevant to this study:

(1) The traces exhibit low means but very high variability, measured
by the variance, interquartile range, and maximum.  Only four traces
have mean loads near 1.0.  The standard deviation (square root of the
variance) is typically at least as large as the mean, while the
maximums can be as much as two orders of magnitude larger.  This high
variability suggests that there is considerable opportunity for
prediction algorithms.

(2) Measures of variability, such as the variance and maximum, are
positively correlated with the mean, so a machine with a high mean
load will also tend to have a large variance and maximum.  This
correlation suggests that there is more opportunity for prediction
algorithms on more heavily loaded machines.

(3) The traces have relatively complex, sometimes multimodal
distributions that are not well fitted by common analytic
distributions.  However, we note here that assuming normality (but
disallowing negative values) for the purpose of computing a 95\%
confidence interval is not unreasonable, as can be seen by comparing
the 2.5th and 97.5th percentile whiskers and the corresponding normal
percentiles (triangles) computed from the mean and variance in
Figure~\ref{fig:loadpred.trace_stats}.

(4) Time series analysis of the traces shows that load is strongly
correlated over time.  The autocorrelation function typically decays
very slowly while the periodogram shows a broad, almost noise-like
combination of all frequency components.  An important implication is
that linear models may be appropriate for predicting load signals.
However, the complex frequency domain behavior suggests such models
may have to be of unreasonably high order.

\begin{figure}[tbp]
\centerline{
\colfigsize
\epsfbox{eps/hurst_all_automated_aug97.eps}
}
\caption
[Hurst parameter estimates]
{Hurst parameter estimates.}
\label{fig:loadpred.trace_hurst}
\end{figure}

%
% Decided to get rid of self-similarity citation altogether - we cite
% the lcr chapter earlier
%
(5) The traces exhibit self-similarity.  We estimated their Hurst
parameters~\cite{HURST51} using four different nonparametric
methods~\cite{TAQQU-HURST-ESTIMATORS-COMPARE} which provided the
results in Figure~\ref{fig:loadpred.trace_hurst}.  The traces' average Hurst
parameter estimates ranged from 0.73 to 0.99, with a strong bias
toward the top of that range.  Hurst parameters in the range of 0.5 to
1.0 indicate self-similarity with positive near-neighbor correlations.
This result tells us that load varies in complex ways on all time
scales and has long-range dependence.  Long-range suggests that using
the fractional ARIMA (ARFIMA) modeling
approach~\cite{HOSKING-FRACT-DIFF,GRANGER-JOYEUX-FRACT-DIFF,BERAN-STAT-LONG-RANGE-METHODS}
may be appropriate.

%\begin{figure}[tbp]
%\centerline{
%\colfigsize
%\epsfbox{eps/sdev_mean_epoch.epsf}
%}
%\caption{Mean epoch lengths and +/- one standard deviation.}
%\label{fig:loadpred.trace_epoch}
%\end{figure}

(6) The traces display what we term ``epochal behavior''.  The local
frequency content (measured by using a spectrogram) of the load signal
remains quite stable for long periods of time (150-450 seconds mean,
with high standard deviations), but changes abruptly at the boundaries
of such epochs.  Such abrupt transitions are likely to be due to
processes being created, destroyed, or entering new execution phases.
This result implies that linear models may have to be refit at these
boundaries, or that threshold models~\cite{TAR-TONG-BOOK} or other
nonlinear or chaotic dynamics methods may be appropriate.  We do not
investigate these methods in this chapter.


%Figure~\ref{fig:loadpred.trace_epoch} shows the mean epoch
%length and standard deviation of epoch length for each of the traces.


\section{Linear time series models}
\label{sec:loadpred.ltsmodels}

\comment{

\begin{figure}[tbp]
\centerline {
\pagefigsize
\epsfbox{eps/ltsmodel.eps}
}
\caption
[Linear time series models for load prediction]
{Linear time series models for load prediction.}
\label{fig:loadpred.ltsmodels}
\end{figure}

The main idea behind using a linear time series model in load
prediction is to treat the sequence of periodic samples of host load,
$\langle z_t \rangle$, as a realization of a stochastic process that 
can be modeled as a white noise source driving a linear filter.  The
filter coefficients can be estimated from past observations of the
sequence.  If most of the variability of the sequence results from the
action of the filter, we can use its coefficients to estimate future
sequence values with low mean squared error.

Figure~\ref{fig:loadpred.ltsmodels} illustrates this decomposition.  In keeping
with the relatively standard Box-Jenkins
notation~\cite{BOX-JENKINS-TS}, we represent the input white noise
sequence as $\langle a_t \rangle$ and the output load sequence as
$\langle z_t \rangle$.  On the right of Figure~\ref{sec:loadpred.ltsmodels} we
see our partially predictable sequence $\langle z_t \rangle$, which
exhibits some mean $\mu$ and variance $\sigma_z^2$.  On the left, we
see our utterly unpredictable white noise sequence $\langle a_t
\rangle$, which exhibits a zero mean and a variance $\sigma_a^2$.  In
the middle, we have our fixed linear filter with coefficients $\langle
\psi_j
\rangle$.  Each output value $z_t$ is the sum of the current noise
input $a_t$ and all previous noise inputs, weighted by the
$\langle\psi_j\rangle$ coefficients.

Given an observed load sequence $\langle z_t \rangle$, the optimum
values for the coefficients $\psi_j$ are those that minimize
$\sigma_a^2$, the variance of the driving white noise sequence
$\langle a_t \rangle$.  Notice that the one-step-ahead prediction
given all the data up to and including time $t$ is $\hat{z}_{t}^1 =
\sum_{j=0}^\infty \psi_ja_{t-j}$, since the expected value of
$a_{t+1}=0$).  The noise sequence consists simply of the
one-step-ahead prediction errors and the optimal coefficient values
minimize the sum of squares of these prediction errors.

The general form of the linear time series model is, of course,
impractical, since it involves an infinite summation using an infinite
number of completely independent weights.  Practical linear time
series models use a small number of coefficients to represent infinite
summations with restrictions on the weights, as well as special casing
the mean value of the sequence, $\mu$.  To understand these models, it
is easiest to represent the weighted summation as a ratio of
polynomials in $B$, the backshift operator, where $B^dz_t =z_{t-d}$.
For example, we can write $z_t = \sum_{j=1}^\infty \psi_ja_{t-j}+a_t$
as $z_t =
\psi(B)a_t$ where $\psi(B)=1+\psi_1B+\psi_2B+\ldots$  Using this
scheme, the models we examine in this chapter can be represented as
\begin{equation}
z_t = \frac{\theta(B)}{\phi(B)(1-B)^d}a_t + \mu
\label{eqn:loadpred.ts_poly}
\end{equation}
where the different model classes we examine in this chapter (AR, MA,
ARMA, ARIMA, ARFIMA, BM, and MEAN)constrain $\theta(B)$, $\phi(B)$ and
$d$ in different ways.  In the signal processing domain, this kind of
filter is known as a pole-zero filter.  The roots of $\theta(B)$ are
the zeros and the roots of $\phi(B)(1-B)^d$ are the poles.  In
general, such a filter can be unstable in that its outputs can rapidly
diverge from the input signal.  This instability is extremely
important from the point of view of the implementor of a load
prediction system.  Such a system will generally fit a model (choose
the $\theta(B)$ and $\phi(B)$ coefficients, and $d$) using some $n$
previous observations.  The model will then be ``fed'' the next $m$
observations and asked to make predictions in the process.  If
coefficients are such that the filter is unstable, then it may explain
the $n$ observations very well, yet fail miserably and even diverge
(and crash!) when used on the $m$ observations after the fitting.


\subsection{AR($p$) models}
The class of AR($p$) models (purely autoregressive models) have $z_t =
\frac{1}{\phi(B)}a_t + \mu$ where $\phi(B)$ has $p$ coefficients.
Intuitively, the output value is a $\phi$-weighted sum of the $p$
previous output values.  The $t+1$ prediction for a load sequence is the
$\phi$-weighted sum of the $p$ previous load measurements.  The $t+2$
prediction is the $\phi$-weighted sum of the $t+1$ prediction and the
$p-1$ previous load measurements, and so on.

From the point of view of a system designer, AR($p$) models are highly
desirable since they can be fit to data in a deterministic amount of
time.  In the Yule-Walker technique that we used, The autocorrelation
function is computed to a maximum lag of $p$ and then a $p$-wide
Toeplitz system of linear equations is solved.  Even for relatively
large values of $p$, this can be done almost instantaneously.

\subsection{MA($q$) models}
The class of MA($q$) models (purely moving average models) have $z_t =
\theta(B)a_t$ where $\theta(B)$ has $q$ coefficients.  Intuitively,
the output value is the $\theta$-weighted sum of the current and the
$q$ previous input values.  The $t+1$ prediction of a load sequence is
the $\theta$-weighted sum of the $q$ previous $t+1$ prediction errors.
The $t+2$ prediction is the $\theta$-weighted sum of the predicted
$t+1$ prediction error (zero) and the $q-1$ previous $t+1$ prediction
errors, and so on.

MA($q$) models are a much more difficult proposition for a system
designer since fitting them takes a nondeterministic amount of time.
Instead of a linear system, fitting a MA($q$) model presents us with a
quadratic system.  Our implementation, which is nonparametric (ie, it
assumes no specific distribution for the white noise source), uses the
Powell procedure to minimize the sum of squares of the $t+1$
prediction errors.  The number of iterations necessary to converge is
nondeterministic and data dependent.

\subsection{ARMA($p$,$q$) models}
The class of ARMA($p$,$q$) models (autoregressive moving average models) have
$z_t=\frac{\theta(B)}{\phi(B)}a_t + \mu$ where $\phi(B)$ has $p$
coefficients and $\theta(B)$ has $q$ coefficients.  Intuitively, the
output value is the $\phi$-weighted sum of the $p$ previous output
values plus the $\theta$-weighted sum of the current and $q$ previous
input values.  The $t+1$ prediction for a load sequence is the
$\phi$-weighted sum of the $p$ previous load measurements plus the
$\theta$-weighted sum of the $q$ previous $t+1$ prediction errors.
The $t+2$ prediction is the $\phi$-weighted sum of the $t+1$ prediction
and the $p-1$ previous measurements plus the $\theta$-weighted sum of
the predicted $t+1$ prediction error (zero) and the $q-1$ previous
prediction errors, and so on.

By combining the AR($p$) and MA($q$) models, ARMA($p$,$q$) models hope
to achieve greater parsimony---using fewer coefficients to explain
the same sequence.  From a system designer's point of view, this may
be important, at least in so far as it may be possible to fit a more
parsimonious model more quickly.  Like MA($q$) models, however,
ARMA($p$,$q$) models take a nondeterministic amount of time to fit to
data, and we use the same Powell minimization procedure to fit them.

\subsection{ARIMA($p$,$d$,$q$) models}
The class of ARIMA($p$,$d$,$q$) models (autoregressive integrated
moving average models) implement Equation~\ref{eqn:loadpred.ts_poly} for
$d=1,2,\ldots$ Intuitively, the $(1-B)^d$ component amounts to a
$d$-fold integration of the output of an ARMA($p$,$q$) model.
Although this makes the filter inherently unstable, it allows for
modeling nonstationary sequences.  Such sequences can vary over an
infinite range and have no natural mean.  Although load clearly cannot
vary infinitely, it doesn't have a natural mean either.

ARIMA($p$,$d$,$q$) models are fit by differencing the sequence $d$
times and fitting an ARMA($p$,$q$) model as above to the result.  

\subsection{ARFIMA($p$,$d$,$q$) models}
The class of ARFIMA($p$,$d$,$q$) models (autoregressive fractionally
integrated moving average models) implement Equation~\ref{eqn:loadpred.ts_poly}
for fractional values of $d$, $0<d<0.5$.  By analogy to
ARIMA($p$,$d$,$q$) models, ARFIMAs are fractionally integrated
ARMA($p$,$q$) models.  The details of fractional
integration~\cite{HOSKING-FRACT-DIFF,GRANGER-JOYEUX-FRACT-DIFF} are
not important here other than to note that $(1-B)^d$ for fractional
$d$ is an infinite sequence whose coefficients are functions of $d$.
The idea is that this infinite sequence captures long range dependence
while the ARMA coefficients capture short range dependence.  Since our
sequences exhibit long-range dependence, even after differencing,
ARFIMAs may prove to be beneficial models.

To fit ARFIMA models, we use Fraley's Fortran 77
code~\cite{FRALEY-FRACDF}, which does maximum likelihood estimation of
ARFIMA models following Haslett and
Raftery~\cite{HASLETT-RAFTERY-LT-DEP-WIND}.  This implementation is
also used by commercial packages such as S-Plus.  We truncate
$(1-B)^d$ at 300 coefficients and use the same representation and
prediction engine as with the other models.

\subsection{Simple models for comparison}
In addition to the Box-Jenkins models and the ARFIMA model described
above, we also use two very simple models for comparison, MEAN and
BM. MEAN has $z_t =\mu$, so all future values of the sequence are
predicted to be the mean.  This is the best predictor, in terms of
minimum mean squared error, for a sequence which has no correlation
over time.  It is useful for measuring the raw variance in the load
signal and the benefits accrued from exploiting the correlations that
exist.  The BM model is an AR($p$) model whose coefficients are all
set to $1/p$.  This simply predicts the next sequence value to be the
average of the previous $p$ values, a simple windowed mean.  $p$ is
chosen to minimize mean squared error for $t+1$ predictions.

\subsection{Making predictions}
After fitting one of the above models, we construct a predictor from
it.  The predictor consists of the model converted to a uniform form,
the predicted next sequence value, a queue that holds the last $p+d$
($d=300$ for ARFIMA models) sequence values, and a queue the holds the
last $q$ prediction errors.  When the next sequence value becomes
available, it is pushed onto the sequence queue, it's corresponding
predicted value's absolute error is pushed onto the error queue, and
the model is evaluated ($O(p+d+q)$ operations) to produce a new
predicted next sequence value.  We refer to this as {\em stepping} the
predictor.  At any point, the predictor can be queried for the
predicted next $k$ values of the sequence, along with their expected
mean squared errors ($O(k(p+d+q)$ operations)).
}


Consider an application program that wants to schedule a compute-bound
soft real-time task in a typical distributed computing environment.
By using mechanisms such as CORBA~\cite{CORBA-23-ARCH-SPEC-TR}
or Java RMI~\cite{JAVA-RMI-SPEC}, the task can be executed on any of
the host machines in the network.  Given this freedom, the application
can choose the host on which the task's deadline is most likely to be
met.

If the application could predict the exact running time of the task on
each of the hosts, choosing a host would be easy.  However, such
predictions are unlikely to be exact due to the dynamic nature of the
distributed system.  Each of the hosts is acting independently, its
vendor-supplied operating system scheduling tasks initiated by other
users, paying no special attention to the task the application is
trying to schedule.  The computational load on each of the hosts can
vary drastically over time.

Because of this dynamic nature, the application's predictions of the
task's running time on each of the hosts have confidence intervals
associated with them.  Better predictions lead to smaller confidence
intervals, which makes it easier for the application to choose between
the hosts, or to decide how likely a particular host is to meet the
deadline.

The running time of a compute-bound task on a host is strongly related
to the computational load on the host. If we could
predict the load that a task would encounter on some host, we could
predict the running time on that host.  Better
load predictions lead to better running time predictions and thus to
smaller confidence intervals.  For this reason, we concentrate on host
load prediction here.


Clear benefits of this kind motivate the following questions: Is load
consistently predictable?  If so, what classes of predictive models
are appropriate?  What are the differences between these different
classes?  Finally, what class can be recommended for use in a real
system?  This chapter describes the results of a study to answer these
questions.


Our measure of host load is the Digital Unix five-second load average,
which is closely related to the execution time of compute-bound tasks
(Section~\ref{sec:loadpred.load_running_time}). We collected a large number of
1 Hz benchmark load traces and subjected them to a detailed
statistical analysis, which we summarize in Section~\ref{sec:loadpred.stats}.

}

%Section~\ref{sec:loadpred.related_work} puts our study in the context of other
%work in the area, and Section~\ref{sec:loadpred.conclusions} summarizes and
%concludes with a discussion of future directions of this research.


%A typical
%testcase in our study fit a randomly selected model with a randomly
%selected number of up to 8 parameters to a randomly selected five
%minute to three hour subsequence of a trace.  This fitted model was
%then evaluated on an immediately following subsequence with a length
%randomly selected from five minutes to three hours.  As each
%subsequent value in this test sequence became available, the fitted
%model was used to make 1,2,3,...,30 second ahead predictions.  At the
%end of the testcase, the mean squared error for each of the prediction
%horizons, as well as a variety of other metrics was recorded.  It is
%from an aggregate of about 152,000 such testcases (about 4000 per
%trace)


\comment{
\section{Host load and running time}
\label{sec:loadpred.load_running_time}

\begin{figure}[tbp]
\centerline{
\colfigsize
\epsfbox{eps/load2time.eps}
}
\caption
[Relationship between host load and running time]
{Relationship between average load during execution and execution time}
\label{fig:loadpred.load2time}
\end{figure}

For CPU-bound tasks, there is an intuitive relationship between host
load and execution time.  Consider Figure~\ref{fig:loadpred.load2time}, which
plots execution time versus average load experienced during execution
for tasks consisting of simple wait loops.  The data was generated by
running variable numbers of these tasks together, at identical
priority levels, on an otherwise unloaded Digital Unix machine.  Each
of these tasks sampled the Digital Unix five-second load average at
roughly one second intervals during their execution and at termination
printed the average of these samples as well as their execution time.
It is these pairs that make up the 42,000 points in the figure.
Notice that the relationship between the measured load during
execution and the execution time is almost perfectly linear
($R^2>0.99$).

If load were presented as a continuous signal, we would summarize this
relationship between execution time and load as 
\begin{displaymath}
\int_{t_{now}}^{t_{now}+t_{exec}}\frac{1}{1+z(t)}dt=t_{nominal}
\end{displaymath}
where $t_{now}$ is when the task begins executing, $t_{nominal}$ is
the execution time of the task on a completely unloaded machine,
$(1+z(t))$ is the load experienced during execution ($z(t)$ is the
continuous ``background'' load), and $t_{exec}$ is the task's
execution time.  Notice that the integral simply computes the inverse
of the average load during execution.  In practice, we can only sample
the load with some noninfinitesimal sample period $\Delta$ so we can
only approximate the integral by summing over the values in the sample
sequence.  In this chapter, $\Delta=1$ second, so we will write such a
sequence of load samples as $\langle z_t \rangle
=\ldots,z_{t-1},z_t,z_{t+1},\ldots$. We will also refer to such a
sequence as a {\em load signal}.
}

\comment{
If we knew $z_{t+k}$ for $k>0$, we could directly compute the
execution time.  Unfortunately, the best we can do is predict these
values in some way.  The quality of these predictions will then
determine how tightly we can bound the execution time of our task.  We
measure prediction quality by the {\em mean squared error}, which is
the average of the square of the difference between predicted values
and actual values.  Note that  Two-step-ahead predictions ($k=2$)
have a different (an probably higher) mean squared error than
one-step-ahead predictions ($k=1$), and so on.

For the simplest prediction, the long-term mean of the signal, the
mean squared error is simply the variance of the load signal.  As we
shall see in Section~\ref{sec:loadpred.stats}, load is highly variable
and exhibits other complex properties, which leads to very loose
estimates of execution time.  The hope is that sophisticated
prediction schemes will have much better mean squared errors.
}


\comment{
Our evaluation methodology, which we describe in detail in
Section~\ref{sec:loadpred.method}, is to run randomized testcases on benchmark
load traces.  The testcases (152,000 in all, or about 4000 per trace)
were randomized with respect to model class, the number of model
parameters and their distribution, the trace subsequence used to fit
the model, and the subsequent trace subsequence used to test the
model.  We collected a large number of metrics for each testcase, but
we concentrate on the mean squared error metric in this chapter.  This
metric is directly comparable to the raw variance in the load signal,
and, with a normality assumption, can be translated into a confidence
interval for the running time of a task.

The results of our evaluation are presented in
Section~\ref{sec:loadpred.results}.  We find that load is consistently
predictable to a useful degree from past behavior.  Except for the MA
models, which performed quite badly, the predictive models were all
roughly similar, although statistically significant differences were
indeed found.  These marginal differences do not seem to warrant the
use of the more sophisticated model classes because their run-time
costs are much higher.  Our recommendation is to use AR models of
relatively high order (16 or better) for load prediction within the
regime we studied.

}