% most useful Latex command
\newcommand{\comment}[1]{}

\documentclass[twocolumn]{article}
\usepackage{fullpage}
\usepackage{epsf}
\usepackage{times}
%\usepackage{changebar}
%\usepackage{lscape}

%\input{restore}

\renewcommand{\dbltopfraction}{.9}
\renewcommand{\topfraction}{.9}
\renewcommand{\textfraction}{.1}
\renewcommand{\floatpagefraction}{.9}
%\renewcommand{\dbltopnumber}{1}
\renewcommand{\dblfloatpagefraction}{.9}

%\def\normspacing {\renewcommand{\baselinestretch}{0.9}}
%\def\tinyspacing {\renewcommand{\baselinestretch}{0.7}}
\def\normspacing {}
\def\tinyspacing {}

\normspacing

\def\thesection{\arabic{section}}
\def\thefigure{\arabic{figure}}

%\newcounter {subsubsubsection}[subsubsection]
%\def\thesubsubsubsection {\thesubsubsection.\arabic{subsubsubsection}}
%\def\subsubsubsection {\subsubsection*}

\makeatletter\long\def\unmarkedfootnote#1{{\long\def\@makefntext##1{##1}\footnotetext{#1}
}}

\def\pagefigwidth {\epsfxsize=5in}
\def\colfigwidth {\epsfxsize=2.5in}
\def\figfontsize {\tiny}
\def\leadfigwidth{\epsfxsize=2in}
%\def\absfigsize {\epsfxsize=3.0in}
%\def\relfigsize {\epsfxsize=3.0in} 


\begin{document}

\bibliographystyle{acm}

\title{High Quality Linear Prediction of Host Load}

\author{Peter A. Dinda \hspace{.5in} David R. O'Hallaron \\
Carnegie Mellon University\\ 5000 Forbes Avenue \\ Pittsburgh, PA
15213 \\ }
\date{}
\maketitle

\unmarkedfootnote{
Effort sponsored in part by the Advanced Research Projects Agency and
Rome Laboratory, Air Force Materiel Command, USAF, under agreement
number F30602-96-1-0287, in part by the National Science Foundation
under Grant CMS-9318163, and in part by a grant from the Intel
Corporation. The U.S. Government is authorized to reproduce and
distribute reprints for Governmental purposes notwithstanding any
copyright annotation thereon.  The views and conclusions contained
herein are those of the authors and should not be interpreted as
necessarily representing the official policies or endorsements, either
expressed or implied, of the Advanced Research Projects Agency, Rome
Laboratory, or the U.S. Government.
}

\begin{abstract}
%
% This says everything important but is still too long
%
The running time of a task is strongly related to the host load it
experiences, as measured by the Unix one-minute load average. High
quality predictions of that load signal can greatly reduce the
uncertainty in the expected running time of tasks, sometimes by
factors of five or more.  We studied the prediction of load for 1 to
30 seconds in the future given 1 Hz samples of it's past behavior.  We
began by collecting and analyzing a large number of long load traces.
The analysis suggested that the Box-Jenkins linear time series models
(AR, MA, ARMA, ARIMA) might be appropriate for predicting load but
that such models might require an infeasible number of parameters.  We
also found self-similarity, which suggested that expensive ARFIMA
models might be appropriate.  We next evaluated the Box-Jenkins and
ARFIMA models, as well as a simple ad hoc scheme.  We ran a large
number of randomized testcases on the traces and used the results to
rigorously compare the models' mean square error levels as well as
their consistency in providing this level of quality.  We found that
load is consistently predictable to a very useful degree from past
behavior and that simple linear models are sufficient.  Although there
are statistically significant differences in the performance of the
different models, these differences are not large enough to warrant
using more the sophisticated models given their much higher run-time
costs and the possibility of instability. Our recommendation is to use
AR models of order 16 or better.
\end{abstract}

\section{Introduction}
\label{sec:intro}

Most distributed computing environments consist of a collection of
loosely interconnected hosts running vendor operating systems.  Tasks
are initiated independently by users and are scheduled locally by a
vendor supplied operating system.  This results in the computational
load, measured by the Unix one minute exponential load average,
changing over time on each of the individual hosts.  The running time
of a particular task is strongly related to the load that it
encounters during its execution.  For a compute bound task, the
relationship is very simple --- the running time is inversely related
to the average load the task encounters during execution.

Consider an adaptive application running in such an environment.  When
the application initiates a compute bound task, it can choose which
host will execute the task.  In many cases, such as the distributed
load balancing of a parallel application or in distributed soft
real-time systems, the objective behind this decision is to minimize
or bound the running time of the task.  Given a prediction of the
future load behavior of each host, the application or underlying QoS
system~\cite{QUO-JOURNAL-98} can estimate the running time of the
task on each host and use these estimates to decide which host should
execute the task.

Unfortunately, load predictions are unlikely to be perfect, so the
estimate of a task's running time has an associated confidence
interval that depends on the quality of the underlying load
predictions.  For a given confidence level, higher quality load
predictions result in shorter confidence intervals, which simplify
choosing between the hosts.  

Good host load predictions can drastically reduce the confidence
interval for a task's running time.  On one heavily loaded shared Unix
server we studied, the 95\% confidence interval for the running time
of a one second task due to the ``raw'' variance of the load signal
was over 13 seconds.  Even a simple windowed mean prediction scheme
was able to shrink the confidence interval to about $1/4$ second while
a more sophisticated autoregressive prediction scheme reduced it
virtually to zero.  On a lightly loaded, single user desktop host, we
noted reductions from five seconds to about one second.

Such promising datapoints motivated a more rigorous study of load and
load prediction.  The study addressed the following questions: What
are the statistical properties of host load and what are their
implications for prediction?  Is load consistently predictable, and,
if so, what are the consistent differences between different classes
of models.  Finally, which model class is preferable both from the
point of view of providing accurate predictions and being reasonable
to implement in a real system?

The prediction quality metric we concentrate on in this paper (we will
touch on others) is the mean of squared prediction errors.  There are
two advantages to this metric.  First, we can directly compare the
mean squared error of a prediction scheme to the raw variance of the
load (ie, how well the mean predicts the load signal) in order to
determine the degree of benefit we get.  Second, for high confidence
levels, such as 95\%, we can easily compute reasonable confidence
intervals from the mean square error by assuming (reasonably) that the
prediction errors have a normal distribution with zero mean and the
mean square error being the variance.

We began our study by collecting long (approximately one week), high
resolution (1 Hz sample rate) traces of the Unix one minute
exponential load average on a large number (38) of machines at two
different times of the year.  The machines included a production
cluster, a research cluster, large memory compute servers, and desktop
workstations.  We performed a detailed statistical
analysis~\cite{DINDA-STAT-PROPS-HOST-LOAD-LCR98} of the load traces in
order to frame the prediction problem.  The results of the analysis
that are salient here are that (1) load is highly variable, even on
lightly loaded machines, and (2) variability tends to increase with
higher mean loads.  These two results show that considerable
opportunity exists for prediction algorithms.  We also found that (3)
load is strongly correlated over time, which suggests that linear time
series models, such as the Box-Jenkins family~\cite{BOX-JENKINS-TS}
may serve as good load predictors.  However, (4) the periodograms of
the load signals showed broad, almost noise-like combinations of all
frequency components, which implies that linear models may require an
impractical number of parameters.  An interesting result that has not
been noted before is that (5) load is self-similar.  This implies that
long range dependences exist and suggests that expensive models such
as fractionally integrated ARIMAs (ARFIMA) are appropriate.  Finally,
another interesting new result is that (6) load displays epochal
behavior --- the frequency content of the signal is not constant, but
changes abruptly at intervals spaced minutes apart.  This behavior is
very different from what could be expected from a single linear model.

Given these results, we next studied the performance of the classical
Box-Jenkins model classes (AR,MA,ARMA,ARIMA), the ARFIMA model class,
and a simple moving average scheme we call Best Mean (BM) on the first
set of traces (the second set had similar properties.)  We
concentrated on predicting load from 1 to 30 seconds into the future.
We implemented the model classes and a framework for testing them on our load
traces.  The same codebase is also used in practice, both in a load
prediction daemon that is being incorporated into the Remos resource
measurement system~\cite{REMOS-HPDC98} and into a load prediction system
condition object for BBN's QuO distributed object quality of service
system~\cite{QUO-JOURNAL-98}.  We ran approximately 152,000
testcases (about 4000 per trace) through our framework to collect the
raw data for our study.  The testcases were randomized with respect to
model class, the number of model parameters and their distribution,
the trace subsequence used to fit the model, and the subsequent trace
subsequence used to test the model.

The conclusions we draw about load prediction using these linear
methods are: (1) Load is consistently predictable to a useful degree
from past behavior; (2) Simple linear models such as BM and AR models,
which are cheap to fit and use, produce very good predictions; (3) MA
models are ineffective for predicting load; (4) There are
statistically significant differences in the performance of the
effective model classes; (5) These differences do not warrant the use of the
more sophisticated model classes because the marginal benefits are more than
offset by their considerably higher run-time cost and the possibility
of instability.  Our recommendation is to use AR models of relatively
high order for load prediction within the regimes we studied.

%A typical
%testcase in our study fit a randomly selected model with a randomly
%selected number of up to 8 parameters to a randomly selected five
%minute to three hour subsequence of a trace.  This fitted model was
%then evaluated on an immediately following subsequence with a length
%randomly selected from five minutes to three hours.  As each
%subsequent value in this test sequence became available, the fitted
%model was used to make 1,2,3,...,30 second ahead predictions.  At the
%end of the testcase, the mean square error for each of the prediction
%horizons, as well as a variety of other metrics was recorded.  It is
%from an aggregate of about 152,000 such testcases (about 4000 per
%trace)


\section{Why host load prediction?}
\label{sec:why_prediction}

Intuitively, the motivation behind predicting host load is to assist
in predicting the execution times of tasks.  Predicting execution time
is important in many contexts, including dynamic load
balancing~\cite{JADE-LANGUAGE,DOME-TECHREPORT,SIEGELL-DYNAMIC-LB-94}
and distributed soft real-time
systems~\cite{REAL-TIME-MANIFESTO-JENSEN,REALTIME-CORBA-WHITEPAPER,PREDICTABLE-NETWORK-COMPUTING-POLZE-ICDCS97}.
.

\begin{figure}
\centerline{
\colfigwidth
\epsfbox{load2time.eps}
}
\caption{Relationship between average load during execution and execution time}
\label{fig:load2time}
\end{figure}

For cpu-bound tasks, the relationship between host load and execution
time is very intuitive.  Consider Figure~\ref{fig:load2time}, which
plots execution time versus average load experienced during execution
for tasks consisting of simple wait loops.  The data was generated by
running variable numbers of these tasks together, at identical
priority levels, on an otherwise unloaded Digital Unix machine.  Each
of these tasks sampled the Unix one-minute load average at roughly one
second intervals during their execution and at termination printed the
average of these samples as well as their execution time.  It is these
pairs that make up the 42,000 points in the figure.  Notice that the
relationship between the measured load during execution and the
execution time is almost perfectly linear ($R^2>0.99$).  

If load were presented as a continuous signal, we would summarize this
relationship between execution time and load as 
\begin{displaymath}
\int_{t_{now}}^{t_{now}+t_{exec}}\frac{1}{1+z(t)}dt=t_{nominal}
\end{displaymath}
where $t_{now}$ is when the task begins executing, $t_{nominal}$ is
the execution time of the task on a completely unloaded machine,
$(1+z(t))$ is the load experienced during execution ($z(t)$ is the
continuous ``background'' load), and $t_{exec}$ is the task's
execution time.  Notice that the integral simply computes the inverse
of the average load during execution.  In practice, we can only sample
the load with some noninfinitesimal sample period $\Delta$ so we can
only approximate the integral by summing over the values in the sample
sequence.  In this paper, $\Delta=1$ second, so we will write such a
sequence of load samples as $\langle z_t \rangle
=\ldots,z_{t-1},z_t,z_{t+1},\ldots$.

If we knew $z_{t+k}$ for $k>0$ we could directly compute the execution
time.  Unfortunately, the best we can do is predict these values in
some way.  The quality of these predictions will then determine how
tightly we can bound the execution time of our task.  We measure
prediction quality by the {\em mean square error}, which is the
average of the square of the difference between predicted values and
actual values.  Note that there is a mean square error associated with
every {\em lead time} $k$.  Two-step-ahead predictions ($k=2$) have a
different (an probably higher) mean squared error than
one-step-ahead predictions ($k=1$), and so on.

For the simplest prediction, the long-term mean of the signal, the
mean square error is simply the variance of the load signal.  As we
shall see in Section~\ref{sec:stats}, load is highly variable and
exhibits other complex properties, which leads to very loose estimates
of execution time.  The hope is that sophisticated prediction schemes
will have much better mean square errors.

\begin{figure}
\centerline{
\colfigwidth
\epsfbox{unix16-arbetter.epsf}
}
\caption{An extreme example of the benefits of load prediction}
\label{fig:extreme}
\end{figure}

Figure~\ref{fig:extreme} shows an extreme example of the benefits of
two such models (models and methodology described in
Sections~\ref{sec:ltsmodels} and~\ref{sec:method}).  The load signal
(trace) used in this case here was measured on an undergraduate
UltraSparc server machine in mid September, 1998. The machine had over
40 individuals logged in and an average load of over 10.  The figure
plots the length of the the 95\% confidence interval for the execution
time of a task with a nominal execution time of one second as a
function of the lead time for three models.  For example, for a 10
second lead time, the execution time estimate based on the AR(9) model
will be accurate, +/- 1 second (2 second interval), 95\% of the time,
while the the estimate based on the Best Mean model will be accurate,
+/- 2.5 seconds (5 second interval), 95\% of the time.  Note that the
estimate based on the mean model, which exposes all of the variance of
the raw load signal, results in a 95\% interval that is over 13
seconds wide.  Clearly, good models can produce order-of-magnitude
better load predictions that translate into significant differences in
user-perceived performance.

\begin{figure}
\centerline{
\colfigwidth
\epsfbox{argus-bm-better.epsf}
}
\caption{Load prediction on one of our traces where simple model wins}
\label{fig:argus-bm-better}
\end{figure}

On less loaded machines, such as the ones we study rigorously in this
paper, improvements are less pronounced, but can still be quite
significant.  For example, Figure~\ref{fig:argus-bm-better} shows the
performance of the Best Mean model and a much more complex
ARFIMA($3,d,1$) model on a load trace taken in late August, 1997 on a
desktop workstation whose long-term load average was less than 0.5.
The interpretation of the figure is as above.  There are two things to
notice here.  First, even on this less dynamic machine, good models
resulted in significant and user-perceptible improvements in execution
time estimates.  For example, for a lead time of one second, we see
the confidence interval shrink from five seconds to about a second.
Second, the much more sophisticated ARFIMA($3,d,1$) model, which, by
our statistical analysis of this load trace (Section~\ref{sec:stats})
would seem to be much more appropriate, in fact does worse than the
very simple Best Mean model.  The crux of our study of the traces was
to discover if there are {\em consistent} differences between the
models.


\section{Statistical properties of load traces}
\label{sec:stats}

%We collected and studied long, fine grain load traces from a wide
%variety of machines in order to discover common statistical properties
%and their implications for load prediction.  This work, which we
%summarize here, is reported in detail in~\cite{DINDA-STAT-PROPS-HOST-LOAD-LCR98}.  The
%results suggest that classical Box-Jenkins linear models may be
%appropriate for load prediction, but that large numbers of parameters
%may be necessary to capture its complicated dynamics.  Two interesting
%new results are that load signals are self-similar and that they
%display what we call epochal behavior.  Self-similarity suggests the
%use of expensive ARFIMA linear models, which can capture long-range
%dependence, may be appropriate.  Epochal behavior suggests that
%reasonably frequent model refitting, piecewise linear models (such as
%TAR models~\cite{TAR-TONG-BOOK}), or locally weighted regression on an
%``unfolded'' load signal~\cite{ABARBANEL} may be appropriate.  This
%paper studies the performance of the classical Box-Jenkins models and
%the ARFIMA model.  In Section~\ref{eval} we evaluate these schemes on
%these load traces we introduce in this section.

%\subsection{Measurement methodology}
%\label{sec:load_trace_measurement}

The load on a Unix system at any given instant is the number of
processes that are running or are ready to run, which is the length of
the ready queue maintained by the scheduler.  The kernel samples the
length of the the ready queue at some rate and exponentially averages
some number of previous samples to produce a load average which can be
accessed from a user program.  We wrote a small tool to sample this
value on Digital Unix (DUX) machines.  By subjecting these systems to
varying loads and sampling at progressively higher rates, we
determined that DUX updates the load average value at a rate of $1/2$
Hz.  We chose to sample at 1 Hz in order to capture all of the load
signal's dynamics.  We ran this trace collection tool on 38 hosts
belonging to the Computing, Media, and Communication Laboratory (CMCL)
at CMU and the Pittsburgh Supercomputing Center (PSC) for slightly
more than one week in late August, 1997.  A second set of week-long
traces was acquired on almost exactly the same set of machines in late
February through early March, 1998.  The results of the statistical
analysis were similar for the two sets of traces.  In this paper, we
describe and use the August 1997 set.  A more detailed description of
our analysis can be found
in~\cite{DINDA-STAT-PROPS-HOST-LOAD-LCR98,DINDA-STAT-PROPS-HOST-LOAD-TR}
and the traces themselves can be found on the web at
http://www.cs.cmu.edu/~pdinda/LoadTraces.

All of the hosts in the August 1997 set were DEC Alpha DUX machines,
and they form four classes: {\em Production Cluster}, 13 hosts of the
PSC's ``Supercluster'', including two front-end machines (axpfea,
axpfeb), four interactive machines (axp0 through axp3), and seven
batch machines); {\em Research Cluster}, eight machines in an
experimental cluster in our lab (manchester-1 through manchester-8));
{\em Compute servers}, two high performance large memory machines used
by our group as compute servers for simulations and the like (mojave
and sahara); and {\em Desktops}, 15 desktop workstations owned by
members of our research group (aphrodite through zeno).


\begin{figure}
\centerline{
\epsfxsize=3.5in
\epsfbox{Aug97BoxPlot.eps}
}
\caption{Statistical summary of load traces.}
\label{fig:trace_stats}
\end{figure}

Figure~\ref{fig:trace_stats} summarizes these traces in a Box plot
variant.  The central line in each box marks the median load, while
the lower and upper lines mark the 25th and 75th percentiles.  The
lower and upper ``whiskers'' extending from the box mark the actual
2.5th and 97.5th percentiles.  The circle marks the mean and the
triangles mark the 2.5th and 97.5th percentiles assuming a normal
distribution with the trace's mean and variance.  

The following points summarize the results of our statistical analysis
of these traces which are relevant to this study:

(1) The traces exhibit low means but very high variability, measured
by the variance, interquartile range, and maximum.  Only four traces
had mean loads near 1.0.  The standard deviation (square root of the
variance) is typically at least as large as the mean, while the
maximums can be as much as two orders of magnitude larger.  This high
variability suggests that there is considerable opportunity for
prediction algorithms.

(2) Measures of variability, such as the variance and maximum, are
positively correlated with the mean, so a machine with a high mean
load will also tend to have a large variance and maximum.  This
correlation suggests that there is more opportunity for prediction
algorithms on more heavily loaded machines.

(3) The traces have relatively complex, sometimes multimodal
distributions that are not well fitted by common analytic
distributions.  However, we note here that assuming normality (but
disallowing negative values) for the purpose of computing a 95\%
confidence interval is not unreasonable, as can be seen by comparing
the 2.5th and 97.5th percentile whiskers and the corresponding normal
percentiles (triangles) computed from the mean and variance in
Figure~\ref{fig:trace_stats}.

(4) Time series analysis of the traces shows that load is strongly
correlated over time.  The autocorrelation function typically decays
very slowly while the periodogram shows a broad, almost noise-like
combination of all frequency components.  An important implication is
that linear models may be appropriate for predicting load signals.
However, the complex frequency domain behavior suggests such models
may have to be of unreasonably high order.

\begin{figure}
\centerline{
\colfigwidth
\epsfbox{hurst_all.epsf}
}
\caption{Hurst parameter estimates.}
\label{fig:trace_hurst}
\end{figure}

(5) The traces exhibit
self-similarity~\cite{BERAN-STAT-LONG-RANGE-METHODS}.  We estimated
their Hurst parameters~\cite{HURST51} using four different
nonparametric methods~\cite{TAQQU-HURST-ESTIMATORS-COMPARE} which
provided the results in Figure~\ref{fig:trace_hurst}.  The traces'
average Hurst parameter estimates ranged from 0.63 to 0.97, with a
strong bias toward the top of that range.  Hurst parameters in the
range of 0.5 to 1.0 indicate self-similarity with positive
near-neighbor correlations.  This result tells us that load varies in
complex ways on all time scales and has long-range dependence.
Long-range suggests that using the fractional ARIMA (ARFIMA) modeling
approach~\cite{HOSKING-FRACT-DIFF,GRANGER-JOYEUX-FRACT-DIFF,BERAN-STAT-LONG-RANGE-METHODS}
may be appropriate.  

\begin{figure}
\centerline{
\colfigwidth
\epsfbox{sdev_mean_epoch.epsf}
}
\caption{Mean epoch lengths and +/- one standard deviation.}
\label{fig:trace_epoch}
\end{figure}

(6) The traces display what we term ``epochal behavior''.  The local
frequency content (measured by using a spectrogram) of the load signal
remains quite stable for long periods of time (150-450 seconds mean),
but changes abruptly at the boundaries of such epochs.
Figure~\ref{fig:trace_epoch} shows the mean epoch length and standard
deviation of epoch length for each of the traces.  This result implies
that linear models may have to be refit at these boundaries, or that
threshold models~\cite{TAR-TONG-BOOK} or other nonlinear or chaotic
dynamics methods may be appropriate.  We do not investigate these
methods in this paper.


\section{Linear time series models}
\label{sec:ltsmodels}

\begin{figure}
\centerline {
\epsfxsize=3in
\epsfbox{ltsmodel.eps}
}
\caption{Linear time series models for load prediction.}
\label{fig:ltsmodels}
\end{figure}

The main idea behind using a linear time series model in load
prediction is to treat our sequence of periodic samples of host load,
$\langle z_t \rangle$, as a realization of a stochastic process which
can be modeled as a white noise source driving a linear filter.  The
filter coefficients can be estimated from past observations of the
sequence.  If most of the variability of the sequence results from the
action of the filter, we can use its coefficients to estimate future
sequence values with low mean squared error.

Figure~\ref{fig:ltsmodels} illustrates this decomposition.  In keeping
with the relatively standard Box-Jenkins
notation~\cite{BOX-JENKINS-TS}, we represent the input white noise
sequence as $\langle a_t \rangle$ and the output load sequence as
$\langle z_t \rangle$.  On the right of Figure~\ref{sec:ltsmodels} we
see our partially predictable sequence $\langle z_t \rangle$, which
exhibits some mean $\mu$ and variance $\sigma_z^2$.  On the left, we
see our utterly unpredictable white noise sequence $\langle a_t
\rangle$, which exhibits a zero mean and a variance $\sigma_a^2$.  In
the middle, we have our fixed linear filter with coefficients $\langle
\psi_j
\rangle$.  Each output value $z_t$ is the sum of the current noise
input $a_t$ and all previous noise inputs, weighted by the
$\langle\psi_j\rangle$ coefficients.

Given an observed load sequence $\langle z_t \rangle$, the optimum
values for the coefficients $\psi_j$ are those that minimize
$\sigma_a^2$, the variance of the driving white noise sequence
$\langle a_t \rangle$.  Notice that a one-step-ahead prediction for a
given set of coefficients is $\hat{z}_{t+1} = \sum_{j=0}^\infty
\psi_ja_{t-j}$ (since the expected value of $a_{t+1}=0$), thus the noise
sequence is simply the one-step-ahead (``$t+1$'' or predictions for a
lead time of one) prediction errors, the optimal coefficient values
minimize the sum of squares of the one-step-ahead prediction errors.

The general form of the linear time series model as, of course,
impractical, since it involves an infinite summation using an infinite
number of completely independent weights.  Practical linear time
series models use a small number of coefficients to represent infinite
summations with restrictions on the weights, as well as special casing
the mean value of the sequence, $\mu$.  To understand these models, it
is easiest to represent the weighted summation as a ratio of
polynomials in $B$, the backshift operator, where $B^dz_t =z_{t-d}$.
For example, we can write $z_t = \sum_{j=1}^\infty \psi_ja_{t-j}+a_t$
as $z_t =
\psi(B)a_t$ where $\psi(B)=1+\psi_1B+\psi_2B+\ldots$  Using this
scheme, the models we examine in this paper can be represented as
\begin{equation}
z_t = \frac{\theta(B)}{\phi(B)(1-B)^d}a_t + \mu
\label{eqn:ts_poly}
\end{equation}
where the different model classes we examine in this paper (AR, MA,
ARMA, ARIMA, ARFIMA, BM, and MEAN)constrain $\theta(B)$, $\phi(B)$ and
$d$ in different ways.  In the signal processing domain, this kind of
filter is known as a pole-zero filter.  The roots of $\theta(B)$ are
the zeros and the roots of $\phi(B)(1-B)^d$ are the poles.  In
general, such a filter can be unstable in that its outputs can rapidly
diverge from the input signal.  This instability is extremely
important from the point of view of the implementor of a load
prediction system.  Such a system will generally fit a model (choose
the $\theta(B)$ and $\phi(B)$ coefficients, and $d$) using some $n$
previous observations.  The model will then be ``fed'' the next $m$
observations and asked to make predictions in the process.  If
coefficients are such that the filter is unstable, then it may explain
the $n$ observations very well, yet fail miserably and even diverge
(and crash!) when used on the $m$ observations after the fitting.


\subsubsection*{AR($p$) models}
The class of AR($p$) models (purely autoregressive models) have $z_t =
\frac{1}{\phi(B)}a_t + \mu$ where $\phi(B)$ has $p$ coefficients.
Intuitively, the output value is a $\phi$-weighted sum of the $p$
previous output values.  The $t+1$ prediction for a load sequence is the
$\phi$-weighted sum of the $p$ previous load measurements.  The $t+2$
prediction is the $\phi$-weighted sum of the $t+1$ prediction and the
$p-1$ previous load measurements, and so on.

From the point of view of a system designer, AR($p$) models are highly
desirable since they can be fit to data in a deterministic amount of
time.  In the Yule-Walker technique that we used, The autocorrelation
function is computed to a maximum lag of $p$ and then a $p$-wide
Toeplitz system of linear equations is solved.  Even for relatively
large values of $p$, this can be done almost instantaneously.

\subsubsection*{MA($q$) models}
The class of MA($q$) models (purely moving average models) have $z_t =
\theta(B)a_t$ where $\theta(B)$ has $q$ coefficients.  Intuitively,
the output value is the $\theta$-weighted sum of the current and the
$q$ previous input values.  The $t+1$ prediction of a load sequence is
the $\theta$-weighted sum of the $q$ previous $t+1$ prediction errors.
The $t+2$ prediction is the $\theta$-weighted sum of the predicted
$t+1$ prediction error (zero) and the $q-1$ previous $t+1$ prediction
errors, and so on.

MA($q$) models are a much more difficult proposition for a system
designer since fitting them takes a nondeterministic amount of time.
Instead of a linear system, fitting a MA($q$) model presents us with a
quadratic system.  Our implementation, which is nonparametric (ie, it
assumes no specific distribution for the white noise source), uses the
Powell procedure to minimize the sum of squares of the $t+1$
prediction errors.  The number of iterations necessary to converge is
nondeterministic and data dependent.

\subsubsection*{ARMA($p$,$q$) models}
The class of ARMA($p$,$q$) models (autoregressive moving average models) have
$z_t=\frac{\theta(B)}{\phi(B)}a_t + \mu$ where $\phi(B)$ has $p$
coefficients and $\theta(B)$ has $q$ coefficients.  Intuitively, the
output value is the $\phi$-weighted sum of the $p$ previous output
values plus the $\theta$-weighted sum of the current and $q$ previous
input values.  The $t+1$ prediction for a load sequence is the
$\phi$-weighted sum of the $p$ previous load measurements plus the
$\theta$-weighted sum of the $q$ previous $t+1$ prediction errors.
The $t+2$ prediction is the $\phi$-weighted sum of the $t+1$ prediction
and the $p-1$ previous measurements plus the $\theta$-weighted sum of
the predicted $t+1$ prediction error (zero) and the $q-1$ previous
prediction errors, and so on.

By combining the AR($p$) and MA($q$) models, ARMA($p$,$q$) models hope
to achieve greater parsimony --- using fewer coefficients to explain
the same sequence.  From a system designer's point of view, this may
be important, at least in so far as it may be possible to fit a more
parsimonious model more quickly.  Like MA($q$) models, however,
ARMA($p$,$q$) models take a nondeterministic amount of time to fit to
data, and we use the same Powell minimization procedure to fit them.

\subsubsection*{ARIMA($p$,$d$,$q$) models}
The class of ARIMA($p$,$d$,$q$) models (autoregressive integrated
moving average models) implement Equation~\ref{eqn:ts_poly} for
$d=1,2,\ldots$ Intuitively, the $(1-B)^d$ component amounts to a
$d$-fold integration of the output of an ARMA($p$,$q$) model.
Although this makes the filter inherently unstable, it allows for
modelling nonstationary sequences.  Such sequences can vary over an
infinite range and have no natural mean.  Although load clearly cannot
vary infinitely, it doesn't have a natural mean either.

ARIMA($p$,$d$,$q$) models are fit by differencing the sequence $d$
times and fitting an ARMA($p$,$q$) model as above to the result.  

\subsubsection*{ARFIMA($p$,$d$,$q$) models}
The class of ARFIMA($p$,$d$,$q$) models (autoregressive fractionally
integrated moving average models) implement Equation~\ref{eqn:ts_poly}
for fractional values of $d$, $0<d<0.5$.  By analogy to
ARIMA($p$,$d$,$q$) models, ARFIMAs are fractionally integrated
ARMA($p$,$q$) models.  The details of fractional
integration~\cite{HOSKING-FRACT-DIFF,GRANGER-JOYEUX-FRACT-DIFF} are
not important here other than to note that $(1-B)^d$ for fractional
$d$ is an infinite sequence whose coefficients are functions of $d$.
The idea is that this infinite sequence captures long range dependence
while the ARMA coefficients capture short range dependence.  Since our
sequences exhibit long-range dependence, even after differencing,
ARFIMAs may prove to be beneficial models.

To fit ARFIMA models, we use Fraley's Fortran 77
code~\cite{FRALEY-FRACDF}, which does maximum likelihood estimation of
ARFIMA models following Haslett and
Raftery~\cite{HASLETT-RAFTERY-LT-DEP-WIND}.  This implementation is
also used by commercial packages such as S-Plus.  We truncate
$(1-B)^d$ at 300 coefficients and use the same representation and
prediction engine as with the other models.

\subsubsection*{Simple models for comparison}
In addition to the Box-Jenkins models and the ARFIMA model described
above, we also use two very simple models for comparison, MEAN and
BM. MEAN has $z_t =\mu$, so all future values of the sequence are
predicted to be the mean.  This is the best predictor, in terms of
minimum mean squared error, for a sequence which has no correlation
over time.  It is useful for measuring the raw variance in the load
signal and the benefits accrued from exploiting the correlations that
exist.  The BM model is an AR($p$) model whose coefficients are all
set to $1/p$.  This simply predicts the next sequence value to be the
average of the previous $p$ values, a simple windowed mean.  $p$ is
chosen to minimize mean squared error for $t+1$ predictions.

\subsubsection*{Making predictions}
After fitting one of the above models, we construct a predictor from
it.  The predictor consists of the model converted to a uniform form,
the predicted next sequence value, a queue that holds the last $p+d$
($d=300$ for ARFIMA models) sequence values, and a queue the holds the
last $q$ prediction errors.  When the next sequence value becomes
available, it is pushed onto the sequence queue, it's corresponding
predicted value's absolute error is pushed onto the error queue, and
the model is evaluated ($O(p+d+q)$ operations) to produce a new
predicted next sequence value.  We refer to this as ``stepping'' the
predictor.  At any point, the predictor can be queried for the
predicted next $k$ values of the sequence, along with their expected
mean square errors ($O(k(p+d+q)$ operations)).

\section{Evaluation methodology}
\label{sec:method}

Our methodology is designed to determine whether there are consistent
differences in {\em practical predictive power} between the different
model classes.  Other goals are also possible. For example, one could
determine the {\em explanatory power} of the models by evaluating how
well they fit data, or the {\em generative power} of the model by
generating new load traces from fitted models and evaluating how well
their statistical properties match the original traces.  We have
touched on these other goals, but do not discuss them here.  To
assess the practical predictive power of the different model
classes, we designed a randomized, trace-driven simulation methodology
that fits randomly selected models to subsequences of a load trace and
tests them on immediately following subsequences.

In practice, a load prediction daemon will have one thread has a
structure like the following:
\small
\begin{verbatim}
do forever {
   get new load measurement; 
   update history;
   if (some criteria) {
      refit model to history;
      make new predictor;
   }
   step predictor;
   sleep for sample interval;
}
\end{verbatim}
\normalsize
where \verb.history. is a window of previous load values.  Fitting the
model to the history is an expensive operation.  Once a model is
fitted, a predictor can be produced from it.  ``Stepping'' this
predictor means informing it of a new load measurement so that it can
modify its internal state. This is an inexpensive operation.
Prediction requests arrive asynchronously and are serviced using the
current state of the predictor.  A prediction request that arrives at
time $t$ includes a lead time $k$. The predicted load values
$\hat{z}_{t+1},\hat{z}_{t+2},\ldots,\hat{z}_{t+k}$ are returned along
with model-based estimates of the mean square prediction error for
each prediction.  

Evaluating the predictive power of the different model classes in such
a context is complicated because there is such a vast space of
configuration parameters to explore.  These parameters include: the
trace, the model class, the number of parameters we allow the model
and how they are distributed, the lead time, the length of the history
to which the model is fit, the length of the interval during which the
model is used to predict, and at what points the model is refit.  We
also want to avoid biases due to favoring particular regions of
traces.

\begin{figure}
\centerline {
\begin{tabular}{c}
\colfigwidth
\epsfbox{testcase1.eps}
\\
\small
\begin{tabular}{|l|l|}
\hline
Model class & Number of parameters \\
\hline
MEAN  & none \\
BM    & none \\
\hline
AR(p) & p=1..32 \\
MA(q) & q=1..8 \\
ARMA(p,q) & p=1..4, q=1..4 \\
ARIMA(p,d,q) & p=1..4, d=1..2, q=1..4 \\
ARFIMA(p,d,q) & p=1..4, d by fit, q=1..4 \\
\hline
\end{tabular}
\normalsize
\end{tabular}
}
\caption{Testcase generation.}
\label{fig:testcasegen}
\end{figure}

To explore this space in a reasonably unbiased way, we decided to run
a large number of randomized testcases on each of the traces.
Figure~\ref{fig:testcasegen} summarizes how a testcase is generated.
First, a random crossover point is chosen in the trace.  Next, a fit
interval with a randomly selected length of five minutes (600 samples)
to three hours (10800 samples) that ends at the crossover point is
chosen.  Then, a test interval is chosen that begins at the crossover
point and has a random length from the same range.  Finally, a test
model (class and number of parameters) is randomly chosen from the
table in Figure~\ref{fig:testcasegen}.  The lower limit we place on
the length of the fit and test intervals is purely prosaic --- the
ARFIMA model needs about this much data to be successfully fit.  The
upper limit is chosen to be greater than most epoch lengths so that we
can see the effect of crossing epoch boundaries.  The models are
limited to eight parameters because fitting larger MA, ARMA, ARIMA, or
ARFIMA models is prohibitively expensive in a real system.  We did
also explore larger AR models, up to order 32.

The random testcase is run in the following way: (1) The test model,
the BM model, and the MEAN model are fit to the measurements in the
fit interval (the ``history.'')  (2)Predictors are produced from the
three fitted models and the data in the fit interval is stepped into
the predictors to bring them to the crossover point.  (3) For each
subsequent value in the test interval each of the predictors is
stepped with the value, and then asked to generate predictions for
$1,2,3,\ldots,30$ seconds into the future.  The value and the
predictions it produced are saved.  (4) The values and predictions
produced in the previous step are passed to an evaluation module
which, for the actual values and predictions at each lead, computes
the minimum, median, maximum, mean, mean absolute, and mean squared
prediction errors for each lead time.  The values and one-step-ahead
prediction errors are also subject to IID and normality tests as
described by Brockwell and Davis~\cite{INTRO-TIME-SERIES-FORECASTING},
pp. 34--37.  IID tests included the fraction of the autocorrelations
that are significant, the Portmanteau Q statistic (the power of the
autocorrelation function), the turning point test, and the sign test.
Normality was tested by computing the $R^2$ value of a least-squares
fit to a quantile-quantile plot of the values or errors versus a
sequence of normals of the same mean and variance.

We ran approximately 152,000 such testcases, which amounted to about
4000 testcases per trace, or about 1000 per model class and parameter
set, or about 30 per trace, model class and parameter set.  Our
parallelized simulation discarded testcases in which an unstable model
``blew up,'' either detectably or due to a floating point exception.
The results of the accepted testcases were committed to a SQL database
to simplify the analysis discussed in the following section.


\section{Results}
\label{sec:results}

The main purpose of this section is to answer the questions we posed
in the introduction: is load consistently predictable and, if so, what
are the consistent differences between the different model classes,
and which class is preferable?  To answer these questions we
data-mined the database of randomized testcases that we describe in
Section~\ref{sec:method}.  We for the most part will address only the
mean square error results, although we will touch on the other
results.

\subsubsection*{Load is consistently predictable}

For a model to provide consistent predictability of load, it must
satisfy two requirements.  First, for the average testcase, the model
must have a considerably lower expected mean square error than the
expected raw variance of the load signal (ie, the expected mean square
error of the MEAN model.) The second requirement is that this
expectation is also very likely, or that there is little variability
from testcase to testcase.  Intuitively, the first requirement says
that the model provides better predictions on average, while the
second says that most predictions are close to that average.

\begin{figure}
\centerline {
\colfigwidth
\epsfbox{all_lead1_8to8.eps} 
}
\caption{All traces, 1 second lead, 8 parameters}
\label{fig:all_lead1_8to8}
\end{figure}

Figure~\ref{fig:all_lead1_8to8} shows that load is indeed consistently
predictable in this sense.  The figure is a Box plot that shows the
distribution of one-step-ahead mean square error measures for 8
parameter models on all of the traces.  In the figure, each category
is a specific class of model and is annotated with the number of
samples for that class.  For each class, the circle indicates the
expected mean square error, while the triangles indicated the 2.5th
and 97.5th percentiles assuming a normal distribution.  The center
line of each box shows the median while the lower and upper limits of
the box show the 25th and 75th percentiles and the lower and upper
whiskers show the actual 2.5th and 97.5th percentiles.  Notice that
the expected raw variance (MEAN) of a testcase is approximately 0.05,
while the expected mean square error for {\em all} of the model
classes is nearly zero.  In terms of the length of the 95\% confidence
interval for the execution time of a one second task as described in
Section~\ref{sec:why_prediction}, this is a reduction from
$(2)(1.96)\sqrt{0.05} = 0.87$ seconds to virtually zero for all of the
classes of predictive models, including the excruciatingly simple BM
model.  The figure also makes it clear that the second requirement is
met.  The variability around the expected mean square error is much
lower for the predictive models than for MEAN. For example, the 97.5th
percentile of the raw variance is almost 0.3 (2.2 second interval)
while it is about 0.02 (0.6 second interval) for the predictive
models.

\begin{figure}
\centerline {
\colfigwidth
\epsfbox{all_lead15_8to8.eps}
}
\caption{All traces, 15 second lead, 8 parameters}
\label{fig:all_lead15_8to8}
\end{figure}

\begin{figure}
\centerline {
\colfigwidth
\epsfbox{all_lead30_8to8.eps}
}
\caption{All traces, 30 second lead, 8 parameters}
\label{fig:all_lead30_8to8}
\end{figure}

Figures~\ref{fig:all_lead15_8to8} and~\ref{fig:all_lead30_8to8} show
the results for 15 second predictions and 30 second predictions.
Notice that, except for the MA models, the predictive models are
consistently better than the raw load variance even with 30 second
ahead predictions.  We also begin to see some differentiation between
the classes with AR and ARFIMA providing slightly better results than
the others.  By expanding to include AR models of up to 32 parameters
(which involve reasonable amounts of computation), AR matches the
performance of ARFIMA for long lead times.  The results shown in
Figures~\ref{fig:all_lead1_8to8}--\ref{fig:all_lead30_8to8} remain
essentially the same if we include testcases with 2 to 8 parameters
instead of just 8 parameters.  The differences between the predictive
models, excluding the MA models, are not particularly significant from
a systems point of view.

\begin{figure}
\centerline {
\colfigwidth
\epsfbox{axp0_lead1_8to8.eps}
}
\caption{Axp0, 1 second lead, 8 parameters}
\label{fig:axp0_lead1_8to8}
\end{figure}

\begin{figure}
\centerline {
\colfigwidth
\epsfbox{axp0_lead15_8to8.eps}
}
\caption{Axp0, 15 second lead, 8 parameters}
\label{fig:axp0_lead15_8to8}
\end{figure}

\begin{figure}
\centerline {
\colfigwidth
\epsfbox{axp0_lead30_8to8.eps}
}
\caption{Axp0, 30 second lead, 8 parameters}
\label{fig:axp0_lead30_8to8}
\end{figure}

For more heavily loaded machines, the differences can be much more
dramatic.  For example, Figure~\ref{fig:axp0_lead1_8to8} shows the
distribution of one-step-ahead mean square error measures for 8
parameter models on the axp0.psc trace.  Here we see an expected raw
variance (MEAN) of almost 0.3 (2.2 second confidence interval) reduced
to about 0.02 (0.6 second interval) by nearly all of the models.
Furthermore, the mean square errors for the different model classes
are tightly clustered around the expected 0.02 value, quite unlike
with MEAN, where we can see a broad range of values and the 97.5th
percentile is almost 0.5 (2.8 second interval.)  The axp0.psc trace
and others are also quite amenable to prediction with long lead times.
For example, Figures~\ref{fig:axp0_lead15_8to8}
and~\ref{fig:axp0_lead30_8to8} show 15 and 30 second ahead predictions
for 8 parameter models on the axp0.psc trace.  With the exception of
the MA models, even 30 second ahead predictions are consistently much
better than the raw signal variance.  These figures remain essentially
the same if we include testcases with 2 to 8 parameters instead of
just 8 parameters.

%\begin{figure}
%\centerline {
%\colfigwidth
%\epsfbox{axp0_lead1_2to8.eps}
%}
%\caption{Axp0, 1 second lead, 2 to 8 parameters}
%\label{fig:axp0_lead1_2to8}
%\end{figure}

%\begin{figure}
%\centerline {
%\colfigwidth
%\epsfbox{axp0_lead16_2to8.eps}
%}
%\caption{Axp0, 16 second lead, 2 to 8 parameters}
%\label{fig:axp0_lead16_2to8}
%\end{figure}


%\begin{figure}
%\centerline {
%\colfigwidth
%\epsfbox{mojave_lead1_2to8.eps}
%}
%\caption{Mojave, 1 second lead, 2 to 8 parameters}
%\label{fig:mojave_lead1_2to8}
%\end{figure}

%\begin{figure}
%\centerline {
%\colfigwidth
%\epsfbox{mojave_lead16_2to8.eps}
%}
%\caption{mojave, 16 second lead, 2 to 8 parameters}
%\label{fig:mojave_lead16_2to8}
%\end{figure}

%\begin{figure}
%\centerline {
%\colfigwidth
%\epsfbox{mojave_lead1_8to8.eps}
%}
%\caption{Mojave, 1 second lead, 8 parameters}
%\label{fig:mojave_lead1_8to8}
%\end{figure}

%\begin{figure}
%\centerline {
%\colfigwidth
%\epsfbox{mojave_lead16_8to8.eps}
%}
%\caption{Mojave, 16 second lead, 8 parameters}
%\label{fig:mojave_lead16_8to8}
%\end{figure}

\subsubsection*{Statistically significant differences between models}

\begin{figure}
\figfontsize
\centerline{
\begin{tabular}{l|ccccccc}
	&MEAN	&BM	&AR	&MA	&ARMA	&ARIMA	&ARFIMA\\
\hline
MEAN	&$=$	&$>$	&$>$	&$>$	&$>$	&$>$	&$>$ \\
BM	&$<$	&$=$	&$=$	&$<$	&$=$	&$=$	&$=$ \\
AR	&$<$	&$=$	&$=$	&$<$	&$=$	&$=$	&$=$ \\ 
MA	&$<$	&$>$	&$>$	&$=$	&$>$	&$=$	&$>$ \\
ARMA	&$<$	&$=$	&$=$	&$<$	&$=$	&$=$	&$=$ \\
ARIMA	&$<$	&$=$	&$=$	&$<$	&$=$	&$=$	&$=$ \\
ARFIMA	&$<$	&$=$	&$=$	&$<$	&$=$	&$=$	&$=$ 
\end{tabular}
}
\caption{T-test comparisons - all traces aggregated, lead 1, 8
parameters.  Longer leads are essentially the same except MA is always
worse.}
\label{fig:t-test-all-lead1-8to8}
\normalsize
\end{figure}



The Box plots of the previous section are very useful for visually
comparing the model classes.  We also algorithmically compared the
expected mean square error of the models using the unpaired
t-test~\cite{JAIN-PERF-ANALYSIS}, pp. 209--212, and did ANOVA
procedures to verify that the differences we detected were
significantly above the noise floor, which they were.  For each pair
of model classes, the t-test tells us, with 95\% confidence, whether
the first model is better, the same, or worse than the second.  We did
such comparisons for the cross product of the models at a number of
different lead times.  We considered the traces both in aggregate and
individually, and used several different constraints on the number of
parameters.  Figure~\ref{fig:t-test-all-lead1-8to8} shows the results
of such a comparison for the aggregated traces, a lead time of 1, and
8 parameters.  In the figure, the row class is being compared to the
column class.  For example, at the intersection of the AR row and MA
column, there is a '$<$', which indicates that, with 95\% confidence,
the expected mean square error of the AR models is less than that of
MA models, thus the AR models are better than MA models for this set
of constraints.  For longer lead times, we found that the results
remained essentially the same, except that MA became consistently
worse.

Figure~\ref{fig:t-test-all-lead1-8to8} shows that, for the typical
trace and a sufficient number of parameters, there is essentially no
difference between the predictive powers of the BM, AR, ARMA, ARIMA,
and ARFIMA models, even with high lead times.  Further, it shows that,
with the exception of the MA models, all of the models are better than
the raw variance of the signal (MEAN model.)

\begin{figure}
\figfontsize
\centerline{
\begin{tabular}{l|ccccccc}
	&MEAN	&BM	&AR	&MA	&ARMA	&ARIMA	&ARFIMA \\
\hline
MEAN   &$=$     & $>$  &    $>$  &    $=$   &   $>$   &   $>$   &   $>$\\
BM     &$<$      &$=$  &    $>$   &   $<$   &   $=$   &   $=$   &   $>$\\
AR     &$<$      &$<$  &    $=$   &   $<$   &   $=$   &   $<$   &   $=$\\
MA     &$=$      &$>$  &    $>$   &   $=$   &   $>$   &   $>$   &   $>$\\
ARMA   &$<$      &$=$  &    $=$   &   $<$   &   $=$   &   $=$   &   $>$\\
ARIMA  &$<$      &$=$  &    $>$   &   $<$   &   $=$   &   $=$   &   $>$\\
ARFIMA &$<$      &$<$  &    $=$   &   $<$   &   $<$   &   $<$   &
$=$\\
\end{tabular}
}
\caption{T-test comparisons - axp0, lead 16, 8 parameters.}
\label{fig:t-test-axp0-lead16-8to8}
\normalsize
\end{figure}

For machines with higher mean loads, there are more significant
differences between the models.  For example,
Figure~\ref{fig:t-test-axp0-lead16-8to8} shows a t-test comparison for
the axp0.psc trace at a lead time of 16 seconds for 8 parameter
models.  Clearly, there is more differentiation here, and we also see
that the ARFIMA models do particularly well, as we might expect given
the self-similarity result of Section~\ref{sec:stats}.  However,
notice that the AR models are doing about as well.  

The results of these t-test comparisons can be summarized as follows:
(1) Except for the MA models, the models we tested were significantly
better than the raw variance of the trace and the difference in
expected performance was significant from a systems point of view.
(2) the AR, ARMA, ARIMA, and ARFIMA models were significantly better
than or equal to the BM model and occasionally the difference in
expected performance was also significant from a systems point of
view. (3) The AR, ARMA, ARIMA, and ARFIMA models had similar expected
performance.

%
% This section will probably have to go due to space constraints
%
\subsubsection*{Characteristics of prediction errors}

As we described in Section~\ref{sec:method}, our evaluation of each
testcase included tests for the independence and normality of
prediction errors.  Intuitively, we want the errors to be independent
so that they can be characterized with probability distribution
function, and we want the distribution function to be normal-variant
in order to simplify computing confidence intervals from it.  

For the most part, the errors we see with the different models are not
independent or Normal to any confidence level.  However, from a
practical point of view, the errors are much less correlated over time
than the raw signal.  For example, for AR($8$) models, the Portmanteau
Q statistic tends to be reduced by an order of magnitude or more,
which suggests that those autocorrelations that are large enough to be
significant are only marginally so.  Furthermore, assuming normality
for computing confidence intervals for high confidence levels, such as
95\%, seems to be reasonable.  However, since a load prediction system
will probably continuously evaluate a fitted model in order to
determine when to refit it, it seems reasonable for it to keep a
histogram or other more detailed representation of the errors in order
to more accurately compute confidence intervals.

\subsubsection*{Recommendations}

Clearly, using one of the model classes, other than MA, for load
prediction is beneficial.  Further, using an AR, ARMA, ARIMA, or
ARFIMA model is occasionally preferable to the BM model.  Since there
were no major differences in the performance of these models, and
fitting AR models is much less expensive than the other models (and
also takes deterministic amount of time), we recommend the use of AR
models.  AR models of order 4 or higher seem to work fine.  However,
we found the knee in the performance of the AR models to be around
order 16.  Since fitting AR models of this order, or even considerably
higher is quite fast, we recommend using AR($16$)s or better.


%\begin{figure}
%\figfontsize
%\centerline{
%\begin{tabular}{l|ccccccc}
%Model &	 0 &	1&	2&	3&	4&	5&	6 \\
%\hline	
%MEAN	&  0.00\% &	  0.00\% &	  0.00\% &	  0.00\% &	  0.00\% &	  0.00\% &	100.00\% \\	
%BM	& 13.79\% &	 17.24\% &	  3.45\% &	 24.14\% &	 27.59\% &	 13.79\% &	  0.00\% \\
%AR	& 20.69\% &	 24.14\% &	 17.24\% &	 13.79\% &	 10.34\% &	 13.79\% &	  0.00\% \\
%MA	&  3.45\% &	  6.90\% &	  3.45\% &	 10.34\% &	 31.03\% &	 44.83\% &	  0.00\% \\
%ARMA	&  6.90\% &	 20.69\% &	 24.14\% &	 34.48\% &	 10.34\% &	  3.45\% &	  0.00\% \\
%ARIMA	& 48.28\% &	  6.90\% &	 20.69\% &	  3.45\% &	  6.90\% &	 13.79\% &	  0.00\% \\
%ARFIMA	&  6.90\% &	 24.14\% &	 31.03\% &	 13.79\% &	 13.79\% &	 10.34\% &	  0.00\% \\
%\end{tabular}
%}
%\caption{Model rankings by mean for lead 1, 8:8 params}
%\label{fig:t-test-example}
%\normalsize
%\end{figure}

%\begin{figure}
%\figfontsize
%\centerline{
%\begin{tabular}{l|ccccccc}
%Model &	 0 &	1&	2&	3&	4&	5&	6 \\
%\hline	
%MEAN	&  2.63\% &	  5.26\% &	  7.89\% &	 26.32\% &	  7.89\% &	 50.00\% &	  0.00\% \\
%BM	&  5.26\% &	 23.68\% &	 18.42\% &	 15.79\% &	 18.42\% &	 10.53\% &	  7.89\% \\
%AR	& 21.05\% &	 13.16\% &	 13.16\% &	 18.42\% &	 23.68\% &	  7.89\% &	  2.63\% \\
%MA	&  2.63\% &	  7.89\% &	  5.26\% &	  5.26\% &	  2.63\% &	  7.89\% &	 68.42\% \\
%ARMA	&  5.26\% &	 21.05\% &	 15.79\% &	 21.05\% &	 18.42\% &	 10.53\% &	  7.89\% \\
%ARIMA	& 31.58\% &	 15.79\% &	 18.42\% &	  7.89\% &	 10.53\% &	  7.89\% &	  7.89\% \\
%ARFIMA	& 31.58\% &	 13.16\% &	 21.05\% &	  5.26\% &	 18.42\% &	  5.26\% &	  5.26\% \\
%\end{tabular}
%}
%\caption{Model rankings by mean for lead 16, 8:8 params}
%\label{fig:t-test-example}
%\normalsize
%\end{figure}

\section{Related work}

In application-centric scheduling~\cite{APPLES-HPDC96}, applications
schedule themselves, adapting to the availability of resources,
perhaps using a framework such as QuO~\cite{QUO-JOURNAL-98}.  Resource
monitoring systems such as Remos~\cite{REMOS-HPDC98}, the Network
Weather Service~\cite{FORECASTING-NETWORK-NWS-WOLSKI-HPDC97}, or
Topology-d~\cite{SERVICE-NET-AWARE-APPS-SPDT-98} provide measurements
and predictions to help applications make scheduling decisions.  This
paper characterized one such measurement (the Unix load average) and
studied how to predict it.

While considerable effort has gone into characterizing
workloads~\cite{LIMITED-BENEFIT-MIGRATION-EAGER,LOAD-BAL-HEURISTICS-PROCESS-BEHAVIOR-LELAND-OTT,EXPLOIT-PROC-LIFETIME-DIST-DYN-LB-SIGMETRICS},
the focus has been on load sharing
systems~\cite{ADAPTIVE-LOAD-SHARING-HOMOGENEOUS-EAGER}, which schedule
all of jobs in a distributed system, and not on application-centric
scheduling.  Mutka and Livny~\cite{WORKSTATION-AVAIL-STATS-CONDOR}
studied workstation availability as a binary function of load average
and other measures.  Our previous
paper~\cite{DINDA-STAT-PROPS-HOST-LOAD-LCR98} and the summary in this
paper are the first detailed study of the Unix load average we are
aware of.

Although linear time series methods are widely used in other areas,
including
networking~\cite{TIME-SERIES-MODELS-INTERNET-TRAFFIC-BASU-GT-TR-95,TIME-SERIES-MODEL-LONG-TERM-NSFNET-GROSCHWITZ-ICC94},
little work has been done in using time series methods for host load
prediction.  Samadani and Kalthofen's
work~\cite{PRED-BASED-SCHED-DIST-COMP-SAMADANI-UNPUB-96} is the
closest to ours.  They found that small ARIMA models were preferable
to single-point predictors and Bayesian predictors for predicting
load.  Their empirical study concentrated on coarse grain prediction
(one minute sampling interval) of four traces that measured load as
the number of non-idle processes shown by the Unix ``ps'' command.  In
contrast, we studied finer grain (one second sampling interval)
prediction of the Unix one minute load average on a much larger set of
machines using higher order models as well as the ARFIMA class of
models.  Additionally, our study was randomized with respect to models
instead of using the Box-Jenkins heuristic identification
procedure. Finally, we reached a different conclusion for our regime,
namely that AR models are sufficient.  The Network Weather
Service~\cite{FORECASTING-NETWORK-NWS-WOLSKI-HPDC97} uses windowed
mean, median, and AR filters to predict various measures of CPU load,
including the load average, but no detailed study of its load
predictors is available.  However, our result that simple prediction
algorithms are adequate agrees with their initial
results.~\footnote{Richard Wolski, private communication.}
Hailperin~\cite{LOAD-BAL-TIME-SERIES-STAT-PERIODIC-LOADS-HAILPERIN}
used the statistical periodicity of his sensor-driven workload to predict
remote load in his load balancing system.


The AR, MA, ARMA, and ARIMA models have long histories dating back to
the 19th century.  Box, et al~\cite{BOX-JENKINS-TS} is the standard
reference and Brockwell and Davis\cite{INTRO-TIME-SERIES-FORECASTING}
provide a useful example-driven approach.  ARFIMA models are a
relatively recent
innovation~\cite{HOSKING-FRACT-DIFF,GRANGER-JOYEUX-FRACT-DIFF}.
Bassingthwaighte, et al~\cite{FRACTAL-STRUCTURES-CHAOS-SCI-MED}
provide a good intuitive introduction to self-similarity and
Beran~\cite{BERAN-STAT-LONG-RANGE-METHODS} provides a good
introduction to long-range dependence.


\section{Conclusions and future work}

The main contribution of this paper was to show, in a rigorous manner,
that host load on real systems is predictable to a very useful degree
from past behavior by using linear time series techniques.  In
addition, we discovered that, while there are statistically
significant differences between the different classes of models we
studied, the marginal benefits of the more complex models do not
warrent their much higher run-time costs.  We reached the conclusion
that AR models of order 16 or higher are sufficient for predicting 1
Hz data up to 30 seconds in the future.  These models work very well
and are very inexpensive to fit to data and to use.

However, it is clear that host load signals do not emerge from linear
systems.  Other techniques, such as threshold autoregressive
models~\cite{TAR-TONG-BOOK} or Abarbanel's methodology for chaotic
data~\cite{ABARBANEL-CHAOTIC-DATA-BOOK} may be more appropriate.  We
have applied parts of Abarbanel's approach to our traces with
promising results.  However, it may be more appropriate for finer or
coarser grain load prediction --- we are satisfied with simple linear
methods for the regime we studied in this paper.  We are currently
studying bandwidth and latency prediction in local area networks using
the evaluation framework described in this paper.

\small 
\tinyspacing
\bibliography{texbib}
\normalsize
\normspacing

\end{document}

