\newcommand{\comment}[1]{}

\documentclass[twocolumn]{article}
\usepackage{fullpage}
\usepackage{epsf}
\usepackage{times}

\renewcommand{\dbltopfraction}{.9}
\renewcommand{\topfraction}{.9}
\renewcommand{\textfraction}{.1}
\renewcommand{\floatpagefraction}{.9}
\renewcommand{\dblfloatpagefraction}{.9}

\def\normspacing {\renewcommand{\baselinestretch}{0.9}}
\def\tinyspacing {\renewcommand{\baselinestretch}{0.7}}
%\def\normspacing {}
%\def\tinyspacing {}

\normspacing

\def\thesection{\arabic{section}}
\def\thefigure{\arabic{figure}}


\makeatletter\long\def\unmarkedfootnote#1{{\long\def\@makefntext##1{##1}\footnotetext{#1}
}}

\def\pagefigwidth {\epsfxsize=5in}
\def\colfigwidth {\epsfxsize=2.5in}
\def\figfontsize {\tiny}
\def\leadfigwidth{\epsfxsize=2in}
%\def\absfigsize {\epsfxsize=3.0in}
%\def\relfigsize {\epsfxsize=3.0in} 

\newcounter{ecount}
\newenvironment{ecompact}{
  \begin{list}%
    {\arabic{ecount}}{\usecounter{ecount}
    \parsep 0pt plus 1pt
    \partopsep 0pt plus 1pt
    \topsep 2pt plus 2pt minus 1pt
    \itemsep 0pt plus 1pt}}%
  {\end{list}}

\newenvironment{icompact}{
  \begin{list}{$\bullet$}{
    \parsep 0pt plus 1pt
    \partopsep 0pt plus 1pt
    \topsep 2pt plus 2pt minus 1pt
    \itemsep 0pt plus 1pt}}%
  {\end{list}}


\begin{document}

\bibliographystyle{acm}

\title{An Evaluation of Linear Models for Host Load Prediction\\
{\em Extended Abstract} }

\author{
Peter A. Dinda \hspace{0.5in} David R. O'Hallaron\\
School of Computer Science \\
Carnegie Mellon University \\
\{pdinda,droh\}@cs.cmu.edu
}

\maketitle

\begin{abstract}
This paper evaluates linear models for predicting the Digital Unix
five-second host load average from 1 to 30 seconds into the future.  A
detailed statistical study of a large number of long, fine grain load
traces from a variety of real machines leads to consideration of the
Box-Jenkins models (AR, MA, ARMA, ARIMA), and the ARFIMA models (due
to self-similarity.)  These models, as well as a simple windowed-mean
scheme, are then rigorously evaluated by running a large number of
randomized testcases on the load traces and data-mining their
results. The main conclusions are that load is consistently
predictable to a useful degree, and that the simpler models such as AR
are sufficient for doing this prediction.
\end{abstract}

\section{Introduction}
\label{sec:intro}

Consider an application program that wants to schedule a compute-bound
soft real-time task in a typical distributed computing environment.
By using mechanisms such as CORBA~\cite{CORBA-20-ARCH-SPEC-TECHREPORT}
or Java RMI~\cite{JAVA-RMI-SPEC}, the task can be executed on any of
the host machines in the network.  Given this freedom, the application
can choose the host on which the task's deadline is most likely to be
met.

If the application could predict the exact running time of the task on
each of the hosts, choosing a host would be easy.  However, such
predictions are unlikely to be exact due to the dynamic nature of the
distributed system.  Each of the hosts is acting independently, its
vendor-supplied operating system scheduling tasks initiated by other
users, paying no special attention to the task the application is
trying to schedule.  The computational load on each of the hosts can
vary drastically over time.

Because of this dynamic nature, the application's predictions of the
task's running time on each of the hosts have confidence intervals
associated with them.  Better predictions lead to smaller confidence
intervals, which makes it easier for the application to choose between
the hosts, or to decide how likely a particular host is to meet the
deadline.

The running time of a compute-bound task on a host is strongly related
to the computational load on the host. If we could
predict the load that a task would encounter on some host, we could
predict the running time on that host.  Better
load predictions lead to better predictions of running time and thus to
smaller confidence intervals.  For this reason, we concentrate on host
load prediction here.

\begin{figure}
\centerline{
\colfigwidth
\epsfbox{unix16-simple.epsf}
}
\caption{Benefits of prediction: server machine with 40 users and
long term load average of 10.}
\label{fig:extreme}
\end{figure}

\begin{figure}
\centerline{
\colfigwidth
\epsfbox{axpfea-simple.epsf}
}
\caption{Benefits of prediction: interactive cluster machine with long term
load average of 0.17.}
\label{fig:argus-bm-better}
\end{figure}

Better load predictions can lead to drastically smaller confidence
intervals on real hosts, both on those with very high overall load,
such as shared servers, and those with very low overall load, such as
desktop workstations.  Figures~\ref{fig:extreme}
and~\ref{fig:argus-bm-better} show the benefits of prediction on
these two classes of machines. The figures plot the length
of the confidence interval for the running time of a one second task
as a function of how far ahead the load predictions are made.  {\em
The final paper will describe these graphs in more detail.}

The {\em existence} of clear benefits of this kind motivates the
following questions: Is host load {\em consistently} predictable or
are figures~\ref{fig:extreme} and~\ref{fig:argus-bm-better} merely
flukes?  If host load is indeed consistently predictable, what classes of
predictive models are appropriate for predicting it?  What are the
differences between these different classes?  Finally, what class can
be recommended for use in a real system?  This paper describes the
results of a large scale, real world study to provide statistically
rigorous answers to these questions. 

We found that host load is, in fact, consistently predictable to a
very useful degree from past behavior, and that simple, practical
linear time series models are sufficiently powerful load predictors.
These results are somewhat surprising because load has complex
behavior and exhibits properties such as self-similarity and epochal
behavior that suggest that more complex models would be more
appropriate.  As far as we are aware, this is the first study to
identify these properties of host load and then to rigorously evaluate
the practical predictive power of linear time series models that both
can and cannot capture them.  Furthermore, our evaluation approach,
while much more computationally intensive than those of traditional
time series analysis, is unbiased and therefore lets us realistically
gauge how the models actually behave when confronted with messy real
world host load measurements.  Our study is also unique in focusing on
predictability on the scale of seconds as opposed to minutes or longer
time scales, and thus is perhaps most useful in the context of
interactive applications running on distributed systems.  Finally, we
used the codebase of a real distributed prediction service for our
study.  The same codebase is used in the Remos resource measurement
system~\cite{REMOS-HPDC98} and in BBN's QuO distributed object quality
of service system~\cite{QUO-JOURNAL-98}.

We began by choosing to measure host load by the Digital Unix
five-second load average.  We found that this measure, which can be
realistically acquired by user-level programs, is closely related to
the execution time of short compute-bound tasks
(Section~\ref{sec:load_running_time}.)   We collected a large number of
1 Hz benchmark load traces, which capture all the dynamics of the load
signal, and subjected them to a detailed statistical analysis, which
we summarize in Section~\ref{sec:stats}.

On the one hand, this analysis suggested that linear time series
models such as those in the Box-Jenkins~\cite{BOX-JENKINS-TS} AR, MA,
ARMA, and ARIMA classes might be appropriate for predicting load.  On
the other hand, the existence of self-similarity induced long-range
dependence suggested that such models might require an impractical
number of parameters or that the much more expensive ARFIMA model
class, which explicitly captures long-range dependence, might be more
appropriate.  Since it is not obvious which model is best, we
empirically evaluated the predictive power of the AR, MA, ARMA, ARIMA,
and ARFIMA model classes, as well as a simple ad hoc windowed-mean
predictor called BM and a long-term mean predictor called MEAN.  We
describe these model classes and our implementations of them in
Section~\ref{sec:ltsmodels}.  

Our evaluation methodology, which we describe in detail in
Section~\ref{sec:method}, was to run randomized testcases on the
benchmark load traces.  The testcases (152,000 in all, or about 4000
per trace) were randomized with respect to model class, the number of
model parameters and their distribution, the trace subsequence used to
fit the model, and the length of the subsequent trace subsequence used
to test the model.  We collected a large number of metrics for each
testcase, but we concentrate on the mean squared (prediction) error
metric in this paper.  This metric is directly comparable to the raw
variance in the load signal, and, with a normality assumption, can be
translated into a confidence interval for the running time of a task.
Moreover, by the central limit theorem, these estimates of the
expected mean squared error are themselves normally distributed, and
thus we can determine, to a given significance level, whether one
model provides better consistent predictability (has lower expected
mean squared error) than another.

The results of our evaluation are presented in
Section~\ref{sec:results}.  We found that host load is consistently
predictable to a useful degree from past behavior.  Except for the MA
models, which performed quite badly, the predictive models were all
roughly similar, although statistically significant differences were
indeed found.  These marginal differences do not seem to warrant the
use of the more sophisticated model classes because their run-time
costs are much higher.  Our recommendation is to use AR models of
relatively high order (16 or better) for load prediction within the
regime we studied.

\section{Host load and running time}
\label{sec:load_running_time}

\begin{figure}
\centerline{
\colfigwidth
\epsfbox{load2time.eps}
}
\caption{Relationship between average load during execution and execution time}
\label{fig:load2time}
\end{figure}

The running time of a CPU-bound task is strongly related to the host
load experienced during its execution.  If we treat load as being a
continuous signal the relationship can be
summarized as
\begin{displaymath}
\int_{t_{now}}^{t_{now}+t_{exec}}\frac{1}{1+z(t)}dt=t_{nominal}
\end{displaymath}
where $t_{now}$ is when the task begins executing, $t_{nominal}$ is
the execution time of the task on a completely unloaded machine,
$(1+z(t))$ is the load experienced during execution ($z(t)$ is the
continuous ``background'' load), and $t_{exec}$ is the task's
execution time.  Notice that the integral computes the average load
during execution.  In practice, we sample the load at 1 Hz,  so we have a
discrete time signal $\langle z_t \rangle
=\ldots,z_{t-1},z_t,z_{t+1},\ldots$   We will also refer to such a
sequence as a {\em load signal}.  

Figure~\ref{fig:load2time} plots execution time versus the average
load experience during execution for groups of compute-bound tasks
running on an otherwise unloaded machine.  We can see that the
relationship is particularly strong.  In fact, $R^2>0.99$ for the
42,000 points in the figure.

{\em The final paper will describe this experiment in more detail.}

If we knew $z_{t+k}$ for $k>0$, we could directly compute the
execution time using this strong relationship.  If we predict these
values, high quality predictions will allow us to more tightly bound
the execution time of a task, We measure prediction quality by the
{\em mean squared error}, which is the average of the square of the
difference between predicted values and actual values.  Note that
there is a mean squared error associated with every {\em lead time}
$k$.  

Because the load signal may be less predictable at some times than
others, we aggregate the estimates of mean squared error for many
different intervals.  The central limit theorem tells us that the
resulting {\em expected mean squared error} is normally distributed,
and thus we use this quantity in our comparisons. For the simplest
prediction, the long-term mean of the signal, the mean squared error
is simply the {\em raw variance} of the load signal.

{\em The final paper will provide more detail here.}

\section{Statistical properties of load traces}
\label{sec:stats}

The load on a Unix system at any given instant is the number of
processes that are running or are ready to run, which in turn is the
length of the ready queue maintained by the scheduler.  The kernel
samples the length of the ready queue at some rate and exponentially
averages the samples to produce a load average which can be accessed
from a user program.  It is important to note that while this
filtering does tend to correlate the load average over the short term
(shorter than the time constant of the filter), the exponential
filter does expose the full spectral content of the underlying signal.

The Unix we used was Digital Unix (DUX.)  DUX is interesting because
the time constant is a mere five seconds, which minimizes the effect
of phantom correlation due to the filter.  We chose to sample at 1 Hz
in order to capture all of the load signal's dynamics made available
by the kernel.  {\em The final paper will elaborate.}  We collected
week-long load traces on 38 hosts belonging to our lab
%the Computing, Media, and Communication Laboratory (CMCL)
%at CMU
and to the Pittsburgh Supercomputing Center (PSC) at two different
times of the year.  In this paper, we describe and use the first set
of traces.

{\em A more detailed description of the hosts will appear in the final
paper.  our analysis of the first set of traces has been
published~\cite{DINDA-STAT-PROPS-HOST-LOAD-LCR98}, and an extended
version covering both sets of traces has been accepted for journal
publication.  The traces themselves are available on the web and a URL
will appear in the final paper.}


\begin{figure}
\centerline{
\epsfxsize=3.5in
\epsfbox{Aug97BoxPlot.eps}
}
\caption{Statistical summary of load traces.}
\label{fig:trace_stats}
\end{figure}

Figure~\ref{fig:trace_stats} summarizes the traces in a Box plot
variant.  The central line in each box marks the median load, while
the lower and upper lines mark the 25th and 75th percentiles.  The
lower and upper ``whiskers'' extending from the box mark the actual
2.5th and 97.5th percentiles.  The circle marks the mean and the
triangles mark the 2.5th and 97.5th percentiles assuming a normal
distribution with the trace's mean and variance.  We will also use
this form of Box plot when we present our prediction results in
Section~\ref{sec:results}. 

The following points summarize the results of our statistical analysis
of these host load traces that are relevant to this study: 
\begin{enumerate}

\item Load has extremely high variability, which leads to high
variability of running times.  This suggests ample opportunity for
prediction algorithms to improve the situation.

\item This variability is correlated with the mean, so heavily loaded
machines generally have higher variability.  This suggests there is
more opportunity for prediction on heavily loaded machines.

\item Load follows relatively complex,
sometimes multimodal distributions that are not well fitted by common
analytic distributions.

\item Load is linearly correlated over time, but spectral analysis
suggests that the relationship of past load values to future load
values is complex.

\begin{figure}
\centerline{
\colfigwidth
\epsfbox{hurst_all.epsf}
}
\caption{Hurst parameter estimates.}
\label{fig:trace_hurst}
\end{figure}

\item Load exhibits self-similarity and long-range dependence.
We estimated the Hurst parameters~\cite{HURST51} of each of our traces
using four different nonparametric
methods~\cite{TAQQU-HURST-ESTIMATORS-COMPARE} which provided the
results in Figure~\ref{fig:trace_hurst}. These results suggest that
the fractional ARIMA (ARFIMA) modeling
approach~\cite{HOSKING-FRACT-DIFF,GRANGER-JOYEUX-FRACT-DIFF,BERAN-STAT-LONG-RANGE-METHODS}
may be appropriate.

\item The traces display what we term ``epochal behavior''.  The local
frequency content (measured by using a spectrogram) of the load signal
remains quite stable for long periods of time (150-450 seconds mean,
with high standard deviations), but changes abruptly at the boundaries
of such epochs.  This property suggests that nonlinear models, or
refitting linear models at epoch boundaries may be appropriate.
\end{enumerate}

 {\em The final paper will
elaborate further on each of these points.}

\section{Linear time series models}
\label{sec:ltsmodels}

\begin{figure}
\centerline {
\epsfxsize=3in
\epsfbox{ltsmodel.eps}
}
\caption{Linear time series models for load prediction.}
\label{fig:ltsmodels}
\end{figure}

The main idea behind using a linear time series model in load
prediction is to treat the sequence of periodic samples of host load,
$\langle z_t \rangle$, as a realization of a stochastic process that 
can be modeled as a white noise source driving a linear filter.  The
filter coefficients can be estimated from past observations of the
sequence.  If most of the variability of the sequence results from the
action of the filter, we can use its coefficients to estimate future
sequence values with low mean squared error.

{\em The full paper will describe general linear time series models here with
reference to Figure~\ref{fig:ltsmodels}.}

The specific models we focus on are pole-zero filters which relate the
input white noise sequence $\langle a_t \rangle$ to the output load
sequence $\langle z_t
\rangle$ according to
\begin{equation}
z_t = \frac{\theta(B)}{\phi(B)(1-B)^d}a_t + \mu
\label{eqn:ts_poly}
\end{equation}
where $\theta(B)$ and $\phi(B)$ are polynomials in $B$, the backshift
operator, where $B^dz_t =z_{t-d}$, and $\mu$ is a constant. 

{\em The complete paper will describe each class of model we studied:
AR, ARMA, ARIMA, and ARFIMA here. }

\subsubsection*{Simple models for comparison}
In addition to the Box-Jenkins models and the ARFIMA model described
above, we also use two very simple models for comparison, MEAN and
BM. MEAN has $z_t =\mu$ --- it predicts that future values will be the
global mean.  BM predicts that the future values will be the mean over
some number $p$ of past values, where $p$ is chosen to minimize the
error on the sequence to which the model is fitted.  Notice that BM
includes even simpler models such as ``predict that the next value
will be the same as the last value'' (ie, p=1).

\subsubsection*{Making predictions}
After fitting one of the above models, we construct a predictor from
it.  The predictor contains the current state given the model and all
the data which has been seen.  Each new value is {\em stepped} into
the predictor as it becomes available.  At any point, the predictor
can be queried for the predicted next $k$ values of the sequence.

\section{Evaluation methodology}
\label{sec:method}
 
Our methodology is designed to determine whether there are consistent
differences in the {\em practical predictive power} of the different
model classes.  {\em The final paper will comment on other goals.}  To
assess the different model classes, we designed a randomized,
trace-driven simulation methodology that fits randomly selected models
to subsequences of a load trace and tests them on immediately
following subsequences.  The intention is to mimic, in an unbiased
way, what a load prediction daemon would see on a real system if
started at randomly selected times.  {\em The final paper will
elaborate.}  To avoid biases, we generate large numbers of testcases
whose parameters are chosen at random from an interesting reason of
the evaluation space.  We evaluate each testcase and then datamine the
results to make statistical comparisons between different options.

\begin{figure}
\centerline {
\begin{tabular}{c}
\colfigwidth
\epsfbox{testcase1.eps}
\\
\small
\begin{tabular}{|l|l|}
\hline
Model class & Number of parameters \\
\hline
MEAN  & none \\
BM    & none \\
\hline
AR(p) & p=1..32 \\
MA(q) & q=1..8 \\
ARMA(p,q) & p=1..4, q=1..4 \\
ARIMA(p,d,q) & p=1..4, d=1..2, q=1..4 \\
ARFIMA(p,d,q) & p=1..4, d by fit, q=1..4 \\
\hline
\end{tabular}
\normalsize
\end{tabular}
}
\caption{Testcase generation.}
\label{fig:testcasegen}
\end{figure}

Figure~\ref{fig:testcasegen} illustrates the parameter space from
which testcase parameters are chosen.  {\em In the final paper,
pseudocode will appear here showing precisely how a testcase is
generated and evaluated.}  

{\em The final paper will explain why this
is an interesting and large section of the evaluation space.}

In addition to computing the mean squared error for each lead time, a
wide variety of other widely accepted evaluation
statistics~\cite[34--37]{INTRO-TIME-SERIES-FORECASTING} are
computed. {\em The final paper will elaborate.} . 

We ran approximately 152,000 such testcases, which amounted to about
4000 testcases per trace, or about 1000 per model class and parameter
set, or about 30 per trace, model class and parameter set.  
The results of the accepted testcases were committed to a SQL database
to simplify the analysis discussed in the following section.  
{\em The final paper will contain directions for how other researchers
can gain access to that database.}

\section{Results}
\label{sec:results}

The section addresses the following questions: Is load consistently
predictable? If so, what are the consistent differences between the
different model classes, and which class is preferable?  To answer
these questions we analyze the database of randomized testcases from
Section~\ref{sec:method}.  For the most part we will address only the
mean squared error results, although we will touch on the other
results as well.  It is important to reiterate the comments of
Section~\ref{sec:method}: we are examining the practical
predictive power of the models here, not the their explanatory power,
or ``fit.''


\subsubsection*{Load is consistently predictable}

For a model to provide consistent predictability of load, it must
satisfy two requirements.  First, for the average testcase, the model
must have a considerably lower expected mean squared error than the
expected raw variance of the load signal (ie, the expected mean
squared error of the MEAN model.) The second requirement is that this
expectation is also very likely, or that there is little variability
from testcase to testcase.  Intuitively, the first requirement says
that the model provides better predictions on average, while the
second says that most predictions are close to that average. 

\begin{figure}
\centerline {
\colfigwidth
\epsfbox{all_lead1_8to8.eps} 
}
\caption{All traces, 1 second lead, 8 parameters}
\label{fig:all_lead1_8to8}
\end{figure}

Figure~\ref{fig:all_lead1_8to8} suggests that load is indeed
consistently predictable in this sense.  The figure is a Box plot as
described in Section~\ref{sec:stats}, except here instead of hosts,
we are comparing model classes.

Notice that the expected raw variance (MEAN) of a testcase is
approximately 0.05, while the expected mean squared error for {\em
all} of the model classes is nearly zero.  In terms of the length of
the 95\% confidence interval for the execution time of a one second
task as described in Section~\ref{sec:load_running_time}, this is a
reduction from $(2)(1.96)\sqrt{0.05} = 0.87$ seconds to virtually zero
for all of the classes of predictive models, including the
excruciatingly simple BM model.  The figure also shows that our second
requirement for consistent predictability is met.  We see that the
variability around the expected mean squared error is much lower for
the predictive models than for MEAN. For example, the 97.5th
percentile of the raw variance is almost 0.3 (2.2 second interval)
while it is about 0.02 (0.6 second interval) for the predictive
models.

\begin{figure}
\centerline {
\colfigwidth
\epsfbox{all_lead15_8to8.eps}
}
\caption{All traces, 15 second lead, 8 parameters}
\label{fig:all_lead15_8to8}
\end{figure}

\begin{figure}
\centerline {
\colfigwidth
\epsfbox{all_lead30_8to8.eps}
}
\caption{All traces, 30 second lead, 8 parameters}
\label{fig:all_lead30_8to8}
\end{figure}

Figures~\ref{fig:all_lead15_8to8} and~\ref{fig:all_lead30_8to8} show
the results for 15 second predictions and 30 second predictions.
Notice that, except for the MA models, the predictive models are
consistently better than the raw load variance, even with 30 second
ahead predictions.  We also see that MA models perform quite badly,
especially at higher lead times.  This was also the case when we
considered the traces individually and broadened the number of
parameters.  MA models are clearly ineffective for load prediction.

\subsubsection*{Successful models have similar performance}

Surprisingly,
Figures~\ref{fig:all_lead1_8to8}~--~\ref{fig:all_lead30_8to8} also
show that the differences between the successful models are actually
quite small.  This is also the case if we expand to include testcases
with 2 to 8 parameters instead of just 8 parameters.  With longer lead
times, the differences do slowly increase.

\begin{figure}
\centerline {
\colfigwidth
\epsfbox{axp0_lead1_8to8.eps}
}
\caption{Axp0, 1 second lead, 8 parameters}
\label{fig:axp0_lead1_8to8}
\end{figure}

\begin{figure}
\centerline {
\colfigwidth
\epsfbox{axp0_lead15_8to8.eps}
}
\caption{Axp0, 15 second lead, 8 parameters}
\label{fig:axp0_lead15_8to8}
\end{figure}

\begin{figure}
\centerline {
\colfigwidth
\epsfbox{axp0_lead30_8to8.eps}
}
\caption{Axp0, 30 second lead, 8 parameters}
\label{fig:axp0_lead30_8to8}
\end{figure}

For more heavily loaded machines, the differences can be much more
dramatic.  For example, Figure~\ref{fig:axp0_lead1_8to8} shows the
distribution of one-step-ahead mean squared error measures for 8
parameter models on the axp0.psc trace.  Here we see an expected raw
variance (MEAN) of almost 0.3 (2.2 second confidence interval) reduced
to about 0.02 (0.6 second interval) by nearly all of the models.
Furthermore, the mean squared errors for the different model classes
are tightly clustered around the expected 0.02 value, quite unlike
with MEAN, where we can see a broad range of values and the 97.5th
percentile is almost 0.5 (2.8 second interval.)  The axp0.psc trace
and others are also quite amenable to prediction with long lead times.
For example, Figures~\ref{fig:axp0_lead15_8to8}
and~\ref{fig:axp0_lead30_8to8} show 15 and 30 second ahead predictions
for 8 parameter models on the axp0.psc trace.  With the exception of
the MA models, even 30 second ahead predictions are consistently much
better than the raw signal variance.  These figures remain essentially
the same if we include testcases with 2 to 8 parameters instead of
just 8 parameters.

\begin{figure}
\figfontsize
\centerline{
\begin{tabular}{l|ccccccc}
	&MEAN	&BM	&AR	&MA	&ARMA	&ARIMA	&ARFIMA\\
\hline
MEAN	&$=$	&$>$	&$>$	&$>$	&$>$	&$>$	&$>$ \\
BM	&$<$	&$=$	&$=$	&$<$	&$=$	&$=$	&$=$ \\
\hline
AR	&$<$	&$=$	&$=$	&$<$	&$=$	&$=$	&$=$ \\ 
\hline
MA	&$<$	&$>$	&$>$	&$=$	&$>$	&$=$	&$>$ \\
ARMA	&$<$	&$=$	&$=$	&$<$	&$=$	&$=$	&$=$ \\
ARIMA	&$<$	&$=$	&$=$	&$<$	&$=$	&$=$	&$=$ \\
ARFIMA	&$<$	&$=$	&$=$	&$<$	&$=$	&$=$	&$=$ 
\end{tabular}
}
\caption{T-test comparisons - all traces aggregated, lead 1, 8
parameters.  Longer leads are essentially the same except MA is always
worse.}
\label{fig:t-test-all-lead1-8to8}
\normalsize
\end{figure}

Although the differences in performance between the successful models
are very small, they are generally statistically significant.  We
determined this using paired and un-paired t-tests.
{\em The final paper will elaborate with reference to
Figures~\ref{fig:t-test-all-lead1-8to8}
and~\ref{fig:t-test-axp0-lead16-8to8} 
}

\begin{figure}
\figfontsize
\centerline{
\begin{tabular}{l|ccccccc}
	&MEAN	&BM	&AR	&MA	&ARMA	&ARIMA	&ARFIMA \\
\hline
MEAN   &$=$     & $>$  &    $>$  &    $=$   &   $>$   &   $>$   &   $>$\\
BM     &$<$      &$=$  &    $>$   &   $<$   &   $=$   &   $=$   &   $>$\\
\hline
AR     &$<$      &$<$  &    $=$   &   $<$   &   $=$   &   $<$   &   $=$\\
\hline
MA     &$=$      &$>$  &    $>$   &   $=$   &   $>$   &   $>$   &   $>$\\
ARMA   &$<$      &$=$  &    $=$   &   $<$   &   $=$   &   $=$   &   $>$\\
ARIMA  &$<$      &$=$  &    $>$   &   $<$   &   $=$   &   $=$   &   $>$\\
ARFIMA &$<$      &$<$  &    $=$   &   $<$   &   $<$   &   $<$   &
$=$\\
\end{tabular}
}
\caption{T-test comparisons - axp0, lead 16, 8 parameters.}
\label{fig:t-test-axp0-lead16-8to8}
\normalsize
\end{figure}

The results of these t-test comparisons can be summarized as follows:
(1) Except for the MA models, the models we tested are significantly
better (in a statistical sense) than the raw variance of the trace and
the difference in expected performance is significant from a systems
point of view.  (2) The AR, ARMA, ARIMA, and ARFIMA models are
significantly better than or equal to the BM model and occasionally
the difference in expected performance is also significant from a
systems point of view. (3) The AR, ARMA, ARIMA, and ARFIMA models have
similar expected performance.

\subsubsection*{AR models are appropriate}

Although the t-test comparisons show small differences between the
models, these differences do not seem to warrant the use of the more
expensive ARMA, ARIMA, or ARFIMA models.  Since AR models are at least
as good as BM models and of similar cost, we recommend using
AR($16$)s or better for load prediction.  

{\em The final paper will elaborate on the choice of order 16.}

\subsubsection*{Prediction errors are not IID normal}

{\em The final paper will elaborate.}


\section{Related work}
\label{sec:related_work}

{\em The final paper will elaborate each of these.}

Our work is of direct use in application-centric
scheduling~\cite{APPLES-HPDC96} such as that provided by QoS
frameworks~\cite{QUO-JOURNAL-98}.  We focus on predicting the
availability of the CPU resource as measured by a resource monitoring
system~\cite{REMOS-HPDC98,FORECASTING-NETWORK-NWS-WOLSKI-HPDC97,SERVICE-NET-AWARE-APPS-SPDT-98}.

Considerable effort has gone into characterizing
workloads~\cite{LIMITED-BENEFIT-MIGRATION-EAGER,LOAD-BAL-HEURISTICS-PROCESS-BEHAVIOR-LELAND-OTT,EXPLOIT-PROC-LIFETIME-DIST-DYN-LB-SIGMETRICS,WORKSTATION-AVAIL-STATS-CONDOR}.
This paper and our previous
paper~\cite{DINDA-STAT-PROPS-HOST-LOAD-LCR98} are the first detailed
study of the Unix load average we are aware of.  Most of workload
characterization is in the context of load sharing
systems~\cite{ADAPTIVE-LOAD-SHARING-HOMOGENEOUS-EAGER}.  A central
assumption of such systems is that current load is a predictor of
future load.  We have shown in that that assumption is valid and
explored the benefits of a number of predictive models.


Linear time series methods are widely used in other areas,
including
networking~\cite{TIME-SERIES-MODELS-INTERNET-TRAFFIC-BASU-GT-TR-95,TIME-SERIES-MODEL-LONG-TERM-NSFNET-GROSCHWITZ-ICC94},
but little work has been done in using them for host load
prediction.  Samadani and Kalthofen's
work~\cite{PRED-BASED-SCHED-DIST-COMP-SAMADANI-UNPUB-96} is the
closest to ours.  {\em The final paper will
elaborate on the significant differences between their study and ours.
Essentially, we studied a different load measure, worked on much finer
grain prediction, used additional and larger models, and a performed a
much larger and more statistically rigorous study.}  The Network
Weather Service~\cite{FORECASTING-NETWORK-NWS-WOLSKI-HPDC97} uses
windowed mean, median, and AR filters to predict various measures of
CPU load.  Our result that simple prediction algorithms are adequate
agrees with their initial results.~\footnote{Richard Wolski, private
communication.}

The AR, MA, ARMA, and ARIMA models have long histories dating back to
the 19th century~\cite{BOX-JENKINS-TS,INTRO-TIME-SERIES-FORECASTING}.
ARFIMA models are a more recent
innovation~\cite{HOSKING-FRACT-DIFF,GRANGER-JOYEUX-FRACT-DIFF} as is
the notion of self-similarity~\cite{FRACTAL-STRUCTURES-CHAOS-SCI-MED}
and long-range-dependence~\cite{BERAN-STAT-LONG-RANGE-METHODS}.

\section{Conclusions and future work}
\label{sec:conclusions}

{\em The final paper will a conclusion section.}

\small
\tinyspacing
\bibliography{texbib}
\normalsize
\normspacing

\end{document}

