\section{Summary statistics and distributions}
\label{sec:sumstat}

Our data consists of random samples (as we note in the next section,
there is almost no significant sequential correlation in our samples)
where each sample consists of a transfer time from a client to a
server and its ranking relative to the other transfers in its group of
fetches.  This section summarizes these samples in terms of general
statistics and analytic distributions.  Conceptually, the analysis
gives some insight into what a random client can expect from a random
mirror site for different sizes and kinds of documents. There are two
main results here.  First, transfer times and server rankings exhibit
considerable variability.  Second, transfer times are well fit by a
Weibull distribution.

The analysis is from the point of view of a random client site (from
Figure~\ref{client-sites}) attempting to fetch a particular document
from a set of mirror sites (Figure~\ref{server-sites}.)  There are 11
different combinations here (Apache and Mars each serve five different
documents while News serves one virtual document.)  For each of these
combinations, we examine the transfer times and corresponding ranks
for all the client fetches of the document to the set of mirror sites.
In effect, we factor out the set of mirrors and the document size here
by doing this.

Figure~\ref{fig:datasum} presents the summary statistics of transfer
times and ranks for each of the combinations.  Notice that mean
transfer times as well as standard deviations increase with increasing
document size.  Further, transfer times are highly variable ---
standard deviations are about as large as means, and we see maxima
and minima near the limits we placed on observed transfer times (300
seconds.)  It is important to note that the maximum transfer time of
638.98 seconds for the News/0 dataset is due to our normalizing the
transfer times for News documents according to their size to
approximate always fetching a 20 KB document.  In some cases,
particularly slow fetches can result in normalized transfer times
exceeding 300 seconds.  This is rare.  

An interesting observation can be made about ranks.  Although ranks
are highly variable, this does not bode disaster for server selection
algorithms.  A random selection is likely to result in an average
ranked server, which by Figure~\ref{mars-rank-separation} would seem
to indicate reasonable performance.  Further, it may well be the case
that some servers vary less in their ranking than others -- for
example, the rankings of a few good servers may very well remain
stable while the remaining servers have more drastically varying
rankings.


\begin{figure}
\small
\centerline{
\begin{tabular}{|l|c|c|c|c|c|}
\hline
 & \multicolumn{5}{c|}{Transfer time (seconds)} \\
\cline{2-6}
Dataset/Doc & Mean & StdDev & Median & Min & Max \\
\hline
Apache/0	&   1.9632	&   5.8366	&   .7	&   0.1000 & 230.5100 \\
Apache/1	&   3.9112	&   7.9753	&   2	&   0.3800 & 297.700 \\
Apache/2	&   3.2929	&   6.3993	&   1.7	&   0.3000 & 293.9000\\
Apache/3	&  15.4776	&  18.2385	&  10.7	&   1.3000 & 299.9000\\
Apache/4	&  23.1960	&  22.9257	&  17.9	&   2.2000 & 298.2000\\
Mars/0		&   1.5416	&   4.6808	&   0.7	&   0.1000 & 296.6000\\
Mars/1		&   2.6929	&   6.5319	&   1.3	&   0.1000 & 292.6000\\
Mars/2		&   5.8062	&   9.4102	&   3.3	&   0.3000 & 290.5000\\
Mars/3		&   8.7380	&  12.3967	&  5.3	&   0.6000 & 297.3000\\
Mars/4		&  19.9019	&  23.5427	&  13.9	&   1.6000 & 298.2000\\
News/0		&   3.8185	&  11.8028	&  1.06	&   0.1200 & 638.9800\\
\hline
\end{tabular}
}
\normalsize
\caption{Summary statistics of transfer time.}
\label{fig:datasum}
\end{figure}

While summary statistics provide some insight on the performance, both
absolute (transfer time) and relative (rank) a client can expect to
receive, they provide a very limited view of the distribution of these
quantities.  To better understand the distribution of transfer times,
we attempted to fit a variety of analytic distributions to the data.
The quality of such a fit can be determined by a quantile-quantile
plot, which plots the quantiles of the data versus the quantiles of
the distribution~\cite[pp. 196-200]{JAIN-PERF-ANALYSIS}.  A good fit
results in a straight line, regardless of the parameters chosen for
the analytic distribution.


We tried normal, exponential, and Poisson distributions.  None of
these fit the transfer time data very well, especially at the tails.
The distribution of transfer times is heavy-tailed compared to these
distributions.  Next, we tried the log-normal distribution by testing
if the logarithms of our data points were normally distributed.
Generally, log-normal was much better than the earlier distributions.
This result agrees with Balakrishnan et al
\cite{balakrishnan-sigmetrics97}, who also found that a single
client's observed throughput can be modeled reasonably well by a
log-normal distribution.  

We next tried a power transformation --- raising the data to a
fractional power --- and seeing if the transformed data could be
fitted with a common analytic distribution.  This turned out to
provide the best results.  The transformed data are well fit with an
exponential distribution, thus the original data is distributed
according to a Weibull distribution.

It important to note that because transfer times were artificially
truncated at 5 minutes, we do not have an accurate picture of the full
tail of the distribution.  It may be the case that the actual
distribution of server transfer times is much more heavy-tailed,
meaning that the Weibull distribution may not fit this data as well as
it seems to.

