\section{Introduction}

Mirror servers, which serve replicas of popular data items, have been
employed for many years on the Internet as a way to increase
reliability and performance in the presence of frequent accesses by
many clients.  While mirroring can provide much higher aggregate
throughput to a given data item, individual clients must choose a
mirror server carefully to achieve reasonable performance.
Unfortunately, only ad hoc mechanisms for choosing the appropriate
mirror server are currently employed.  However, a number of server
selection mechanisms have been proposed.  Partridge et al
\cite{rfc1546} have introduced a scheme called {\em anycast} that allows a
client to automatically reach the replica of a server which is the
smallest number of network hops away.  Others
\cite{carter-96-007,bhattacharjee-infocom97} have observed that
static metrics of proximity, such as distance in hops, are less
effective at finding a server that will deliver good performance than
metrics which take dynamically changing network and server conditions
into account.

In order to design an effective algorithm for mirror server selection,
an understanding of the actual behavior of Internet mirror servers is
necessary.  To contribute towards this understanding, we have
undertaken a large measurement scale study involving 9 clients and 47
mirror servers scattered throughout the United States.  Although other
studies of mirror server behavior have appeared in the literature
before, we believe this is the first study of this scale.  This paper
presents a number of interesting properties that we have observed in
the data collected.  Our focus is on characterizing the performance an
individual client receives when transferring documents from mirror
servers.  We wish to answer four questions:
\begin{itemize}
\item Does performance observed by a client vary across mirror servers?
\item How dynamic is the set of servers that offer good performance?
\item How is the probability that a server's performance will drop
relative to other servers affected by time scale?
\item Does a drop in a server's performance indicate it has become
less likely to offer better performance than other servers?
\end{itemize}

To answer the first question, we have looked at the time required to
retrieve a document from each mirror server of a site.  We have found
that the difference in performance between the best and worst servers
is typically larger than an order of magnitude, and can grow larger
than two orders of magnitude on occasion.  This result shows that
performance does indeed vary largely from one server to another.

The second question is an attempt to explore how dynamic server
performance changes are.  By counting the number of servers that a
client must visit over time in order to achieve good performance, we
can see whether the set of servers that offer good performance at any
given time is small or large.  We found that the set is usually fairly
small, indicating less dynamic behavior.

The third and fourth questions concern mechanisms that could
potentially be incorporated into a server selection system.  In the
third question, we consider whether there is some relationship between
a server's performance relative to other servers and time scale.  The
fourth question considers the case in which a client has found a
server that offers good performance relative to the other servers but
then notices a drop in that server's performance.  The question is
whether or not that drop in performance indicates that the server's
performance is no longer good.  We found that large performance drops
do indicate an increased likelihood that a server no longer offers
good performance.

Finally, we will consider the effect of document choice on server
choice.  Though we assume that all mirrors of a server have the same
set of documents, it might be the case that some factor such as
document size or popularity would affect the performance of a server.
We found that server choice is independent of document choice almost
all the time.

To summarize, we have five main results:
\begin{itemize}
\item Performance can vary widely from one server to another.
\item Clients can achieve near-optimal performance by considering only
a few servers out of the whole group of mirrors.
\item The probability of any server's rank change depends very little on the
time scale over which the rank change takes place.
\item There is a weak but detectable link between a server's change in
transfer time and its change in rank.
\item Server choice is independent of document choice in most instances.
\end{itemize}
We discuss the implications of these results in
Section~\ref{section:implications}.

\subsection{Related work}
\label{sec:related-work}

Previous work on server selection techniques can be divided into four
categories: network-layer server selection systems, application-layer
selection systems, metric evaluation, and measurement studies.  The
first includes work dealing with finding the closest server in terms
of number of network hops or in terms of network latency
\cite{rfc1546,idmaps,distributeddirector,basturk-ibm-tr,francis-hops,ipv6,levine-icnp97}.  The second consists of systems that take
application performance metrics into account
\cite{bhattacharjee-infocom97,clustercats,winddance,rosenberg-infocom98,srvloc,seshan-usits97,yoshikawa-usenix97}.
Most of these systems use a combination of server load and available
network throughput to select a server for a client.  The third
category consists of evaluations of server selection metrics
\cite{carter-96-007,fei-infocom98,guyton-95-762,sayal-wisp98}.
These studies propose new metrics and test them experimentally.

The fourth category, which includes this work, consists of studies
that characterize the behavior of existing mirror servers in order to
draw conclusions about the design of server selection systems.
Bhattarcharjee et al \cite{bhattacharjee-96-25} measured ``server
response time,'' defined to be the time required to send a query to a
server and receive a brief response, using clients at a single site to
visit two sets of web sites.  While neither set of sites were true
mirrors, each set consisted of servers with similar content.
Bhattacharjee also measured the throughput between a client and four
FTP servers.  Carter and Crovella \cite{carter-96-007} measured ping
times and hop counts to 5262 web servers to determine how well one
approximated the other.  In contrast, our study is on a larger scale,
using multiple client sites, a longer measurement period, and a larger
number of groups of popular web servers that are true mirrors.

There have been several other web-related measurement studies.
Balakrishnan et al \cite{balakrishnan-sigmetrics97} analyzed a trace
of web accesses to determine how stable network performance is through
time and from host to host.  Gribble and Brewer \cite{gribble-usits97}
looked at users' web browsing behavior, exploring server response
time, burstiness of offered load, and the link between time of day and
user activity.  Cunha et al \cite{cunha-bu-tr-95-010} also collected
user traces via a customized version of Mosaic and looked at a number
of factors including document size and popularity.  Arlitt and
Williamson \cite{arlitt} searched for trends present in a variety of
different WWW workloads based on server access logs.  Finally,
Crovella and Bestavros \cite{crovella-ton97} have found evidence for
self-similarity in WWW traffic.

The rest of this paper consists of a description of our data
collection system (Section~\ref{sec:method}), a general picture of the
data we collected (Sections~\ref{sec:data} and \ref{sec:sumstat}), a
discussion of our findings (Sections~\ref{sec:ranktime}
through~\ref{section:document}), implications of our results
(Section~\ref{section:implications}), and conclusions
(Section~\ref{section:conclusion}).

