\section{Small server sets}
\label{section:working-sets}

The observation in Section~\ref{sec:ranktime} that most rank changes
are small leads us to ask how many servers must a client consider to
achieve optimal performance.  If server ranks never changed, a client
would only need to use one server, the one with the best rank.  But
because server ranks do change, a client will need to evaluate
multiple servers to find the current best.  We have found that a
client needs to evaluate only a very small number of servers, usually
less than half the total number of servers, to achieve near-optimal
performance.  In this section, we define a server's performance to be
near-optimal, or ``good,'' if it can deliver a document in no longer
than 10\% more than the best transfer time for the same document
observed in the same group of fetches.

We define a {\em server set} for client $C$ and document $D$ to be the
minimum subset of a site's mirrors such that for any group, at least
one server in the server set can deliver $D$ to $C$ with good
performance.  If a server set contains all the mirrors, it means that
a client must consider all mirrors when choosing a server.  From the
data we have, we can build a server set for each client-document
combination using a straightforward greedy algorithm: In each group of
fetches, all servers that deliver good performance are marked.  The
number of marks that each server accrues over all groups is computed,
and the server, $s$, with the highest total, is added to the server
set.  The groups where $s$ exhibited good performance are discarded,
and the procedure is repeated on the remaining groups.  The algorithm
terminates when there are no groups left.

\begin{figure}[t]
\begin{center}
\epsfig{file=wset.eps,width=3in}
\end{center}
\caption{Server sets for two client-data combinations: Wash. U.'s
Mars set and U. Mass.'s Apache}
\label{working-set}
\end{figure}

Figure~\ref{working-set} shows the composition of the server sets for
10 data sets composed of the five documents from U. Mass's Apache data
and the five documents from Washington U.'s Mars set.  Each stripe
from each column represents the proportion of time that one server
offers good performance.  For example, the first column of the graph
shows the server set for the Wash. U. client's fetches of document 0
from the Mars sites.  Each stripe in that column corresponds to one
server.  For purposes of this discussion, it does not matter which
server maps to which stripe.  What is significant is the size of each
stripe, which shows how often the corresponding server is able to
deliver good performance.  The first stripe in the column shows that
one server is good in almost 70\% of its fetches.  The next stripe
represents a server that is good in a little more than 10\% of the
remaining fetches.

The distribution and number of stripes show that client sites do not
have to consider every server in the mirror set to achieve good
performance.  Rather, a small number of servers can provide good
performance for a significant fraction of all fetches.  Looking at the
Washington U. data, we see that for documents 1 through 4, the client
can receive good performance over 90\% of the time by considering only
2 servers out of the group of 20.  For document 0, the client would
need to consider 5 servers to achieve good performance more than 90\%
of the time.  On the other hand, the client at U. Mass. requires more
servers to achieve good performance when fetching from Apache servers.
Seven servers are required for the first document while 5 are required
for the other documents.  This is a much higher proportion of servers
than for the Washington U. client (7 out of 11 vs. 5 out of 20).

\begin{figure}[t]
\begin{center}
{\small
\begin{tabular}{|l|c|c|}\hline
Doc. & Avg. for 90\% Good & Avg. for 100\% Good\\\hline%\hline
\multicolumn{3}{|c|}{Mars (20 Servers)}\\\hline
0 & 3.44 & 8.57 \\%\hline
1 & 2.67 & 5.83 \\%\hline
2 & 2.56 & 5.83 \\%\hline
3 & 2.67 & 5.67 \\%\hline
4 & 2.22 & 5.60 \\\hline%\hline
\multicolumn{3}{|c|}{Apache (11 servers)}\\\hline
0 & 3.89 & 6.25 \\%\hline
1 & 3.00 & 5.20 \\%\hline
2 & 3.11 & 5.25 \\%\hline
3 & 3.00 & 5.80 \\%\hline
4 & 3.00 & 6.00 \\\hline%\hline
\multicolumn{3}{|c|}{News (16 servers)}\\\hline
0 & 2.44 & 5.88 \\\hline
\end{tabular}
} %\small
\end{center}
\caption{Average (taken over all clients) number of servers 
	required to achieve good performance in 90\% 
	and 100\% of fetches}
\label{avg-num-server}
\end{figure}

Figure~\ref{avg-num-server} summarizes our findings over all
documents.  On average, less than half of all servers need to be
considered to achieve good performance most (or even all) of the time.
This result implies that when designing a server selection mechanism,
it is unnecessary to assume that all clients will need to contact or
evaluate all servers.
