\section{Data collection methodology}
\label{sec:method}

\begin{figure}[t]
\begin{center}
{
\small
%\tinyspacing
\begin{tabular}{|l|c|c|c|}\hline
Client Site & Avg. time & Fetches & Failure rate\\\hline
CMU   & 32.82 min. & 54695 & 10.18\%\\
Ga. Tech.         &  23.78 & 60021 & 11.55\%\\
ISI               & 36.52 & 53200 & 22.13\%\\
U. C. Berkeley & 32.55 & 55062 & 4.62\%\\
U. Kentucky          & 31.23  & 55091 & 12.76\%\\
U. Mass.       & 70.56 & 36542 & 10.95\%\\
U. T. Austin         & 39.56 & 51640 & 4.70\%\\
U. Virginia         & 19.32 & 62405 & 28.88\%\\
Wash. U., St. Louis & 23.27 & 62187 & 1.96\%\\\hline
\end{tabular} 
}  
\end{center}
\caption{Average time for one group of fetches,
number of fetches completed, and failure rate for each client site.}
\label{client-sites}
\end{figure}

%\begin{figure*}[t]
%\begin{center}
%{\small
%\begin{tabular}{cc}
%\begin{tabular}{|l|c|c|c|}\hline
%Client Site & Avg. time & Fetches & Failure rate\\\hline
%CMU   & 32.82 min. & 54695 & 10.18\%\\
%Ga. Tech.         &  23.78 & 60021 & 11.55\%\\
%ISI               & 36.52 & 53200 & 22.13\%\\
%U. Ca., Berkeley & 32.55 & 55062 & 4.62\%\\
%U. of Ky          & 31.23  & 55091 & 12.76\%\\
%U. of Mass.       & 70.56 & 36542 & 10.95\%\\
%U. of Tx.         & 39.56 & 51640 & 4.70\%\\
%U. of Va.         & 19.32 & 62405 & 28.88\%\\
%Wash. U., St. Louis & 23.27 & 62187 & 1.96\%\\\hline
%\end{tabular} 
%&
%\begin{tabular}{|l|l|c|}\hline
% & URL & Size (bytes) \\\hline
%\multicolumn{3}{|c|}{Mars documents} \\ \hline
%0 & /nav.html  & 2967 \\
%1 & /2001/lander.jpg & 70503 \\
%2 & /mgs/msss/camera/images/12\_31\_97\_release/2303/2303p.jpg & 235982 \\
%3 & /mgs/msss/camera/images/12\_31\_97\_release/2201/2201p.jpg & 403973 \\
%4 & /mgs/msss/camera/images/12\_31\_97\_release/3104/3104p.jpg & 1174839 \\\hline
%\multicolumn{3}{|c|}{Apache documents}\\\hline
%0 & dist/patches/apply\_to\_1.2.4/no2slash-loop-fix.patch & 1268 \\
%1 & dist/CHANGES\_1.2 & 90631 \\
%2 & dist/contrib/modules/mod\_conv.0.2.tar.gz & 74192 \\
%3 & dist/apache\_1.2.6.tar.gz & 714976 \\
%4 & dist/binaries/linux\_2.x/apache\_1.2.4-i586-whatever-linux2.tar.Z &
%1299105 \\\hline
%\end{tabular} \\
%(a) Client Sites
%&
%(b) Documents
%\end{tabular}
%}  %\small
%\end{center}
%\caption{(a) Average time for one group of fetches,
%number of fetches completed, and failure rate for each client site.
%(b) Document URLs from the Mars and Apache servers. }
%\label{client-sites-urls}
%\end{figure*}

\begin{figure}[t]
\begin{center}
{\small
%\tinyspacing
\begin{tabular}{|l|l|} \hline
\multicolumn{2}{|c|}{Mars sites}\\\hline
mars.sgi.com & www.sun.com/mars \\
entertainment.digital.com/mars/JPL & mars.novell.com \\
mars.primehost.com & mars.hp.com \\
mars.excite.com/mars & mars1.demonet.com \\
mars.wisewire.com & mars.ihighway.net \\
pathfinder.keyway.net/pathfinder & mpfwww.arc.nasa.gov \\
mars.jpl.nasa.gov & www.ncsa.uiuc.edu/mars \\
mars.sdsc.edu & laguerre.psc.edu/Mars \\
www.ksc.nasa.gov/mars & mars.nlanr.net \\
mars.catlin.edu & mars.pgd.hawaii.edu \\\hline
\multicolumn{2}{|c|}{News sites}\\\hline
www.cnn.com & www.nytimes.com/index.gif \\
www.latimes.com & www.washingtonpost.com\\
www.csmonitor.com & www.usatoday.com\\
www.abcnews.com & www.msnbc.com \\
www.s-t.com & nt.excite.com\\
news.bbc.co.uk & www.newscurrent.com\\
pathfinder.com/time/daily & www.sfgate.com/news\\
headlines.yahoo.com/Full\_Coverage & www.topnews.com\\\hline
\multicolumn{2}{|c|}{Apache sites}\\\hline
www.rge.com/pub/infosystems/apache & apache.compuex.com \\
apache.arctic.org & ftp.epix.net/apache \\
apache.iquest.net & www.apache.org \\
apache.utw.com & www.ameth.org/apache\\
apache.technomancer.com/ & apache.plinet.com\\
\multicolumn{2}{|l|}{fanying.eecs.stevens-tech.edu/pub/mirrors/apache} \\\hline
\end{tabular}
} %\small
\end{center}
\caption{Servers visited}
\label{server-sites}
\end{figure}


\begin{figure}
\begin{center}{
\small
%\tinyspacing
\begin{tabular}{|l|l|c|}\hline
 & URL & Size (bytes) \\\hline
\multicolumn{3}{|c|}{Mars documents} \\ \hline
0 & /nav.html  & 2967 \\
1 & /2001/lander.jpg & 70503 \\
2 & /mgs/msss/camera/images/... & \\ 
  & ...12\_31\_97\_release/2303/2303p.jpg & 235982 \\
3 & /mgs/msss/camera/images/... & \\
  & ...12\_31\_97\_release/2201/2201p.jpg & 403973 \\
4 & /mgs/msss/camera/images/... & \\
  & ...12\_31\_97\_release/3104/3104p.jpg & 1174839 \\\hline
\multicolumn{3}{|c|}{Apache documents}\\\hline
0 & dist/patches/apply\_to\_1.2.4/... & \\
  & ...no2slash-loop-fix.patch & 1268 \\
1 & dist/CHANGES\_1.2 & 90631 \\
2 & dist/contrib/modules/mod\_conv.0.2.tar.gz & 74192 \\
3 & dist/apache\_1.2.6.tar.gz & 714976 \\
4 & dist/binaries/linux\_2.x/... & \\
  & ...apache\_1.2.4-i586-whatever-linux2.tar.Z & 1299105 \\\hline
\end{tabular} 
}
\end{center}
\caption{URLs of documents fetched Mars and Apache servers.}
\label{document-urls}
\end{figure}

At each of nine client sites where we had guest accounts (listed in
Figure~\ref{client-sites}) a perl script periodically fetched
documents from each server in three sets of mirrored web sites (the
Apache Web Server site, NASA's Mars site, and News Headlines) listed
in Figure~\ref{server-sites}.  The Apache and Mars web sites were true
mirrors: each of the servers in one set held the same documents at the
same time.  However, the News sites were an artificial mirror since
they did not contain the same documents.  The News servers were picked
from Yahoo's index (http://www.yahoo.com/).  Current headlines from
each of the News sites were fetched and the transfer times were
normalized so that all News documents appeared to be 20 KB long.  For
the Mars and Apache servers, we used five documents ranging in size
from 2 KB to 1.3 MB (listed in Figure~\ref{document-urls}).  We chose
these sites in order to capture three different ranges of site content
update frequency: the Apache site's content changed on the order of
weeks; the Mars site, on the order of days; and the News site, on the
order of minutes.

%need figure depicting fetch order
Clients visited servers sequentially, fetching all documents from a
server before moving on to the next.  Similarly, all mirrors of one
site were visited before moving on to the next site.  For example, a
client would start by visiting http://www.sgi.com/, the first Mars
mirror on the list, and fetching each of the Mars documents from it.
Then the client would fetch the Mars documents from the second Mars
server, then the third, and so on.  When all of the Mars servers had
been visited, the client would move on to the Apache mirrors, and
finally to the News sites.  We refer to the process of visiting all
servers and collecting all documents once as a {\em group} of fetches.

After all servers were visited, the client would sleep for a random
amount of time taken from an exponential distribution with a mean of
$1/2$ hour added to a constant $1/2$ hour.  By scheduling the next
group of fetches relative to the previous group's finish time (rather
than its start time), we avoided situations in which multiple fetches
from the same client interfered with each other, competing for
bandwidth on links near the client.

We introduced the delay between fetches to limit the load our fetches
created on client and server sites.  A typical group of fetches
involved transferring more than 60 MB of data to a client.  If the
fetches finished in 30 minutes, the average transfer rate would have
been 266 Kbps, which is a noticeable share of the traffic on a LAN.
The delay between groups of fetches lowered the average resource
utilization to roughly half the original average bandwidth.

We used the lynx\footnote{Available from http://lynx.browser.org/} web
browser to perform fetches.  Choosing lynx was a compromise between
realism and ease of implementation.  Lynx is an actual production web
browser that people use every day.  At the same time, it is easy to
control via command line switches, allowing us to run fetches via a
perl script.  Implementing our own URL fetch code might not have
captured the characteristics of actual browsers.  Conversely, using a
more popular, hence more realistic, browser, e.g. Netscape, would have
presented a significant programming challenge.

Our client script would invoke lynx to retrieve a URL and send it to
standard output.  The number of bytes received by lynx was counted and
recorded along with the amount of time the fetch took to complete.  If
a fetch did not terminate after five minutes, it would be considered
unsuccessful and the associated lynx process would be killed.  We
chose five minutes as a compromise between achieving a complete
picture of a server's behavior and forcing groups of fetches to finish
in a reasonable amount of time.  The observable effects of such a
short timeout were a slightly higher failure rate, especially among
larger documents.  Possible causes for timeouts are network
partitions, client errors (lynx might have frozen), server errors (the
server might have stopped providing data), or shortages of available
bandwidth.  In our analysis, we treat these incidents as failures to
collect data, rather than as failures of servers.

Fetches could also be unsuccessful if the number of bytes returned was
incorrect.  We found that the wrong number of bytes usually indicated
a temporary failure such as a ``server too busy'' message although in
some cases it signified that the server no longer existed (failed DNS
query) or was no longer mirroring data.  We assumed that every fetch
which returned the proper number of bytes succeeded.

It was more difficult to identify failed fetches from the News sites.
Since we were retrieving news headlines, each page's content was
constantly changing so we could not use a hard-coded size to determine
success.  A simple heuristic that worked well was to assume that all
fetches that returned less than 600 bytes were failures.  This value
was larger than typical error messages (200-300 bytes) and smaller
than typical page sizes (as low as 3k on some servers).  As with the
other servers, fetches lasting five minutes were considered failures.

While our fetch scripts were running, there were multiple occasions
on which client machines crashed or were rebooted.  To limit the
impact of these interruptions, we used the Unix {\tt cron} system to
run a ``nanny'' script every 10 minutes which would restart the fetch
script if necessary.  This kept all fetch scripts running as
often as possible.

\subsection{Limitations}
\label{section:limitations}

While our methodology was sufficient to capture the information in
which we were most interested, there were some data that we were not
able to capture.  Because of the relatively large, random gaps between
fetches to the same server, we were unable to capture shorter-term
periodic behavior.  Further, because each group of fetches finished in
a different amount of time because of variations in server load and
network congestion, the distribution of fetch interarrivals to a
single server from a client was extremely hard to characterize and
exploit.  Thus, we were unable to map the observed frequency of
network conditions to the actual frequency of occurrence of these
conditions.

No two fetches from a given client were done simultaneously to prevent
the fetches from competing with each other.  At the same time, we
would like to compare results across servers to rank servers relative
to one another.  There is a reasonable amount of evidence which
suggests that network performance changes over longer time scales
\cite{seshan-usits97,balakrishnan-sigmetrics97} while our
measurements took place over shorter time scales.  On average, clients
visited all Mars mirrors in just over 17 minutes, all Apache mirrors
in under 13 minutes, and all News sites in less than one and a half
minutes.  Because of these results, we believe that it is valid to
treat sequential fetches as occurring simultaneously.

Another artifact of sequential fetches is that periods of network
congestion are possibly underrepresented in the data.  As congestion
increases, fetches will take longer.  The result is that the number of
fetches completed during periods of congestion will be lower than the
number completed during periods with less congestion.  If periods of
congestion are short-lived, only a few fetches will reflect the
congestion.  If periods of congestion are long-lived, all fetches will
take longer but the total number of groups of fetches completed will
be smaller.

DNS caching effects could also potentially bias our results.
Depending on the DNS workload at a given client site, DNS entries for
the servers in our study may or may not remain in the local cache from
one group of fetches to another.  In fact, cache entries could even be
purged within a group of fetches.  The DNS lookups added a potentially
highly variable amount of time to each fetch we performed.  Performing
the lookups separately would have been possible, but less realistic.

Finally, we must consider inter-client effects.  Because each client's
fetches are independently scheduled, two clients could wind up
visiting the same server at the same time.  We will refer to such an
incident as a {\em collision}.  We believe that collisions have a
negligible effect on fetch times.  Further, less than 10\% of all
fetches were involved in collisions.
