The popularity of Gnutella queries and its implications on scalability

Kunwadee Sripanidkulchai
Carnegie Mellon University
Email

Abstract

The surging increase in the popularity of peer-to-peer applications had led to a dramatic need for a scalable and high performance content location protocol. Gnutella, a peer-to-peer file-sharing protocol, broadcasts queries to locate content and, thus, suffers from an overwhelming amount of query and reply traffic. In this paper, we analyze the characteristics of queries on Gnutella and its implications on scaling. We find that the popularity of search strings follows a Zipf-like distribution. Taking advantage of such a popularity distribution by caching a small number of query results significantly decreases the amount of traffic seen on the network. We evaluate the effectiveness of caching and find that caching at one Gnutella node can result in up to a 3.7-time reduction in traffic while using only a few megabytes of memory. As more nodes implement caching, more traffic is reduced. Caching is a short-term solution to increasing the scalability of Gnutella.

1 Introduction

Gnutella is a file sharing application and protocol. End hosts join Gnutella by connecting to existing end hosts already on the Gnutella, forming an application-level overlay on top of the physical network. All end hosts, or peers, can serve files and retrieve files from one another. To facilitate file sharing, messages are sent between end hosts. Queries for files are broadcasted on the overlay network, and replies are routed back to the host which originally generated the query through the overlay network. Ping and pong (a reply to a ping) messages are used to discover hosts on the network.

Query broadcasts are not scalable. As more nodes join, more queries are sent. Currently, a peer connected to the network through a modem does not have enough bandwidth to support queries and replies, let alone retrieve content. In order to scale Gnutella to more than hundreds of users, it is imperative to stop the flooding of every query and reply.

In this study, we look at the characteristics of Gnutella queries. Given such characteristics, we ask whether there is a simple enhancement to the broadcast protocol that would go a long way towards reducing traffic on the network. In Section 2, we discuss the experimental set up and collected traces. We then analyze the trace, focusing on the popularity of query strings in Section 3. We present our approach in Section 4, and perform trace-based simulations to evaluate it in Section 5. We discuss related work and conclude in Section 6.

2 Experimental Setup

We modified an open source Gnutella client (gtk-gnutella) [8] to passively monitor and log all queries and results that were routed through it. End hosts on our local network, each running the modified Gnutella client, joined Gnutella using Clip2's gnutellahosts.com [6] service. The details of our traces are listed in Table 1.

After the collection period, we sent pings to all the IP addresses observed during December 10-13, 2000, from a set of end hosts at CMU. There were 18,708 (22%) successful pings 18,708 with an average latency of 268.6 msec.

We also recorded packet-level traces of the activity from Gnutella on our local network during the measurement to obtain bandwidth usage information with tcpdump. Eleven 5-10-second traces were recorded during Dec 10-13, 2000, and one 85-second trace was recorded on January 28, 2001. On average, Gnutella was consuming 3.5 Mbps.

In this paper, we explore one characteristic of the collected traces: the popularity of Gnutella queries. We leave the exploration of other characteristics as future work.

Trace Date Duration Number of monitoring hosts Total number of queries Total number of results Unique IP addresses Alpha
1-2 December 10-13, 2000 4 days 2 5,854,009/5,975,167 - 83,709 1.24/1.24
3 January 18, 2001 5 hours 1 570,361 5,282,668 - 1.13
4 January 19, 2001 3 hours 1 362,999 3,857,505 7,032 1.06
5 January 28, 2001 2.5 hours 1 352,396 2,083,615 7,288 0.78
6 January 29, 2001 5 hours 1 1,146,782 8,037,609 12,805 0.63
Table 1: Statistics of the traces

3 Popularity of queries

In this section, we look at the characteristics of queries on Gnutella. We find that 17% of queries contain non-ASCII strings which we believe is due to non-English queries and faulty clients. We removed such queries from the analysis. The most popular queries are for file extensions, artists and adult content. The top 20 queries for Trace 1 and 6 are categorized and shown in Figure 1. In the December trace, the most popular queries were for file extensions. In the later trace, a larger portion of the queries were for artists and adult content. The top 20 queries for other traces can be found at http://www.cs.cmu.edu/~kunwadee/research/p2p/content.html.

Dec13 Jan29
Figure 1: Top 20 most popular queries for two traces.

The number of times a query is observed versus the ranking of the query for Trace 1 and 6 are shown in Figure 2 in log scale. Rank 1 is the most popular query. If each curve were to be a straight line, then the popularity of queries follows a Zipf-like distribution with the probability of seeing a query for the i'th most popular query is proportional to 1/(i^alpha). The curve looks like two straight lines with an inflection point at around query rank 100. The first portion of the curve for queries rank 1 to 100 is flatter. This implies that the most popular queries are almost equally popular. The second portion of the curve, after query rank 100, fits a straight line reasonably well. We used Matlab's curve-fitting tools to estimate the value of alpha, for the second portion of the curve. The values of alpha were between 0.63 and 1.24, and are listed in Table 1. The curves for the other traces can be found at http://www.cs.cmu.edu/~kunwadee/research/p2p/zipf.html.

Dec13 1 Pop Jan 29 Pop
Figure 2: Frequency of query string observed versus query ranking.

Many studies have observed that the frequency by which a Web document is accessed follows a Zipf-like distribution [2,3,4,5,9]. Our findings indicate that the distribution of the popularity of Gnutella queries has two distinct phases. Very popular documents are equally popular. Less popular documents have a distribution which follows a Zipf-like distribution. Caching Gnutella query results, similar to caching web documents, to cope with Zipf-like distributions should significantly reduce the traffic on Gnutella, freeing up bandwidth to provide better content retrieval performance. In the following section, we analyze the benefits of caching query results.

4 Caching query results

In this section, we outline a simple protocol to implement caching of queries and results.

Gnutella nodes monitor and cache query strings and results that flow through it. Cached results are valid only up to a timeout period, after which it is removed from cache. If a query result has been cached, the node will answer the query and will not forward the query on. In the best case, this allows for a node to answer its own queries without needing to send any queries out on the network. If the cached result for a query is expired, the Gnutella node will not answer the query and will forward the query out to its neighbors.

5 Evaluation

Caching query results helps to reduce the time it takes for queries to be answered and also reduces the amount of traffic on the network. In this section, we evaluate the effectiveness of caching query results on both metrics using three caching policies based on timeouts: queries and results are cached for 1 minute, 5 minutes, and 10 minutes.

In our simulations, query strings are directly associated with query results. For example, is, if a query for "foo.mpg" is received, we cache the query string "foo.mpg" and the associated replies. When a subsequent query for "foo.mpg" is received before timeout, it is a cache hit. Any subsequent queries for "foo", or "mpg" are considered misses even if received before the timeout. This gives a worst-case bound on the benefits of caching. A real implementation should allow for partial string matchings. We also assumed that it takes t_max after a query is received for a query result to be cached, where t_max is the maximum time it takes to receive a query result from any node. This gives a worst-case estimate on when a query result can be answered from cache. Our evaluation is for one Gnutella node, our monitoring node, implementing caching. As more nodes cache results, less traffic is seen all together, and most queries will be answered within very few hops.

Hitrate
Figure 3: Hit rates for each caching policy.

Figure 3 depicts the cache hit rate for each trace, where hit rates are defined as the number of queries that were answered from cache over the total number of queries. When results are cached for longer periods, it is more likely that a larger number of queries can be answered from cache, and the amount of query and reply traffic on the network is reduced. The hit rate ranges from 3% up to 73%, where the highest hit rate was observed when using a 10 minute caching interval with trace 2. Traffic is reduced by 3.7-times when the hit rate is 73%. The hit rate for trace 3 using a 1 minute caching interval was the lowest. This is because t_max for this trace is close to 1 minute.

Although our findings suggest that having larger caching intervals results in a more significant reduction in traffic, there is a tradeoff. Cached results could become out-of-date under dynamic conditions where peers can join and leave the network at any time and content on peers can change at any time. The longer a result is cached, the more stale it becomes. Finding a balance between caching and inconsistency is key to achieving good performance. We plan to explore this in future work.

Overhead
Figure 4: Average amount of memory used for caching.

Any node that caches queries and replies is contributing storage. Figure 4 depicts the average amount of memory used for caching. The longer the caching interval is, the more memory is needed. The amount of memory used ranges from 195 kB to 4.3 MB. This is an acceptable amount of overhead given that most computers are equipped with at least 64 MB or RAM. However, if memory is scarce, a caching policy using limited memory, such as LRU or LFU, can be used.

6 Conclusions and related work

We studied the popularity of queries on Gnutella and find that it has 2 distinct phases. The most popular objects are almost equally popular, and the lesser popular objects have popularity characteristics which follow a Zipf-like distribution. We propose to cache queries and replies to take advantage of such a popularity distribution. Caching does significantly reduce the amount of traffic seen on Gnutella without using a large amount of memory. Therefore, caching is a simple short-term solution that should help to scale Gnutella. However, as Gnutella grows, more memory will be needed for good caching performance. There is still a need for a better long term solution for a content location protocol that does not broadcast queries.

There have been many recent studies on Gnutella [1,7,10]. Adar and Huberman observe that there is a significant amount of freeriding (peers not sharing files) which could impend on the usability of Gnutella. Distributed Search Solutions has several reports on characteristics of Gnutella, and how it has changed over time. More related to our paper, Ritter uses mathematical analysis to show that Gnutella cannot scale because of its use of broadcasting. We agree that there is a need for a scalable content location protocol. Caching can be used a short-term fix while we await a better architecture for enabling peer-to-peer applications.

References

[1] E. Adar and B. Huberman, "Free Riding on Gnutella," First Monday, Volume 5, Number 10, October 2000.
[2] V. Almeida, A. Bestavros, M. Crovella and A. de Oliveira, "Characterizing reference locality in the WWW, " in Proceedings of 1996 International Conference on parallel and Distributed Information Systems (PDIS '96), 1996.
[3] L. Breslau, P. Cao, L. Fan, G. Phillips and S. Shenker, "Web caching and Zipf-like distributions: Evidence and implications," in Proceedings of the IEEE INFOCOM '99, March 1999.
[4] C. Cunha, A. Bestavros and M. Crovella, "Characteristics of WWW client based traces," Tech. Rep. BU-CS-95-010, Computer Science Department, Boston University, 1995.
[5] S. Glassman, "A caching relay for the World Wide Web," in Proceedings of the First International Conference on the World Wide Web, 1994.
[6] Gnutella hosts web page, http://www.gnutellahosts.com
[7] Gnutella: To the Bandwidth Barrier and Beyond, Clip 2 Report, November 6, 2000.
[8] GTK-gnutella web page, http://gtk-gnutella.sourceforge.net
[9] T. Kroeger, J. Mogul and C. Maltzahn, "Digital's web proxy traces," Available at ftp://ftp.digital.com/pub/DEC/traces/proxy/webtraces.html, August 1996.
[10] J. Ritter, "Why Gnutella Can't Scale. No, Really.," http://www.monkey.org/~dugsong/mirror/gnutella.html, February 2001.

Kay Sripanidkulchai