Kleinberg et al, ICCC 1999
From ScribbleWiki: Analysis of Social Media
Contents |
The Web as a graph: Measurements, models and methods
Authors: Jon M. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew S. Tomkins
Review -by Shilpa Arora
Introduction
In this paper the authors describe two algorithms that operate on the Web graph, addressing problems from Web search and automatic community discovery: Kleinberg's HITS method (Kleinberg, 1999) and enumeration of bipartite cliques (Kumar et al., 1999). These two papers are motivated by the underlying idea that for any particular topic, there tends to be a set of "authoritative" pages focused on the topic and a set of "hub" pages each containing links to useful, relevant pages on the topic. Kleinberg's HITS method is a search algorithm that distills high quality pages for a topic query and the bipartite clique method enumerates all topics that are represented on the web in terms of dense sets of hub/authority pages.
HITS algorithm
HITS algorithm searches for high quality relevant pages for a given topic query.
HITS algorithm has two main steps:
- Sampling Step: The sampling step creates a focused collection of several thousand Web pages rich in relevant authorities for the topic queried. This subgraph is then searched for hubs and authorities on the topic. The algorithm first constructs a subset called root set (about 200 pages) using keyword queries. The root set is expanded into a base set by including pages that are linked to by the pages in the root set and all pages that link to pages in the root set. Intuition behind this is that the prominence of authoritative pages is normally due to the endorsement of many relevant pages that are not in themselves prominent. Base set normally consists of about 1000-3000 pages.
- Weight-propagation step: This step determines the numerical estimates of scores for the hub and authority pages using an iterative process. Each page is associated with a authority weight and hub weight. Authority weight is updated as a sum of hub weights of all the pages that point to it and similarly the hub weight of a page is updated as a sum of authority weights of all the pages it points to.
The output of the HITS algorithm is a short list consisting of pages with the largest hub and authority weights.
An interesting property of HITS algorithm is that after collecting the initial root set using the key terms. It doesn't look at the text of the web pages. Thus, HITS is a purely link based algorithm after collecting the initial root set with no further regard for query terms. And it was found to give relevant results for queries.
Trawling the web for cyber-communities
An example of a cyber-community is a community of aficionados of commercial aircraft manufactures, who create hub-like pages that co-cite the authoritative pages such as the boeing homepage, airbus homepage and embraer homepage. To identify such cyber-communities in the web-graph, (Kumar et al.,1999) use a trawling algorithm to detect bipartite cores in the graph. A bipartite clique K_{i,j} is a graph in which every one of the i nodes has an edge directed to each of the j nodes. A bipartite core C_{i,j} is defined as a graph on i+j nodes that contains at least one K_{i,j} as a sub-graph. A bipartite core represents a cyber-community. For any topic, sufficiently well represented, there will be a bipartite core in the Web graph. An example of a bipartite core is shown in Figure 1.
The naive algorithm for detecting bipartite cores can be computationally expensive. (Kumar et. al., 1999) propose an elimination-generation algorithm which performs a number of sequential passes over the web graph where elimination and generation processes are interleaved. The elimination process eliminates certain edges from the graph which do not satisfy the necessary (but not necessarily sufficient) conditions for it to be part of a bipartite core. An example of such a condition is that for identifying a C_{4,4}, the in-degree of the nodes on the right should be more than 3. The generation process on the other hand verifies the boundary cases with conditions such as: Considering a C_{4,4}, if u is a node with in-degree 4, then the four nodes of incoming links are verified for neighborhood overlap of at least four.
(Kumar et. al., 1999) enumerated C_{i,j}'s for values of i ranging from 3 to 9 and j ranging from 3 to 20. The results suggested that the web graph has several hundred to thousand such cores and only a small fraction of these are coincidences and the vast majority correspond to communities with a definite topic focus.
Measurements & Comparison to traditional random graph models
With the algorithms discussed above, the authors were able to study some interesting characteristics of the web graph which were not possible with the traditional random graph models. Some of these properties are:
- Degree distribution: In and Out degree on the web follow the Zipfian distribution and it was observed that the probability that a node has an in-degree of i is 1/i^alpha where alpha=2. The average out-degree was observed to be 7.2.
- Number of bipartite cores (C_{i,j}) for the web graph with average out-degree as 7.2 was observed to be
- Connectivity of local subgraphs: Analysis show that the web graph consists of a small biconnected nucleus which represents the top hubs and authorities computed by HITS. Bi-connectivity is defined as: two nodes u and v are said to be biconnected if there in no third node w so that w lies on all u-v paths.
Random graph model
In this paper, authors propose a random graph model to capture the intuition behind web page creation process, described as follows:
First a few scattered pages begin to appear about a topic. As this reaches a critical mass, people start to link together the web pages creating a locally dense subgraph around the topic of interest.
Creating a random graph model for the web would allow modeling various structural properties of the web graph such as node degrees, neighborhood structures and communities. It would also allow us to predict the behavior of algorithms on the web and also make predictions about the shape of the web graph in future. In their model they try to reflect the structural phenomena observed in the web graph.
The model generation process consists of four stochastic process: creation and deletion processes for nodes and edges. Probability distributions are sampled at each step to determine the node to add or delete the edges out of, the number of edges to add, delete or copy from one node to another.
References
J. Kleinberg, Authoritative sources in a hyperlinked environment, J. of ACM, 1999.
S.R. Kumar, P. Raghavan, S. Rajagopalan and A. Tomkins, Trawling emerging cyber-communities automatically. Proc 8th WWW conference, 1999.