Page et al, 1999

From ScribbleWiki: Analysis of Social Media

Jump to: navigation, search

Note: In progress

Contents

The PageRank citation ranking: Bringing order to the Web

Authors:Page, Lawrence; Brin, Sergey; Motwani, Rajeev; Winograd, Terry

paper

Review -by Shilpa Arora

Introduction

Efficient ranking of webpages is important for search engines and users in order to make efficient use of the heterogeneous web. There has been work done in academic citation analysis but there are a lot of difference in academic papers and web pages. Firstly, there is no quality control on the web, its easy to generate huge number of pages programatically and inflate the citation count. Secondly, web pages vary a lot in terms of quality, usage, citations and length etc. The average web page quality observed by the user is much higher than the quality of the average web page. Also, the web environment consists of profit seeking ventures, thus evaluation strategies based on replicable features of the web are prone to manipulation. This paper aims at differentiating between web pages based on overall relative importance.

Considering the links as citations, number of backlinks (link from outside to this page) can be used to rank the pages. But every link is not the same. If a web page has a link from Yahoo home page, this implies that it is an important page and it should be ranked higher. PageRank uses the link structure to get a good approximation of importance of a web page.

Definition of Page Rank : Propagation of ranking through links

A page has higher rank if sum of ranks of its back links is high. This handles both the cases when a page has many backlinks and when a page has a few highly ranked backlinks.For a web page u, let <math> N_{u} </math> be the number of links from u and let <math> B_{u} </math> be the set of pages that point to u and c be the normalization factor. The simplified page rank is defined as:

<math> R(u)=c\sum_{v\in B_{u}} \cfrac{R(v)}{N_{v}} </math>

The simplified page rank fails when there are loops in the link structure such as two pages pointing to each other and to no where else. This can result in a sort of trap called rank sink. To avoid this, the page rank can be modified using the source of the rank - E(u):

<math> R'(u)=c\sum_{v\in B_{u}} \cfrac{R'(v)}{N_{v}} + cE(u)</math>

such that c is maximized and <math>{\mid\mid R'\mid\mid}_{u} =1 </math>(L1 norm of R')

Looking at page rank from the point of view of random walks on graphs, if a real web surfer gets into a small loop of pages, it is unlikely that the surfer will get stuck in the loop forever. He/she would jump to some other page. The E(u) factor models the behavior that the surfer periodically gets bored and jumps to a random page based on the distribution in E. Normally, E is chosen as uniform over all web pages.

Dangling links: Links that point to a page with no outgoing links are called dangling links. The dangling links are handled by removing them until page ranks are calculated and adding them back in and doing some recalculations.

Implementation: Speed is very important for the web. For an efficient implementation, the link structure is sorted by parent ID to make access to the link data base structure linear. The initial assignment is chosen carefully as it can help speed up the convergence a lot.

Convergence Properties:Page rank scales very well even for extremely large collections as the scaling factor is roughly linear in log n. The graph below shows the rate of convergence for full-size and half-size link databases. One of the reasons Page Rank converges rapidly is that the web is like an expander-like graph. An expander graph is the one in which every subset of the nodes S has a neighborhood that is larger than some factor <math> \alpha </math> times <math> \mid S \mid </math>, where <math> \alpha </math> is called the expansion factor. A random walk through a graph is a stochastic process where at any node in the graph, you choose an outedge uniformly at random. A random walk is called rapidly-mixing if it converges to a limiting distribution on the set of nodes in a graph quickly (logarithmic in the size of the graph). PageRank computation determines the limiting distribution of a random walk on the web graph. The importance ranking of a node is essentially the limiting probability that the random walk will be at that node after a sufficiently large time. Saying that PageRank computation terminates in logarithmic time is equivalent to saying that the random walk is rapidly mixing or the web graph has a good expansion factor. A graph has a good expansion factor if and only if the largest eigen value is sufficiently larger than the second-largest eigenvalue.

Image:Page1.JPG

Searching with PageRank

PageRank was tested on two search engines - a simple title-based search engine and Google. Benefits of PageRank are greatest for underspecified queries.

Title Search

Search engine that used only titles of 16 million web pages was used to test Pagerank. The search engine finds the web pages which contain all the query words in its title. And then it sorts the results using Pagerank. The results were compared to Altavista and Pagerank based search engine performed better than Altavista by giving more relevant results at the top of the list.

Rank Merging

Title match ensures high precision and Pagerank ensures high quality and thus they got good results for title search. But in title search, for queries like University, recall is not that important because there is a lot more than what the user can look at. For searches where recall is important, the traditional IR scores over full-text and PageRank should be combined. This is called Rank Merging.

Common Case approach

Pagerank follows the common case approach i.e. a random user is quite likely to be looking for the page matching the query and regularly/commonly accessed by the users. This is different from finding a site that contains the most information about the topic. Pagerank also captures a kind of collaborative trust i.e. if a page is linked from a trustworthy or authoritative page, it is more likely to be trustworthy or authoritative as well.

Personalized PageRank

To make up for the rank sinks i.e. cycles with no outedges, vector E which refers to a vector over the web pages as a source of the rank is used. E can be understood as the distribution of web pages that a random surfer periodically jumps to. Choosing E as an uniform distribution would result in similar pages with many related links receiving very high ranks. For example, highly interlinked mailing list archives. If E is chosen to be a single web page, page ranks are generated based on the chosen webpage. The chosen webpage gets the highest rank, followed by its immediate links. This can be used to generate personalized pagerank for users given their bookmarks or homepage as input.

Applications

The difference between the PageRank and usage can be used to find things that the people like to look at but do not want to link to it. For example, the authors observed that the pornography sites have high usage but low pagerank.

Pagerank is a predictor for backlinks. It is better than citation counts as it avoids the local maxima that the citation counting gets stuck in. Citation counting tends to get stuck in local cluster of related pages, taking a long time to branch out, whereas the Pagerank gives preference to the related pages but doesn't get stuck there.

Pagerank can also be used to decide if a site is trustworthy or not based on the backlinks it has, since a page with backlinks from an important page like Stanford's website is highly likely to be trustworthy.

Conclusion

Pagerank provides a global ranking for all the web pages based on their location and relation with the other pages in the web graph. Backlinks provide a kind of peer review and intuitively, the review or link from an "important" page certifies the credibility of the given page. Pagerank can help in identifying a small subset of important webpages which an be used to answer most of the queries and thus full database of webpages doesn't need to be used. Pagerank can also be used to find representative pages to display for a cluster center.

Views
Personal tools
  • Log in / create account