Gyöngyi et al, VLDB 2004

From ScribbleWiki: Analysis of Social Media

Jump to: navigation, search

Combating Web Spam with TrustRank

PDF

Summarized by - Yichia

This paper proposes an algorithm called TrustRank which can semi-automatically identify reputable good web pages from spam given a small set of seed pages.

Web spam is a well-known problem on the World Wide Web. It refers to hyperlinked pages on the WWW that are created with the intention to misleading search engines. For instance, a web spam may try to increase its page rank by adding a large number of keywords to its home page. Another common spamming technique is to create many bogus web pages which all point to a single page since search engines would consider the number of inlinks when ranking pages. While most people can easily identify web spam, it is not easy for a computer to detect such pages. The TrustRank algorithm first selects a small seed set of pages whose spam status has been determined by human experts, and then identifies other pages that are likely to be good based on their hyperlink connectivity. More specifically, the TrustRank includes five steps:

1. Evaluating seed-desirability of pages: this step finds pages that will be the most useful in identifying good pages. In the paper, two seed selection strategies are introduced. One is called inverse PageRank which would select seed pages base on the number of outlinks; another is high PageRank which gives preference to pages with high PageRank.

2. Generating corresponding ordering: in this step, pages are reordered regarding their scores of seed-desirability.

3. Selecting good seeds: the top-ranked pages are identified as seed pages whose spam status is annotated by experts.

4. Normalizing static score.

5. Computing TrustRank scores: the basic idea of TrustRank score is a biased PageRank computation. In each iteration, the trust score of a node is split among its neighbors and dampened by a factor while it propagates.

The authors evaluated their algorithms on the complete set of pages crawled and indexed by AltaVista search engine as of August 2003. They compared TrustRank with PageRank (which does not incorporate any knowledge about the quality of a site, nor does it explicitly penalize badness). The results show that TrustRank can effectively remove most of the spam from among the top-scored sites using only 178 selected good seeds.

In summary, this paper provides details about formalizing the problem of web spam, defining metrics for assessing detection algorithms, presenting schemes for seed page selection, and using TrustRank for determining the goodness of pages.

Views
Personal tools
  • Log in / create account