Bao et al, WWW 2007
From ScribbleWiki: Analysis of Social Media
Optimizing Web Search Using Social Annotations
This paper presents two uses of "social annotations", a.k.a. del.icio.us tags, for web retrieval. The first is a method, SocialSimRank, uses these annotations to provide another feature to use in document-query similarity calculations. The second method, SocialPageRank, gives a static-ranking of documents based on their popularity on del.icio.us.
SocialSimRank uses the overlap between a document's tags and query terms as well as the network defined by the documents, users and tags on del.icio.us to compute a similarity measure between the query and documents. This effectively expands the documents' representation with explicitly assigned tags and provides a method for calculating the similarity of the query terms and those tags. The heart of SSR is the calculation of tag-tag similarity measure through an iterative algorithm defined over a bi-partite graph of tags and documents, with weights defined by the number of users who have assigned a tag to a page. Once this tag-tag similarity matrix is computed, similarity of document to the query can then be computed by treating the query as a new tag set and looking at the similarity between this query tag set and the tags explicitly assigned to the document.
SocialPageRank defines a static ranking similar to PageRank, but using the information provided by del.icio.us tags instead of hyperlinks. In SPR, information about the popularity of a document is propagated through a network over users, tags and documents until a convergent state is reached. The final weights assigned to pages defines a static ranking over the pages and can be used as a feature in document retrieval algorithms.
Experiments & Analysis
Experiments were conducted using two query sets: (1) a manually created query set of 50 queries and relevance judgments by computer science students and (2) an automatically created query set made from 3000 Open Directory Project category names with their corresponding documents as the relevant set.
The authors show improvement using SSR and SPR over a range of experiments. SSR provides retrieval performance improvements over a BM25 baseline algorithm, and also shows consistent improvement over a social annotation baseline of simple term-matching between the query terms and assigned tags. SPR likewise improves retrieval performance over the BM25 baseline and also over Google's PageRank (as provided by the Google Toolbar).
The authors acknowledge there are several drawbacks to the algorithm: (1) many pages don't have any del.icio.us tags, (2) tags can be ambiguous (eg. airplane ticket vs. movie ticket) and (3) annotation spam.