Halpin et al, WWW 2007
From ScribbleWiki: Analysis of Social Media
This page maintained by: Mahesh Joshi
The Complex Dynamics of Collaborative Tagging
Harry Halpin, Valentin Robu, Hana Shepherd
This paper is primarily concerned with empirically validating that the metadata that gets generated through the process of collaborative tagging stabilizes with time and use. In absence of such stability, the generated tags would naturally be rendered less useful. It is argued that the collaborative tagging systems exhibit several traits associated with complex systems, such as, large number of users, and a lack of central coordinating system. More importantly, such systems are known to produce "scale-free" power law distributions, which in this case implies, a small set of frequently used tags and a large set of rarely used tags.
A generative framework is applied to the process of resource tagging where three entities interact: users, tags, and resources. In the process of tag assignment, the phenomenon of "rich get richer" (preferential attachment) is observed. This implicitly creates a feedback cycle which is again a property of complex systems. Tags are after all means of finding/re-finding resources efficiently. Efficacy of a tag is captured by an IDF like measure called Information-value. Linear interpolation of information-value and the preferential attachment of a tag is learnt as a probability of that tag being added/reinforced.
The experimental dataset consists of 500 tagged websites from the "popular" section of del.icio.us and 250 from the "recent" section. The analysis performed on the tags from the popular section exhibits power law distribution (with some error), however this was not the case for the tags from the recent section.
A solution for the problem of estimating the time point at which the tag distribution has stabilized for a website is proposed using KL Divergence. Two complementary methods are proposed: 1. Finding the KL divergence between every two consecutive time points (in this paper, a month was considered as a time point). 2. Finding the KL divergence between each time point and the current distribution.
A small scale qualitative analysis is performed using visualization of correlation between tags. The relationship between tags is quantified using a cosine similarity based measure that uses co-occurrence statistics of a given pair of tags.