Wilkinson et. al., ACM 2007
From ScribbleWiki: Analysis of Social Media
Cooperation and Quality in Wikipedia
- Authors: Wilkinson D. and Huberman B.
- Conference: Proceedings of the 2007 international symposium on Wikis, Montreal, Canada, 2007.
- Link: http://www.hpl.hp.com/research/idl/papers/wikipedia/wikipedia07.pdf
- Maintainer: Sameer Badaskar
Wikipedia is a good example of the "Wisdom of the Crowds" where the collaboration between numerous authors results in articles having quality comparable to or better than an article composed by on individual (expert). Previous studies on Wikipedia articles have investigated accuracy, quality of language and textual comparison to standard encyclopediae as the metrics for quality of articles. This paper shows experimentally the correlation between the quality of an article to parameters like number of edits, number of unique editors, article age etc. The main contributions of this paper are
- An assessment of the growth of articles in wikipedia.
- Correlation shown between wikipedia article quality and parameters like number of edits, number of editors.
Growth of Articles
If an article has many edits and editors, editors debate on the content of the article. What this means is that edits made to an article lead to more edits. Let <math> n(t) </math> be the total number of edits to an article at time <math> t </math>. The number of edits at time <math> t + \Delta t </math>, <math> n(t + \Delta t) </math> is expressed in terms of a random fraction (in (0,1)) of the number of edits at time <math> t </math> as follows
<math> n(t + \Delta t) = (a + f(t)) n(t) </math>
where a is a constant and f(t) is a zero mean random value. It can be shown that the distribution of <math> n(t) </math> is lognormal. This means that the number of edits to an article form a log-normal distribution. A histogram of the logarithm of the number of edits to articles confirms with the log-normal distribution explained before as shown in Figure 1.
In Figure 1, note that the mean of the lognormal distribution increases with time starting from 120 weeks to 240 weeks. This means that articles having edits invite more edits and the process does not converge with age of the article. The lognormal distribution implies that a small portion of the articles have a large number of edits which has a bearing on the overall quality of Wikipedia.
Given that the distribution of edits of articles is log-normal, the correlation between article quality and editing activity is analyzed. Counts of number of edits of an article and number of distinct editors are compared between equal-sized populations of Featured Wikipedia articles and Non-Featured articles. Featured Articles are chosen by the Wikipedia Community based on accuracy, neutrality and style of writing. Thus, Featured Articles serve as a reference for high quality articles.
An article about a popular topic will have more edits than an article which is not. Thus the population of articles that contains more popular topics will have more edits. To compensate for this, both the Featured and Non-featured articles are grouped based on their popularity. The measure of popularity of an article is chosen to be its PageRank (Google).
The different ages of the articles in both the Featured and Non-featured are compensated for using the mean and variance of the number of edits.
Now, the following observations are made
- Featured (or high quality) articles have more number of edits than Non-featured articles
- Featured articles are edited by more number of distinct editors than Non-featured articles
- Featured articles have more number of comments in the Talk-Page than Non-featured articles
To sum up, 1. 2 and 3. simply mean that higher editing activity is associated with high article quality. In light of this observation, article quality is well correlated with high editing activity.
The paper make two important observations
- The distribution of edits among articles follows a log-normal distribution which means that a small proportion of articles accrete high number of edits while most of the articles receive few edits.
- Articles with high editing activity are often of high quality.
It follows that
- A small percentage of the wikipedia articles are of high quality.
- The average quality of Wikipedia is increases with time since the mean of the lognormal distribution increases with time.