Voss, ICISSI 2005
From ScribbleWiki: Analysis of Social Media
Created and maintained by Sachin Agarwal (User:Sachina)
HomePage: Sachin Agarwal Contact me
Related page: Wikipedia - an always growing Social Media
Voss, J. Measuring Wikipedia. Proceedings of 10th International Conference of the International Society for Scientometrics and Informetrics, (Stockholm, Sweden), 2005.
Wikipedia is a collaborative project that uses wiki to create encyclopedia. It is editable by everyone and maintains the version history for all the articles. This paper is a general overview of wikipedia research. The author analysis the fundamental components of wikipedia which are authors, articles, edits and links apart from content and quality. Author also measures general trends that are typical for all Wikipedias but vary between languages in detail.
Wikipedia uses Wiki software which is a type of collaborative software invented first by Ward Cunningham in 1995 [ Cunningham & Leuf, 2001 ]. A simple tool was created by him for knowledge management which he decided to name as WikiWikiWeb using the Hawaiian term “wiki” for “quick” and with allusion to the WWW. Wiki is a collection of hypertext documents that can directly be edited by anyone. Every edit is recorded and thus can be retraced by any other user. Each version of a document is available in its revision history and can be compared to other versions.
Wikipedia originated from the Nupedia project. Nupedia was founded by Jimmy Wales who wanted to create an online encyclopaedia licensed under the GNU Free Documentation License. In January 2001 Wikipedia was started as a side project to allow collaboration on articles prior to entering the lengthy peer review process. Soon it grew faster and attracted more authors than Nupedia which was closed in 2002. In March 2005 there are 195 registered languages, 60 languages have more than 1000 and 21 languages more than 10000 articles. In June 2003 the Wikimedia Foundation was founded as an independent institution. It is aimed at developing and maintaining open content, wiki-based projects and to provide the full contents of these projects to the public free of charge. Meanwhile Wikimedia hosts also Wiki projects to create dictionaries, textbooks, quotes, media etc. In 2004 a German non-profit association was founded to assist the project locally. Associations in other countries follow. Their main goals are collecting donations, promotion of Wikipedia and social activities within the community of “Wikipedians”.
Wikipedias are hosted by Wikimedia Foundation using MediaWiki – a GPL licensed Wiki engine that is developed especially for Wikimedia projects but also used elsewhere. The author presents an overview of wikipedia growth followed by analysis of Wikipedia’s fundamental components and discussion of content and quality.
Wikipedia began to grow exponentially from about 2002 soon after it started.
The above figure shows six fundamental metrics of Wikipedia’s growth. These are:
- Database size (combined size of all articles including redirects in bytes)
- Total number of words (excluding redirects and special markup)
- Total number of internal links (excluding redirects and stubs)
- Number of articles (at least contain one internal link)
- Number of active Wikipedians (contributed 5 times or more in a given month)
- Number of very active Wikipedians (contributed 100 times or more in a given month)
The graph could be divided into three phases. After the first, linear phase exponential growth started around April 2002 when 10 active Wikipedians and 2000 articles were exceeded. Until February 2005 all metrics increased similarly with around 18 percent increase per month. Only the number of articles increased slower with 13.8 percent. Around March 2005 the growth abruptly changed to linear growth but with a big jump in total growth at the same time. A similar effect can only be observed in the Japanese Wikipedia around November 2003. All other Wikipedias are in the first (linear) or second (exponential) phase – including English that only dropped increase rate around March 2005. Probably the third phase (linear) at the German Wikipedia is caused by technical limits and may only be a transitional state. At regular intervals Wikipedia becomes slow because software and hardware cannot cope with the exploding number of users.
Generally all measures but number of articles increase in the same way. That means article’s size, number of links and numbers of active Wikipedians per article continuously rise. This effect is steady in different languages with different expansion rates. German is one of the faster growing Wikipedias with 15.6% word number increase per article and month in 2004 (Japanese: 16%, Danish: 7.2%, Croatian: 18.6%, English: 7.8%).
Wikipedia assigns a unique name to every article. Due to the fact that there is a special Wiki syntax which can be learned quickly, articles or single chapter can be directly edited without knowledge of HTML. Extensions allow timelines, hieroglyphs and formulas in LaTeX. Easy graph drawing, map generation and music markup is also being developed. Each article has a Talk page for discussions. Articles on the project itself belong to a special Wikipedia namespace. Uploaded image or other media files can be described at pages in the Media namespace. Each logged-in user has a User-page where he can introduce himself. Talk pages of user pages are essential for personal communication within the project. There is also the Template namespace for text modules that can be used in multiple articles, the Category namespace for categories that can be assigned to articles and the MediaWiki and Help namespace for localisation and documentation of the software.
Above figure shows comparison of namespace sizes of German (de), Japanese (ja), Danish (da) and Croatian (hr), which is a simple method for getting an overview of a Wikipedia structure. The percentage of all pages (no redirects) except for those in MediaWiki and Help namespace is compared. Talk pages are added to their corresponding page but for articles in the default namespace. Unsurprisingly normal articles make up the majority of pages with 60 to 80 percent. Together with talk and media pages they amount around 90 percent. Categories and user pages are used on a varying amount. German and Japanese Wikipedia have similar structures but there are less normal articles and more media, talk and other pages in the German Wikipedia.
Wiki articles can have a large number of authors since anyone can edit an article. The number of distinct authors per Wikipedia article (diversity) follows a power law distribution.
The above figure shows the distribution of authors per articles based on the data of the first Wikipedia CD-ROM (Directmedia, 2004). It contains all articles of the German Wikipedia on September 1, 2004. For articles with between 5 and 40 distinct authors the number follows a power-law with γ ≈ 2.7. Some special articles have more authors. ‘September 2003’, ‘Gerhard Schröder’, ‘März 2004’ and ‘2004’ were leading with more than 100 contributors each. Almost half of the articles (47.9%) have less than 5 distinct authors and almost a quarter (27.6%) of all articles in the German Wikipedia of September 1, 2004 had only been edited by one logged-in user. Anonymous edits are omitted in this calculation. The number of anonymous edits varies by language between 10% (Italian) and 44% (Japanese).
A person's edit is recorded whenever some article is edited by him. It also gets listed in the article’s version history where one can highlight differences between selected versions. Users can add articles to their watch list to get notified on changes. All changes are listed at the recent changes. The recent changes section can serve as an important place to observe new contributors and suspect edits. Some structures, like two authors who revert each others edits (“edit war”), can be detected automatically. On an average there are 16 edits per minute in the English Wikipedia with daily oscillation (6.6 in the German version) at the time of writing this paper. Percentage of anonymous edits varies by language (English: 22%, German: 26%, Japanese: 42%, French 15%, Croation 16% till end of 2004). The author uses shared information distance between two versions to measure the amount of edit. The shared information distance between two strings is calculated with their Kolmogorov complexities (the minimal number of bits needed to encode a string) and the complexity of the string that results in concatenating them. Another application of shared information distance can be to determe main authors and detect how parts of articles have been moved to another article.
The articles on wikipedia and links between them form a network of concepts. Many real world networks are mostly found to have a very right skewed degree distribution with power laws in their tails. Networks with power-law degree distribution are also called scale-free networks.
The above figure shows link distribution for existing articles of all namespaces in the German Wikipedia. Links can be removed and added any time and links may point to pages that do not exist yet. Links to such a page (“broken links”) are marked red to encourage people to create missing articles.
The above figure shows a log-log diagram of the number of wanted pages in the German Wikipedia, number of broken links pointing to them and a visual power law fit with γ ≈ 3. The clean power law can be explained by the exponential growth of links and the assumptions that the more a not-existing article is linked, the more likely someone will create it. Most languages have similar values but one can also find some exceptions and differences between small and large Wikis.
Content structure of Wikipedia articles can be identified using samples or using categories. Wikipedia’s category system is a special form of social tagging related to classification and subject indexing.Generally in Wikipedias there is a variety of special pages for organization and navigation, like subject portals, lists, articles for single years, month and dates, templates, and navigation bars. There are also attempts to add more semantic markup. In German Wikipedia there are special templates to judge articles, to mark links to external databases and for collecting metadata on people. Beside such special pages, every article links to its most related topics. Tests have been done on deriving thesaurus structures out of the linking between Wikipedia articles. Detailed comparisons with existing thesauruses and semantic networks and linguistic methods have also been tried.
The more people read an article the more errors are emended. We cannot be sure how many qualified people have read an article and how many errors remain. Edit history and user contributions are auxiliary clues but very time-consuming to review. [ Lih, 2004 ] used the number of edits and the number of unique editors as a very simple and arguable approximation for quality. Wikipedia has been evaluated several times by journalist with very good results and Wikipedia articles are regularly cited in press. The overall quality of Wikipedia mainly depends on your definition of (information) quality and the time scale you are interested in.
- Cunningham, W. & Leuf, B. The Wiki Way. Quick Collaboration on the Web, Reading, Mass.: Addison-Wesley, 2001.
- Lih, A.Wikipedia as Participatory Journalism: Reliable Sources? Metrics for evaluating collaborative media as a news resource, Symposium. Proceedings of the International Symposium on Online Journalism 2004.