Kolari et al, WWE 2006
From ScribbleWiki: Analysis of Social Media
Characterizing the Splogosphere
Authors: Pranam Kolari, Akshay Java and Tim Finin
This paper aims to characterize splogs by comparing them with authentic blogs. The collected blogs are labeled by a splog detection system developed by the authors. Then several characters of splogs against authentic blogs are discussed based on these labels. The discussed characters include the content and link of the splogs, and the ping time to the Ping Server as well.
Motivation of splogs
1. Creation of fake blogs, containing gibberish or hijacked content from other blogs and news sources with the sole purpose of hosting profitable context based advertisements.
2. Create false blogs that realize a like farm intended to unjustifiably increase the ranking of affiliated sites.
Blogosphere vs. Splogosphere
BlogPulse Dataset: A dataset spanning a period of 21 days in July of 2005 released by BlogPulse. The splog detection model developed by the authors (Kolari et al, AAAI 2006,Kolari, AAAI 2006) is used to label the blogs in the datasets. Blogs that are not available any more at the time of this work, live-journals and not English blogs are filtered. The probability of splog for each blog is calculated by the detection model. A blog, whose probability of splog is less than 0.25, is labeled as authentic blog. Whereas, a blog whose probability of splog is more than 0.8, is labeled as splog. After these processing, 27k splogs and 27k authentic blogs are sampled uniformly for analysis.
Characters of splogs
1. Frequency of words: the frequencies of top common terms in blogs and splogs are analyzed. The result showed that authentic blogs and splogs do not share the same top common terms. The top blog features occur more frequently in blogs than splogs and vice-versa.
2. Link Structure: The distribution of inlinks for splogs and authentic blogs are analyzed. Authentic blogs show a power-law that is typical to Web in general for the distribution of inlinks. However, splogs deviate from this norm, and have high fraction even in a large number of inlinks.
Splogs and Ping Servers
Ping Servers define standard interfaces that can be used by blogs to notify new (or updated) posts. Two kins of spams are faced by Ping Server: 1) pings from non-blogs. 2) pings from splogs, both of which are called spings.
BlogPulse datasets are filtered by a URL based heuristics for blog identification. For those passed through the filter, homepages of pings are fetched to make a splog judgment based the splog detection model. Unlike the threshold used for splogs, different thresholds are used here. A blog, whose probability of splog is less than 0.5, is labeled as authentic blog. Whereas, a blog whose probability of splog is more than 0.5, is labeled as splog.
Ping time are compared between blogs and splogs. Two characters are found.
1. splog pings do not show any patterns that are associated with typical blog posting times, for example, pings are relatively higher at day-time.
2. the number of spings are approximately three times the number of authentic pings suggesting that around 75% of pings from English Blogs are from splogs.
Several other observations are also made for Ping Server:
1. Even though splogs constitute around 88% of all pinging URLs, they account for only 75% of all pings. This is attributed to the fact aht many splog pings are one-time ping.
2. Many of the URLs are from non-existent blogs. They constitute what could be termed as zombie pings, spings that exist even thought the splog, while their representation is non-existent in the blogosphere.
3. Since many search engines give particular importance to the URL tokens of pages, splogs exploit this ranking criteria by hosting blogs in the info domain, where registrations are easier than com domain.
Detection point of splog
1. At Update Ping Server. This can be the most effective approach. However, the judgment cannot be made with a high confidence until sufficient posts form the blog is observed.
2. Before Indexing Content. Blog search engines can detect splogs before indexing the content of the blogs.
3. After Indexing. Later in the life-cycle of blog search engines can still detect splogs.