# CASCADES project: Cost-effective Outbreak Detection in Networks.

by Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen and Natalie Glance

### SCHOOL OF COMPUTER SCIENCE, CARNEGIE MELLON UNIVERSITY

CMU Press Release: Carnegie Mellon Algorithm Identifies Top 100 Blogs for News

### Blog rankings (based on Jan - Dec 2006 data)

Rankings are based on the following question: Which blogs should one read to be most up to date, i.e., to quickly know about important stories that propagate over the blogosphere?
• Budget=100 blogs: If I can read 100 blogs, which should I read to be most up to date? Unit cost (each blog costs 1 unit), optimizing the information captured -- population affected (we want to be the first to know about something with many people blogging about the story after us)

• Budget=5000 posts: If we can read the total of 5000 posts, which blogs should one read? Cost of reading a blog is the number of posts it has, we optimize the information captured

• Multicriterion solution: We want to read both a small number of blogs and a small number of posts. These results are from the experiment on figure 4(a) from the paper. We find the right budget where value of objective function is 40%. Cost of a blog is a combination of a number of posts (NP) a blog has plus a constant (UC).

• All blogs: List of all 45,000 blogs considered in our study together with statistics on number of posts, cascades and links of a blog. Note it is a large (7mb) file.

### Information propagation and comparison to current blog ranking techniques

 The spread of information in the blogosphere. First blog writes a post and then other blogs refer to it. The behavior (information) spreads (cascades) through the network of blogs. We plot the mount of information captured vs. the number of blogs read. We compare our algorithm to current blog ranking techniques, like reading blogs with most in-links, most posts, most out-links or just reading random blogs. See the paper for details.

### Water distribution networks

Same techniques and algorithms as used for blogs also apply to detecting disease outbreaks in water distribution networks. Consider a city water distribution network, delivering water to households via pipes and junctions. Intrusions can cause contaminants to spread over the network, and we want to select a few locations (pipe junctions) to install sensors, in order to detect these contaminations as quickly as possible.

The sensor placements obtained by our algorithm are provably near optimal, providing a constant fraction of the optimal solution. Our approach scales, achieving speedups and savings in storage of several orders of magnitude.

City water distribution network. Circles show the locations of placed sensors in order to detect water contaminations as quickly as possible.