Complex network phenomena - such as information cascades in online social networks - are hard to fully observe, model, and forecast.
In forecasting, a recent trend has been to forgo the use of parsimonious models in favor of models with increasingly large degrees of freedom that are trained to learn the behavior of a process from historical data.
Extrapolating this trend into the future, eventually we would renounce models all together.
* But is it possible to forecast the evolution of a complex stochastic process directly from the data without a model? *
In this work we show that model-free forecasting is possible.
We present SED,
an algorithm that forecasts process statistics based on relationships of statistical equivalence using two general axioms and historical data.
To the best of our knowledge, SED is the first method that can perform axiomatic, model-free forecasts of complex stochastic processes.
Our simulations using simple and complex evolving processes and tests performed on a large real-world dataset show promising results.
Bruno Ribeiro, Christos Faloutsos,
Modeling Website Popularity Competition in the Attention-Activity Marketplace, ACM International Conference on Web Search and Data (WSDM'15), 2015 (to appear) [arXiv:1403.0600].
How does a new startup drive the popularity of competing websites into oblivion like Facebook famously did to MySpace? This question is of great interest to academics, technologists, and financial investors alike. In this work we exploit the singular way in which Facebook wiped out the popularity of MySpace, Hi5, Friendster, and Multiply to guide the design of a new popularity competition model. Our model provides new insights into what Nobel Laureate Herbert A. Simon called the "marketplace of attention," which we recast as the attention-activity marketplace. Our model design is further substantiated by user-level activity of 250,000 MySpace users obtained between 2004 and 2009. The resulting model not only accurately fits the observed Daily Active Users (DAU) of Facebook and its competitors but also predicts their fate four years into the future.
Pinghui Wang, John C.S. Lui, Bruno Ribeiro, Don Towsley, Junzhou Zhao, Xiaohong Guan,
Efficiently Estimating Motif Statistics of Large Networks, ACM Transactions on Knowledge Discovery from Data (TKDD), Sept. 2014 [pdf].
Older ArXiv version arXiv:1306.5288.
Exploring statistics of locally connected subgraph patterns (also known as network motifs) has helped researchers better understand the structure and function of biological and online social networks (OSNs). Nowadays the massive size of some critical networks – often stored in already overloaded relational databases – effectively limits the rate at which nodes and edges can be explored, making it a challenge to accurately discover subgraph statistics. In this work, we propose sampling methods to accurately estimate subgraph statistics from as few queried nodes as possible. We present sampling algorithms that efficiently and accurately estimate subgraph properties of massive networks. Our algorithms require no pre-computation or complete network topology information. At the same time, we provide theoretical guarantees of convergence. We perform experiments using widely known data sets, and show that for the same accuracy, our algorithms require an order of magnitude less queries (samples) than the current state-of-the-art algorithms.
Ting-Kai Huang, Bruno Ribeiro, Harsha V. Madhyastha, Michalis Faloutsos,
The Socio-monetary Incentives of Online Social Network Malware Campaigns, ACM Conference on Online Social Networks (COSN'14), 2014 [pdf] [slides.pdf] [slides.pptx].
Online social networks (OSNs) offer a rich medium of malware propagation. Unlike other forms of malware, OSN malware campaigns direct users to malicious websites that hijack their accounts, posting malicious messages on their behalf with the intent of luring their friends to the malicious website, thus triggering word-of-mouth infections that cascade through the network compromising thousands of accounts. * But how are OSN users lured to click on the malicious links? *
In this work, we monitor 3.5 million Facebook accounts and explore the role of pure monetary, social, and combined socio-monetary psychological incentives in OSN malware campaigns. Among other findings we see that the majority of the malware campaigns rely on pure social incentives. However, we also observe that malware campaigns using socio-monetary incentives infect more accounts and last longer than campaigns with pure monetary or social incentives. The latter suggests the efficiency of an epidemic tactic surprisingly similar to the mechanism used by biological pathogens to cope with diverse gene pools.
Flavio Figueiredo, Jussara M Almeida, Yasuko Matsubara, Bruno Ribeiro, Christos Faloutsos,
Revisit Behavior in Social Media: The Phoenix-R Model and Discoveries, (ECML/PKDD 2014) [arXiv:1405.1459].
How many listens will an artist receive on a online radio? How about plays on a YouTube video? How many of these visits are new or returning users? Modeling and mining popularity dynamics of social activity has important implications for researchers, content creators and providers. We here investigate and model the effect of revisits on popularity. A revisit can be defined as a repeated play (song or video) or a re-tweet of the same hashtag over time. Using four datasets of social activity, with up to tens of millions media objects (e.g., YouTube videos, Twitter hashtags or LastFM artists), we show the effect of revisits in the popularity evolution of such objects. Secondly, we propose the Phoenix-R model which captures the popularity dynamics of individual objects. Phoenix-R has the desired properties of being: (1) parsimonious, being based on the minimum description length principle, and achieving lower root mean squared error than state-of-the-art baselines; (2) applicable, the model is effective for predicting future popularity values of time series.
Driven by outstanding success stories of Internet startups such as Facebook and The Huffington Post, recent studies have thoroughly described their growth. These highly visible online success stories, however, overshadow an untold number of similar ventures that fail. The study of website popularity is ultimately incomplete without general mechanisms that can describe both successes and failures. In this work we present six years of the daily number of users (DAU) of twenty-two membership-based websites -- encompassing online social networks, grassroots movements, online forums, and membership-only Internet stores -- well balanced between successes and failures. We then propose a combination of reaction-diffusion-decay processes whose resulting equations seem not only to describe well the observed DAU time series but also provide means to roughly predict their evolution. This model allows an approximate automatic DAU-based classification of websites into self-sustainable v.s. unsustainable and whether the startup growth is mostly driven by marketing & media campaigns or word-of-mouth adoptions.
(alphabetical) K. Avrachenkov, P. Basu, G. Neglia, B. Ribeiro*, and D. Towsley,
Pay Few, Influence Most: Online Myopic Network Covering, IEEE NetSciCom Workshop 2014(Best Paper Award) [pdf] [slides.pptx]. *Corresponding author.
Older version as Technical Report arXiv.
Efficient marketing or awareness-raising campaigns seek to recruit a small number, $w$, of influential individuals -- where $w$ is the campaign budget -- that are able to cover the largest possible target audience through their social connections.
To date, most of the related literature on maximizing this network cover assumes that the social network topology is known.
Even in such a case the coverage problem is NP-hard. In practice, the network topology is generally unknown and needs to be discovered on-the-fly.
In this paper we assume that the topology is gradually discovered thanks to recruited individuals disclosing their social connections.
Our goal is to provide an efficient online algorithm that recruits individuals so as to maximize the size of target audience covered by the campaign.
We analyze the performance of a variety of online myopic algorithms (i.e. that do not have a priori information on the topology) currently used to sample and search large networks, including the Flickr and YouTube networks.
We also propose a new greedy online algorithm, Maximum Expected Uncovered Degree (MEUD).
Our proposed algorithm greedily maximizes the expected size of the cover, but it requires the degree distribution to be known.
For a class of random power law networks we show that MEUD simplifies into a straightforward procedure, denoted as MOD because it requires only the knowledge of the Maximum Observed Degree.
We substantiate our analytical results with extensive simulations and show that MOD significantly outperforms all analyzed myopic algorithms.
Bo Jiang, Liyuan Sun, Daniel R. Figueiredo, Bruno Ribeiro, Don Towsley,
On the duration and intensity of cumulative advantage competitions and the struggle of the fittest, [arXiv:1402.4536].
At the heart of competition for resources lie two simple principles: "survival of the fittest", which reflects how intrinsic fitnesses of competing agents influence their ability to gather resources; and "cumulative advantage", which reflects how accumulated resources help gather even more resources. In this letter we show that competitions embodying just the essence of these two principles exhibit a variety of surprising behaviors with respect to their duration, as measured by how long it takes for an indisputable winner to emerge, and intensity, as measured by how many times agents have equal amount of resources. We demonstrate that when the fitnesses of the agents are different, long-lasting competitions continue to occur in spite of a dramatic decrease in their intensity and the certainty that the fitter agent will win. We also unveil an unexpected phenomenon that we call "struggle of the fittest", where long-lasting competitions become more probable when one agent is slightly fitter than when agents are equally fit. These results have important implications for our understanding of competitions and various applications that leverage such models.
Peng Xia, Kun Tu, Bruno Ribeiro, Hua Jiang, Xiaodong Wang, Cindy Chen, Benyuan Liu, Don Towsley,
Who is Dating Whom: Characterizing User Behaviors of a Large Online Dating Site in Social Network Analysis – Community Detection and Evolution , Springer [arXiv:1401.5710].
Online dating sites have become popular platforms for people to look for potential romantic partners. It is important to understand users' dating preferences in order to make better recommendations on potential dates. The message sending and replying actions of a user are strong indicators for what he/she is looking for in a potential date and reflect the user's actual dating preferences. We study how users' online dating behaviors correlate with various user attributes using a large real-world dateset from a major online dating site in China. Many of our results on user messaging behavior align with notions in social and evolutionary psychology: males tend to look for younger females while females put more emphasis on the socioeconomic status (e.g., income, education level) of a potential date. In addition, we observe that the geographic distance between two users and the photo count of users play an important role in their dating behaviors. Our results show that it is important to differentiate between users' true preferences and random selection. Some user behaviors in choosing attributes in a potential date may largely be a result of random selection. We also find that both males and females are more likely to reply to users whose attributes come closest to the stated preferences of the receivers, and there is significant discrepancy between a user's stated dating preference and his/her actual online dating behavior. These results can provide valuable guidelines to the design of a recommendation engine for potential dates.
Kun Tu, Bruno Ribeiro, H. Jiang, X. Wang, David Jensen, Benyuan Liu, Don Towsley,
Online Dating Recommendations: Matching Markets and Learning Preferences, 5th International Workshop on Social Recommender Systems (SRS 2014) with WWW 2014 [arXiv:1401.8042].
Recommendation systems for online dating have recently attracted much attention from the research community. In this paper we proposed a two-side matching framework for online dating recommendations and design an LDA model to learn the user preferences from the observed user messag- ing behavior and user profile features. Experimental results using data from a large online dating website shows that two-sided matching improves significantly the rate of suc- cessful matches by as much as 45%. Finally, using simulated matchings we show that the the LDA model can correctly capture user preferences.
Yeon-sup Lim, Bruno Ribeiro, Don Towsley,
Classifying Latent Infection States in Complex Networks, SIMPLEX Workshop 2014 (@ WWW'14) [arXiv:1402.0013 ].
Algorithms for identifying the infection states of nodes in a network are crucial for understanding and containing infec- tions. Often, however, only a relatively small set of nodes have a known infection state. Moreover, the length of time that each node has been infected is also unknown. This miss- ing data – infection state of most nodes and infection time of the unobserved infected nodes – poses a challenge to the study of real-world cascades.
In this work, we develop techniques to identify the latent infected nodes in the presence of missing infection time-and- state data. Based on the likely epidemic paths predicted by the simple susceptible-infected epidemic model, we propose a measure (Infection Betweenness) for uncovering these un- known infection states. Our experimental results using ma- chine learning algorithms show that Infection Betweenness is the most effective feature for identifying latent infected nodes.
James Atwood, Bruno Ribeiro, Don Towsley,
Efficient Network Generation Under General Preferential Attachment, SIMPLEX Workshop 2014 (@ WWW'14) [arXiv:1403.4521].
James's source code.
Preferential attachment (PA) models of network structure are widely used due to their explanatory power and conceptual simplicity. PA models are able to account for the scale-free degree distribution observed in many real-world large networks through the remarkably simple mechanism of sequentially introducing nodes that attach preferentially to nodes with high degree. The ability to efficiently generate instances from PA models is a key asset in understanding both the models themselves and the real networks that they represent. Surprisingly, little attention has been paid to the problem of efficient instance generation. In this paper, we show that the complexity of generating network instances from a PA model depends on the preference function of the model, provide efficient data structures that work under any preference function, and present empirical results from an implementation based on these data structures. We demonstrate that, by indexing growing networks with a simple augmented heap, we can implement a network generator which scales many orders of magnitude beyond existing capabilities (106 – 108 nodes). We show the utility of an efficient and general PA network generator by investigating the consequences of varying the preference functions of an existing model. We also provide “quicknet”, a freely-available open-source implementation of the methods described in this work.
Bruno Ribeiro, Nicola Perra, and Andrea Baronchelli,
Quantifying the Effect of Temporal Resolution on Time-varying Networks, Scientific Reports, 2013 [pdf]. arXiv:1211.7052.
Time-varying networks describe a wide array of systems whose constituents and interactions evolve in time. These networks are defined by an ordered stream of interactions between nodes.
However, they are often represented as a sequence of static networks, resulting from aggregating all edges and nodes appearing at time intervals of size $\Delta t$. In this work we investigate the consequences of this procedure. In particular, we address the impact of an arbitrary $\Delta t$ on the description of a dynamical process taking place upon a time-varying network. We focus on the elementary random walk, and put forth a mathematical framework that provides exact results in the context of synthetic activity driven networks. Remarkably, the equations turn out to also describe accurately the behavior observed on real datasets. Our results provide the first analytical description of the bias introduced by time integrating techniques, and represent a step forward in the correct characterization of dynamical processes on time-varying graphs.
Peng Xia, Bruno Ribeiro, Cindy Chen, Benyuan Liu, Don Towsley,
A Study of User Behavior on an Online Dating Site (short paper), ASONAM 2013.
Online social networks (OSNs) have become a popular new vector for distributing malware and spam, which we refer to as socware. Unlike email spam, which is sent by spammers directly to intended victims, socware cascades through OSNs as compromised users spread it to their friends. In this paper, we analyze data from the walls of roughly 3 million Facebook users over five months, with the goal of developing a better understanding of socware cascades.
We study socware cascades to understand: (a) their spatio-temporal properties, (b) the underlying motivations and mechanisms, and (c) the social engineering tricks used to con users. First, we identify an evolving trend in which cascades appear to be throttling their rate of growth to evade detection, and thus, lasting longer. Second, our forensic investigation into the infrastructure that supports these cascades shows that, surprisingly, Facebook seems to be inadvertently enabling most cascades; 44% of cascades are disseminated via Facebook applications. At the same time, we observe large groups of synergistic Facebook apps (more than 144 groups of size 5 or more) that collaborate to support multiple cascades. Lastly, we find that hackers rely on two social engineering tricks in equal measure—luring users with free products and appealing to users’ social curiosity—to enable socware cascades. Our findings present several promising avenues towards reducing socware on Facebook, but also highlight associated challenges.
Pinghui Wang, Junzhou Zhao, Bruno Ribeiro, John C.S. Lui, Don Towsley, and Xiaohong Guan,
Practical Characterization of Large Networks Using Neighborhood Information, (under review) [arXiv:1311.3037].
Characterizing large online social networks (OSNs)
through node querying is a challenging task.
OSNs often impose severe constraints on the query rate,
hence limiting the sample size to a small fraction of the total network.
Various ad-hoc subgraph sampling methods have been proposed,
but many of them give biased estimates
and no theoretical basis on the accuracy.
In this work, we focus on developing sampling methods for OSNs where querying
a node also reveals partial structural information about its neighbors.
Our methods are optimized for NoSQL graph databases
(if the database can be accessed directly),
or utilize Web API available on most major OSNs for graph sampling.
We show that our sampling method has provable convergence guarantees
on being an unbiased estimator,
and it is more accurate than current state-of-the-art methods.
We characterize metrics such as
node label density estimation and edge label density estimation,
two of the most fundamental network characteristics from which other network characteristics can be derived.
We evaluate our methods on-the-fly over several live networks using
their native APIs.
Our simulation studies over a variety of offline datasets show that
by including neighborhood information, our method drastically (4-fold) reduces the number of samples required
to achieve the same estimation accuracy of state-of-the-art methods.
Fabricio Murai, Bruno Ribeiro, Don Towsley, and Krista Gile,
Characterizing Branching Processes from Sampled Data, SIMPLEX Workshop 2013 (@ WWW'13). arXiv:1302.5847.
Branching processes model the evolution of populations of agents that randomly generate offsprings. These processes, more patently the Galton-Watson process, are widely used to model biological, social, cognitive, and technological phenomena, such as the diffusion of ideas, knowledge, chain letters, viruses, and the evolution of humans through their Y-chromosome DNA or mitochondrial RNA. A practical challenge of modeling real phenomena using a Galton-Watson process is the offspring distribution, which must be measured from the population. In most cases, however, directly measuring the offspring distribution is unrealistic due to lack of resources or the death of agents. So far, researchers have relied on informed guesses to guide their choice of offspring distribution. In this work we propose a collection of methods to estimate the offspring distribution from real sampled data. Using a small sampled fraction of the agents and instrumented with the identity of the ancestors of the sampled agents, we show that accurate offspring distribution estimates can be obtained by sampling as little as 14% of the population.
F. Murai, B. Ribeiro, D. Towsley, P. Wang,
On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling, IEEE JSAC special issue on Network Science, 2013. [arXiv:1209.0736].
In this work we study the set size distribution estimation problem, where elements are randomly sampled from a collection of non-overlapping sets and we seek to recover the original set size distribution from the samples. This problem has applications to capacity planning, network theory, among other areas. Examples of real-world applications include characterizing in-degree distributions in large graphs and uncovering TCP/IP flow size distributions on the Internet. We demonstrate that it is hard to estimate the original set size distribution. The recoverability of original set size distributions presents a sharp threshold with respect to the fraction of elements that remain in the sets. If this fraction remains below a threshold, typically half of the elements in power-law and heavier-than-exponential-tailed distributions, then the original set size distribution is unrecoverable. We also discuss practical implications of our findings.
D. Figueiredo, P. Nain, B. Ribeiro*, E. de Souza e Silva, and D. Towsley,
Characterizing Continuous Time Random Walks on Time-varying Graphs, SIGMETRICS 2012. *Corresponding author. [pdf]. Technical Report arXiv:1112.5762.
In this paper we study the behavior of a continuous time random walk (CTRW) on a stationary and ergodic time varying dynamic graph. We establish conditions under which the CTRW is a stationary and ergodic process. In general, the stationary distribution of the walker depends on the walker rate and is difficult to characterize. However, we characterize the stationary distribution in the following cases: i) the walker rate is significantly larger or smaller than the rate in which the graph changes (time-scale separation), ii) the walker rate is proportional to the degree of the node that it resides on (coupled dynamics), and iii) the degrees of node belonging to the same connected component are identical (structural constraints). We provide examples that illustrate our theoretical findings.
Bruno Ribeiro, Prithwish Basu, Don Towsley,
Multiple Random Walks to Uncover Short Paths in Power Law Networks, IEEE INFOCOM NetSciCom 2012. [pdf].
Developing simple distributed algorithms to allow nodes to perform topology discovery and message routing
using incomplete topological information is a problem of great interest in network science.
Consider the following routing problem in the context of a large power law network $G$.
A small number of nodes want to exchange messages over short paths on $G$.
These nodes do not have access to the topology of $G$ but are allowed to crawl the network subject to a budget constraint.
Only crawlers whose paths cross are allowed to exchange topological information.
In this work we study the use of random walks (RWs) to crawl $G$.
We show that RWs have the ability to find short paths and that this bears no relation to the paths that they take.
Instead, it relies on two properties of RWs on power law networks:
The RW ability to observe a sizable fraction of the network edges;
The near certainty that two distinct RW sample paths cross after a small percentage of the nodes have been visited.
We show promising simulation results on several real world networks.
Bruno Ribeiro, Don Towsley
On the Estimation Accuracy of Degree Distributions from Graph Sampling, IEEE Conference on Decision and Control (invited),2012.[pdf].
This is a new version of the old TechReport UM-CS-2011-036. [pdf].
Estimating characteristics of large graphs via
sampling is a vital part of the study of complex networks.
In this work we present an in-depth study of the Mean
Squared Error (MSE) of sampling methods such as independent
random vertex (RV) and random edge (RE) sampling
and crawling methods such as random walks (RWs), a.k.a.
RDS, and the a Metropolis-Hastings algorithm whose target
distribution is to uniformly sample vertices (MHRWu). This
paper provides an upper bound for the MSE of a stationary
RW as a function of the MSE of RE and the absolute value
of the second most dominant eigenvalue of the RW transition
probability matrix. We see that RW and RV sampling are
optimal in respect to different weighted MSE optimizations
and show when RW is preferable to RV sampling. Finally,
we present an approximation to the MHRWu MSE. We
evaluate the accuracy of our approximations and bound using
simulations on large real world graphs.
Bruno Ribeiro, Pinghui Wang, Fabricio Murai, and Don Towsley,
Sampling Directed Graphs with Random Walks, INFOCOM 2012 [pdf].
A version is available as UMass CMPSCI Technical Report UMCS-2011-031 [pdf].
Despite recent efforts to characterize complex networks such as citation graphs or online social networks (OSNs), little attention has been given to developing tools that can be used to characterize directed graphs in the wild, where no pre-processed data is available. The presence of hidden incoming edges but observable outgoing edges poses a challenge to characterize large directed graphs through crawling. Unless we can crawl the entire graph or the directed graph edges are highly symmetrical, hidden incoming edges induce unknown biases in the sampled nodes. In this work we propose a random
walk sampling algorithm that is less prone these biases. The
driving principle behind our random walk is to construct, in
real-time, an undirected graph from the directed graph in a
way that is consistent with the sample path followed by the
algorithm walking on either graph. We also study out-degree
and in-degree distribution estimation. Out-degrees are visible
to the walker while in-degrees are hidden (latent). This makes
for strikingly different estimation accuracies of in- and outdegree
distributions. We show that our algorithm can accurately
estimate out-degree distributions and show that no algorithm
can accurately estimate unbiased in-degree distributions unless
the directed graph is highly symmetrical.
Y. Lim, D. S. Menasche, B. Ribeiro, D. Towsley, and P. Basu, Online estimating the k central nodes of a network. IEEE Network Science Workshop (NSW), 2011 [pdf].
Estimating the most influential nodes in a network is a fundamental problem in network analysis. Influential nodes may be important spreaders of diseases in biological networks, key actors in terrorist networks, or marketing targets in social networks. By identifying such central nodes, one can devise efficient strategies for prevention of diseases or crime and efficient marketing strategies.
The goal of this paper is to estimate the k most central nodes in a network through parsimonious sampling. Centrality determines the relative importance of a particular node within the network. Conventional measures of node centrality include degree, betweenness, and closeness.
Bruno Ribeiro, Daniel Figueiredo, Edmundo de Souza e Silva, and Don Towsley,
Characterizing Dynamic Graphs with Continuous-time Random Walks, SIGMETRICS 2011 (Generalized to more general graph processes in our SIGMETRICS 2012 paper) [pdf].
In this paper we study the steady state behavior of continuous-time random walks (CTRW) on Markov dynamic networks. We consider two types of CTRWs: one that walks at a constant rate (CTRW-C) and another that walks with a rate proportional to the vertex degree (CTRW-D). We derive closed-form analytical expressions for the steady state (SS) distribution of these walkers. For CTRW-C we obtain the approximate SS distribution for either a very fast or very slow walker. We show that the behavior of CTRW-C and CTRW-D is strikingly different. Surprisingly, the steady state distribution of the fixed rate walker depends on the walker rate, which is not the case for the degree proportional walker. Such findings have direct implication on the design of algorithms to measure dynamic networks.
Konstantin Avrachenkov, Bruno Ribeiro, and Don Towsley,
Improving Random Walk Estimation Accuracy with Uniform Restarts, 7th Workshop on Algorithms and Models for the Web Graph (WAW 2010), Sept 2010.
A version is available as INRIA Technical Report no 7394 [pdf].
This work proposes and studies the properties of a hybrid sampling scheme that mixes independent uniform node sampling and random walk (RW)-based crawling. We show that our sampling method combines the strengths of both uniform and RW sampling while minimizing their drawbacks. In particular, our method increases the spectral gap of the random walk, and hence, accelerates convergence to the stationary distribution. The proposed method resembles PageRank but unlike PageRank preserves time-reversibility. Applying our hybrid RW to the problem of estimating degree distributions of graphs shows promising results.
William Gauvin, Bruno Ribeiro, Benyuan Liu, Don Towsley, Jie Wang, Measurement and Gender-Specific Analysis of User Publishing Characteristics on MySpace, IEEE Network special issue on Online Social Networks, vol 24, number 5, Oct 2010.
Online social networks have become popular platforms for people to make connections, share information, and interact with each other. In an online social network, user publishing activities such as sending messages and posting photos represent online interactions between friends that involve the use of network and system resources. As more and more businesses use online social networks as a means to promote their brand names a good understanding of user publishing characteristics is important not only for capacity planning (server scaling estimation, network bandwidth provisioning, and performance tuning), but also for marketing analysis and security measures of online social networks. Recently there have been many efforts to measure and analyze various characteristics of online social networks. Most of these studies have focused on the network graph properties such as node degree distribution, clustering coefficient, and connected components. In this work, we measure and analyze the gender-specific user publishing characteristics on MySpace. Our results show that there are recognizable patterns with respect to profile attributes, user interactions, temporal usages, and blog contents for users of different genders. In particular, gender neutral profiles (most of them are music bands, TV shows, and other commercial sites) differ significantly from normal male and female profiles for several publishing patterns.
Bruno Ribeiro and Don Towsley, Estimating and Sampling Graphs with Multidimensional Random Walks, ACM SIGCOMM Internet Measurement Conference, Nov, 2010. [src] [pdf]. IMC 2010 slides [pptx].
Estimating characteristics of large graphs via sampling is a vital part of the study of complex networks. Current sampling methods such as (independent) random vertex and random walks are useful but have drawbacks. Random vertex sampling may require too many resources (time, bandwidth, or money). Random walks, which normally require fewer resources per sample, can suffer from large estimation errors in the presence of disconnected or quasi-disconnected subgraphs. In this work we propose a new $m$-dimensional random walk that uses $m$ dependent random walkers. We prove that the proposed sampling method, which we call Frontier sampling, has all of the nice sampling properties of a regular random walk. At the same time, our simulations over large real world graphs show that, in the presence of disconnected or quasi-disconnected subgraphs, Frontier sampling exhibits lower estimation errors than regular random walks. We also argue that Frontier sampling is more suitable to sample power-law graphs than random vertex sampling.
In this work we study the activity span of MySpace accounts and its connection to the distribution of the number of friends.
The activity span is the time elapsed since the creation of the account until the user's last login time.
We observe exponentially distributed activity spans.
We also observe that the distribution of the number of friends over accounts with the same activity span is well approximated by a lognormal with a fairly light tail.
These two findings shed light into the puzzling (yet unexplained) inflection point (knee) in the distribution of friends in MySpace when plotted in log-log scale.
We argue that the inflection point resembles the inflection point of Reed's (Double Pareto) Geometric Brownian Motion with Exponential Stopping Times model.
We also present evidence against the Dunbar number hypothesis of online social networks, which argues, without proof, that the inflection point is due to the Dunbar number (a theoretical limit on the number of people that a human brain can sustain active social contact with).
While we answer many questions, we leave many others open.
2009 & ealier work
Bruno Ribeiro, Daniel Figueiredo, and Don Towsley, Herding BitTorrent Traffic away from Expensive ISP Links, Jan, 2009.
A version is available as UMass CMPSCI Technical Report UM-CS-2008-029 [pdf].
The steady increase in access link capacity of broadband subscribers coupled with the significant increase in traffic, partially due to popular peer-to-peer (P2P) file sharing applications, has had an adverse economic impact on last-mile Internet Service Providers (ISPs). Reducing the cost associated with P2P traffic has become a major issue for ISPs worldwide and has triggered several proposals. However, previous works advocate for changes to P2P software or require application-specific packet identification and manipulation. This paper presents a distinct approach. We leverage on the observation that traffic costs can be reduced by diverting traffic away from expensive links and into cheaper links. We focus on BitTorrent (BT) traffic and propose a change to the ISP's access router packet schedulers in order to divert traffic away from links labeled as expensive by the ISP. Our proposed packet scheduler is stateless. Our approach requires no change to applications, is blind to packet payload, and only requires changes to access routers of ISPs that desire to reduce their BT traffic costs. Our simulation and experimental results indicate that our approach can successfully herd BT traffic towards cheaper links (up to 50% reduction), without sacrificing user-level performance.
Bruno Ribeiro, Tao Ye, and Don Towsley, A Resource-minimalist Flow Size Histogram Estimator,
ACM SIGCOMM Internet Measurement Conference, Oct, 2008.
A version is available as UMass CMPSCI Technical Report 29-08 [pdf].
The IMC 2008 slides: [ppt]
This work presents a data stream algorithm for coarse-grained Internet flow size histogram estimation.
This algorithm is a fast sub-optimal estimator that also has a small memory footprint.
The memory savings comes from the application of a probabilistic counting technique proposed by Robert Morris in 1978 and from our hash folding technique.
In this work we present two tools that are possibly of interest in their own:
hash folding (multiplexing two or more hash tables into the space of one table)
Section 3.1.2 has a typo: " ... its ownership bit changed from one to zero, g_j is decremented by one" should be " ... its ownership bit changed from one to zero, g_0 is decremented by one".
g_0 counts the number of virtual counters in the sketch and is decremented when a virtual counter is lost (happens when the ownership bit changes from one to zero). Thanks to Chia-Mu Yu for point out this typo.
Bruno Ribeiro, Weifeng Chen, Gerome Miklau, and Don Towsley Analyzing Privacy in Enterprise Packet Trace Anonymization,
Proceedings of the 15th Annual Network & Distributed System Security Symposium (NDSS), Feb, 2008 [pdf] and [ppt] An extended version is currently available as UMass CMPSCI Technical Report TR 48-07. (BibTeX) [pdf].
The source code of the software used in this work is available at [src].
In this work we show (and present the formal background of) an attack on anonymized IP addresses of (full and partial) prefix-preserved
anonymized traces. This attack that combines the use of external information (fingerprints) and the matching
contrains imposed by prefix preservation.
Perhaps most importantly, we develop analysis tools that allow data publishers to quantify the worst-case
vulnerability of their trace given assumptions about the adversary's external information.
Bruno Ribeiro, Don Towsley, Tao Ye, and Jean Bolot, Fisher Information of Sampled Packets: an Application to Flow Size Estimation,
ACM SIGCOMM Internet Measurement Conference (IMC 2006), Rio de Janeiro, Brazil, October 2006. (BibTeX)
The camera-ready version and the IMC'06 slides are available (also an errata) here [pdf][ppt] (recommended).
An older version is also available as UMass CMPSCI Technical Report 06-37
Also available is a tar.gz file with the source code that computes the Cramer-Rao bounds found in the paper. It needs R 2.4 or newer: [src].
The source code to numerically obtain the MLE can be found at [src]. It contains the version of wnlib that I used. Wnlib needs csh to compile.
In this work we look at the amount of statistical information, more specifically the flow size distribution, obtained from probabilistic packet sampling.
We focus primarily on TCP flows and ask if it is possible to extract enough information from a
sampled packet stream in order to recover its original flow size distribution. We look at single and multiple
This work also provides some insights of which TCP/IP fields can be used to improve flow size estimation.
We also provide tools to pre-compute the number of samples needed to achieve a given confidence interval.
Bruno Ribeiro, Edmundo de Souza e Silva, and Don Towsley, On the Efficiency of Path Diversity for Continuous Media Applications,
A version is currently available as UMass CMPSCI Technical Report 05-19
Path diversity is beneficial for multimedia application in the presence of loss.
It ameliorates the effect of end-to-end burst losses by spreading such losses.
Multimedia applications using path diversity are known to achieve better error performance over the single path case.
Packet interleaving is an older and more widespread technique that also spread burst losses.
To the best of my knowledge, this work is the first to provide evidence (and a mathematical proof :) that packet interleaving can
also be used to achieve good error performance on multimedia applications.
Weifeng Chen, Yong Huang, Bruno Ribeiro, Kyoungwon Suh, Honggang Zhang, Edmundo de Souza e Silva, Jim Kurose, and Don Towsley, Exploiting the IPID field to infer network path and end-system characteristics,
Passive and Active Measurement Workshop (PAM'05), Boston, March 2005.
This work shows the many uses of the IPID field for network measurements.
Among many other applications, it presents a technique to measure end-to-end one way delay differences that does
not require the end host to be instrumented.
This means that two hosts A and B can measure their one-way delay difference to a
Windows 2000/XP end host C without the need to instrument host C -- or without the knowledge or consent of host C :-O .
S. C. Coutinho, Bruno Ribeiro, On Holomorphic Foliations without Algebraic Solutions,
Experimental Mathematics, v.10, n.4, p.529 - 536, 2001.
There has been a rapid increase in the number and types of digital
networks over the last decade beginning
with the Internet, its constituent networks, and the World Wide Web,
and with the addition of on-line social
networks such as Facebook and Twitter. The difficulty in visualizing
and understanding large complex networks demands sampling tools and
metrics that can help us understand their structure and mechanics. In
this talk,we will review the role of random walks in sampling and
characterizing large graphs and look at future directions towards
characterizing dynamic graphs.
Understanding Complex Networks through Incomplete Information: Mistakes, Myths, and Positive Steps. Performance Workshop at the Conference of the Society of Brazilian Computing (CSBC), the premier annual computer science event in Brazil, 2010. slides (in Portuguese) pdf