Modern ubiquitous platforms, such as mobile apps, web browsers, social networks, and IoT devices, are providing sophisticated services to users while also increasingly collecting privacy-sensitive data. Service providers do give users fine-grained privacy controls over these sensitive actions; however, the number of privacy settings has reached a point where it is overwhelming to users, preventing them from taking advantage of these controls.

This work addresses this user burden problem by studying to what extent machine learning techniques could simplify user decisions and help users configure their privacy settings. We chose mobile app permissions as a first domain to explore this topic.

Specifically, in our first study, we explored the power of different combinations of features to predict users’ mobile app permission settings based on a dataset of 200K real Android users. We evaluated the profile-based predictions together with individual prediction models and showed that with selectively prompting 10% of the permission requests to the users, the system could predict users’ app permission settings with 90% accuracy. We conducted a second study in which we applied nudging to motivate users to engage with the settings to help develop strong predictive models even based on a small sample of users. The two studies confirmed that a relatively small number of profiles can go a long way capturing users’ diverse privacy preferences and predicting their privacy settings.

We then introduced an app that provides personalized recommendations for mobile app permission settings. Our results from a pilot study (N=72) conducted on real Android users showed that participants accepted 78.7% of the recommendations and kept 94.9% of these settings on their phones with comfort. In light of this, we propose a final study exploring the extent to which learned users’ privacy profiles from mobile app permissions could help predict users’ privacy settings across other domains.

Thesis Committee:
Norman Sadeh (Chair)M
Lorrie Cranor
Alessandro Acquisti
Florian Schaub (University of Michigan)
Nina Taft (Google)

Join us for CMU Privacy Day 2017 at Carnegie Mellon University. CMU is celebrating the International Data Privacy Day by presenting privacy research and practical advice on protecting privacy online. Privacy Day is open to the public, and no registration is required.

Data Privacy Day is an international effort to empower and educate people to protect their privacy and control their digital footprint. For more information, please visit

Privacy Day will feature a Privacy Clinic. Come and  learn how to protect your privacy. CMU’s information privacy and security students will educate you and answer your questions about privacy risks and remedies concerning many topics, including:

  • Web Application for Searching and Comparing Financial Companies' Privacy Practices
  • Are you being monitored at Carnegie Mellon?
  • Online Tracking and Targeted Ads
  • Private Browsing
  • The Decline of the Ad Blocker
  • Privacy for IoT Devices
  • How to Avoid In-App Tracking and Advertising
  • Encryption for Messenger Apps
  • Opting Out from Ad Targeting
  • Analyzing Privacy Requirements for Mobile Apps
  • Generating Privacy Policies for Websites and Apps

Refreshments will be provided.

Hosted by the MSIT-Privacy Engineering Program.

Structured probabilistic inference has shown to be useful in modeling complex latent structures of data. One successful way in which this technique has been applied is in the discovery of latent topical structures of text data, which is usually referred to as topic modeling. With the recent popularity of mobile devices and social networking, we can now easily acquire text data attached to meta information, such as geo-spatial coordinates and time stamps. This metadata can provide rich and accurate information that is helpful in answering many research questions related to spatial and temporal reasoning. However, such data must be treated differently from text data. For example, spatial data is usually organized in terms of a two dimensional region while temporal information can exhibit periodicities. While some work existing in the topic modeling community that utilizes some of the meta information, these models largely focused on incorporating metadata into text analysis, rather than providing models that make full use of the joint distribution of meta-information and text.

In this thesis, I propose the event detection problem, which is a multi-dimensional latent clustering problem on spatial, temporal and topical data. I start with a simple parametric model to discover independent events using geo-tagged Twitter data. The model is then improved toward two directions. First, I augmented the model using Recurrent Chinese Restaurant Process (RCRP) to discover events that are dynamic in nature. Second, I studied a model that can detect events using data from multiple media sources. I studied the characteristics of different media in terms of reported event times and linguistic patterns.

The approaches studied in this thesis are largely based on Bayesian non-parametric methods to deal with steaming data and unpredictable number of clusters. The research will not only serve the event detection problem itself but also shed light into a more general structured clustering problem in spatial, temporal and textual data.

Thesis Committee:
Kathleen M. Karley (Chair)
Tom Mitchell (MLD)
Alexander Smola (MLD/Amazon)
Huan Liu (Arizona State University)

Copy of Thesis Document

In our recent work, we aggressively modify the Java bytecode in order to implement a novel technique called variational execution. As we delve deeply into the bytecode, we realize that bytecode manipulation is a powerful technique that could be applied to various application domains. It has its own unique advantages over similar techniques like source-to-source transformation. It can be useful for simple tasks like performance profiling, refactoring and runtime checking. It is also widely used in research community for more complicated tasks like static analysis and dynamic analysis. In this talk, I am going to briefly introduce Java bytecode and show a few examples of how bytecode manipulation could be useful in a variety of scenarios. My hope is that, after this talk, you have one more implementation option to consider for your research project.

Software crashes is one of the serious category of defects, which are generally dealt with high priority. To debug a software crash, companies such as Microsoft, Apple, Google and Synopsys collect function stack traces. Often the same issue in a code causes crash on different customer sites resulting in submission of multiple crash reports for that issue. Having multiple traces for the same issue could increase turnaround time. Therefore, efficient management of stack traces is required; an approach is required that could group (cluster) the crash reports that are caused by the same issue.

During summer internship at Synopsys Inc., I worked on this problem of clustering of crashes (based on stack traces) that belong to the same issue/component. The problem has already been studied in the past. Most recently Microsoft research (MSR) proposed a solution, which on the surface appeared to solve this problem. However, it turned out to be a number of reasons why it could not be applied directly legacy products such as Synopsys-VCS.

This talks (1) provides an overview of the problem and proposed solution by MSR, (2) discusses the challenges in applying the solution to a legacy product at Synopsys, and (3) reflects upon the lessons learned while working on this project.

Cyber-physical systems (CPSs) are systems that mix software and physical control, with equal prominence. Typically, several software (or cyber) models have been used for the management and control of CPSs. However, to fully realize the goals of CPSs, physical models too have to be treated as first class models in these systems. This gives rise to three main challenges: (a) identifying and integrating physical and software models with different characteristics and semantics; (b) obtaining instances of physical models at a suitable level of abstraction for control; and (c) using and adapting physical models to control CPSs. In this talk, I would discuss these challenges and outline the steps that we have taken to address them in the context of development of power models for a robotic platform named TurtleBot.

The ability to specify immutability in a programming language is a powerful tool for developers, enabling them to better understand and more safely transform their code without fearing unintended changes to program state. The C++ programming language allows developers to specify a form of immutability using the const keyword. In this work, we characterize the meaning of the C++ const qualifier and present the ConstSanitizer tool, which dynamically verifies a stricter form of immutability than that defined in C++: it identifies const uses that are either not consistent with transitive immutability, that write to mutable fields, or that write to formerly-const objects whose const`-ness has been cast away.

We evaluate a set of 7 C++ benchmark programs to find writes-through-const, establish root causes for how they fail to respect our stricter definition of immutability, and assign attributes to each write (namely: synchronized, not visible, buffer/cache, delayed initialization, and incorrect). ConstSanitizer finds 17 archetypes for writes in these programs which do not respect our version of immutability. Over half of these seem unnecessary to us. Our classification and observations of behaviour in practice contribute to the understanding of a widely-used C++ language feature.

Jon Eyolfson is a PhD candidate at the University of Waterloo. His current research is on immutability in the presence of writes. His most recent work involves dynamic empirical analysis of immutability, while continuing work aims to statically analyze immutability. Previously he investigated unread memory using dynamic analysis and an empirical study on what time of the day buggy commits occur.

Faculty Host: Jonathan Aldrich

Despite decades of research into developing abstract security advice and improving interfaces, users still struggle to make passwords. Users frequently create passwords that are predictable for attackers or make other decisions (e.g., reusing the same password across accounts) that harm their security. In this thesis, I use data-driven methods to better understand how users choose passwords and how attackers guess passwords. I then combine these insights into a better password-strength meter that provides real-time, data-driven feedback about the user's candidate password.

I first quantify the impact on password security and usability of showing users different password-strength meters that score passwords using basic heuristics. I find in a 2,931-participant online study that meters that score passwords stringently and present their strength estimates visually lead users to create stronger passwords without significantly impacting password memorability. Second, to better understand how attackers guess passwords, I perform comprehensive experiments on password cracking approaches. I find that simply running these approaches in their default configuration is insufficient, but considering multiple well-configured approaches in parallel can serve as a proxy for guessing by an expert in password forensics. The third and fourth sections of this thesis delve further into how users choose passwords. Through a series of analyses, I pinpoint ways in which users structure semantically significant conteSocietal Computing, Ph.D. Student nt in their passwords. I also examine the relationship between users' perceptions of password security and passwords' actual security, finding that while users often correctly judge the security impact of individual password characteristics, wide variance in their understanding of attackers may lead users to judge predictable passwords as sufficiently strong. Finally, I integrate these insights into an open-source password-strength meter that gives users data-driven feedback about their specific password. I validate this meter through a ten-participant laboratory study and 1,624-participant online study.

Thesis Committee:
Lorrie Faith Cranor (Chair)
Alessandro Acquisti (Heinz)
Lujo Bauer (ECE/ISR)
Jason Hong (HCII)
Michael K. Reiter (University of North Carolina at Chapel Hill)

Copy of Thesis Document

Developers often build regression test suites that are automatically run to check that code changes do not break any functionality. Nowadays, tests are usually run on a could-based continuous integration service (CIS), e.g., Travis CI. Although regression testing is important, it is also costly, and the cost is reportedly increasing. For example, Google recently reported that they observed quadratic increase in test-suite run time (linear increase in the number of changes and linear increase in the number of tests). One approach to speed up regression testing is regression test selection (RTS), which runs only a subset of tests that may be affected by the latest changes. To detect affected tests, RTS techniques statically analyze the latest changes to a codebase. To obtain overall time saving, compared to rerunning all the tests, RTS techniques have to balance the time spent on analysis vs. the time saved from not running non-selected tests. In addition, novel techniques are needed to reduce other extra costs when running tests on CIS, such as the cost of library retrieval.

I proposed a new, lightweight RTS technique, called Ekstazi, that provides a sweet-spot balancing of the analysis time and time for running non-selected tests. Ekstazi is also the first RTS technique for software that uses modern distributed version-control systems, e.g., Git. I also proposed Molly, a novel technique that substantially reduces the library retrieval cost. Molly lazily retrieves (parts of) libraries only when the libraries are accessed by the language compiler/runtime. I implemented Ekstazi and Molly for Java, and evaluated them on several hundred revisions of 32 open-source projects (totaling 5M lines of code). Ekstazi reduced the overall time 54% compared to running all tests. Furthermore, Molly reduced the library retrieval time by 50%. Finally, only a few months after the initial release, Ekstazi was adopted by several Apache projects.

Milos Gligoric is an Assistant Professor in Electrical and Computer Engineering at the University of Texas at Austin. His research interests are in software engineering and formal methods, especially in designing techniques and tools that improve software quality and developers' productivity. His PhD work has explored test-input generation, test-quality assessment, testing concurrent code, and regression testing.  Two of his papers won ACM SIGSOFT Distinguished Paper awards (ICSE 2010 and ISSTA 2015), and three of his papers were invited for journal submissions. Milos was awarded a Google Faculty Research Award (2015) and an NSF CRII Award (2016). Milos’ PhD dissertation won the ACM SIGSOFT Outstanding Doctoral Dissertation Award (2016) and the UIUC David J. Kuck Outstanding Ph.D. Thesis Award (2016). Milos holds a PhD (2015) from UIUC, and an MS (2009) and BS (2007) from the University of Belgrade, Serbia.

Faculty Host: Claire Le Goues


Subscribe to ISR