Despite decades of research into developing abstract security advice and improving interfaces, users still struggle to make passwords. Users frequently create passwords that are predictable for attackers or make other decisions (e.g., reusing the same password across accounts) that harm their security. In this thesis, I use data-driven methods to better understand how users choose passwords and how attackers guess passwords. I then combine these insights into a better password-strength meter that provides real-time, data-driven feedback about the user's candidate password.
I first quantify the impact on password security and usability of showing users different password-strength meters that score passwords using basic heuristics. I find in a 2,931-participant online study that meters that score passwords stringently and present their strength estimates visually lead users to create stronger passwords without significantly impacting password memorability. Second, to better understand how attackers guess passwords, I perform comprehensive experiments on password cracking approaches. I find that simply running these approaches in their default configuration is insufficient, but considering multiple well-configured approaches in parallel can serve as a proxy for guessing by an expert in password forensics. The third and fourth sections of this thesis delve further into how users choose passwords. Through a series of analyses, I pinpoint ways in which users structure semantically significant conteSocietal Computing, Ph.D. Student nt in their passwords. I also examine the relationship between users' perceptions of password security and passwords' actual security, finding that while users often correctly judge the security impact of individual password characteristics, wide variance in their understanding of attackers may lead users to judge predictable passwords as sufficiently strong. Finally, I integrate these insights into an open-source password-strength meter that gives users data-driven feedback about their specific password. I validate this meter through a ten-participant laboratory study and 1,624-participant online study.
Lorrie Faith Cranor (Chair)
Alessandro Acquisti (Heinz)
Lujo Bauer (ECE/ISR)
Jason Hong (HCII)
Michael K. Reiter (University of North Carolina at Chapel Hill)
Developers often build regression test suites that are automatically run to check that code changes do not break any functionality. Nowadays, tests are usually run on a could-based continuous integration service (CIS), e.g., Travis CI. Although regression testing is important, it is also costly, and the cost is reportedly increasing. For example, Google recently reported that they observed quadratic increase in test-suite run time (linear increase in the number of changes and linear increase in the number of tests). One approach to speed up regression testing is regression test selection (RTS), which runs only a subset of tests that may be affected by the latest changes. To detect affected tests, RTS techniques statically analyze the latest changes to a codebase. To obtain overall time saving, compared to rerunning all the tests, RTS techniques have to balance the time spent on analysis vs. the time saved from not running non-selected tests. In addition, novel techniques are needed to reduce other extra costs when running tests on CIS, such as the cost of library retrieval.
I proposed a new, lightweight RTS technique, called Ekstazi, that provides a sweet-spot balancing of the analysis time and time for running non-selected tests. Ekstazi is also the first RTS technique for software that uses modern distributed version-control systems, e.g., Git. I also proposed Molly, a novel technique that substantially reduces the library retrieval cost. Molly lazily retrieves (parts of) libraries only when the libraries are accessed by the language compiler/runtime. I implemented Ekstazi and Molly for Java, and evaluated them on several hundred revisions of 32 open-source projects (totaling 5M lines of code). Ekstazi reduced the overall time 54% compared to running all tests. Furthermore, Molly reduced the library retrieval time by 50%. Finally, only a few months after the initial release, Ekstazi was adopted by several Apache projects.
Milos Gligoric is an Assistant Professor in Electrical and Computer Engineering at the University of Texas at Austin. His research interests are in software engineering and formal methods, especially in designing techniques and tools that improve software quality and developers' productivity. His PhD work has explored test-input generation, test-quality assessment, testing concurrent code, and regression testing. Two of his papers won ACM SIGSOFT Distinguished Paper awards (ICSE 2010 and ISSTA 2015), and three of his papers were invited for journal submissions. Milos was awarded a Google Faculty Research Award (2015) and an NSF CRII Award (2016). Milos’ PhD dissertation won the ACM SIGSOFT Outstanding Doctoral Dissertation Award (2016) and the UIUC David J. Kuck Outstanding Ph.D. Thesis Award (2016). Milos holds a PhD (2015) from UIUC, and an MS (2009) and BS (2007) from the University of Belgrade, Serbia.
Faculty Host: Claire Le Goues
The rise of the Islamic State of Iraq and al-Sham (ISIS) has been watched by millions through the lens of social media. This “crowd” of social media users has given the group broad reach resulting in a massive online support community that is an essential element of their public affairs and resourcing strategies. Other extremist groups have begun to leverage social media as well. Online Extremist Community (OEC) detection holds great potential to enable social media intelligence (SOCMINT) mining as well as informed strategies to disrupt these online communities. I present Iterative Vertex Clustering and Classification (IVCC), a scalable analytic approach for OEC detection in annotated heterogeneous networks, and propose several extensions to this methodology to help provide policy makers the ability to identify these communities at scale to understand members’ interests and influence. Ultimately, these methods could provide unique insights to shape information operations, intervention strategies, and policy decisions.
In this thesis, I propose the following contributions:
- efficient search and discovery of OEC members via semi-supervised dense community detection
- an active learning framework to incorporate regional expertise and enable monitoring of highly dynamic OECs
- a formal study of mention strategies used to gain social influence in within a target online community
- an extended literature review of methods applicable to detection, analysis, and disruption of OECs
The contributions proposed in this thesis will be applied to four large Twitter corpora containing distinct online communities of interest. My goal is to provide a substantive foundation enabling follow-on work in this emergent area so critical to counter-terrorism and national security.
Kathleen M. Carley (Chair, ISR)
Daniel Neill (Heinz)
Dr. Zico Kolter (CSD/ISR)
Randy Garrett (IronNet Cybersecurity Inc.)
Social identities, the labels we use to describe ourselves and others, carry with them stereotypes that have significant impacts on our social lives. Our stereotypes, sometimes without us knowing, guide our decisions on whom to talk to and whom to stay away from, whom to befriend and whom to bully, whom to treat with reverence and whom to view with disgust.
Despite the omnipotent impact of identities and stereotypes on our lives, existing methods used to understand them are lacking. In this thesis, I first develop three novel computational tools that further our ability to test and utilize existing social theory on identity and stereotypes. These tools include a method to extract identities from Twitter data, a method to infer affective stereotypes from newspaper data and a method to infer both affective and semantic stereotypes from Twitter data. Case studies using these methods provide insights into Twitter data relevant to the Eric Garner and Michael Brown tragedies and both Twitter and newspaper data from the Arab Spring'.
Results from these case studies motivate the need for not only new methods for existing theory, but new social theory as well. To this end, I develop a new socio-theoretic model of identity labeling - how we choose which label to apply to others in a particular situation. The model combines data, methods and theory from the social sciences and machine learning, providing an important example of the surprisingly rich interconnections between these fields.
Kathleen M. Carley (Chair)
Jason Hong (HCII)
Eric Xing (MLD/LTI,/CSD)
Lynn Smith-Lovin (Duke University)
Robert L. Wilson (Department of Sociology)
It is essentially impossible to build modern software without extensive supply chains. Supply chains decrease development risk, but typically at the cost of increased security risk. In particular, it is often difficult to trust or verify that a component contains no unwanted behaviors, vulnerabilities, or malicious code that may harm an application or its users.
Sandboxes provide relief by encapsulating a component and imposing a security policy on it. Instead of trusting or verifying the component, we must trust or verify the relatively simple sandbox. Given this appealing prospect, researchers have spent decades developing new sandboxes. However, sandboxes are rarely used in practice. What is hampering adoption? This thesis advances our understanding of and overcomes some barriers to adoption.
We begin by systematically analyzing a decade of sandboxing research from relevant top-tier conferences. This work uncovers our first two challenges: sandbox researchers often (1) use relatively subjective validation techniques and (2) ignore usability.
We then focus on the Java sandbox to empirically study its use within the open source community. We find unnecessary complexity in the sandbox that has promoted a thriving exploit landscape. We develop run time monitors for the Java Virtual Machine (JVM) that disable these complexity points, stopping sandbox escaping exploits without breaking compatibility with benign applications. However, we found no cases where the sandbox was successfully used for security purposes, which motivated us to overcome necessary complexity via tooling. Our tools help users derive and refine a security policy and impose it on targeted Java application components.
Finally, we observe vulnerability-inducing design and implementation complexity in adopted sandboxes, and thus develop and evaluate a sandboxing technique that leverages cloud computing environments to execute untrusted computations. Any malicious outcomes produced by the computations are contained by ephemeral virtual machines.
We evaluate our tools and sandbox in a series of case studies that investigate hard-to-sandbox cases in collaboration with industrial partners.
Jonathan Aldrich (Co-chair)
William L. Scherlis (Co-chair)
Lujo Bauer (ECE/ISR)
Bruno Amizic (The Boeing Company)
With the emergence of the Internet of Things and a data-centric economy, where a growing number of products, services and business processes rely on the collection and processing of user data, people are increasingly confronted with an unmanageable number of privacy decisions. What is needed is a new, more scalable paradigm that empowers them to regain control over their data. Over the past ten years, my group at CMU has been working towards the development of personalized privacy assistants. To be effective, these assistants will have to be capable of incrementally learning the privacy preferences of their users, semi-automatically configure many privacy settings on their behalf and generally alert their users about privacy practices they may not be expecting. These assistants will have to be minimally disruptive, limiting their interactions with us to the bare minimum, yet knowing enough about us to make many decisions on our behalf.
Through these very selective interactions, they will alert us about practices we may not feel comfortable with, confirm privacy settings they are not sure how to configure, refine models of our preferences, and occasionally nudge us to carefully (re)consider the implications of some of our privacy decisions.
As part of this presentation, I will summarize our progress in this area, focusing in particular on a series of pilot studies conducted to evaluate personalized privacy assistant functionality developed to help smartphone users manage their mobile app privacy settings. I will also review some of the ethical and business challenges that the development of this type of technology needs to consider.
Norman Sadeh is a Professor in the School of Computer Science at Carnegie Mellon University. His research interests span mobile and pervasive computing, cybersecurity and privacy, human computer interaction, machine learning and personal assistants. Norman is director of the School of Computer Science’s Mobile Commerce Lab and also co-founder and co-director of the School of Computer Science’s Master’s Program in Privacy Engineering. He also co-founded and served as co-director of the School’s PhD Program in Societal Computing during its first ten years.
Dr. Sadeh is also co-founder, chairman and chief scientist of Wombat Security Technologies, a CMU spinoff that sells a unique suite of cybersecurity training products and anti-phishing filtering technologies. Norman has been on the faculty at Carnegie Mellon since 1991 and is also well-known for his work on the livehoods project as well as earlier work in scheduling, constraint satisfaction and constrained optimization, supply chain management and Semantic Web technologies. In the late nineties, he served as Chief Scientist of the European Union’s $600M e-Work and e-Commerce program, which at the time included all pan-European research in cyber security and online privacy.
Norman received his PhD in computer science from Carnegie Mellon University, an MSc, also in computer science, from the University of Southern California, and a BS/MSc in Electrical Engineering and Applied Physics from Brussels Free University.
We need better tools for C, such as source browsers, bug finders, and automated refactorings. The problem is that large C systems such as Linux are software product lines, containing thousands of configuration variables controlling every aspect of the software from architecture features to file systems and drivers. The challenge for software tools is to analyze all configurations of the source code without succumbing to exponential explosion due to variability.
Bug-finding is one area where such capability is useful. Even trivial bugs like linker errors become hard to find when they only appear in certain configurations, because testing each configuration one-at-a-time is infeasible. Previous work shows that such variability bugs do appear in the Linux kernel source but, lacking automated tools, finds them by examining patches to the source. To support variability-aware tools, we focus on two key subproblems, parsing and the build system, and show how these components work together to enable a simple variability-aware bug-finder for linker errors.
A configuration-preserving preprocessor and parser called SuperC collects all function calls and definitions along with their conditions. Then, a configuration-preserving Makefile evaluator called Kmax collects all compilation units and their conditions. To find linker errors, a boolean expression representing the potential linker error is constructed for each function call. The expression combines the conditions at the definition site and its compilation unit, the call site and its compilation unit, and the Linux feature model. A SAT solver then determines whether a linker error exists in any configuration. Evaluation shows that known bugs can be found and that the finder scales across the entire Linux kernel source code.
Paul Gazillo is currently a post-doctoral associate at Yale university where I research static analyses to find complexity attacks on software. I received my PhD from New York University December 2015, where my research focused on enabling software tools for C programs in the presence of the C preprocessor and variability. Before beginning the PhD, I was a research data analyst for Educational Testing Service.
Cyber-Physical Systems (CPS) are software-controlled systems that have complex interactions with the physical world. Many CPS, such as autonomous drones and self-driving cars, are becoming increasingly more embedded in our society, and therefore safety-critical and demanding of rigorous quality assurance. To this end, CPS engineering relies on modeling methods from diverse scientific and engineering fields, for example control theory and real-time scheduling. Diverse modeling methods are difficult to combine with each other due to their complexity and heterogeneity. Inconsistencies between models and analyses that come from different modeling methods often lead to implicit design errors, which subsequently can cause critical CPS failures with loss of lives and substantial material resources.
To detect and prevent inconsistencies between CPS modeling methods, this thesis investigates an improved architectural approach to integration of CPS modeling methods. This approach relies on architectural views (annotated component-and-connector models) to abstract out and check integration-relevant information from detailed models (e.g., hybrid programs). On top of these views I introduce a novel integration perspective based analyses -- algorithms and procedures that interpret and augment models. For each analysis I propose to specify a contract that captures inputs, outputs, assumptions and guarantees of the analysis in terms of view elements. A particularly challenging task is creating a language to express assumptions, guarantees, and consistency statements over heterogeneous models and views. This language needs to strike a balance between expressiveness and decidability to be effective in the CPS context.
The conceptual advances of this thesis enable a new level of automation for CPS modeling method integration. I will implement these advances in a toolset that will support automated model-view synchronization, analysis execution, and verification of semantic consistency between models. This toolset will serve as a means of evaluating the proposed integration approach in case studies of realistic CPS systems, such as autonomous spacecraft and collaborative robots. I will validate claims about correctness, effectiveness, and generality of my approach.
David Garlan (Chair, ISR)
André Platzer (CSD CMU)
Bruce Krogh (ECE CMU)
Dionisio de Niz (Software Engineering Institute, CMU)
John Day (Jet Propulsion Laboratory/NASA)
Solving the expression problem requires a language or system to allow one to add new variants to a datatype as well as new operations over it while ensuring strong static type safety without re-compiling the existing implementation. The independently-extensible version goes a step further to require that these extensions can be added in any order. In extensible languages this amounts to adding new (abstract) syntax to a language and new semantic analyses or translations to it. Attribute grammars solve these versions of the expression problem, but with forwarding and a modular well-definedness analysis they also solve a version that requires that no 'glue' code need be written to combine the various extensions. This means that the composition can always be done safely and automatically. Using these techniques we have built ableC, an extensible language framework for C in which non-export programmers can select the language extensions that best fit their task at hand with the confidence that the supporting tools can generate a working compiler for their custom language.
Eric Van Wyk's research focuses on programming languages, in particular extensible programming languages and compilers, applications of temporal logic, and algebraic compilers. In 2005 he was awarded a McKnight Land-Grant Professorship and the National Science Foundation's CAREER award in 2004.
He has authored or co-authored more than 25 publications, including journal and conference papers, articles and technical reports. Van Wyk has developed various software packages including the Silver attribute grammar specification and evaluation system, extensible specifications of Java 1.4 and ANSI C written in Silver, and various domain-specific language extensions for these Java and C specifications. He is a member of ACM, ACM SIGPLAN, IEEE, the IEEE Computer Society, and is involved in numerous conference committees. Van Wyk also does outreach, serving as a member of the St. Louis Park High School School Business and Information Technology Advisory Board.
Faculty Host: Jonathan Aldrich