Associate Professor · Carnegie Mellon University · Institute for Software Research
Key publications highlighted in yellow.
A practical and innovative textbook detailing how to build real-world software products with machine learning components, not just models. Traditional machine learning texts focus on how to train and evaluate the machine learning model, while MLOps books focus on how to streamline model development and deployment. But neither focus on how to build actual products that deliver value to users. This practical textbook, by contrast, details how to responsibly build products with machine learning components, covering the entire development lifecycle from requirements and design to quality assurance and operations. Machine Learning in Production brings an engineering mindset to the challenge of building systems that are usable, reliable, scalable, and safe within the context of real-world conditions of uncertainty, incomplete information, and resource constraints. Based on the author's popular class at Carnegie Mellon, this pioneering book integrates foundational knowledge in software engineering and machine learning to provide the holistic view needed to create not only prototype models but production-ready systems. • Integrates coverage of cutting-edge research, existing tools, and real-world applications • Provides students and professionals with an engineering view for production-ready machine learning systems • Proven in the classroom • Offers supplemental resources including slides, videos, exams, and further readings
Modern machine learning produces models that are impossible for users or developers to fully understand – raising concerns about trust, oversight and human dignity. Transparency and explainability methods aim to provide some help in understanding models, but it remains challenging for developers to design explanations that are understandable to target users and effective for their purpose. Emerging guidelines and regulations set goals but may not provide effective actionable guidance to developers. In a controlled experiment with 124 participants, we investigate whether and how specific forms of policy guidance help developers design explanations for an ML-powered screening tool for diabetic retinopathy. Contrary to our expectations, we found that participants across the board struggled to produce quality explanations, comply with the provided policy requirements for explainability, and provide evidence of compliance. We posit that participant noncompliance is in part due to a failure to imagine and anticipate the needs of their audience, particularly non-technical stakeholders. Drawing on cognitive process theory and the sociological imagination to contextualize participants' failure, we recommend educational interventions.
Machine learning models are often criticized as opaque from a lack of transparency in their decision-making process. This study examines how policy design impacts the quality of explanations in ML models. We conducted a classroom experiment with 124 participants and analyzed the effects of policy length and purpose on developer compliance with policy requirements. Our results indicate that while policy length affects engagement with some requirements, policy purpose has no effect, and explanation quality is generally poor. These findings highlight the challenge of effective policy development and the importance of addressing diverse stakeholder perspectives within explanations.
Recent high-profile incidents in open-source software have greatly raised practitioner attention on software supply chain attacks. To guard against potential malicious package updates, security practitioners advocate pinning dependency to specific versions rather than floating in version ranges. However, it remains controversial whether pinning carries a meaningful security benefit that outweighs the cost of maintaining outdated and possibly vulnerable dependencies. In this paper, we quantify, through counterfactual analysis and simulations, the security and maintenance impact of version constraints in the npm ecosystem. By simulating dependency resolutions over historical time points, we find that pinning direct dependencies not only (as expected) increases the cost of maintaining vulnerable and outdated dependencies, but also (surprisingly) even increases the risk of exposure to malicious package updates in larger dependency graphs due to the specifics of npm’s dependency resolution mechanism. Finally, we explore collective pinning strategies to secure the ecosystem against supply chain attacks, suggesting specific changes to npm to enable such interventions. Our study provides guidance for practitioners and tool designers to manage their supply chains more securely.
Machine learning (ML) components are increasingly integrated into software products, yet their complexity and inherent uncertainty often lead to unintended and potentially hazardous consequences, both for individuals and society at large. Despite these risks, practitioners rarely adopt proactive approaches to anticipate and mitigate potential hazards before they occur. Traditional safety engineering approaches, such as Failure Mode and Effects Analysis (FMEA) and System Theoretic Process Analysis (STPA), offer promising frameworks for systematic early risk identification but are rarely adopted. In this position paper, we argue that hazard analysis should be an integral part of developing any ML-powered software product and that greater support is needed to make this process manageable for developers. By using large language models (LLMs) to partially automate a modified STPA process with human oversight at critical steps, we expect to address two key challenges: the heavy dependency on highly experienced safety engineering experts, and the time-consuming, labor-intensive nature of traditional hazard analysis, which often impedes its integration into real-world development workflows. We illustrate our approach with a running example, demonstrating that many seemingly unanticipated issues can, in fact, be anticipated. We conclude with a call to action for the software engineering community to adopt proactive safety engineering practices for ML-powered systems.
Highly-configurable software often includes performance-sensitive configuration options. There are performance expectations across different configurations, but these expectations may not hold, due to inaccurate mental models, corner cases, or unanticipated interactions with other options. We propose differential performance fuzzing of configuration options, a fuzzing technique that uses differential performance feedback to automatically identify inputs that violate these expectations for specific configuration changes. By guiding fuzzing toward scenarios where a supposedly faster configuration performs worse, differential performance fuzzing reveals unexpected performance behavior effectively. In our preliminary evaluation, our method identified unexpected performance gains in configurations presumed slower for 4 configuration options in Closure, demonstrating the potential for detecting performance issues in real-world applications.
Machine learning in production needs to balance multiple objectives: This is particularly evident in ranking or recommendation models, where conflicting objectives such as user engagement, satisfaction, diversity, and novelty must be considered at the same time. However, designing multi-objective rankers is inherently a dynamic wicked problem – there is no single optimal solution, and the needs evolve over time. Effective design requires collaboration between cross-functional teams and careful analysis of a wide range of information. In this work, we introduce Orbit, a conceptual framework for Objective-centric Ranker Building and Iteration. The framework places objectives at the center of the design process, to serve as boundary objects for communication and guide practitioners for design and evaluation. We implement Orbit as an interactive system, which enables stakeholders to interact with objective spaces directly and supports real-time exploration and evaluation of design trade-offs. We evaluate Orbit through a user study involving twelve industry practitioners, showing that it supports efficient design space exploration, leads to more informed decision-making, and enhances awareness of the inherent trade-offs of multiple objectives. Orbit (1) opens up new opportunities of an objective-centric design process for any multi-objective ML models, as well as (2) sheds light on future designs that push practitioners to go beyond a narrow metric-centric or example-centric mindset.
Large Language Models (LLMs) are increasingly embedded into software products across diverse industries, enhancing user experiences, but at the same time introducing numerous challenges for developers. Unique characteristics of LLMs force developers, who are accustomed to traditional software development and evaluation, out of their comfort zones as the LLM components shatter standard assumptions about software systems. This study explores the emerging solutions that software developers are adopting to navigate the encountered challenges. Leveraging a mixed-method research, including 26 interviews and a survey with 332 responses, the study identifies 19 emerging solutions regarding quality assurance that practitioners across several product teams at Microsoft are exploring. The findings provide valuable insights that can guide the development and evaluation of LLM-based products more broadly in the face of these challenges.
Algorithmic fairness of machine learning (ML) models has raised significant concern in the recent years. Many testing, verification, and bias mitigation techniques have been proposed to identify and reduce fairness issues in ML models. The existing methods are model-centric and designed to detect fairness issues under static settings. However, many ML-enabled systems operate in a dynamic environment where the predictive decisions made by the system impact the environment, which in turn affects future decision-making. Such a self-reinforcing feedback loop can cause fairness violations in the long term, even if the immediate outcomes are fair. In this paper, we propose a simulation-based framework called FairSense to detect and analyze long-term unfairness in ML-enabled systems. In particular, the framework targets systems with an ML model that is trained over tabular data using supervised learning. Given a fairness requirement, FairSense performs Monte-Carlo simulation to enumerate evolution traces for each system configuration. Then, FairSense performs sensitivity analysis on the space of system parameters to understand the impact of configuration decisions on long-term fairness of the system. We demonstrate FairSense's potential utility through three real-world case studies: Loan lending, opioids risk scoring, and predictive policing.
The integrity of software builds is fundamental to the security of the software supply chain. While Thompson first raised the potential for attacks on build infrastructure in 1984, limited attention has been given to build integrity in the past 40 years, enabling recent attacks on SolarWinds, event-stream, and xz. The best-known defense against build system attacks is creating reproducible builds; however, achieving them can be complex for both technical and social reasons and thus is often viewed as impractical to obtain. In this paper, we analyze reproducibility of builds in a novel context: reusable components distributed as packages in six popular software ecosystems (npm, Maven, PyPI, Go, RubyGems, and Cargo). Our quantitative study on a representative sample of 4000 packages in each ecosystem raises concerns: Rates of reproducible builds vary widely between ecosystems, with some ecosystems having all packages reproducible whereas others have \issues in nearly every package. However, upon deeper investigation, we identified that with relatively straightforward infrastructure configuration and patching of build tools, we can achieve very high rates of reproducible builds in all studied ecosystems. We conclude that if the ecosystems adopt our suggestions, the build process of published packages can be independently confirmed for nearly all packages without individual developer actions, and doing so will prevent significant future software supply chain attacks.
Large Language Models (LLMs) are increasingly embedded into software products across diverse industries, enhancing user experiences, but at the same time introducing numerous challenges for developers. Unique characteristics of LLMs force developers, who are accustomed to traditional software development and evaluation, out of their comfort zones as the LLM components shatter standard assumptions about software systems. This study explores the emerging solutions that software developers are adopting to navigate the encountered challenges. Leveraging a mixed-method research, including 26 interviews and a survey with 332 responses, the study identifies 19 emerging solutions regarding quality assurance that practitioners across several product teams at Microsoft are exploring. The findings provide valuable insights that can guide the development and evaluation of LLM-based products more broadly in the face of these challenges.
Machine learning models make mistakes, yet sometimes it is difficult to identify the systematic problems behind the mistakes. Practitioners engage in various activities, including error analysis, testing, auditing, and red-teaming, to form hypotheses of what can go (or has gone) wrong with their models. To validate these hypotheses, practitioners employ data slicing to identify examples relevant to their hypotheses. However, traditional data slicing is limited by available features and programmatic slicing functions. In this work, we propose SemSlicer, a framework that supports semantic data slicing, which identifies a semantically coherent slice, without the need for existing features. SemSlicer uses Large Language Models (LLMs) to annotate datasets and generate slices from any user-defined slicing criteria. We show that SemSlicer generates accurate slices with low cost, allows flexible trade-offs between different design dimensions, reliably identifies under-performing data slices, and helps practitioners identify useful data slices that reflect systematic problems.
One of the major challenges in the use of opaque, complex AI models is the need or desire to provide an explanation to the end-user (and other stakeholders) as to how the system arrived at the answer it did. While there is significant research in the development of explainability techniques for AI, the question remains as to who needs an explanation, what an explanation consists of, and how to communicate this to a lay user who lacks direct expertise in the area. In this position paper, an interdisciplinary team of researchers argue that the example of clinical communications offers lessons to those interested in improving the transparency and interpretability of AI systems. We identify five lessons from clinical communications: (1) offering explanations for AI systems and disclosure of their use recognizes the dignity of those using and impacted by it; (2) AI explanations can be productively targeted rather than totally comprehensive; (3) AI explanations can be enforced through codified rules but also norms, guided by core values; (4) what constitutes a “good” AI explanation will require repeated updating due to changes in technology and social expectations; 5) AI explanations will have impacts beyond defining any one AI system, shaping and being shaped by broader perceptions of AI. We review the history, debates and consequences surrounding the institutionalization of one type of clinical communication, informed consent, in order to illustrate the challenges and opportunities that may await attempts to offer explanations of opaque AI models. We highlight takeaways and implications for computer scientists and policymakers in the context of growing concerns and moves toward AI governance.
Machine learning (ML) components are increasingly incorporated into software products for end-users, but developers face challenges in transitioning from ML prototypes to products. Academics have limited access to the source of commercial ML products, challenging research progress. In this study, first, we contribute a novel process to identify 262 open-source ML products among more than half a million ML-related projects on GitHub. Then, we qualitatively and quantitatively analyze 30 open-source ML products to answer six broad research questions about development practices and system architecture. We find that the majority of the ML products in our sample represent startup-style development reported in past interview studies. We report 21 findings, including limited involvement of data scientists in many ML products, unusually low modularity between ML and non-ML code, diverse architectural choices on incorporating models into products, and limited prevalence of industry best practices such as model testing, pipeline automation, and monitoring. Additionally, we discuss 7 implications of this study on research, development, and education, including the need for tools to assist teams without data scientists, education opportunities, and open-source-specific research for privacy-preserving telemetry.
Many developers relying on open-source digital infrastructure expect continuous maintenance, but even the most critical packages can become unmaintained. Despite this, there is little understanding of the prevalence of abandonment of widely-used packages, of subsequent exposure, and of reactions to abandonment in practice, or the factors that influence them. We perform a large-scale quantitative analysis of all widely-used npm packages and find that abandonment is common among them, that abandonment exposes many projects which often do not respond, that responses correlate with other dependency management practices, and that removal is significantly faster when a projects end-of-life status is explicitly stated. We end with recommendations to both researchers and practitioners who are facing dependency abandonment or are sunsetting projects, such as opportunities for low-effort transparency mechanisms to help exposed projects make better, more informed decisions.
With the rise of artificial intelligence (AI), concerns about AI applications causing unforeseen harms to safety, privacy, security, and fairness are intensifying. While attempts to create regulations are underway, with initiatives such as the EU AI Act and the 2023 White House executive order, skepticism abounds as to the efficacy of such regulations. This paper explores an interdisciplinary approach to designing policy for the explainability of AI-based products, as the widely discussed "right to explanation" in the EU General Data Protection Regulation is ambiguous. To develop practical guidance for explainability, we conducted an experimental study that involved continuous collaboration among a team of researchers with AI and policy backgrounds over the course of ten weeks. The objective was to determine whether, through interdisciplinary effort, we can reach consensus on a policy for explainability in AI—one that is clearer, and more actionable and enforceable than current provisions. We share nine observations, derived from an iterative policy design process, which included drafting the policy, attempting to comply with it (or circumvent it), and collectively evaluating its effectiveness on a weekly basis. The observations include: iterative and continuous feedback was useful to improve policy drafts over time, discussing what evidence would satisfy policy was necessary during policy design, and human-subject studies were found to be necessary evidence to ensure effectiveness. We conclude with a note of optimism, arguing that meaningful policies can be achieved within a moderate time frame and with limited experience in policy design, as demonstrated by our student researchers on the team. This holds promising implications for policymakers, signaling that practical and effective regulation for AI applications is attainable.
Past research suggests software should be continuously maintained in order to remain useful in our digital society. To determine whether these studies on software evolution are supported in modern-day software libraries, we conduct a natural experiment on 26,050 GitHub repositories, statistically modeling library usage based on their package-level downloads against different factors related to project maintenance.
Introducing new maintainers to established projects is critical to the long-term sustainability of open-source projects. Yet, we have little understanding of what motivates developers to join and maintain already established projects. Previous research on volunteering motivations emphasizes that individuals are motivated by a unique set of factors to volunteer in a specific area, suggesting that the motivations behind open-source contributions also depend on the nature of the work. We aim to determine correlations between types of open-source contributions and their specific motivators through surveys of open-source contributors.
Large Language Models (LLMs) are increasingly integrated into software applications. Downstream application developers often access LLMs through APIs provided as a service. However, LLM APIs are often updated silently and scheduled to be deprecated, forcing users to continuously adapt to evolving models. This can cause performance regression and affect prompt design choices, as evidenced by our case study on toxicity detection. Based on our case study, we emphasize the need for and re-examine the concept of regression testing for evolving LLM APIs. We argue that regression testing LLMs requires fundamental changes to traditional testing approaches, due to different correctness notions, prompting brittleness, and non-determinism in LLM APIs.
Trust is integral for the successful and secure functioning of software supply chains, making it important to measure the state and evolution of trust in open source communities. However, existing security and supply chain research often studies the concept of trust without a clear definition and relies on obvious and easily available signals like GitHub stars without deeper grounding. In this paper, we explore how to measure trust in open source supply chains with the goal of developing robust measures for trust based on the behaviors of developers in the community. To this end, we contribute a process for decomposing trust in a complex large-scale system into key trust relationships, systematically identifying behavior-based indicators for the components of trust for a given relationship, and in turn operationalizing data-driven metrics for those indicators, allowing for the wide-scale measurement of trust in practice.
Current model testing work has mostly focused on creating test cases. Identifying what to test is a step that is largely ignored and poorly supported. We propose Weaver, an interactive tool that supports requirements elicitation for guiding model testing. Weaver uses large language models to generate knowledge bases and recommends concepts from them interactively, allowing testers to elicit requirements for further testing. Weaver provides rich external knowledge to testers and encourages testers to systematically explore diverse concepts beyond their own biases. In a user study, we show that both NLP experts and non-experts identified more, as well as more diverse concepts worth testing when using Weaver. Collectively, they found more than 200 failing test cases for stance detection with zero-shot ChatGPT. Our case studies further show that Weaver can help practitioners test models in real-world settings, where developers define more nuanced application scenarios (e.g., code understanding and transcript summarization) using LLMs.
Machine learning (ML) components are increasingly incorporated into software products, yet developers face challenges in transitioning from ML prototypes to products. Academic researchers struggle to propose solutions to these challenges and evaluate interventions because they often do not have access to close-sourced ML products from industry. In this study, we define and identify open-source ML products, curating a dataset of 262 repositories from GitHub, to facilitate further research and education. As a start, we explore six broad research questions related to different development activities and report 21 findings from a sample of 30 ML products from the dataset. Our findings reveal a variety of development practices and architectural decisions surrounding different types and uses of ML models that offer ample opportunities for future research innovations. We also find very little evidence of industry best practices such as model testing and pipeline automation within the open-source ML products, which leaves room for further investigation to understand its potential impact on the development and eventual end-user experience for the products.
While lots of research has explored how to prevent maintainers from abandoning the open-source projects that serve as our digital infrastructure, there are very few insights on addressing abandonment when it occurs. We argue open-source sustainability research must expand its focus beyond trying to keep particular projects alive, to also cover the sustainable use of open source by supporting users when they face potential or actual abandonment. We perform an interview study with 33 developers who have experienced open-source dependency abandonment and analyze the data using iterative thematic analysis. Often, multiple strategies were used to cope with abandonment, for example, first reaching out to the community to find potential alternatives, then switching to a community-accepted alternative if one exists. We found many developers felt they had little to no support or guidance when facing abandonment, leaving them to figure out what to do through a trial-and-error process on their own. Abandonment introduces cost for otherwise seemingly free dependencies, but users can decide whether and how to prepare for abandonment through a number of different strategies, such as dependency monitoring, building abstraction layers, and community involvement. In many cases, community members can invest in resources that help others facing the same abandoned dependency, but often do not because of the many other competing demands on their time – in a form of the volunteer's dilemma. We discuss cost reduction strategies and ideas to overcome this volunteers dilemma. Our findings can be used directly by open-source users seeking resources on dealing with dependency abandonment, or by researchers to motivate future work supporting the sustainable use of open source.
Incorporating machine learning (ML) components into software products raises new software-engineering challenges and elevates already existing challenges. Many researchers have invested significant effort into understanding the challenges of industry practitioners working on building products with ML components through interviews and surveys with practitioners. With the intention to aggregate and present their collective findings, we conduct a meta-summary study: We collect 50 relevant papers that together interacted with over 4758 practitioners using guidelines for systematic literature reviews and subsequently group and organize the over 500 mentions of challenges within those papers. We highlight the most commonly reported challenges and how this meta-summary will be a useful resource for the research community to prioritize research and education in this field.
The documentation practice for machine-learned (ML) models often falls short of established practices for traditional software, which impedes model accountability and inadvertently abets inappropriate or misuse of models. Recently, model cards, a proposal for model documentation, have attracted notable attention, but their impact on the actual practice is unclear. In this work, we systematically study the model documentation in the field and investigate how to encourage more responsible and accountable documentation practice. Our analysis of publicly available model cards reveals a substantial gap between the proposal and the practice. We then designed a tool named DocML aiming to (1) nudge the data scientists to comply with the model cards proposal during the model development, especially the sections related to ethics, and (2) assess and manage the documentation quality. A lab study reveals the benefit of our tool towards long-term documentation quality and accountability.
In spite of machine learning’s rapid growth, its engineering support is scattered in many forms, and tends to favor certain engineering stages, stakeholders, and evaluation preferences. We envision a capability-based framework, which uses fine-grained specifications for ML model behaviors to unite existing efforts towards better ML engineering. We use concrete scenarios (model design, debugging, and maintenance) to articulate capabilities’ broad applications across various different dimensions, and their impact on building safer, more generalizable and more trustworthy models that reflect human needs. Through preliminary experiments, we show the potential of capabilities for reflecting model generalizability, which can provide guidance for the ML engineering process. We discuss challenges and opportunities for the integration of capabilities into ML engineering.
Many organizations seek to ensure that machine learning (ML) and artificial intelligence (AI) systems work as intended in production but currently do not have a cohesive methodology in place to do so. To fill this gap, we propose MLTE (Machine Learning Test and Evaluation, colloquially referred to as “melt”), a framework and implementation to evaluate ML models and systems. The framework compiles state-of-the-art evaluation techniques into an organizational process for interdisciplinary teams, including model developers, software engineers, system owners, and other stakeholders. The MLTE tooling supports this process by providing a domain-specific language that teams can use to express model requirements, an infrastructure to define, generate, and collect ML evaluation metrics, and the means to communicate results.
Data science pipelines to train and evaluate models with machine learning may contain bugs just like any other code. Leakage between training and test data can lead to overestimating the model's accuracy during offline evaluations, possibly leading to deployment of low-quality models in production. Such leakage can happen easily by mistake or by following poor practices but may be tedious and challenging to detect manually. We develop a static analysis approach to detect common forms of data leakage in data science code. Our evaluation shows that our analysis accurately detects data leakage and that such leakage is pervasive among over 100,000 analyzed public notebooks. We discuss how our static analysis approach can help both practitioners and educators, and how leakage prevention can be designed into the development process.
Data analysis is an exploratory, interactive, and often collaborative process. Computational notebooks have become a popular tool to support this process, among others because of their ability to interleave code, narrative text, and results. However, notebooks in practice are often criticized as hard to maintain and being of low code quality, including problems such as unused or duplicated code and out-of-order code execution. Data scientists can benefit from better tool support when maintaining and evolving notebooks. We argue that central to such tool support is identifying the structure of notebooks. We present a lightweight and accurate approach to extract notebook structure and outline several ways such structure can be used to improve maintenance tooling for notebooks, including navigation and finding alternatives.
One of the key differences between traditional software engineering and machine learning (ML) is the lack of specifications for ML models. Traditionally, specifications provide a cornerstone for compositional reasoning and for the divide-and-conquer strategy of how we build large and complex systems from components, but these are hard to come by for machine learned components. While the lack of specification seems like a fundamental new problem at first sight, in fact, software engineers routinely deal with iffy specifications in practice. We face weak specifications, wrong specifications, and unanticipated interactions among specifications. ML may push us further, but the problems are not fundamentally new. Rethinking ML model composition from the perspective of the feature-interaction problem highlights the importance of software design.
Talks at practitioner-focused open-source software conferences are a valuable source of information for software engineering researchers. They provide a pulse of the community and are valuable source material for grey literature analysis. We curated a dataset of 24,669 talks from 87 open-source conferences between 2010 and 2021. We stored all relevant metadata from these conferences and provide scripts to collect the transcripts. We believe this data is useful for answering many kinds of questions, such as: What are the important/highly discussed topics within practitioner communities? How do practitioners interact? And how do they present themselves to the public? We demonstrate the usefulness of this data by reporting our findings from two small studies: a topic model analysis providing an overview of open-source community dynamics since 2011 and a qualitative analysis of a smaller community-oriented sample within our dataset to gain a better understanding of why contributors leave open-source software.
Contributors are vital to the sustainability of open source ecosys- tems, and disengagement threatens that sustainability. We seek to protect and strengthen open source communities through a better and more robust way of defining and identifying contributor dis- engagement in open source communities. To do this we, collected a large amount of gray literature on contributor disengagement, and performed a qualitative analysis to better our understanding of why contributors disengage.
Open-source software has integrated itself into our daily lives, impacting 78% of US companies in 2015. Past studies of open-source community dynamics have found motivations behind contributions and the significance of community engagement, but there are still many aspects not well understood. There’s a direct correlation between the success of an open-source project and the social interactions within its community. Most projects depend on a small group. A study by Avelino et al. on the 133 most popular GitHub projects found that 86% will fail if one or two of its core contributors leave. To sustain open-source, we need to better understand how contributors interact, what infor- mation is shared, and what concerns practitioners have. We study common topics, how these have changed over time (2011 - 2021), and what social issues have appeared within open-source commu- nities. Our research is guided by the following questions: (1) How is open-source changing/evolving? (2) What changes do practitioners believe are necessary for open-source to be sustainable? ...
Interpersonal conflict in code review, such as toxic language or an unnecessary pushback, is associated with negative outcomes such as stress and turnover. Automatic detection is one approach to prevent and mitigate interpersonal conflict. Two recent automatic detection approaches were developed in different settings: a toxicity detector using text analytics for open source issue discussions and a pushback detector using logs-based metrics for corporate code reviews. This paper tests how the toxicity detector and the pushback detector can be generalized beyond their respective contexts and discussion types, and how the combination of the two can help improve interpersonal conflict detection. The results reveal connections between the two concepts.
The introduction of machine learning (ML) components in software projects has created the need for software engineers to collaborate with data scientists and other specialists. While collaboration can always be challenging, ML introduces additional challenges with its exploratory model development process, additional skills and knowledge needed, difficulties testing ML systems, need for continuous evolution and monitoring, and non-traditional quality requirements such as fairness and explainability. Through interviews with 45 practitioners from 28 organizations, we identified key collaboration challenges that teams face when building and deploying ML systems into production. We report on common collaboration points in the development of production ML systems for requirements, data, and integration, as well as corresponding team patterns and challenges. We find that most of these challenges center around communication, documentation, engineering, and process and collect recommendations to address these challenges
Online toxicity is ubiquitous across the internet and its negative impact on the people and online communities it effects has been well documented. However, toxicity manifests differently on various platforms and toxicity in open source communities, while frequently discussed, is not well understood. We take a first stride at understanding the characteristics of open source toxicity to better inform future work designing effective intervention and detection methods. To this end, we curate a sample of 100 toxic GitHub issue discussions combining multiple search and sampling strategies. We then qualitatively analyze the sample to gain an understanding of the characteristics of open-source toxicity. We find that the prevalent forms of toxicity in open source differ from those observed on other platforms like Reddit or Wikipedia. We find some of the most prevalent forms of toxicity in open source are entitled, demanding, and arrogant comments from project users and insults arising from technical disagreements. In addition, not all toxicity was written by people external to the projects, project members were also common authors of toxicity. We also provide in-depth discussions about the implications of our findings including patterns that may be useful for detection work and subsequent questions for future work.
Determining whether a configurable software system has a performance bug or the system was misconfigured is often challenging. While there are numerous debugging techniques that can support developers in this task, there is limited empirical evidence of how useful the techniques are to address the actual needs that developers have when debugging the performance of configurable systems; most techniques are often evaluated in terms of technical accuracy instead of their usability. In this paper, we take a human-centered approach to identify, design, implement, and evaluate a solution to support developers in the process of debugging the performance of configurable software systems. We first conduct an exploratory study with 19 developers to identify the information needs that developers have during this process. Subsequently, we design and implement a tailored tool, building on relevant information provided by Global and Local performance-influence models, CPU profiling, and program slicing, to support those needs. Two user studies, with a total of 20 developers, validate and confirm that the information that we provide help developers debug the performance of configurable software systems.
Data scientists commonly use computational notebooks because they provide a good environment for testing multiple models. However, once the scientist completes the code and finds the ideal model, he or she will have to dedicate time to clean up the code in order for others to easily understand it. In this paper, we perform a qualitative study on how scientists clean their code in hopes of being able to suggest a tool to automate this process. Our end goal is for tool builders to address possible gaps and provide additional aid to data scientists, who then can focus more on their actual work rather than the routine and tedious cleaning work. By sampling notebooks from GitHub and analyzing changes between subsequent commits, we identified common cleaning activities, such as changes to markdown (e.g., adding headers sections or descriptions) or comments (both deleting dead code and adding descriptions) as well as reordering cells. We also find that common cleaning activities differ depending on the intended purpose of the notebook. Our results provide a valuable foundation for tool builders and notebook users, as many identified cleaning activities could benefit from codification of best practices and dedicated tool support, possibly tailored depending on intended use.
In this paper, we study toxic online interactions in issue discussions of open-source communities. Our goal is to qualitatively understand how toxicity impacts an open-source community like GitHub. We are driven by users complaining about toxicity, which leads to burnout and disengagement from the site. We collect a substantial sample of toxic interactions and qualitatively analyze their characteristics to ground future discussions and intervention design.
Data scientists commonly use computational notebooks because they provide a good environment for testing multiple models. However, once the scientist completes the code and finds the ideal model, the data scientist will have to dedicate time to clean up the code in order for others to understand it. In this paper, we perform a qualitative study on how scientists clean their code in hopes of being able to suggest a tool to automate this process. Our end goal is for tool builders to address possible gaps and provide additional aid to data scientists, who can then focus more on their actual work rather than the routine and tedious cleaning duties.
Data scientists reportedly spend a significant amountof their time in their daily routines on data wrangling, i.e., cleaning data and extracting features. However, data wrangling code is often repetitive and error-prone to write. Moreover, itis easy to introduce subtle bugs when reusing and adopting existing code, which result not in crashes but reduce model quality. To support data scientists with data wrangling, we present a technique to generate interactive documentation for data wrangling code. We use (1) program synthesis techniques to automatically summarize data transformations and (2) test case selection techniques to purposefully select representative examples from the data based on execution information collected with tailored dynamic program analysis. We demonstrate that a JupyterLab extension with our technique can provide documentation for many cells in popular notebooks and find in a user study that users with our plugin are faster and more effective at finding realistic bugs in data wrangling code.
Automatically repairing a bugging program is essentially a search problem, searching for code transformations that pass a set of tests. Various search strategies have been explored, but they either navigate the search space in an ad hoc way using heuristics, or systemically but at the cost of limited edit expressiveness in the kinds of supported program edits. In this work, we explore the possibility of systematically navigating the search space without sacrificing edit expressiveness. The key enabler of this exploration is variational execution, a dynamic analysis technique that has been shown to be effective at exploring many similar executions in large search spaces. We evaluate our approach on IntroClassJava and Defects4J, showing that a systematic search is effective at leveraging and combining fixing ingredients to find patches, including many high-quality patches and multi-edit patches.
In collaborative software development, it is considered to be a best practice to submit code changes as a sequence of cohesive commits, each of which records the work result of a specific development activity, such as adding a new feature, bug fixing, and refactoring. However, rather than following this best practice, developers often submit a set of loosely-related changes serving for different development activities as a composite commit, due to the tedious manual work and lack of effective tool support to decompose such a tangled changeset. Composite commits often obfuscate the change history of software artifacts and bring challenges to efficient collaboration among developers. To encourage activity-oriented commits, we propose SmartCommit, a graph-partitioning-based interactive approach to tangled changeset decomposition that leverages not only the efficiency of algorithms but also the knowledge of developers. To evaluate the effectiveness of our approach, we (1) deployed SmartCommit in an international IT company, and analyzed usage data collected from a field study with 83 engineers over 9 months; and (2) conducted a controlled experiment on 3,000 synthetic composite commits from 10 diverse open-source projects. Results show that SmartCommit achieves a median accuracy between 71%–83% when decomposing composite commits without developer involvement, and significantly helps developers follow the best practice of submitting activity-oriented commits with acceptable interaction effort and time cost in real collaborative software development.
The lack of specifications is a key difference between traditional software engineering and machine learning. We discuss how it drastically impacts how we think about divide-and-conquer approaches to system design, and how it impacts reuse, testing and debugging activities. Traditionally, specifications provide a cornerstone for compositional reasoning and for the divide-and-conquer strategy of how we build large and complex systems from components, but those are hard to come by for machine-learned components. While the lack of specification seems like a fundamental new problem at first sight, in fact software engineers routinely deal with iffy specifications in practice: we face weak specifications, wrong specifications, and unanticipated interactions among components and their specifications. Machine learning may push us further, but the problems are not fundamentally new. Rethinking machine-learning model composition from the perspective of the feature interaction problem, we may even teach us a thing or two on how to move forward, including the importance of integration testing, of requirements engineering, and of design.
Open source software projects often rely on package management systems that help projects discover, incorporate, and maintain dependencies on other packages, maintained by other people. Such systems save a great deal of effort over adhoc ways of advertising, packaging, and transmitting useful libraries, but coordination among project teams is still needed when one package makes a breaking change affecting other packages. Ecosystems differ in their approaches to breaking changes, and there is no general theory to explain the relationships between features, behavioral norms, ecosystem outcomes, and motivating values. We address this through two empirical studies. In an interview case study we contrast Eclipse, NPM, and CRAN, demonstrating that these different norms for coordination of breaking changes shift the costs of using and maintaining the software among stakeholders, appropriate to each ecosystem’s mission. In a second study, we combine a survey, repository mining, and document analysis to broaden and systematize these observations across 18 ecosystems. We find that all ecosystems share values such as stability and compatibility, but differ in other values. Ecosystems’ practices often support their espoused values, but in surprisingly diverse ways. The data provides counterevidence against easy generalizations about why ecosystem communities do what they do.
Performance-influence models can help stakeholders understand how and where configuration options and their interactions influence the performance of a system. With this understanding, stakeholders can debug performance and make deliberate configuration decisions. Current black-box techniques to build such models combine various sampling and learning strategies, resulting in trade offs between measurement effort, accuracy, and interpretability. We present Comprex, a white-box approach to build performance-influence models for configurable systems, combining insights of local measurements, dynamic taint analysis to track options in the implementation, compositionality, and compression of the configuration space, without using machine learning to extrapolate incomplete samples. Our evaluation on 4 widely-used open-source projects demonstrates that Comprex builds similarly accurate performance-influence models to the most accurate and expensive black-box approach, but at a reduced cost and with additional benefits from interpretable and local models.
The large amount of third-party packages available in fast-moving software ecosystems, such as the Node.js/npm, enables attackers to compromise applications by pushing malicious updates to their package dependencies. Studying the npm repository, we observed that many packages perform only simple computations and do not need access to filesystem or network APIs. This offers the opportunity to enforce least-privilege design per package, protecting them from malicious updates. We discuss the design space and propose a lightweight permission system that protects Node.js/npm applications by enforcing package permissions at runtime. Our system makes a large number of packages much harder to be exploited, almost for free.
Mutation testing is a fault-based technique commonly used to evaluate the quality of test suites in software systems. It consists of introducing syntactical changes, called mutations, into source code and checking whether the test cases distinguish them. Since there are dozens of distinct mutation types, one of the most challenging problems is the high computational effort required to test the whole test suite against each mutant. Since mutation testing is proposed, researchers have presented techniques aiming at effort reduction in the phases of its process. This study focuses on the potential reduction in the number of mutants provided by a special set of mutants generated by the introduction of two syntactical changes (strongly subsuming second-order mutants). In this work, we exhaustively searched for those second-order mutants Our results show that they (i) are frequently generated by the "expression removal" mutation, (ii) are likely to be killed by the same test cases that kill their constituent mutants, and (iii) have the potential to reduce the number of mutants to be executed by about 22%.
For configurable systems, features developed and tested separately may present a different behavior when combined in a system. Since software products might be composed of thousands of features, developers should guarantee that all valid combinations work properly. However, features can interact in undesired ways, resulting in failures. A feature interaction is an unpredictable behavior that cannot be easily deduced from the individual features involved. We proposed VarXplorer to inspect feature interactions as they are detected and incrementally classify them as benign or problematic. Our approach provides an iterative analysis of feature interactions allowing developers to focus on suspicious cases. In this paper, we present an experimental study to evaluate our iterative process of tests execution. We aim to understand how VarXplorer could be used for a faster and more objective feature interaction analysis. Our results show that VarXplorer may reduce up to 50% the amount of interactions a developer needs to check during the testing process.
Developers often use the C preprocessor to handle variability and portability. However, many researchers and practitioners criticize the use of preprocessor directives because of their negative effect on code understanding, maintainability, and error proneness. This negative effect may lead to configuration-related code weaknesses, which appear only when we enable or disable certain configuration options. A weakness is a type of mistake in software that, in proper conditions, could contribute to the introduction of vulnerabilities within that software. Configuration-related code weaknesses may be harder to detect and fix than weaknesses that appear in all configurations, because variability increases complexity. To address this problem, we propose a sampling-based white-box technique to detect configuration-related weaknesses in configurable systems. To evaluate our technique, we performed an empirical study with 24 popular highly configurable systems that make heavy use of the C preprocessor, such as Apache Httpd and Libssh. Using our technique, we detected 57 configuration-related weaknesses in 16 systems. In total, we found occurrences of the following five kinds of weaknesses: 30 memory leaks, 10 uninitialized variables, 9 null pointer dereferences, 6 resource leaks, and 2 buffer overflows. The corpus of these weaknesses is a valuable source to better support further research on configuration-related code weaknesses.
In configurable software systems, stakeholders are often interested in knowing how configuration options influence the performance of a system to facilitate, for example, the debugging and optimization processes of these systems. There are several black-box approaches to obtain this information, but they either sample the system end-to-end with a large number of configurations to make accurate predictions or miss important performance-influencing interactions when sampling few configurations. In addition, these approaches cannot pinpoint the parts of a program that are responsible for performance differences among configurations. This paper proposes ConfigCrusher, a white-box performance analysis that analyzes the implementation of a system to guide the performance analysis and exploits several insights about configurable systems in the process. ConfigCrusher employs a static data-flow analysis to identify how configuration options may influence control-flow decisions and instruments code regions corresponding to these decisions to dynamically analyze the influence of configuration options on the regions' performance. Our evaluation shows the feasibility of our white-box approach to more efficiently build performance models that are similar to or more accurate than current state-of-the-art approaches on 10 configurable systems. Overall, we showcase the benefits of white-box performance analyses and their potential to outperform black-box approaches and provide additional information for analyzing configurable systems.
Automation tools have become essential in contemporary software development. Tools like continuous integration services, code coverage reporters, style checkers, dependency managers, etc. are all known to provide significant improvements in developer productivity and software quality. Some of these tools are widespread, others are not. How do these automation "best practices" spread? And how might we facilitate the diffusion process for those that have seen slower adoption? In this paper, we rely on a recent innovation in transparency on code hosting platforms like GitHub–the use of repository badges–to track how automation tools spread in open-source ecosystems through different social and technical mechanisms over time. Using a large longitudinal data set, multivariate network science techniques, and survival analysis, we study which socio-technical factors can best explain the observed diffusion process of a number of popular automation tools. Our results show that factors such as social exposure, competition, and observability affect the adoption of tools significantly, and they provide a roadmap for software engineers and researchers seeking to propagate best practices and tools.
Higher-order mutation has the potential for improving major drawbacks of traditional first-order mutation, such as by simulating more realistic faults or improving test optimization techniques. Despite interest in studying promising higher-order mutants, such mutants are difficult to find due to the exponential search space of mutation combinations. State-of-the-art approaches rely on genetic search, which is often incomplete and expensive due to its stochastic nature. First, we propose a novel way of finding a complete set of higher-order mutants by using variational execution, a technique that can, in many cases, explore large search spaces completely and often efficiently. Second, we use the identified complete set of higher-order mutants to study their characteristics. Finally, we use the identified characteristics to design and evaluate a new search strategy, independent of variational execution, that is highly effective at finding higher-order mutants even in large code bases.
Higher-order mutation has the potential for improving major drawbacks of traditional first-order mutation, such as by simulating more realistic faults or improving test optimization techniques. Despite interest in studying promising higher-order mutants, such mutants are difficult to find due to the exponential search space of mutation combinations. State-of-the-art approaches rely on genetic search, which is often incomplete and expensive due to its stochastic nature. First, we propose a novel way of finding a complete set of higher-order mutants by using variational execution, a technique that can, in many cases, explore large search spaces completely and often efficiently. Second, we use the identified complete set of higher-order mutants to study their characteristics. Finally, we use the identified characteristics to design and evaluate a new search strategy, independent of variational execution, that is highly effective at finding higher-order mutants even in large code bases.
Feature flags (a.k.a feature toggles) are a mechanism to keep new features hidden behind a boolean option during development. Flags are used for many purposes, such as A/B testing and turning off a feature more easily in case of failures. While software engineering feature flags research is burgeoning, examples of software projects using flags rarely come from outside commercial and private projects, stifling academic progress. To address this gap, in this paper we present a novel mining software repositories approach to detect feature flagging open-source projects, based on analyzing the projects' commit messages. We apply our approach to all open-source GitHub projects, identifying 231,223 candidate feature flagging projects, and manually validating 100. We also report on an initial analysis of feature flags in the validated sample of 100 projects, investigating practices that correlate with shorter flag lifespans (typically desirable to reduce technical debt), such as using the issue tracker and having the flag owner (the developer introducing a flag) also be the one removing it.
Developers from open-source communities have reported high stress levels from frequent demands for features and bug fixes and the sometimes aggressive tone of these demands. Toxic conversations may demotivate and burn out developers, creating challenges for sustaining open source. We outline a path toward finding, understanding, and possibly mitigating such unhealthy interactions. We take a first step toward finding them, by developing and demonstrating a measurement instrument (an SVM classifier tailored towards the software engineering domain) to detect toxic discussions in GitHub issues. We used our classifier to analyze trends over time and in different GitHub communities, finding that toxicity varies by community and that toxicity decreased from 2012-2018.
Feature flags for continuous deployment and configuration options for customizing software share many similarities, both conceptually and technically. However, neither academic nor practitioner publications seem to distinguish these two concepts. We argue that a distinction is valuable, as applications, goals, and challenges differ fundamentally between feature flags and configuration options. In this work, we explore the differences and commonalities of both concepts to help understand practices and challenges and to help transfer existing solutions (e.g., for testing). To better understand feature flags and how they relate to configuration options, we performed nine semi-structured interviews with feature-flag experts. We discovered a number of distinguishing characteristics but also opportunities for knowledge and technology transfer across both communities. Overall, we think that both communities can learn from each other.
Open source is ubiquitous and critical infrastructure, yet funding and sustaining it is challenging. While there are many different funding models for open-source donations and concerted efforts through foundations, donation platforms like Paypal, Patreon, or OpenCollective are popular and low-bar forms to raise funds for open-source development, for which GitHub recently even built explicit support. With a mixed-method study, we explore the emerging and largely unexplored phenomenon of donations in open source: We quantify how commonly open-source projects ask for donations, statistically model characteristics of projects that ask for and receive donations, analyze for what the requested funds are needed and used, and assess whether the received donations achieve the intended outcomes. We find 25,885 projects asking for donations on GitHub, often to support engineering activities; however, we also find no clear evidence that donations influence the activity level of a project. In fact, we find that donations are used in a multitude of ways, raising new research questions about effective funding.
The notion of forking has changed with the rise of distributed version control systems and social coding environments, like GitHub. Traditionally forking refers to splitting off an independent development branch (which we call hard forks); research on hard forks, conducted mostly in pre-GitHub days showed that hard forks were often seen critical as they may fragment a community. Today, in social forking environments, open-source developers are encouraged to fork a project in order to integrate contributions to the community (which we call social forks), which may have also influenced perceptions and practices around hard forks. To revisit hard forks, we identify, study and classify 15,306 hard forks on GitHub and interview 18 owners of hard forks or forked repositories. We find that, among others, hard forks often evolve out of social forks rather than being planned deliberately and that perception about hard forks have indeed changed dramatically, seeing them often as a positive non-competitive alternative to the original project.
Software engineers have significant expertise to offer when building intelligent systems, drawing on decades of experience and methods for building systems that scale and are responsive and robust, even when built on unreliable components. Systems with artificial-intelligence or machine-learning (ML) components raise new challenges and require careful engineering. We designed a new course to teach software-engineering skills to students with a background in ML. We specifically go beyond traditional ML courses that teach modeling techniques under artifical conditions and focus, in lecture and assignments, on realism with large and changing datasets, robust and evolvable infrastructure, and purposeful requirements engineering that considers also ethics and fairness. We describe the course and our infrastructure and share experience and all material from teaching the course for the first time.
Software developed in different platforms has different characteristics and needs. More specifically, code changes are differently performed in the mobile platform compared to non-mobile platforms (e.g., desktop and Web platforms). Prior works have investigated the differences in specific platforms. However, we still lack a deeper understanding of how code changes evolve across different software platforms. In this paper, we present a study aiming at investigating the frequency of changes and how source code changes, build changes and test changes co-evolve in mobile and non-mobile platforms. We developed linear regression models to explain which factors influence the frequency of changes in different platforms and applied the Apriori algorithm to find types of changes that frequently occur together. Our findings show that non-mobile repositories have a higher number of commits per month compared to mobile and our regression models suggest that being mobile significantly impacts on the number of commits in a negative direction when controlling for confound factors, such as code size. We also found that developers do not usually change source code files together with build files or test files. We argue that our results can provide valuable information for developers on how changes are performed in different platforms so that practices adopted in successful software systems can be followed.
Forking and pull requests have been widely used in open-source communities as a uniform development and contribution mechanisms, which gives developers the flexibility to modify their own fork without affecting others. However, some projects observe severe inefficiencies, including lost and duplicate contributions and fragmented communities. We observed that different communities experience these inefficiencies to widely different degrees and interviewed practitioners indicate several project characteristics and practices, including modularity and coordination mechanisms, that may encourage more efficient forking practices. In this paper, we explore how open-source projects on GitHub differ with regard to forking inefficiencies. Using logistic regression models, we analyzed the association of context factors with the inefficiencies and found that better modularity and centralized management can encourage more contributions and a higher fraction of accepted pull requests, suggesting specific good practices that project maintainers can adopt to reduce forking-related inefficiencies in their community.
Continuous integration (CI) is an established software quality assurance practice, and the focus of much prior research with a diverse range of methods and populations. In this paper, we conduct a literature review of 37 papers on CI pain points. We then conduct a conceptual replication study on results from these papers using a triangulation design consisting of a survey with 132 responses, 12 interviews, and two logistic regressions predicting CI abandonment and switching on a dataset of 6,239 GitHub projects. We report and discuss which past results we were able to replicate, those for which we found conflicting evidence, those for which we did not find evidence, and the implications of these cases on future CI research, CI tool builders, and CI users.
In many domains, software systems cannot be deployed until authorities judge them fit for use in an intended operating environment. Certification standards and processes have been devised and deployed to regulate operations of software systems and prevent their failures. However, practitioners are often unsatisfied with the efficiency and value proposition of certification efforts. In this study, we compare two certification standards, Common Criteria and DO-178C, and collect insights from literature and from interviews with subject-matter experts to identify design options relevant to the design of standards. The results of the comparison of certification efforts—leading to the identification of design dimensions that affect their quality—serve as a framework to guide the comparison, creation, and revision of certification standards and processes. This paper puts software engineering research in context and discusses key issues around process and quality assurance and includes observations from industry about relevant topics such as recertification, timely evaluations, but also technical discussions around model-driven approaches and formal methods. Our initial characterization of the design space of certification efforts can be used to inform technical discussions and to influence the directions of new or existing certification efforts. Practitioners, technical commissions, and government can directly benefit from our analytical framework.
In configurable software systems, stakeholders are often interested in knowing how configuration options influence the performance of a system to facilitate, for example, the debugging and optimization processes of these systems. There are several black-box approaches to obtain this information, but they usually require a large number of samples to make accurate predictions, whereas the few existing white-box approaches impose limitations on the systems that they can analyze. This paper proposes ConfigCrusher, a white-box performance analysis that exploits several insights of configurable systems. ConfigCrusher employs a static data-flow analysis to identify how configuration options may influence control-flow decisions and instruments code regions corresponding to these decisions to dynamically analyze the influence of configuration options on the regions’ performance. Our evaluation using 10 real-world configurable systems shows that ConfigCrusher is more efficient at building performance models that are similar to or more accurate than current state-of-the-art black-box and white-box approaches. Overall, this paper showcases the benefits and potential of whitebox performance analyses to outperform black-box approaches and provide additional information for analyzing configurable systems.
Detecting feature interactions is imperative for accurately predicting performance of highly-configurable systems. State-of-the-art performance prediction techniques rely on supervised machine learning for detecting feature interactions, which, in turn, relies on time-consuming performance measurements to obtain training data. By providing information about potentially interacting features, we can reduce the number of required performance measurements and make the overall performance prediction process more time efficient. We expect that information about potentially interacting features can be obtained by analyzing the source code of a highly-configurable system, which is computationally cheaper than performing multiple performance measurements. To this end, we conducted an in-depth qualitative case study on two real-world systems (mbedTLS and SQLite), in which we explored the relation between internal (precisely control-flow) feature interactions, detected through static program analysis, and external (precisely performance) feature interactions, detected by performance-prediction techniques using performance measurements. We found that a relation exists that can potentially be exploited to predict performance interactions.
We developed model-based adaptation, an approach that leverages models of software and its environment to enable automated adaptation. The goal of our approach is to build long-lasting software systems that can effectively adapt to changes in their environment.
Modern cyber-physical systems (e.g., robotics systems) are typically composed of physical and software components, the characteristics of which are likely to change over time. Assumptions about parts of the system made at design time may not hold at run time, especially when a system is deployed for long periods (e.g., over decades). Self-adaptation is designed to find reconfigurations of systems to handle such run-time inconsistencies. Planners can be used to find and enact optimal reconfigurations in such an evolving context. However, for systems that are highly configurable, such planning becomes intractable due to the size of the adaptation space. To overcome this challenge, in this paper we explore an approach that (a) uses machine learning to find Pareto-optimal configurations without needing to explore every configuration, and (b) restricts the search space to such configurations to make planning tractable. We explore this in the context of robot missions that need to consider task timeliness and energy consumption. An independent evaluation shows that our approach results in high-quality adaptation plans in uncertain and adversarial environments.
Since software engineering is not a homogeneous whole, we expect that development practices are differently adopted across domains. However, little is known about how practices are followed in different software domains (e.g., healthcare, banking, and Oil and gas). In this paper, we report the results of an exploratory and inductive research, in which we seek differences and similarities regarding the adoption of several practices across 13 domains. We interviewed 19 developers with experience in multiple domains (i.e., cross-domain developers) from large multinational companies, such as Facebook, Google and Macy's. We also run a Web survey to confirm (or not) the interview results. Our findings show that, in fact, different domains adopt practices in a different fashion. We identified that continuous integration practices are interrupted during important commerce periods (e.g., Black Friday) in the financial domains. We also noticed the company's culture and policies strongly influence the adopted practices, instead of the domain itself. Our study also has important implications for practice. For instance, companies should provide targeted training for their development teams and new interdisciplinary courses in software engineering and other domains, such as healthcare, are highly recommended.
With an increased level of automation provided bypackage managers, which sometimes allow updates to be installedautomatically, malicious package updates are becoming a realthreat in software ecosystems. To address this issue, we proposean approach based on anomaly detection, to identify suspiciousupdates based on security-relevant features that attackers coulduse in an attack. We evaluate our approach in the contextof Node.js/npm ecosystem, to show its feasibility in terms ofreduced review effort and the correct identification of a confirmedmalicious update attack. Although we do not expect it to bea complete solution in isolation, we believe it is an importantsecurity building block for software ecosystems.
Established contributors are the backbone of many free/libre open source software (FLOSS) projects. Previous research has shown that it is critically important to retain contributors and has also revealed motives behind why contributors choose to participate in FLOSS in the first place. However, there has been limited research done on the reasons why established contributors disengage, and factors (on an individual and project level) that predict their disengagement. In this paper, we conduct a mixed-methods empirical study, combining surveys and survival modeling, in order to identify reasons and predictive factors behind established contributor disengagement. We find that different groups of contributors tend to disengage for different reasons, however, overall contributors most commonly cite some kind of transition (e.g., switching jobs or leaving academia). We also find that factors such as the popularity of the projects a contributor works on, whether they have experienced a transition, when they work, and how much they work are all factors that can be used to predict their disengagement from open source.
Fork-based development is popular and easy to use, but makes it difficult to maintain an overview of the whole community when the number of forks increases, which leads to redundant development where multiple developers are solving the same problem in parallel without being aware of each other. Redundant development wastes effort for both maintainers and developers. In this paper, we designed an approach to identify redundant code changes in forks as early as possible by extracting clues indicating similarities between code changes, and building a machine learning model to predict redundancies. We evaluated the effectiveness from both the maintainer's and the developer's perspectives. The result shows that we achieve 57%-83% precision for detecting duplicate code changes from maintainer's perspective, and we could save developers' effort of 1.9-3.0 commits on average. Also, we show that our approach significantly outperforms existing state-of-art.
Maintenance consumes 40% to 80% of software development costs. So, it is essential to write source code that is easy to understand to reduce the costs with maintenance. Improving code understanding is important because developers often mistake the meaning of code, and misjudge the program behavior, which can lead to errors. There are patterns in source code, such as operator precedence, and comma operator, that have been shown to influence code understanding negatively. Despite initial results, these patterns have not been evaluated in a real-world setting, though. Thus, it is not clear whether developers agree that the patterns studied by researchers can cause substantial misunderstandings in real-world practice. To better understand the relevance of misunderstanding patterns, we applied a mixed research method approach, by performing repository mining and a survey with developers, to evaluate misunderstanding patterns in 50 C open-source projects, including Apache, OpenSSL, and Python. Overall, we found more than 109K occurrences of the 12 patterns in practice. Our study shows that according to developers only some patterns considered previously by researchers may cause misunderstandings. Our results complement previous studies by taking the perception of developers into account.
Mutation testing is an effective but time consuming method for gauging the quality of a test suite. It functions by repeatedly making changes, called mutants, to the source code and checking whether the test suite fails (i.e., whether the mutant is killed). Recent work has shown cases in which applying multiple changes, called a higher order mutation, is more difficult to kill than a single change, called a first order mutation. Specifically, a special kind of higher order mutation, called a strongly subsuming higher order mutation (SSHOM), can enable equivalent accuracy in assessing the quality of the test suite with fewer executions of tests. Little is known about these SSHOMs, as they are difficult to find. Our goal in this research is to identify a faster, more reliable method for finding SSHOMs in order to characterize them in the future. We propose an approach based on variational execution to find SSHOMs. Preliminary results indicate that variational execution performs better than the existing genetic algorithm in terms of speed and completeness of results. Out of a set of 33 first order mutations, our variational execution approach finds all 38 SSHOMs in 4.5 seconds, whereas the genetic algorithm only finds 36 of the 38 SSHOMs in 50 seconds.
Contemporary software development is characterized by increased reuse and speed. Open source software forges such as GitHub host millions of repositories of libraries and tools, which developers reuse liberally, creating complex and often fragile networks of interdependencies. Hence, developers must make more decisions at a higher speed, finding which libraries to depend on and which projects to contribute to. This decision making process is supported by the transparency provided by social coding platforms like GitHub, where user profile pages display information on a one's contributions, and repository pages provide information on a project's social standing (e.g., through stars and watchers).
Variational execution offers an avenue of efficiently analyzing configurable systems, but data structures like lists require special consideration. We implement automatic substitution of a more efficient list representation in a variational execution framework and evaluate its performance in micro-benchmarks. The results suggest that the substitution may offer substantial performance improvements to programs involving highly variational lists.
In highly configurable systems, features may interact unexpectedly and produce faulty behavior. Those faults are not easily identified from the analysis of each feature separately, especially when feature specifications are missing. We propose VarXplorer, a dynamic and iterative approach to detect suspicious interactions. It provides information on how features impact the control and data flow of the program. VarXplorer supports developers with a graph that visualizes this information, mainly showing suppress and require relations between features. To evaluate whether VarXplorer helps improve the performance of identifying suspicious interactions, we perform a controlled study with 24 subjects. We find that with our proposed feature-interaction graphs, participants are able to identify suspicious interactions more than 3 times faster compared to the state-of-the-art tool.
In software testing, different testers focus on different aspects of the software such as functionality, performance, design, and other attributes. While many tools and coverage metrics exist to support testers at the code level, not much support is targeted for testers who want to inspect the output of a program such as a dynamic web application. To support this category of testers, we propose a family of output-coverage metrics (similar to statement, branch, and path coverage metrics on code) that measure how much of the possible output has been produced by a test suite and what parts of the output are still uncovered. To do that, we first approximate the output universe using our existing symbolic execution technique. Then, given a set of test cases, we map the produced outputs onto the output universe to identify the covered and uncovered parts and compute output-coverage metrics. In our empirical evaluation on seven real-world PHP web applications, we show that selecting test cases by output coverage is more effective at identifying presentation faults such as HTML validation errors and spelling errors than selecting test cases by traditional code coverage. In addition, to help testers understand output coverage and augment test cases, we also develop a tool called WebTest that displays the output universe in one single web page and allows testers to visually explore covered and uncovered parts of the output.
The advent of variability management and generator technology enables users to derive individual system variants from a configurable code base by selecting desired configuration options. This approach gives rise to the generation of possibly billions of variants, which, however, cannot be efficiently analyzed for bugs and other properties with classic analysis techniques. To address this issue, researchers and practitioners have developed sampling heuristics and, recently, variability-aware analysis techniques. While sampling reduces the analysis effort significantly, the information obtained is necessarily incomplete, and it is unknown whether state-of-the-art sampling techniques scale to billions of variants. Variability-aware analysis techniques process the configurable code base directly, exploiting similarities among individual variants with the goal of reducing analysis effort. However, while being promising, so far, variability-aware analysis techniques have been applied mostly only to small academic examples. To learn about the mutual strengths and weaknesses of variability-aware and sample-based static-analysis techniques, we compared the two by means of seven concrete control-flow and data-flow analyses, applied to five real-world subject systems: BusyBox, OpenSSL, SQLite, the x86 Linux kernel, and uclibc. In particular, we compare the efficiency (analysis execution time) of the static analyses and their effectiveness (potential bugs found). Overall, we found that variability-aware analysis outperforms most sample-based static-analysis techniques with respect to efficiency and effectiveness. For example, checking all variants of OpenSSL with a variability-aware static analysis is faster than checking even only two variants with an analysis that does not exploit similarities among variants.
Variational execution is a novel dynamic analysis technique for exploring highly configurable systems and accurately tracking information flow. It is able to efficiently analyze many configurations by aggressively sharing redundancies of program executions. The idea of variational execution has been demonstrated to be effective in exploring variations in the program, especially when the configuration space grows out of control. Existing implementations of variational execution often require heavy lifting of the runtime interpreter, which is painstaking and error-prone. Furthermore, the performance of this approach is suboptimal. For example, the state-of-the-art variational execution interpreter for Java, VarexJ, slows down executions by 100 to 800~times over a single execution for small to medium size Java programs. Instead of modifying existing JVMs, we propose to transform existing bytecode to make it variational, so it can be executed on an unmodified commodity JVM. Our evaluation shows a dramatic improvement on performance over the state-of-the-art, with a speedup of 2 to 46 times, and high efficiency in sharing computations.
Variational execution is a novel dynamic analysis technique for exploring highly configurable systems and accurately tracking information flow. It is able to efficiently analyze many configurations by aggressively sharing redundancies of program executions. The idea of variational execution has been demonstrated to be effective in exploring variations in the program, especially when the configuration space grows out of control. Existing implementations of variational execution often require heavy lifting of the runtime interpreter, which is painstaking and error-prone. Furthermore, the performance of this approach is suboptimal. For example, the state-of-the-art variational execution interpreter for Java, VarexJ, slows down executions by 100 to 800~times over a single execution for small to medium size Java programs. Instead of modifying existing JVMs, we propose to transform existing bytecode to make it variational, so it can be executed on an unmodified commodity JVM. Our evaluation shows a dramatic improvement on performance over the state-of-the-art, with a speedup of 2 to 46 times, and high efficiency in sharing computations.
Program comprehension is an important, but hard to measure cognitive process. This makes it difficult to provide suitable programming languages, tools, or coding conventions to support developers in their everyday work. Here, we explore whether functional magnetic resonance imaging (fMRI) is feasible for soundly measuring program comprehension. To this end, we observed 17 participants inside an fMRI scanner while they were comprehending source code. The results show a clear, distinct activation of five brain regions, which are related to working memory, attention, and language processing, which all fit well to our understanding of program comprehension. Furthermore, we found reduced activity in the default mode network, indicating the cognitive effort necessary for program comprehension. We also observed that familiarity with Java as underlying programming language reduced cognitive effort during program comprehension. To gain confidence in the results and the method, we replicated the study with 11 new participants and largely confirmed our findings. Our results encourage us and, hopefully, others to use fMRI to observe programmers and, in the long run, answer questions, such as: How should we train programmers? Can we train someone to become an excellent programmer? How effective are new languages and tools for program comprehension?
One of the main challenges of debugging is to understand why the program fails for certain inputs but succeeds for others. This becomes especially difficult if the fault is caused by an interaction of multiple inputs. To debug such interaction faults, it is necessary to understand the individual effect of the input, how these inputs interact and how these interactions cause the fault. The differences between two execution traces can explain why one input behaves differently than the other. We propose to compare execution traces of all input options to derive explanations of the behavior of all options and interactions among them. To make the relevant information stand out, we represent them as variational traces that concisely represents control-flow and data-flow differences among multiple concrete traces. While variational traces can be obtained from brute-force execution of all relevant inputs, we use variational execution to scale the generation of variational traces to the exponential space of possible inputs. We further provide an Eclipse plugin Varviz that enables users to use variational traces for debugging and navigation. In a user study, we show that users of variational traces are more than twice as fast to finish debugging tasks than users of the standard Eclipse debugger. We further show that variational traces can be scaled to programs with many options
Most software systems provide options that allow users to tailor the system in terms of functionality and qualities. The increased flexibility raises challenges for understanding the configuration space and the effects of options and their interactions on performance and other non-functional properties. To identify how options and interactions affect the performance of a system, several sampling and learning strategies have been recently proposed. However, existing approaches usually assume a fixed environment (hardware, workload, version) such that learning has to be repeated when the environment changes. Repeating learning and measurement for each environment is expensive and often practically infeasible. Instead, we pursue a strategy that transfers knowledge across environments, but sidesteps heavyweight and expensive transfer-learning strategies. Based on empirical insights about common relationships regarding (i) influential options, (ii) their interactions, and (iii) their performance distributions, our approach L2S (Learning to Sample) selects better samples in the target environment based on information from the source environment. It progressively shrinks the configuration space and adaptively concentrates on interesting regions of the configuration space. With both synthetic benchmarks and several real systems, we demonstrate that L2S outperforms state of the art performance learning and transfer-learning approaches in terms of measurement effort and learning accuracy.
Software metrics and thresholds provide means to quantify several quality attributes of software systems. Indeed, they have been used in a wide variety of methods and tools for detecting different sorts of technical debts, such as code smells. Unfortunately, these methods and tools do not take into account characteristics of software domains, as the intrinsic complexity of geo-localization and scientific software systems or the simple protocols employed by messaging applications. Instead, they rely on generic thresholds that are derived from heterogeneous systems. Although derivation of reliable thresholds has long been a concern, we still lack empirical evidence about threshold variation across distinct software domains. To tackle this limitation, this paper investigates whether and how thresholds vary across domains by presenting a large-scale study on 3,107 software systems from 15 domains. We analyzed the derivation and distribution of thresholds based on 8 well-known source code metrics. As a result, we observed that software domain and size are relevant factors to be considered when building benchmarks for threshold derivation. Moreover, we also observed that domain-specific metric thresholds are more appropriated than generic ones for code smell detection.
Continuous Integration (CI) services, which can automatically build, test, and deploy software projects, are an invaluable asset in distributed teams, increasing productivity and helping to maintain code quality. Prior work has shown that CI pipelines can be sophisticated, and choosing and configuring a CI system involves tradeoffs. As CI technology matures, new CI tool offerings arise to meet the distinct wants and needs of software teams, as they negotiate a path through these tradeoffs, depending on their context. In this paper, we begin to uncover these nuances, and tell the story of open-source projects falling out of love with Travis, the earliest and most popular cloud-based CI system. Using logistic regression, we quantify the effects that open-source community factors and project technical factors have on the rate of Travis abandonment. We find that increased build complexity reduces the chances of abandonment, that larger projects abandon at higher rates, and that a project’s dominant language has significant but varying effects. Finally, we find the surprising result that metrics of configuration attempts and knowledge dispersion in the project do not affect the rate of abandonment.
Previous research shows that developers spend most of their time understanding code. Despite the importance of code understandability for maintenance-related activities, an objective measure of it remains an elusive goal. Recently, Scalabrino et al. reported on an experiment with 46 Java developers designed to evaluate metrics for code understandability. The authors collected and analyzed data on more than a hundred features describing the code snippets, the developers’ experience, and the developers’ performance on a quiz designed to assess understanding. They concluded that none of the metrics considered can individually capture understandability. Expecting that understandability is better captured by a combination of multiple features, we present a reanalysis of the data from the Scalabrino et al. study, in which we use different statistical modeling techniques. Our models suggest that some computed features of code, such as those arising from syntactic structure and documentation, have a small but significant correlation with understandability. Further, we construct a binary classifier of understandability based on various interpretable code features, which has a small amount of discriminating power. Our encouraging results, based on a small data set, suggest that a useful metric of understandability could feasibly be created, but more data is needed.
Modeling the performance of a highly-configurable software system requires capturing the influences of its configuration options and their interactions on the system's performance. Performance-influence models quantify these influences, explaining this way the performance behavior of a configurable system as a whole. To be useful in practice, a performance-influence model should have a low prediction error, small model size, and reasonable computation time. Because of the inherent tradeoffs among these properties, optimizing for one property may negatively influence the others. It is unclear, though, to what extent these tradeoffs manifest themselves in practice, that is, whether a large configuration space can be described accurately only with large models and significant resource investment. By means of 10 real-world highly-configurable systems from different domains, we have systematically studied the tradeoffs between the three properties. Surprisingly, we found that the tradeoffs between prediction error and model size and between prediction error and computation time are rather marginal. That is, we can learn accurate and small models in reasonable time, so that one performance-influence model can fit different use cases, such as program comprehension and performance prediction. We further investigated the reasons for why the tradeoffs are marginal. We found that interactions among four or more configuration options have only a minor influence on the prediction error and that ignoring them when learning a performance-influence model can save a substantial amount of computation time, while keeping the model small without considerably increasing the prediction error. This is an important insight for new sampling and learning techniques as they can focus on specific regions of the configuration space and find a sweet spot between accuracy and effort. We further analyzed the causes for the configuration options and their interactions having the observed influences on the systems' performance. We were able to identify several patterns across subject systems, such as dominant configuration options and data pipelines, that explain the influences of highly influential configuration options and interactions, and give further insights into the domain of highly-configurable systems.
Detecting feature interactions is imperative for accurately predicting performance of highly-configurable systems. State-of-the-art performance prediction techniques rely on supervised machine learning for detecting feature interactions, which, in turn, relies on time consuming performance measurements to obtain training data. By providing information about potentially interacting features, we can reduce the number of required performance measurements and make the overall performance prediction process more time efficient. We expect that the information about potentially interacting features can be obtained by statically analyzing the source code of a highly-configurable system, which is computationally cheaper than performing multiple performance measurements. To this end, we conducted a qualitative case study in which we explored the relation between control-flow feature interactions (detected through static program analysis) and performance feature interactions (detected by performance prediction techniques using performance measurements). We found that a relation exists, which can potentially be exploited to predict performance interactions.
In fast-paced, reuse-heavy software development, the transparency provided by social coding platforms like GitHub is essential to decision making. Developers infer the quality of projects using visible cues, known as signals, collected from personal profile and repository pages. We report on a large-scale, mixed-methods empirical study of npm packages that explores the emerging phenomenon of repository badges, with which maintainers signal underlying qualities about the project to contributors and users. We investigate which qualities maintainers intend to signal and how well badges correlate with those qualities. After surveying developers, mining 294,941 repositories, and applying statistical modeling and time series analysis techniques, we find that non-trivial badges, which display the build status, test coverage, and up-to-dateness of dependencies, are mostly reliable signals, correlating with more tests, better pull requests, and fresher dependencies. Displaying such badges correlates with best practices, but the effects do not always persist.
Fork-based development has been widely used both in open source community and industry, because it gives developers flexibility to modify their own fork without affecting others. Unfortunately, this mechanism has downsides; when the number of forks becomes large, it is difficult for developers to get or maintain an overview of activities in the forks. Current tools provide little help. We introduced INFOX, an approach to automatically identifies not-merged features in forks and generates an overview of active forks in a project. The approach clusters cohesive code fragments using code and network analysis techniques and uses information-retrieval techniques to label clusters with keywords. The clustering is effective, with 90% accuracy on a set of known features. In addition, a human-subject evaluation shows that INFOX can provide actionable insight for developers of forks.
Features in highly configurable systems can interact in undesiredways which may result in faults. However, most interactions arenot easily detectable as specifications of feature interactions areusually missing. In this paper, we aim to detect interactions and tohelp create feature-interaction specifications. We use variational ex-ecution to observe internal interactions on control and data flow ofhighly configurable systems. The number of potential interactionscan be large and hard to understand, especially as many interac-tions are benign. To help developers understand these interactions,we propose feature-interaction graphs as a concise representationof all pairwise interactions. We provide two analyses that reportsuspicious interactions, namely suppress and require interactionsFinally, we propose a specification language that enables develop-ers to define different kinds of allowed and forbidden interactions.Our tool, VarXplorer, provides a visualization of feature-interactiongraphs and supports the creation of feature interaction specifi-cations. VarXplorer also provides an iterative analysis of featureinteractions allowing developers to focus on suspicious cases.
While the creation of new branches and forks is easy and fast with modern version-control systems, merging is often time-consuming. Especially when dealing with many branches or forks, a prediction of merge costs based on lightweight indicators would be desirable to help developers recognize problematic merging scenarios before potential conflicts become too severe in the evolution of a complex software project. We analyze the predictive power of several indicators, such as the number, size or scattering degree of commits in each branch, derived either from the version-control system or directly from the source code. Based on a survey of 41 developers, we inferred 7 potential indicators to predict the number of merge conflicts. We tested corresponding hypotheses by studying 163 open-source projects, including 21,488 merge scenarios and comprising 49,449,773 lines of code. A notable (negative) result is that none of the 7 indicators suggested by the participants of the developer survey has a predictive power concerning the frequency of merge conflicts. We discuss this and other findings as well as perspectives thereof.
Highly configurable software systems are pervasive, although configuration options and their interactions raise complexity of the program and increase maintenance effort. Especially load-time configuration options, such as parameters from command-line options or configuration files, are used with standard programming constructs such as variables and if-statements intermixed with the program’s implementation; manually tracking configuration options from the time they are loaded to the point where they may influence control-flow decisions is tedious and error prone. We design and implement LOTRACK , an extended static taint analysis to track configuration options automatically. LOTRACK derives a configuration map that explains for each code fragment under which configurations it may be executed. An evaluation on Android apps and Java applications from different domains shows that LOTRACK yields high accuracy with reasonable performance. We use LOTRACK to empirically characterize how much of the implementation of Android apps depends on the platform’s configuration options or interactions of these options.
Build systems are crucial for software system development, however there is a lack of tool support to help with their high maintenance overhead. GNU Autotools are widely used in the open source community, but users face various challenges from its hard to comprehend nature and staging of multiple code generation steps, often leading to low quality and error-prone build code. In this paper, we present a platform AutoHaven to provide a foundation for developers to create analysis tools to help them understand, maintain, and migrate their GNU Autotools build systems. Internally it uses approximate parsing and symbolic analysis of the build logic. We illustrate the use of the platform with two tools: ACSense helps developers to better understand their build systems and ACSniff detects build smells to improve build code quality. Our evaluation shows that AutoHaven can support most GNU Autotools build systems and can detect build smells in the wild.
Modern software systems provide many configuration options which not only influence their functionality but also non-functional properties such as response-time. To understand and predict the effect of configuration options, several sampling, analysis, and learning strategies have been proposed, albeit often with significant cost to cover the highly dimensional configuration space. Recently, transfer learning has been applied to reduce the effort of constructing performance models by transferring knowledge about performance behavior across environments. While this line of research is promising to learn more accurate models at lower cost, it is unclear until now why and when transfer learning works for performance modeling and analysis in highly configurable systems. To shed light on when it is beneficial to apply transfer learning, we conducted an empirical study on four popular software systems, varying software configurations and environmental conditions, such as hardware, workload, and software versions, to identify the key knowledge pieces that can be exploited for transfer learning. Our results show that in small environmental changes (e.g., homogeneous workload change), by applying a linear transformation to the performance model of the source environment, we can understand the performance behavior of the target environment, while for severe environmental changes (e.g., drastic workload change) we can transfer only knowledge that makes sampling in the target environment more efficient, e.g., by reducing the dimensionality of the configuration space.
Diffing and merging of source-code artifacts is an essential task when integrating changes in software versions. While state-of-the-art line-based tools (e.g., git merge) are fast and independent of the programming language used, they have only a low precision. Recently, it has been shown that the precision of merging can be substantially improved by using a language-aware, structured approach that works on abstract syntax trees. But, precise structured merging is NP hard, especially, when considering the notoriously difficult scenarios of renamings and shifted code. To address these scenarios without compromising scalability, we propose a syntax-aware, heuristic optimization for structured merging that employs a lookahead mechanism during tree matching. The key idea is that renamings and shifted code are not arbitrarily distributed but their occurrence follows patterns, which we address with a syntax-specific lookahead. Our experiments with 48 real-world open-source projects (4878 merge scenarios with over 400 million lines of code) demonstrate that we can significantly improve matching precision in 28 percent while maintaining performance.
Most modern software programs cannot be understood in their entirety by a single programmer. Instead, programmers must rely on a set of cognitive processes that aid in seeking, filtering, and shaping relevant information for a given programming task. Several theories have been proposed to explain these processes, such as “beacons,” for locating relevant code, and “plans,” for encoding cognitive models. However, these theories are decades old and lack validation with modern cognitive-neuroscience methods. In this paper, we report on a study using functional magnetic resonance imaging (fMRI) with 11 participants who performed program comprehension tasks. We manipulated experimental conditions related to beacons and layout to isolate specific cognitive processes related to bottom-up comprehension and comprehension based on semantic cues. We found evidence of semantic chunking during bottom-up comprehension and lower activation of brain areas during comprehension based on semantic cues, confirming that beacons ease comprehension.
Differential testing to solve the oracle problem has been applied in many scenarios where multiple supposedly equivalent implementations exist, such as multiple implementations of a C compiler. If the multiple systems disagree on the output for a given test input, we have likely discovered a bug without every having to specify what the expected output is. Research on variational analyses (or variability-aware or family-based analyses) can benefit from similar ideas. The goal of most variational analyses is to perform an analysis, such as type checking or model checking, over a large number of configurations much faster than an existing traditional analysis could by analyzing each configuration separately. Variational analyses are very suitable for differential testing, since the existence nonvariational analysis can provide the oracle for test cases that would otherwise be tedious or difficult to write. In this experience paper, I report how differential testing has helped in developing KConfigReader, a tool for translating the Linux kernel's kconfig model into a propositional formula. Differential testing allows us to quickly build a large test base and incorporate external tests that avoided many regressions during development and made KConfigReader likely the most precise kconfig extraction tool available.
Transparent environments and social-coding platforms as GitHub help developers to stay abreast of changes during the development and maintenance phase of a project. Especially, notification feeds can help developers to learn about relevant changes in other projects. Unfortunately, transparent environments can quickly overwhelm developers with too many notifications, such that they loose the important ones in a sea of noise. Complementing existing prioritization and filtering strategies based on binary compatibility and code ownership, we develop an anomaly-detection mechanism to identify unusual commits in a repository, that stand out with respect to other changes in the same repository or by the same developer. Among others, we detect exceptionally large commits, commits at unusual times, and commits touching rarely changed file types given the characteristics of a particular repository or developer. We automatically flag unusual commits on GitHub through a browser plugin. In an interactive survey with 173 active GitHub users, rating commits in a project of their interest, we found that, though our unusual score is only a weak predictor of whether developers want to be notified about a commit, information about unusual characteristics of a commit change how developers regard commits. Our anomaly-detection mechanism is a building block for scaling transparent environments.
Modern software systems are now being built to be used in dynamic environments utilizing configuration capabilities to adapt to changes and external uncertainties. In a self-adaptation context, we are often interested in reasoning about the performance of the systems under different configurations. Usually, we learn a black-box model based on real measurements to predict the performance of the system given a specific configuration. However, as modern systems become more complex, there are many configuration parameters that may interact and, therefore, we end up learning an exponentially large configuration space. Naturally, this does not scale when relying on real measurements in the actual changing environment. We propose a different solution: Instead of taking the measurements from the real system, we learn the model using samples from other sources, such as simulators that approximate performance of the real system at low cost. We define a cost model that transform the traditional view of model learning into a multi-objective problem that not only takes into account model accuracy but also measurements effort as well. We evaluate our cost-aware transfer learning solution using real world configurable software including (i) a robotic system, (ii) 3 different stream processing applications, and (iii) a NoSQL database system. The experimental results demonstrate that our approach can achieve (a) high prediction accuracy as well as (b) high model reliability with only few samples from the target environment.
The C preprocessor is used in many C projects to support variability and portability. However, researchers and practitioners criticize the C preprocessor because of its negative effect on code understanding and maintainability and its error proneness. More importantly, the use of the preprocessor hinders the development of tool support that is standard in other languages, such as automated refactoring. Developers aggravate these problems when using the preprocessor in undisciplined ways (e.g., conditional blocks that do not align with the syntactic structure of the code). In this article, we proposed a catalogue of refactorings and we evaluated the number of application possibilities of the refactorings in practice, the opinion of developers about the usefulness of the refactorings, and whether the refactorings preserve behavior. Overall, we found 5670 application possibilities for the refactorings in 63 real-world C projects. In addition, we performed an online survey among 246 developers, and we submitted 28 patches to convert undisciplined directives into disciplined ones. According to our results, 63% of developers prefer to use the refactored (i.e., disciplined) version of the code instead of the original code with undisciplined preprocessor usage. To verify that the refactorings are indeed behavior preserving, we applied them to more than 36 thousand programs generated automatically using a model of a subset of the C language, running the same test cases in the original and refactored programs. Furthermore, we applied the refactorings to three real-world projects: BusyBox, OpenSSL, and SQLite. This way, we detected and fixed a few behavioral changes, 62% caused by unspecified behavior in the C programming language.
Many applications require not only representing variability in software and data, but also computing with it. To do so efficiently requires variational data structures that make the variability explicit in the underlying data and the operations used to manipulate it. Variational data structures have been developed ad hoc for many applications, but there is little general understanding of how to design them or what tradeoffs exist among them. In this paper, we take a first step towards a more systematic exploration and analysis of a variational data structure. We want to know how different design decisions affect the performance and scalability of a variational data structure, and what properties of the underlying data and operation sequences need to be considered. Specifically, we study several alternative designs of a variational stack, a data structure that supports efficiently representing and computing with multiple variants of a plain stack, and that is a common building block in many algorithms. The different variational stacks are presented as a small product line organized by three design decisions. We analyze how these design decisions affect the performance of a variational stack with different usage profiles. Finally, we evaluate how these design decisions affect the performance of the variational stack in a real-world scenario: in the interpreter VarexJ when executing real software containing variability.
GNU Autotools is a widely used build tool in the open source community. As open source projects grow more complex, maintaining their build systems becomes more challenges, due to the lack of tool support. Here we propose a platform to mitigate this problem, and aid developers by providing a platform to build support tools for GNU Autotools build systems. The platform would provide an abstract approximation for the build system to be used in different analysis techniques.
Quality assurance for highly-configurable systems is challenging due to the exponentially growing configuration space. Interactions among multiple options can lead to surprising behaviors, bugs, and security vulnerabilities. Analyzing all configurations systematically might be possible though if most options do not interact or interactions follow specific patterns that can be exploited by analysis tools. To better understand interactions in practice, we analyze program traces to identify where interactions occur on control flow and data. To this end, we developed a dynamic analysis for Java based on variability-aware execution and monitor executions of multiple mid-sized real-world programs. We find that the essential configuration complexity of these programs is indeed much lower than the combinatorial explosion of the configuration space indicates, but also that the interaction characteristics that allow scalable and complete analyses are more nuanced than what is exploited by existing state-of-the-art quality assurance strategies.
There has been a considerable growth in research and development of service robots in recent years. For deployment in diverse environment conditions for a wide range of service tasks, novel features and algorithms are developed and existing ones undergo change. However, developing and evolving the robot software requires making and revising many design decisions that can affect the quality of performance of the robots and that are non-trivial to reason about intuitively because of interactions among them. We propose to use sensitivity analysis to build models of the quality of performance to the different design decisions to ease design and evolution. Moreover, we envision these models to be used for run-time adaptation in response to changing goals or environment conditions. Constructing these models is challenging due to the exponential size of the decision space. We build on previous work on performance influence models of highly-configurable software systems using a machine-learning-based approach to construct influence models for robotic software.
Change introduces conflict into software ecosystems: breaking changes may ripple through the ecosystem and trigger rework for users of a package, but often developers can invest additional effort or accept opportunity costs to alleviate or delay downstream costs. We performed a multiple case study of three software ecosystems with different tooling and philosophies toward change, Eclipse, R/CRAN, and Node.js/npm, to understand how developers make decisions about change and change-related costs and what practices, tooling, and policies are used. We found that all three ecosystems differ substantially in their practices and expectations toward change and that those differences can be explained largely by different community values in each ecosystem. Our results illustrate that there is a large design space in how to build an ecosystem, its policies and its supporting infrastructure; and there is value in making community values and accepted tradeoffs explicit and transparent in order to resolve conflicts and negotiate change-related costs.
Preprocessors support the diversification of software products with #ifdefs, but also require additional effort from developers to maintain and understand variable code. We conjecture that #ifdefs cause developers to produce more vulnerable code because they are required to reason about multiple features simultaneously and maintain complex mental models of dependencies of configurable code. We extracted a variational call graph across all configurations of the Linux kernel, and used configuration complexity metrics to compare vulnerable and non-vulnerable functions considering their vulnerability history. Our goal was to learn about whether we can observe a measurable influence of configuration complexity on the occurrence of vulnerabilities. Our results suggest, among others, that vulnerable functions have higher variability than non-vulnerable ones and are also constrained by fewer configuration options. This suggests that developers are inclined to notice functions appear in frequently-compiled product variants. We aim to raise developers’ awareness to address variability more systematically, since configuration complexity is an important, but often ignored aspect of software product lines.
The Android platform is designed to support mutually untrusted third-party apps, which run as isolated processes but may interact via platform-controlled mechanisms, called Intents. Interactions among third-party apps are intended and can contribute to a rich user experience, for example, the ability to share pictures from one app with another. The Android platform presents an interesting point in a design space of module systems that is biased toward isolation, extensibility, and untrusted contributions. The Intent mechanism essentially provides message channels among modules, in which the set of message types is extensible. However, the module system has design limitations including the lack of consistent mechanisms to document message types, very limited checking that a message conforms to its specification, the inability to explicitly declare dependencies on other modules, and the lack of checks for backward compatibility as message types evolve over time. In order to understand the degree to which these design limitations result in real issues, we studied a broad corpus of apps and cross-validated our results against app documentation and Android support forums. Our findings suggest that design limitations do indeed cause development problems. Based on our results, we outline further research questions and propose possible mitigation strategies..
Almost every software system provides configuration options to tailor the system to the target platform and application scenario. Often, this configurability renders the analysis of every individual system configuration infeasible. To address this problem, researchers proposed a diverse set of sampling algorithms. We present a comparative study of 10 state-of-the-art sampling algorithms regarding their fault-detection capability and size of sample sets. The former is important to improve software quality and the latter to reduce the time of analysis. In a nutshell, we found that the sampling algorithms with larger sample sets detected higher numbers of faults. Furthermore, we observed that the limiting assumptions made in previous work influence the number of detected faults, the size of sample sets, and the ranking of algorithms. Finally, we identified a number of technical challenges when trying to avoid the limiting assumptions, which question the practicality of certain sampling algorithms.
Today’s social coding tools foreshadow a transformation of the software industry, as it increasingly relies on open libraries, frameworks, and code fragments. Our vision calls for new “intelligently transparent” services that support rapid development of innovative products while managing risk and receiving early warnings of looming failures. Intelligent transparency is enabled by an infrastructure that applies analytics to data from all phases of the lifecycle of open source projects, from development to deployment, bringing stakeholders the information they need when they need it.
Dependencies among software projects and libraries are an indicator of the often implicit collaboration among many developers in software ecosystems. Negotiating change can be tricky: changes to one module may cause ripple effects to many other modules that depend on it, yet insisting on only backward-compatible changes may incur significant opportunity cost and stifle change. We argue that awareness mechanisms based on various notions of stability can enable developers to make decisions that are independent yet wise and provide stewardship rather than disruption to the ecosystem. In ongoing interviews with developers in two software ecosystems (CRAN and Node.js), we are finding that developers in fact struggle with change, that they often use adhoc mechanisms to negotiate change, and that existing awareness mechanisms like Github notification feeds are rarely used due to information overload. We study the state of the art and current information needs and outline a vision toward a change-based awareness system.
Smart home automation and IoT promise to bring many advantages but they also expose their users to certain security and privacy vulnerabilities. For example, leaking the information about the absence of a person from home or the medicine somebody is taking may have serious security and privacy consequences for home users and potential legal implications for providers of home automation and IoT platforms. We envision that a new ecosystem within an existing smartphone ecosystem will be a suitable platform for distribution of apps for smart home and IoT devices. Android is increasingly becoming a popular platform for smart home and IoT devices and applications. Built-in security mechanisms in ecosystems such as Android have limitations that can be exploited by malicious apps to leak users’ sensitive data to unintended recipients. For instance, Android enforces that an app requires the Internet permission in order to access a web server but it does not control which servers the app talks to or what data it shares with other apps. Therefore, sub-ecosystems that enforce additional fine-grained custom policies on top of existing policies of the smartphone ecosystems are necessary for smart home or IoT platforms. To this end, we have built a tool that enforces additional policies on inter-app interactions and permissions of Android apps. We have done preliminary testing of our tool on three proprietary apps developed by a future provider of a home automation platform. Our initial evaluation demonstrates that it is possible to develop mechanisms that allow definition and enforcement of custom security policies appropriate for ecosystems of the like smart home automation and IoT.
In collaborative software development, when two or more developers incorporate their changes, a merge conflict may arise if the changes are incompatible. Previous research has shown that such conflicts are common and occur as textual conflicts or build/test failure, i.e., semantic conflicts. When a merge conflict occurs for a large number of parallel changes, it is desirable to identify the actual (minimum) set of changes that directly results in the conflict. Pinpointing the specific conflicting changes directly leading to test failure facilitates quick accountability and correction from developers. For semantic conflicts, to identify such subset of the changes is challenging. A naive approach trying all possible subsets would not scale due to the exponential number of possible combinations. We propose Semex, a novel approach to detect semantic conflicts using variability-aware execution. In the first step, we encode all parallel changes into a single program with variability in which we use symbolic variables to represent whether a given change is applied to the original program. In the second step, we run the test cases via variability-aware execution. Variability-aware execution explores all possible executions of the combined program with regard to all possible values of the symbolic variables representing all changes, and returns a propositional formula over the set of variables repre- senting the condition in which a test case fails. Due to our encoding algorithm, such a set corresponds to the minimum set of changes that are responsible for the conflict. In our preliminary experimental study on seven PHP applications with a total of 50 test cases and 19 semantic conflicts, Semex correctly detected all 19 conflicts.
Almost every complex software system today is configurable. While configurability has many benefits, it challenges performance prediction, optimization, and debugging. Often, the influences of the individual configurations options on performance is unknown. Worse, configuration options may interact, giving rise to a configuration space of possibly exponential size. Addressing this challenge, we propose an approach that derives a performance-influence model for a given configurable system, describing all relevant influences of configuration options and their interactions. Such a model shall be useful for automatic performance prediction and optimization, on the one hand, and performance debugging for developers, on the other hand. Our approach combines machine-learning and sampling technique in a novel way. Our approach improves over standard techniques in that it (1) represents influences of options and their interactions explicitly (which eases debugging), (2) smoothly integrates binary and numeric configuration options for the first time, (3) incorporates domain knowledge, if available (which eases learning and increases accuracy), (4) considers complex constraints among options, and (5) systematically reduces the solution space to a tractable size. A series of experiments demonstrates the feasibility of our approach in terms of the accuracy of the models learned as well as the accuracy of the performances predictions one can make with them. Using our approach, we were able to identify a number of real performance bugs and other problems in real-world systems.
During software maintenance, program slicing is a useful technique to assist developers in understanding the impact of their changes. While different program-slicing techniques have been proposed for traditional software systems, program slicing for dynamic web applications is challenging since the client-side code is generated from the server-side code and data entities are referenced across different languages and are often embedded in string literals in the server-side program. To address those challenges, we introduce WebSlice, an approach to compute program slices across different languages for web applications. We first identify data-flow dependencies among data entities for PHP code based on symbolic execution. We also compute SQL queries and a conditional DOM that represents client-code variations and construct the data flows for embedded languages: SQL, HTML, and JavaScript. Next, we connect the data flows across different languages and those across PHP pages. Finally, we compute a program slice for any given entity based on the established data flows. Running WebSlice on five real-world PHP systems, we found that out of 40,670 program slices, 10 % cross languages, 38 % cross files, and 13 % cross string fragments, demonstrating the potential benefit of tool support for cross-language program slicing in web applications.
Highly configurable systems allow users to tailor software to specific needs. Valid combinations of configuration options are often restricted by intricate constraints. Describing options and constraints in a variability model allows reasoning about the supported configurations. To automate creating and verifying such models, we need to identify the origin of such constraints. We propose a static analysis approach, based on two rules, to extract configuration constraints from code. We apply it on four highly configurable systems to evaluate the accuracy of our approach and to determine which constraints are recoverable from the code. We find that our approach is highly accurate (93 % and 77 % respectively) and that we can recover 28 % of existing constraints. We complement our approach with a qualitative study to identify constraint sources, triangulating results from our automatic extraction, manual inspections, and interviews with 27 developers. We find that, apart from low-level implementation dependencies, configuration constraints enforce correct runtime behavior, improve users’ configuration experience, and prevent corner cases. While the majority of constraints is extractable from code, our results indicate that creating a complete model requires further substantial domain knowledge and testing. Our results aim at supporting researchers and practitioners working on variability model engineering, evolution, and verification techniques.
The C preprocessor has received strong criticism in academia, among others regarding separation of concerns, error proneness, and code obfuscation, but is widely used in practice. Many (mostly academic) alternatives to the preprocessor exist, but have not been adopted in practice. Since developers continue to use the preprocessor despite all criticism and research, we ask how practitioners perceive the C preprocessor. We performed interviews with 40 developers, used grounded theory to analyze the data, and cross-validated the results with data from a survey among 202 developers, repository mining, and results from previous studies. In particular, we investigated four research questions related to why the preprocessor is still widely used in practice, common problems, alternatives, and the impact of undisciplined annotations. Our study shows that developers are aware of the criticism the C preprocessor receives, but use it nonetheless, mainly for portability and variability. They indicate that they regularly face preprocessor-related problems and preprocessor-related bugs. The majority of our interviewees do not see any current C-native technologies that can entirely replace the C preprocessor. However, developers tend to mitigate problems with guidelines, but those guidelines are not enforced consistently. We report the key insights gained from our study and discuss implications for practitioners and researchers on how to better use the C preprocessor to minimize its negative impact.
Build systems contain a lot of configuration knowledge about a software system, such as under which conditions specific files are compiled. Extracting such configuration knowledge is important for many tools analyzing highly-configurable systems, but very challenging due to the complex nature of build systems. We design an approach, based on SYMake, that symbolically evaluates Makefiles and extracts configuration knowledge in terms of file presence conditions and conditional parameters. We implement an initial prototype and demonstrate feasibility on small examples.
In software development, IDE services such as syntax highlighting, code completion, and “jump to declara- tion” are used to assist developers in programming tasks. In dynamic web applications, however, since the client-side code is dynamically generated from the server-side code and is embedded in the server-side program as string literals, providing IDE services for such embedded code is challenging. In this work, we introduce Varis, a tool that provides editor services on the client-side code of a PHP-based web application, while it is still embedded within server-side code. Technically, we first perform symbolic execution on a PHP program to approximate all possible variations of the generated client-side code and subsequently parse this client code into a VarDOM that compactly represents all its variations. Finally, using the VarDOM, we implement various types of IDE services for embedded client code including syntax highlighting, code completion, and “jump to declaration”.
Almost every sufficiently complex software system today is configurable. Conditional compilation is a simple variability-implementation mechanism that is widely used in open-source projects and industry. Especially, the C preprocessor (cpp) is very popular in practice, but it is also gaining (again) interest in academia. Although there have been several attempts to understand and improve cpp, there is a lack of understanding of how it is used in open-source and industrial systems and whether different usage patterns have emerged. The background is that much research on configurable systems and product lines concentrates on open-source systems, simply because they are available for study in the first place. This leads to the potentially problematic situation that it is unclear whether the results obtained from these studies are transferable to industrial systems. We aim at lowering this gap by comparing the use of cpp in open-source projects and industry—especially from the embedded-systems domain—, based on a substantial set of subject systems and well-known variability metrics, including size, scattering, and tangling metrics. A key result of our empirical study is that, regarding almost all aspects we studied, the analyzed open-source systems and the considered embedded systems from industry have comparable distributions regarding most metrics, including systems that have been developed in industry and made open source at some point. So, our study indicates that, regarding cpp as variability-implementation mechanism, insights, methods, and tools developed based on studies of open- source systems are transferable to industrial systems—at least, with respect to the metrics we considered.
Highly-configurable software systems are pervasive, although configuration options and their interactions raise complexity of the program and increase maintenance effort. Especially load-time configuration options, such as parameters from command-line options or configuration files, are used with standard programming constructs such as variables and if statements intermixed with the program’s implementation; manually tracking configuration options from the time they are loaded to the point where they may influence control-flow decisions is tedious and error prone. We design and implement Lotrack, an extended static taint analysis to automatically track configuration options. Lotrack derives a configuration map that explains for each code fragment under which configurations it may be executed. An evaluation on Android applications shows that Lotrack yields high accuracy with reasonable performance. We use Lotrack to empirically characterize how much of the implementation of Android apps depends on the platform’s configuration options or interactions of these options.
When developing and maintaining a software system, programmers often rely on IDEs to provide editor services such as syntax highlighting, auto-completion, and “jump to declaration”. In dynamic web applications, such tool support is currently limited to either the server-side code or to hand-written or generated client-side code. Our goal is to build a call graph for providing editor services on client-side code while it is still embedded as string literals within server-side code. First, we symbolically execute the server-side code to identify all possible client-side code variations. Subsequently, we parse the generated client-side code with all its variations into a VarDOM that compactly represents all DOM variations for further analysis. Based on VarDOM, we build conditional call graphs for embedded HTML, CSS, and JS. Our empirical evaluation on real-world web applications show that our analysis achieves 100 % precision in identifying call-graph edges. 62 % of the edges cross PHP strings, and 17 % of them cross files—in both situations, navigation without tool support is tedious and error prone.
Variation is everywhere, but in the construction and analysis of customizable software it is paramount. In this context, there arises a need for variational data structures for efficiently representing and computing with related variants of an underlying data type. So far, variational data structures have been explored and developed ad hoc. This paper is a first attempt and a call to action for systematic and foundational research in this area. Research on variational data structures will benefit not only customizable software, but the many other application domains that must cope with variability. In this paper, we show how support for variation can be understood as a general and orthogonal property of data types, data structures, and algorithms. We begin a systematic exploration of basic variational data structures, exploring the tradeoffs between different implementations. Finally, we retrospectively analyze the design decisions in our own previous work where we have independently encountered problems requiring variational data structures.
We performed an empirical study to explore how closely well-known, open source C programs follow the safe C standards for integer behavior, with the goal of understanding how difficult it is to migrate legacy code to these stricter standards. We performed an automated analysis on fifty-two releases of seven C programs (6 million lines of preprocessed C code), as well as releases of Busybox and Linux (nearly one billion lines of partially-preprocessed C code). We found that integer issues, that are allowed by the C standard but not by the safer C standards, are ubiquitous—one out of four integers were inconsistently declared, and one out of eight integers were inconsistently used. Integer issues did not improve over time as the programs evolved. Also, detecting the issues is complicated by a large number of integers whose types vary under different preprocessor configurations. Most of these issues are benign, but the chance of finding fatal errors and exploitable vulnerabilities among these many issues remains significant. A preprocessor-aware, tool-assisted approach may be the most viable way to migrate legacy C code to comply with the standards for secure programming.
Software-product-line engineering has gained considerable momentum in recent years, both in industry and in academia. A software product line is a set of software products that share a common set of features. Software product lines challenge traditional analysis techniques, such as type checking, model checking, and theorem proving, in their quest of ensuring correctness and reliability of software. Simply creating and analyzing all products of a product line is usually not feasible, due to the potentially exponential number of valid feature combinations. Recently, researchers began to develop analysis techniques that take the distinguishing properties of software product lines into account, for example, by checking feature-related code in isolation or by exploiting variability information during analysis. The emerging field of product-line analyses is both broad and diverse, such that it is difficult for researchers and practitioners to understand their similarities and differences. We propose a classification of product-line analyses to enable systematic research and application. Based on our insights with classifying and comparing a corpus of 76 articles, we infer a research agenda to guide future research on product-line analyses.
Program comprehension is an important cognitive process that inherently eludes direct measurement. Thus, researchers are struggling with providing optimal programming languages, tools, or coding conventions to support developers in their everyday work. With our approach, we explore whether functional magnetic resonance imaging (fMRI), which is well established in cognitive neuroscience, is feasible to directly measure program comprehension. To this end, we observed 17 participants inside an fMRI scanner while comprehending short source-code snippets, which we contrasted with locating syntax errors. We found a clear, distinct activation pattern of five brain regions, which are related to working memory, attention, and language processing—all processes that fit well to our understanding of program comprehension. Based on the results, we propose a model of program comprehension. Our results encourage us to use fMRI in future studies to measure program comprehension and, in the long run, answer questions, such as: Can we predict whether someone will be an excellent programmer? How effective are new languages and tools for program understanding? How do we train someone to become an excellent programmer?
In plugin-based systems, plugin conflicts may occur when two or more plugins interfere with one another, changing their expected behaviors. It is highly challenging to detect plugin conflicts due to the exponential explosion of the combinations of plugins (i.e., configurations). In this paper, we address the challenge of executing a test case over many configurations. Leveraging the fact that many executions of a test are similar, our variability-aware execution runs common code once. Only when encountering values that are different depending on specific configurations will the execution split to run for each of them. To evaluate the scalability of variability-aware execution on a large real-world setting, we built a prototype PHP interpreter called Varex and ran it on the popular WordPress blogging Web application. The results show that while plugin interactions exist, there is a significant amount of sharing that allows variability-aware execution to scale to 2^50 configurations within seven minutes of running time. During our study, with Varex, we were able to detect two plugin conflicts: one was recently reported on WordPress forum, and another one is not yet discovered.
Highly-configurable systems allow users to tailor the software to their specific needs. Not all combinations of configuration options are valid though, and constraints arise for technical or non-technical reasons. Explicitly describing these constraints in a variability model allows reasoning about the supported configurations. To automate creating variability models, we need to identify the origin of such configuration constraints. We propose an approach which uses build-time errors and a novel feature-effect heuristic to automatically extract configuration constraints from C code. We conduct an empirical study on four highly-configurable open-source systems with existing variability models having three objectives in mind: evaluate the accuracy of our approach, determine the recoverability of existing variability-model constraints using our analysis, and classify the sources of variability-model constraints. We find that both our extraction heuristics are highly accurate (93 % and 77 % respectively), and that we can recover 19 % of the existing variability-models using our approach. However, we find that many of the remaining constraints require expert knowledge or more expensive analyses. We argue that our approach, tooling, and experimental results support researchers and practitioners working on variability model re-engineering, evolution, and consistency-checking techniques.
Hidden code dependencies are responsible for many complications in maintenance tasks. With the introduction of variable features in product lines, dependencies may even cross feature boundaries and related problems are prone to be detected late. Many current implementation techniques for product lines lack proper interfaces, which could make such dependencies explicit. As alternative to changing the implementation approach, we provide a comprehensive tool-based solution to support developers in recognizing and dealing with feature dependencies: emergent interfaces. Emergent interfaces are computed on demand, based on feature-sensitive interprocedural data-flow analysis. They emerge in the IDE and emulate benefits of modularity not available in the host language. To evaluate the potential of emergent interfaces, we conducted and replicated a controlled experiment, and found, in the studied context, that emergent interfaces can improve performance of code change tasks by up to 3 times while also reducing the number of errors.
Programming experience is an important confounding parameter in controlled experiments regarding program comprehension. In literature, ways to measure or control programming experience vary. Often, researchers neglect it or do not specify how they controlled for it. We set out to find a well-defined understanding of programming experience and a way to measure it. From published comprehension experiments, we extracted questions that assess programming experience. In a controlled experiment, we compare the answers of computer-science students to these questions with their performance in solving program-comprehension tasks. We found that self estimation seems to be a reliable way to measure programming experience. Furthermore, we applied exploratory and confirmatory factor analyses to extract and evaluate a model of programming experience. With our analysis, we initiate a path toward validly and reliably measuring and describing programming experience to better understand and control its influence in program-comprehension experiments.
The feature-interaction problem has been keeping researchers and practitioners in suspense for years. Although there has been substantial progress in developing approaches for modeling, detecting, managing, and resolving feature interactions, we lack sufficient knowledge on the kind of feature interactions that occur in real-world systems. In this position paper, we set out the goal to explore the nature of feature interactions systematically and comprehensively, classified in terms of order and visibility. Understanding this nature will have significant implications on research in this area, for example, on the efficiency of interaction-detection or performance-prediction techniques. A set of preliminary results as well as a discussion of possible experimental setups and corresponding challenges give us confidence that this endeavor is within reach but requires a collaborative effort of the community.
Software product line engineering is an efficient means to generate a set of tailored software products from a common implementation. However, adopting a product-line approach poses a major challenge and significant risks, since typically legacy code must be migrated toward a product line. Our aim is to lower the adoption barrier by providing semiautomatic tool support—called variability mining—to support developers in locating, documenting, and extracting implementations of product-line features from legacy code. Variability mining combines prior work on concern location, reverse engineering, and variability-aware type systems, but is tailored specifically for the use in product lines. Our work pursues three technical goals: (1) we provide a consistency indicator based on a variability-aware type system, (2) we mine features at a fine level of granularity, and (3) we exploit domain knowledge about the relationship between features when available. With a quantitative study, we demonstrate that variability mining can efficiently support developers in locating features.
The advent of proper variability management and generator technology enables users to derive individual variants from a variable code base solely based on a selection of desired configuration options. This approach gives rise to a huge configuration space, but the high degree of variability comes at a cost: classic analysis methods do not scale any more; there are simply too many potential variants to analyze. To address this issue, researchers and practitioners usually apply sampling techniques—only a subset of all possible variants is analyzed. While sampling promises to reduce the analysis effort significantly, the information obtained is necessarily incomplete. Furthermore, it is unknown whether sampling strategies scale to billions of variants, because even samples may be huge and expensive to compute. Recently, researchers have begun to develop variability-aware analyses that analyze the variable code base directly with the goal to exploit the similarities among individual variants to reduce analysis effort. However, while being promising, so far, variability-aware analyses have been applied mostly only to small academic systems. To learn about the mutual strengths and weaknesses of variability-aware and sampling-based analyses of large-scale, real-world software systems, we compared the two by means of two concrete analysis implementations (type checking and liveness analysis) applied to three subject systems: the Busybox tool suite, the x86 Linux kernel, and the cryptographic library OpenSSL. A key result is that in these settings already setting up sampling techniques is challenging while variability-aware analysis even outperforms most sampling approximations with respect to analysis time.
While standardization has empowered the software industry to substantially scale software development and to provide affordable software to a broad market, it often does not address smaller market segments, nor the needs and wishes of individual customers. Software product lines reconcile mass production and standardization with mass customization in software engineering. Ideally, based on a set of reusable parts, a software manufacturer can generate a software product based on the requirements of its customer. The concept of features is central to achieving this level of automation, because features bridge the gap between the requirements the customer has and the functionality a product provides. Thus features are a central concept in all phases of product-line development. The authors take a developer’s viewpoint, focus on the development, maintenance, and implementation of product-line variability, and especially concentrate on automated product derivation based on a user’s feature selection. The book consists of three parts. Part I provides a general introduction to feature-oriented software product lines, describing the product-line approach and introducing the product-line development process with its two elements of domain and application engineering. The pivotal Part II covers a wide variety of implementation techniques including design patterns, frameworks, components, feature-oriented programming, and aspect-oriented programming, as well as tool-based approaches including preprocessors, build systems, version-control systems, and virtual separation of concerns. Finally, Part III is devoted to advanced topics related to feature-oriented product lines like refactoring, feature interaction, and analysis tools specific to product lines. In addition, an Appendix lists various helpful tools for software product-line development, along with a description of how they relate to the topics covered in this book. To tie the book together, the authors use two running examples that are well documented in the product-line literature: data management for embedded systems, and variations of graph data structures. They start every chapter by explicitly stating the respective learning goals and finish it with a set of exercises; additional teaching material is also available online. All these features make the book ideally suited for teaching – both for academic classes and for professionals interested in self-study.
Formal specification and verification techniques have been used successfully to detect feature interactions. We investigate whether feature-based specifications can be used for this task. Feature-based specifications are a special class of specifications that aim at modularity in open-world, feature-oriented systems. The question we address is whether modularity of specifications impairs the ability to detect feature interactions, which cut across feature boundaries. In an exploratory study on 10 feature-oriented systems, we found that the majority of feature interactions could be detected based on feature-based specifications, but some specifications have not been modularized properly and require undesirable workarounds to modularization. Based on the study, we discuss the merits and limitations of feature-based specifications, as well as open issues and perspectives. A goal that underlies our work is to raise awareness of the importance and challenges of feature-based specification.
Program comprehension plays a crucial role during the software-development life cycle: Maintenance programmers spend most of their time with comprehending source code, and maintenance is the main cost factor in software development. Thus, if we can improve program comprehension, we can save considerable amount of time and cost. To improve program comprehension, we have to measure it first. However, program comprehension is a complex, internal cognitive process that we cannot observe directly. Typically, we need to conduct controlled experiments to soundly measure program comprehension. However, empirical research is applied only reluctantly in software engineering. To close this gap, we set out to support researchers in planning and conducting experiments regarding program comprehension. We report our experience with experiments that we conducted and present the resulting framework to support researchers in planning and conducting experiments. Additionally, we discuss the role of teaching for the empirical researchers of tomorrow.
The collections API of a programming language forms an embedded domain-specific language to express queries and operations on collections. Unfortunately, the ordinary style of implementing such APIs does not allow automatic domain-specific analyses and optimizations such as fusion of collection traversals, usage of indexing, or reordering of filters. Performance-critical code using collections must instead be hand-optimized, leading to non-modular, brittle, and redundant code. We propose SQuOpt, the Scala Query Optimizer—a deep embedding of the Scala collections API that allows such analyses and optimizations to be defined and executed within Scala, with- out relying on external tools or compiler extensions. SQuOpt provides the same “look and feel” (syntax and static typing guar- antees) as the standard collections API. We evaluate SQuOpt by re-implementing several code analyses of the Findbugs tool using SQuOpt and demonstrate that SQuOpt can reconcile modularity and efficiency in real-world applications.
Software product-line engineering aims at the development of families of related products that share common assets. An important aspect is that customers are often interested not only in particular functionalities (i.e., features), but also in non-functional quality attributes such as performance, reliability, and footprint. A naive approach is to measure quality attributes of every single product, and to deliver the products that fit the customers' needs. However, as product lines may consist of millions of products, this approach does not scale. In this research-in-progress report, we propose a systematic approach for the efficient and scalable prediction of quality attributes of products that consists of two steps. First, we generate predictors for certain categories of quality attributes (e.g., a predictor for low performance) based on software and network measures, and receiver operating characteristic analysis. Second, we use these predictors to guide a sampling process that takes the asset base of a product line as input and efficiently determines the products that fall into the category denoted by a given predictor (e.g., products with low performance). In other words, we use predictors to make the process of finding “acceptable” products more efficient. We discuss and compare several strategies to incorporate predictors in the sampling process.
Software product-line engineering aims at the development of families of related products that share common assets. An important aspect is that customers are often interested not only in particular functionalities (i.e., features), but also in non-functional quality attributes such as performance, reliability, and footprint. A naive approach is to measure quality attributes of every single product, and to deliver the products that fit the customers' needs. However, as product lines may consist of millions of products, this approach does not scale. In this research-in-progress report, we propose a systematic approach for the efficient and scalable prediction of quality attributes of products that consists of two steps. First, we generate predictors for certain categories of quality attributes (e.g., a predictor for low performance) based on software and network measures, and receiver operating characteristic analysis. Second, we use these predictors to guide a sampling process that takes the asset base of a product line as input and efficiently determines the products that fall into the category denoted by a given predictor (e.g., products with low performance). In other words, we use predictors to make the process of finding “acceptable” products more efficient. We discuss and compare several strategies to incorporate predictors in the sampling process.
Product-line analysis has received considerable attention in the past. As it is often infeasible to analyze each product of a product line individually, researchers have developed analyses, called variability-aware analyses, that consider and exploit variability manifested in a code base. Variability-aware analyses are often significantly more efficient than traditional analyses, but each of them has certain weaknesses regarding applicability or scalability, as we discuss in this paper. We present the Product-Line-Analysis Model, a formal model for the classification and comparison of existing analyses, including traditional and variability-aware analyses, and lay a foundation for formulating and exploring further, combined analyses. As a proof of concept, we discuss different examples of analyses in the light of our model, and demonstrate its benefits for systematic comparison and exploration of product-line analyses.
Program comprehension is an often evaluated, internal cognitive process. In neuroscience, functional magnetic resonance (fMRI) imaging is used to visualize such internal cognitive processes. We propose an experimental design to measure program comprehension based on fMRI. In the long run, we hope to answer questions like What distinguishes good programmers from bad programmers? or What makes a good programmer?
We investigate how to execute a unit test in all configurations of a product line without generating each product in isolation in a brute-force fashion. Learning from variability-aware analyses, we (a) design and implement a variability-aware interpreter and (b) reencode variability of the product line to simulate the test cases with a model checker. The interpreter internally reasons about variability, executing paths not affected by variability only once for the whole product line. The model checker achieves similar results by reusing powerful off-the-shelf analyses. We experimented with a prototype implementation for each strategy. We compare both strategies and discuss trade-offs and future directions.
It is common believe that separating source code along concerns or features improves program comprehension of source code. However, empirical evidence is mostly missing. In this paper, we design a controlled experiment to evaluate that believe for feature-oriented programming based on maintenance tasks with human participants. We validate our experiment with a pilot study, which already preliminarily confirms that students use different strategies to complete maintenance tasks.
The theory of context-free languages is well-understood and context-free parsers can be used as off-the-shelf tools in practice. In particular, to use a context-free parser framework, a user does not need to understand its internals but can specify a language declaratively as a grammar. However, many languages in practice are not context-free. One particularly important class of such languages is layout-sensitive languages, in which the structure of code depends on indentation and whitespace. For example, Python, Haskell, F\#, and Markdown use indentation instead of curly braces to determine the block structure of code. Their parsers (and lexers) are not declaratively specified but hand-tuned to account for layout-sensitivity. To support declarative specifications of layout-sensitive languages, we propose a parsing framework in which a user can annotate layout in a grammar as constraints on the relative positioning of tokens in the parsed subtrees. For example, a user can declare that a block consists of statements that all start on the same column. We have integrated layout constraints into SDF and implemented a layout-sensitive generalized parser as an extension of generalized LR parsing. We evaluate the correctness and performance of our parser by parsing 33290 open-source Haskell files. Layout-sensitive generalized parsing is easy to use, and its performance overhead compared to layout-insensitive parsing is small enough for most practical applications.
Context: A software product line is a family of related software products, typically created from a set of common assets. Users select features to derive a product that fulfills their needs. Users often expect a product to have specific non-functional properties, such as a small footprint or a bounded response time. Because a product line may have an exponential number of products with respect to its features, it is usually not feasible to generate and measure non-functional properties for each possible product. Objective: Our overall goal is to derive optimal products with respect to non-functional requirements by showing customers which features must be selected. Method: We propose an approach to predict a product’s non-functional properties based on the product’s feature selection. We aggregate the influence of each selected feature on a non-functional property to predict a product’s properties. We generate and measure a small set of products and, by comparing measurements, we approximate each feature’s influence on the non-functional property in question. As a research method, we conducted controlled experiments and evaluated prediction accuracy for the non-functional properties footprint and main-memory consumption. But, in principle, our approach is applicable for all quantifiable non-functional properties. Results: With nine software product lines, we demonstrate that our approach predicts the footprint with an average accuracy of 94\,\%, and an accuracy of over 99\,\% on average if feature interactions are known. In a further series of experiments, we predicted main memory consumption of six customizable programs and achieved an accuracy of 89\,\% on average. Conclusion: Our experiments suggest that, with only few measurements, it is possible to accurately predict non-functional properties of products of a product line. Furthermore, we show how already little domain knowledge can improve predictions and discuss trade-offs between accuracy and required number of measurements. With this technique, we provide a basis for many reasoning and product-derivation approaches.
Feature-oriented software development is a paradigm for the construction, customization, and synthesis of large-scale and variable software systems, focusing on structure, reuse and variation. In this tutorial, we provide a gentle introduction to software product lines, feature oriented programming, virtual separation of concerns, and variability- aware analysis. We provide an overview, show connections between the different lines of research, and highlight possible future research directions.
FeatureIDE is an open-source framework for feature-oriented software development (FOSD) based on Eclipse. FOSD is a paradigm for the construction, customization, and synthesis of software systems. Code artifacts are mapped to features, and a customized software system can be generated given a selection of features. The set of software systems that can be generated is called a software product line (SPL). FeatureIDE supports several FOSD implementation techniques such as feature-oriented programming, aspect-oriented programming, delta-oriented programming, and preprocessors. All phases of FOSD are supported in FeatureIDE, namely domain analysis, requirements analysis, domain implementation, and software generation.
Module systems enable a divide and conquer strategy to software development. To implement compile-time variability in software product lines, modules can be composed in different combinations. However, this way variability dictates a dominant decomposition. Instead, we introduce a variability-aware module system that supports compile-time variability inside a module and its interface. This way, each module can be considered a product line that can be type checked in isolation. Variability can crosscut multiple modules. The module system breaks with the antimodular tradition of a global variability model in product-line development and provides a path toward software ecosystems and product lines of product lines developed in an open fashion. We discuss the design and implementation of such a module system on a core calculus and provide an implementation for C, which we use to type check the open source product line Busybox with 811 compile-time options.
Module systems enable a divide and conquer strategy to software development. To implement compile-time variability in software product lines, modules can be composed in different combinations. However, this way variability dictates a dominant decomposition. Instead, we introduce a variability-aware module system that supports compile-time variability inside a module and its interface. This way, each module can be considered a product line that can be type checked in isolation. Variability can crosscut multiple modules. The module system breaks with the antimodular tradition of a global variability model in product-line development and provides a path toward software ecosystems and product lines of product lines developed in an open fashion. We discuss the design and implementation of such a module system on a core calculus and provide an implementation for C, which we use to type check the open source product line Busybox with 811 compile-time options.
Software-product-line engineering has gained considerable momentum in recent years, both in industry and in academia. A software product line is a set of software products that share a common set of features. Software product lines challenge traditional analysis techniques, such as type checking, testing, and formal verification, in their quest of ensuring correctness and reliability of software. Simply creating and analyzing all products of a product line is usually not feasible, due to the potentially exponential number of valid feature combinations. Recently, researchers began to develop analysis techniques that take the distinguishing properties of software product lines into account, for example, by checking feature-related code in isolation or by exploiting variability information during analysis. The emerging field of product-line analysis techniques is both broad and diverse such that it is difficult for researchers and practitioners to understand their similarities and differences (e.g., with regard to variability awareness or scalability), which hinders systematic research and application. We classify the corpus of existing and ongoing work in this field, we compare techniques based on our classification, and we infer a research agenda. A short-term benefit of our endeavor is that our classification can guide research in product-line analysis and, to this end, make it more systematic and efficient. A long-term goal is to empower developers to choose the right analysis technique for their needs out of a pool of techniques with different strengths and weaknesses.
Software-product-line engineering aims at the development of variable and reusable software systems. In practice, software product lines are often implemented with preprocessors. Preprocessor directives are easy to use, and many mature tools are available for practitioners. However, preprocessor directives have been heavily criticized in academia and even referred to as “#ifdef hell”, because they introduce threats to program comprehension and correctness. There are many voices that suggest to use other implementation techniques instead, but these voices ignore the fact that a transition from preprocessors to other languages and tools is tedious, erroneous, and expensive in practice. Instead, we and others propose to increase the readability of preprocessor directives by using background colors to highlight source code annotated with ifdef directives. In three controlled experiments with over 70 subjects in total, we evaluate whether and how background colors improve program comprehension in preprocessor-based implementations. Our results demonstrate that background colors have the potential to improve program comprehension, independently of size and programming language of the underlying product. Additionally, we found that subjects generally favor background colors. We integrate these and other findings in a tool called FeatureCommander, which facilitates program comprehension in practice and which can serve as a basis for further research.
Background: Software product line engineering provides an effective mechanism to implement variable software. However, the usage of preprocessors to realize variability, which is typical in industry, is heavily criticized, because it often leads to obfuscated code. Using background colours to highlight preprocessor statements to support comprehensibility has shown effective, however, scalability to large software product lines (SPLs) is questionable. Aim: Our goal is to implement and evaluate scalable usage of background colours for industrial-sized SPLs. Method: We designed and implemented scalable concepts in a tool called FeatureCommander. To evaluate its effectiveness, we conducted a controlled experiment with a large real-world SPL with over 99,000 lines of code and 340 features. We used a within-subjects design with treatments colours and no colours. We compared correctness and response time of tasks for both treatments. Results: For certain kinds of tasks, background colours improve program comprehension. Furthermore, subjects generally favour background colours compared to no background colours. Additionally, subjects who worked with background colours had to use the search functions less frequently. Conclusion: We show that background colours can improve program comprehension in large SPLs. Based on these encouraging results, we will continue our work on improving program comprehension in large SPLs.
Programming experience is an important confounding parameter in controlled experiments regarding program comprehension. In literature, ways to measure or control programming experience vary. Often, researchers neglect it or do not specify how they controlled it. We set out to find a well-defined understanding of programming experience and a way to measure it. From published comprehension experiments, we extracted questions that assess programming experience. In a controlled experiment, we compare the answers of 128 students to these questions with their performance in solving program-comprehension tasks. We found that self estimation seems to be a reliable way to measure programming experience. Furthermore, we applied exploratory factor analysis to extract a model of programming experience. With our analysis, we initiate a path toward measuring programming experience with a valid and reliable tool, so that we can control its influence on program comprehension.
Customizable programs and program families provide user-selectable features to tailor a program to an application scenario. Knowing in advance which feature selection yields the best performance is difficult because a direct measurement of all possible feature combinations is infeasible. Our work aims at predicting program performance based on selected features. The challenge is predicting performance accurately when features interact. An interaction occurs when a feature combination has an unexpected influence on performance. We present a method that automatically detects performance feature interactions to improve prediction accuracy. To this end, we propose three heuristics to reduce the number of measurements required to detect interactions. Our evaluation consists of six real-world case studies from varying domains (e.g. databases, compression libraries, and web server) using different configuration techniques (e.g., configuration files and preprocessor flags). Results show, on average, a prediction accuracy of 95 %.
Superimposition is a composition technique that has been applied successfully in many areas of software development. Although superimposition is a general-purpose concept, it has been (re)invented and implemented individually for various kinds of software artifacts. We unify languages and tools that rely on superimposition by using the language-independent model of feature structure trees (FSTs). On the basis of the FST model, we propose a general approach to the composition of software artifacts written in different languages. Furthermore, we offer a supporting framework and tool chain, called FeatureHouse. We use attribute grammars to automate the integration of additional languages. In particular, we have integrated Java, C#, C, Haskell, Alloy, and JavaCC. A substantial number of case studies demonstrate the practicality and scalability of our approach and reveal insights into the properties that a language must have in order to be ready for superimposition. We discuss perspectives of our approach and demonstrate how we extended FeatureHouse with support for XML languages (in particular, XHTML, XMI/UML, and Ant) and alternative composition approaches (in particular, aspect weaving). Rounding off our previous work, we provide here a holistic view of the FeatureHouse approach based on rich experience with numerous languages and case studies and reflections on several years of research.
Software is changed frequently during its life cycle. New requirements come and bugs must be fixed. To update an application it usually must be stopped, patched, and restarted. This causes time periods of unavailability which is always a problem for highly available applications. Even for the development of complex applications restarts to test new program parts can be time consuming and annoying. Thus, we aim at dynamic software updates to update programs at runtime. There is a large body of research on dynamic software updates, but so far, existing approaches have shortcomings either in terms of flexibility or performance. In addition, some of them depend on specific runtime environments and dictate the program’s architecture. We present JavAdaptor, the first runtime update approach based on Java that (a) offers flexible dynamic software updates, (b) is platform independent, (c) introduces only minimal performance overhead, and (d) does not dictate the program architecture. JavAdaptor combines schema changing class replacements by class renaming and caller updates with Java HotSwap using containers and proxies. It runs on top of all major standard Java virtual machines. We evaluate our approach’s applicability and performance in non-trivial case studies and compare it to existing dynamic software update approaches.
Software product line engineering is an efficient means to generate a set of tailored software products from a common implementation. However, adopting a product-line approach poses a major challenge and significant risks, since typically legacy code must be migrated toward a product line. Our aim is to lower the adoption barrier by providing semiautomatic tool support—called variability mining—to support developers in locating, documenting, and extracting implementations of product-line features from legacy code. Variability mining combines prior work on concern location, reverse engineering, and variability-aware type systems, but is tailored specifically for the use in product lines. Our work extends prior work in three important aspects: (1) we provide a consistency indicator based on a variability-aware type system, (2) we mine features at a fine level of granularity, and (3) we exploit domain knowledge about the relationship between features when available. With a quantitative study, we demonstrate that variability mining can efficiently support developers in locating features.
Large software projects consist of code written in a multitude of different (possibly domain-specific) languages, which are often deeply interspersed even in single files. While many proposals exist on how to integrate languages semantically and syntactically, the question of how to support this scenario in integrated development environments (IDEs) remains open: How can standard IDE services, such as syntax highlighting, outlining, or reference resolving, be provided in an extensible and compositional way, such that an open mix of languages is supported in a single file? Based on our library-based syntactic extension language for Java, SugarJ, we propose to make IDEs extensible by organizing editor services in editor libraries. Editor libraries are libraries written in the object language, SugarJ, and hence activated and composed through regular import statements on a file-by-file basis. We have implemented an IDE for editor libraries on top of SugarJ and the Eclipse-based Spoofax language workbench. We have validated editor libraries by evolving this IDE into a fully-fledged and schema-aware XML editor as well as an extensible Latex editor, which we used for writing this paper.
Modularity of feature representations has been a long standing goal of feature-oriented software development. While some researchers regard feature modules and corresponding composition mechanisms as a modular solution, other researchers have challenged the notion of feature modularity and pointed out that most feature-oriented implementation mechanisms lack proper interfaces and support neither modular type checking nor separate compilation. We step back and reflect on the feature-modularity discussion. We distinguish two notions of modularity, cohesion without interfaces and information hiding with interfaces, and point out the different expectations that, we believe, are the root of many heated discussions. We discuss whether feature interfaces should be desired and weigh their potential benefits and costs, specifically regarding crosscutting, granularity, feature interactions, and the distinction between closed-world and open-world reasoning. Because existing evidence for and against feature modularity and feature interfaces is shaky and inconclusive, more research is needed, for which we outline possible directions.
In many projects, lexical preprocessors are used to manage different variants of the project (using conditional compilation) and to define compile-time code transformations (using macros). Unfortunately, while being a simply way to implement variability, conditional compilation and lexical macros hinder automatic analysis, even though such analysis would be urgently needed to combat variability-induced complexity. To analyze code with its variability, we need to parse it without preprocessing it. However, current parsing solutions use heuristics, support only a subset of the language, or suffer from exponential explosion. As part of the TypeChef project, we contribute a novel variability-aware parser that can parse unpreprocessed code without heuristics in practicable time. Beyond the obvious task of detecting syntax errors, our parser paves the road for further analysis, such as variability-aware type checking. We implement variabilityaware parsers for Java and GNU C and demonstrate practicability by parsing the product line MobileMedia and the entire X86 architecture of the Linux kernel with 6065 variable features.
Existing approaches to extend a programming language with syntactic sugar often leave a bitter taste, because they cannot be used with the same ease as the main extension mechanism of the programming language—libraries. Sugar libraries are a novel approach for syntactically extending a programming language within the language. A sugar library is like an ordinary library, but can, in addition, export syntactic sugar for using the library. Sugar libraries maintain the composability and scoping properties of ordinary libraries and are hence particularly well-suited for embedding a multitude of domain-specific languages into a host language. They also inherit the self-applicability of libraries, which means that the syntax extension mechanism can be applied in the definition of sugar libraries themselves. To demonstrate the expressiveness and applicability of sugar libraries, we have developed SugarJ, a language on top of Java, SDF and Stratego that supports syntactic extensibility. SugarJ employs a novel incremental parsing mechanism that allows changing the syntax within a source file. We demonstrate SugarJ by five language extensions, including embeddings of XML and closures in Java, all available as sugar libraries. We illustrate the utility of self-applicability by embedding XML Schema, a metalanguage to define XML languages.
An ongoing problem in revision control systems is how to resolve conflicts in a merge of independently developed revisions. Unstructured revision control systems are purely text-based and solve conflicts based on textual similarity. Structured revision control systems are tailored to specific languages and use language-specific knowledge for conflict resolution. We propose semistructured revision control systems that inherit the strengths of both classes of systems: generality and expressiveness. The idea is to provide structural information of the underlying software artifacts—declaratively, in the form of annotated grammars. This way, a wide variety of languages can be supported and the information provided can assist the automatic resolution of two classes of conflicts: ordering conflicts and semantic conflicts. The former can be resolved independently of the language and the latter can be resolved using specific conflict handlers supplied by the user. We have been developing a tool that supports semistructured merge and conducted an empirical study on 24 software projects developed in Java, C#, and Python comprising 180 merge scenarios. We found that semistructured merge reduces the number of conflicts in 60 % of the sample merge scenarios by, on average, 34 %. Our study reveals that renaming is challenging in that it can significantly increase the number of conflicts during semistructured merge, which we discuss.
A software product line (SPL) is a family of related programs of a domain. The programs of an SPL are distinguished in terms of features, which are end-uservisible characteristics of programs. Based on a selection of features, stakeholders can derive tailor-made programs that satisfy functional requirements. Besides functional requirements, different application scenarios raise the need for optimizing non-functional properties of a variant. The diversity of application scenarios leads to heterogeneous optimization goals with respect to non-functional properties (e.g., performance vs. footprint vs. energy optimized variants). Hence, an SPL has to satisfy different and sometimes contradicting requirements regarding non-functional properties. Usually, the actually required non-functional properties are not known before product derivation and can vary for each application scenario and customer. Allowing stakeholders to derive optimized variants requires to measure non-functional properties after the SPL is developed. Unfortunately, the high variability provided by SPLs complicates measurement and optimization of non-functional properties due to a large variant space. With SPL Conqueror, we provide a holistic approach to optimize non-functional properties in SPL engineering. We show how non-functional properties can be qualitatively specified and quantitatively measured in the context of SPLs. Furthermore, we discuss the variant-derivation process in SPL Conqueror that reduces the effort of computing an optimal variant. We demonstrate the applicability of our approach by means of nine case studies of a broad range of application domains (e.g., database management and operating systems). Moreover, we show that SPL Conqueror is implementation and language independent by using SPLs that are implemented with different mechanisms, such as conditional compilation and feature-oriented programming.
Bedingte Kompilierung ist ein einfaches und häufig benutztes Mittel zur Implementierung von Variabilität in Softwareproduktlinien, welches aber aufgrund negativer Auswirkungen auf Codequalität und Wartbarkeit stark kritisiert wird. Wir zeigen wie Werkzeugunterstützung – Sichten, Visualisierung, kontrollierte Annotationen, Produktlinien-Typsystem – die wesentlichen Probleme beheben kann und viele Vorteile einer modularen Entwicklung emuliert. Wir bieten damit eine Alternative zur klassischen Trennung von Belangen mittels Modulen. Statt Quelltext notwendigerweise in Dateien zu separieren erzielen wir eine virtuelle Trennung von Belangen durch entsprechender Werkzeugunterstüzung.
Software measures are often used to assess program comprehension, although their applicability is discussed controversially. Often, their application is based on plausibility arguments, which however is not sufficient to decide whether and how software measures are good predictors for program comprehension. Our goal is to evaluate whether and how software measures and program comprehension correlate. To this end, we carefully designed an experiment. We used four different measures that are often used to judge the quality of source code: complexity, lines of code, concern attributes, and concern operations. We measured how subjects understood two comparable software systems that differ in their implementation, such that one implementation promised considerable benefits in terms of better software measures. We did not observe a difference in program comprehension of our subjects as the software measures suggested it. To explore how software measures and program comprehension could correlate, we used several variants of computing the software measures. This brought them closer to our observed result, however, not as close as to confirm a relationship between software measures and program comprehension. Having failed to establish a relationship, we present our findings as an open issue to the community and initiate a discussion on the role of software measures as comprehensibility predictors.
A software product line is a set of program variants, typically generated from a common code base. Feature models describe variability in product lines by documenting features and their valid combinations. In product-line engineering, we need to reason about variability and program variants for many different tasks. For example, given a feature model, we might want to determine the number of all valid feature combinations or detect specific feature combinations for testing. However, we found that contemporary reasoning approaches can only reason about feature combinations, not about program variants, because they do not take abstract features into account. Abstract features are features used to structure a feature model that, however, do not have any impact at implementation level. Using existing feature-model reasoning mechanisms for product variants leads to incorrect results. We raise awareness of the problem of abstract features for different kinds of analyses on feature models. We argue that, in order to reason about program variants, abstract features should be made explicit in feature models. We present a technique based on propositional formulas to reason about program variants. In practice, our technique can save effort that is caused by considering the same program variant multiple times, for example, in product-line testing.
A software product line (SPL) is a family of related software products, from which users can derive a product that fulfills their needs. Often, users expect a product to have specific non-functional properties, for example, to not exceed a footprint limit or to respond in a given time frame. Unfortunately, it is usually not feasible to generate and measure non-functional properties for each possible product of an SPL in isolation, because an SPL can contain millions of products. Hence, we propose an approach to estimate each product's non-functional properties in advance, based on the product's configuration. To this end, we approximate non-functional properties per features and per feature interaction. We generate and measure a small set of products and approximated non-functional properties by comparing the measurements. Our approach is implementation independent and language independent. We present three different approaches with different trade-offs regarding accuracy and required number of measurements. With nine case studies, we demonstrate that our approach can predict non-functional properties with an accuracy of 2\%.
What is modularity? Which kind of modularity should developers strive for? Despite decades of research on modularity, these basic questions have no definite answer. We submit that the common understanding of modularity, and in particular its notion of information hiding, is deeply rooted in classical logic. We analyze how classical modularity, based on classical logic, fails to address the needs of developers of large software systems, and encourage researchers to explore alternative visions of modularity, based on nonclassical logics, and henceforth called nonclassical modularity.
Background: Software product line engineering provides an effective mechanism to implement variable software. However, the usage of preprocessors, which is typical in industry, is heavily criticized, because it often leads to obfuscated code. Using background colors to support comprehensibility has shown effective, however, scalability to large software product lines (SPLs) is questionable. Aim: Our goal is to implement and evaluate scalable usage of background colors for industrial-sized SPLs. Method: We designed and implemented scalable concepts in a tool called FeatureCommander. To evaluate its effectiveness, we conducted a controlled experiment with a large real-world SPL with over 160,000 lines of code and 340 features. We used a within-subjects design with treatments colors and no colors. We compared correctness and response time of tasks for both treatments. Results: For certain kinds of tasks, background colors improve program comprehension. Furthermore, subjects generally favor background colors. Conclusion: We show that background colors can improve program comprehension in large SPLs. Based on these encouraging results, we will continue our work improving program comprehension in large SPLs.
Software-product-line engineering is an efficient means to generate a family of program variants for a domain from a single code base. However, because of the potentially high number of possible program variants, it is difficult to test them all and ensure properties like type safety for the entire product line. We present a product-line–aware type system that can type check an entire software product line without generating each variant in isolation. Specifically, we extend the Featherweight Java calculus with feature annotations for product-line development and prove formally that all program variants generated from a well-typed product line are well-typed. Furthermore, we present a solution to the problem of typing mutually exclusive features. We discuss how results from our formalization helped implementing our own product-line tool CIDE for full Java and report of experience with detecting type errors in four existing software-product-line implementations.
The C preprocessor cpp is a widely used tool for implementing variable software. It enables programmers to express variable code of features that may crosscut the entire implementation with conditional compilation. The C preprocessor relies on simple text processing and is independent of the host language (C, C++, Java, and so on). Language independent text processing is powerful and expressive|programmers can make all kinds of annotations in the form of #ifdefs but can render unpreprocessed code difficult to process automatically by tools, such as code aspect refactoring, concern management, and also static analysis and variability-aware type checking. We distinguish between disciplined annotations, which align with the underlying source-code structure, and undisciplined annotations, which do not align with the structure and hence complicate tool development. This distinction raises the question of how frequently programmers use undisciplined annotations and whether it is feasible to change them to disciplined annotations to simplify tool development and to enable programmers to use a wide variety of tools in the first place. By means of an analysis of 40 mediumsized to large-sized C programs, we show empirically that programmers use cpp mostly in a disciplined way: about 85 % of all annotations respect the underlying source-code structure. Furthermore, we analyze the remaining undisciplined annotations, identify patterns, and discuss how to transform them into a disciplined form.
The C preprocessor is commonly used to implement variability. Given a feature selection, code fragments can be excluded from compilation with #ifdef and similar directives. However, the token-based nature of the C preprocessor makes variability implementation difficult and errorprone. Additionally, variability mechanisms are intertwined with macro definitions, macro expansion, and file inclusion. To determine whether a code fragment is compiled, the entire file must be preprocessed. We present a partial preprocessor that preprocesses file inclusion and macro expansion, but retains variability information for further analysis. We describe the mechanisms of the partial preprocessor, provide a full implementation, and present some initial experimental results. The partial preprocessor is part of a larger endeavor in the TypeChef project to check variability implementations (syntactic correctness, type correctness) in C projects such as the Linux kernel.
Software product lines have gained momentum as an approach to generate many variants of a program, each tailored to a specific use case, from a common code base. However, the implementation of product lines raises new challenges, as potentially millions of program variants are developed in parallel. In prior work, we and others have developed product-line–aware type systems to detect type errors in a product line, without generating all variants. With TypeChef, we build a similar type checker for product lines written in C that implements variability with #ifdef directives of the C preprocessor. However, a product-line–aware type system for C is more difficult than expected due to several peculiarities of the preprocessor, including lexical macros and unrestricted use of #ifdef directives. In this paper, we describe the problems faced and our progress to solve them with TypeChef. Although TypeChef is still under development and cannot yet process arbitrary C code, we demonstrate its capabilities so far with a case study: By type checking the open-source web server Boa with potentially 2^110 variants, we found type errors in several variants.
Feature-Oriented Software Development (FOSD) is a paradigm for the development of software product lines. A challenge in FOSD is to guarantee that all software systems of a software product line are correct. Recent work on type checking product lines can provide a guarantee of type correctness without generating all possible systems. We generalize previous results by abstracting from the specifics of particular programming languages. In a first attempt, we present a reference-checking algorithm that performs key tasks of product-line type checking independently of the target programming language. Experiments with two sample product lines written in Java and C are encouraging and give us confidence that this approach is promising.
In feature-oriented programming (FOP) a programmer decomposes a program in terms of features. Ideally, features are implemented modularly so that they can be developed in isolation. Access control is an important ingredient to attain feature modularity as it provides mechanisms to hide and expose internal details of a module's implementation. But developers of contemporary feature-oriented languages have not considered access control mechanisms so far. The absence of a well-defined access control model for FOP breaks encapsulation of feature code and leads to unexpected program behaviors and inadvertent type errors. We raise awareness of this problem, propose three feature-oriented access modifiers, and present a corresponding access modifier model. We offer an implementation of the model on the basis of a fully-fledged feature-oriented compiler. Finally, by analyzing ten feature-oriented programs, we explore the potential of feature-oriented modifiers in FOP.
Feature-oriented software development (FOSD) aims at the construction, customization, and synthesis of large-scale software systems. We propose a novel software design paradigm, called feature-oriented design, which takes the distinguishing properties of FOSD into account, especially the clean and consistent mapping between features and their implementations as well as the tendency of features to interact inadvertently. We extend the lightweight modeling language Alloy with support for feature-oriented design and call the extension FeatureAlloy. By means of an implementation and four case studies, we demonstrate how feature-oriented design with FeatureAlloy facilitates separation of concerns, variability, and reuse of models of individual features and helps in defining and detecting semantic dependences and interactions between features.
Some limitations of object-oriented mechanisms are known to cause code clones (e.g., extension using inheritance). Novel programming paradigms such as feature-oriented programming (FOP) aim at alleviating these limitations. However, it is an open issue whether FOP is really able to avoid code clones or whether it even facilitates (FOP-specific) clones. To address this issue, we conduct an empirical analysis on ten feature-oriented software product lines with respect to code cloning. We found that there is a considerable amount of clones in feature-oriented software product lines and that a large fraction of these clones is FOP-specific (i.e., caused by limitations of feature-oriented mechanisms). Based on our results, we initiate a discussion on the reasons for FOP-specific clones and on how to cope with them. We exemplary show how such clones can be removed by the application of refactoring.
Conditional compilation with preprocessors such as cpp is a simple but effective means to implement variability. By annotating code fragments with #ifdef and #endif directives, different program variants with or without these annotated fragments can be created, which can be used (among others) to implement software product lines. Although, such annotation-based approaches are frequently used in practice, researchers often criticize them for their negative effect on code quality and maintainability. In contrast to modularized implementations such as components or aspects, annotation-based implementations typically neglect separation of concerns, can entirely obfuscate the source code, and are prone to introduce subtle errors. Our goal is to rehabilitate annotation-based approaches by showing how tool support can address these problems. With views, we emulate modularity; with a visual representation of annotations, we reduce source code obfuscation and increase program comprehension; and with disciplined annotations and a product-line–aware type system, we prevent or detect syntax and type errors in the entire software product line. At the same time we emphasize unique benefits of annotations, including simplicity, expressiveness, and being language independent. All in all, we provide tool-based separation of concerns without necessarily dividing source code into physically separated modules; we name this approach virtual separation of concerns. We argue that with these improvements over contemporary preprocessors, virtual separation of concerns can compete with modularized implementation mechanisms. Despite our focus on annotation-based approaches, we do intend not give a definite answer on how to implement software product lines. Modular implementations and annotation-based implementations both have their advantages; we even present an integration and migration path between them. Our goal is to rehabilitate preprocessors and show that they are not a lost cause as many researchers think. On the contrary, we argue that – with the presented improvements – annotation-based approaches are a serious alternative for product-line implementation.
The C preprocessor is often used in practice to implement variability in software product lines. Using #ifdef statements provokes problems such as obfuscated source code, yet they will still be used in practice at least in the medium-term future. With CIDE, we demonstrate a tool to improve understanding and maintaining code that contains #ifdef statements by visualizing them with colors and providing different views on the code.
Feature-Oriented Software Development (FOSD) provides a multitude of formalisms, methods, languages, and tools for building variable, customizable, and extensible software. Along different lines of research, different notions of a feature have been developed. Although these notions have similar goals, no common basis for evaluation, comparison, and integration exists. We present a feature algebra that captures the key ideas of feature orientation and provides a common ground for current and future research in this field, in which also alternative options can be explored. Furthermore, our algebraic framework is meant to serve as a basis for the upcoming development paradigms automatic feature-based program synthesis and architectural metaprogramming.
A feature-oriented product line is a family of programs that share a common set of features. A feature implements a stakeholder's requirement and represents a design decision or configuration option. When added to a program, a feature involves the introduction of new structures, such as classes and methods, and the refinement of existing ones, such as extending methods. A feature-oriented decomposition enables a generator to create an executable program by composing feature code solely on the basis of the feature selection of a user – no other information needed. A key challenge of product line engineering is to guarantee that only well-typed programs are generated. As the number of valid feature combinations grows combinatorially with the number of features, it is not feasible to type check all programs individually. The only feasible approach is to have a type system check the entire code base of the feature-oriented product line. We have developed such a type system on the basis of a formal model of a feature-oriented Java-like language. The type system guaranties type safety for feature-oriented product lines. That is, it ensures that every valid program of a well-typed product line is well-typed. Our formal model including type system is sound and complete.
Over 30 years ago, the preprocessor cpp was developed to extend the programming language C by lightweight metaprogramming capabilities. Despite its error-proneness and low abstraction level, the cpp is still widely being used in presentday software projects to implement variable software. However, not much is known about how the cpp is employed to implement variability. To address this issue, we have analyzed forty open-source software projects written in C. Specifically, we answer the following questions: How does program size influence variability? How complex are extensions made via cpp's variability mechanisms? At which level of granularity are extensions applied? What is the general type of extensions? These questions revive earlier discussions on understanding and refactoring of the preprocessor. To answer them, we introduce several metrics measuring the variability, complexity, granularity, and type of extensions. Based on the data obtained, we suggest alternative implementation techniques. The data we have collected can influence other research areas, such as language design and tool support.
Revision control systems are a major means to manage versions and variants of today's software systems. An ongoing problem in these systems is how to resolve conflicts when merging independently developed revisions. Unstructured revision control systems are purely text-based and solve conflicts based on textual similarity. Structured revision control systems are tailored to specific languages and use language-specific knowledge for conflict resolution. We propose semistructured revision control systems to inherit the strengths of both classes of systems: generality and expressiveness. The idea is to provide structural information of the underlying software artifacts in the form of annotated grammars, which is motivated by recent work on software product lines. This way, a wide variety of languages can be supported and the information provided can assist the resolution of conflicts. We have implemented a preliminary tool and report on our experience with merging Java artifacts. We believe that drawing a connection between revision control systems and product lines has benefits for both fields.
Bedingte Kompilierung mit Präprozessoren wie cpp ist ein einfaches, aber wirksames Mittel zur Implementierung von Variabilität in Softwareproduktlinien. Durch das Annotieren von Code-Fragmenten mit #ifdef und #endif können verschiedene Programmvarianten mit oder ohne diesen Fragmenten generiert werden. Obwohl Präprozessoren häufig in der Praxis verwendet werden, werden sie oft für ihre negativen Auswirkungen auf Codequalität und Wartbarkeit kritisiert. Im Gegensatz zu modularen Implementierungen, etwa mit Komponenten oder Aspekte, vernachlässigen Präprozessoren die Trennung von Belangen im Quelltext, sind anfällig für subtile Fehler und verschlechtern die Lesbarkeit des Quellcodes. Wir zeigen, wie einfache Werkzeugunterstützung diese Probleme adressieren und zum Teil beheben bzw. die Vorteile einer modularen Implementierung emulieren kann. Gleichzeitig zeigen wir Vorteile von Präprozessoren wie Einfachheit und Sprachunabhängigkeit auf.
There are many different implementation approaches to realize the vision of feature oriented software development, ranging from simple preprocessors, over feature-oriented programming, to sophisticated aspect-oriented mechanisms. Their impact on readability and maintainability (or program comprehension in general) has caused a debate among researchers, but sound empirical results are missing. We report experience from our endeavor to conduct experiments to measure the influence of different implementation mechanisms on program comprehension. We describe how to design such experiments and report from possibilities and pitfalls we encountered. Finally, we present some early results of our first experiment on comparing CPP with CIDE.
In feature-oriented programming (FOP), a programmer decomposes a program in terms of features. Ideally, features are implemented modularly so that they can be developed in isolation. Access control is an important ingredient to attain feature modularity as it provides mechanisms to hide and expose internal details of a module's implementation. But developers of contemporary feature-oriented languages did not consider access control mechanisms so far. The absence of a well-defined access control model for FOP breaks the encapsulation of feature code and leads to unexpected and undefined program behaviors as well as inadvertent type errors, as we will demonstrate. The reason for these problems is that common object-oriented modifiers, typically provided by the base language, are not expressive enough for FOP and interact in subtle ways with feature-oriented language mechanisms. We raise awareness of this problem, propose three feature-oriented modifiers for access control, and present an orthogonal access modifier model.
Conditional compilation with preprocessors like cpp is a simple but effective means to implement variability. By annotating code fragments with #ifdef and #endif directives, different program variants with or without these fragments can be created, which can be used (among others) to implement software product lines. Although, preprocessors are frequently used in practice, they are often criticized for their negative effect on code quality and maintainability. In contrast to modularized implementations, for example using components or aspects, preprocessors neglect separation of concerns, are prone to introduce subtle errors, can entirely obfuscate the source code, and limit reuse. Our aim is to rehabilitate the preprocessor by showing how simple tool support can address these problems and emulate some benefits of modularized implementations. At the same time we emphasize unique benefits of preprocessors, like simplicity and language independence. Although we do not have a definitive answer on how to implement variability, we want highlight opportunities to improve preprocessors and encourage research toward novel preprocessor-based approaches.
Physical separation with class refinements and method refinements à la AHEAD and virtual separation using annotations à la #ifdef or CIDE are two competing groups of implementation approaches for software product lines with complementary advantages. Although both groups have been mainly discussed in isolation, we strive for an integration to leverage the respective advantages. In this paper, we provide the basis for such an integration by providing a model that supports both, physical and virtual separation, and by describing refactorings in both directions. We prove the refactorings complete, such that every virtually separated product line can be automatically transformed into a physically separated one (replacing annotations by refinements) and vice versa. To demonstrate the feasibility of our approach, we have implemented the refactorings in our tool CIDE and conducted four case studies.
Programs can be composed from features. We want to verify automatically that all legal combinations of features can be composed safely without errors. Prior work on this problem assumed that features add code monotonically. We generalize prior work to enable features to both add and remove code, describe our analyses and implementation, and review case studies. We observe that more expressive features can increase the complexity of developed programs rapidly – up to the point where automated concepts as presented in this paper are not a helpful tool but a necessity for verification.
Feature-oriented software development (FOSD) is a paradigm for the construction, customization, and synthesis of large-scale software systems. In this survey, we give an overview and a personal perspective on the roots of FOSD, connections to other software development paradigms, and recent developments in this field. Our aim is to point to connections between different lines of research and to identify open issues.
A software product-line is a family of related programs that are distinguished in terms of features. A feature implements a stakeholders' requirement. Different program variants specified by distinct feature selections are produced from a common code base. The optional feature problem describes a common mismatch between variability intended in the domain and dependencies in the implementation. When this occurs, some variants that are valid in the domain cannot be produced due to implementation issues. There are many different solutions to the optional feature problem, but they all suffer from drawbacks such as reduced variability, increased development effort, reduced efficiency, or reduced source code quality. In this paper, we examine the impact of the optional feature problem in two case studies in the domain of embedded database systems, and we survey different state-of-the-art solutions and their trade-offs. Our intension is to raise awareness of the problem, to guide developers in selecting an appropriate solution for their product-line project, and to identify opportunities for future research.
Through implicit invocation, procedures are called without explicitly referencing them. Implicit announcement adds to this implicitness by not only keeping implicit which procedures are called, but also where or when – under implicit invocation with implicit announcement, the call site contains no signs of that, or what it calls. Recently, aspect-oriented programming has popularized implicit invocation with implicit announcement as a possibility to separate concerns that lead to interwoven code if conventional programming techniques are used. However, as has been noted elsewhere, as currently implemented it establishes strong implicit dependencies between components, hampering independent software development and evolution. To address this problem, we present a type-based modularization of implicit invocation with implicit announcement that is inspired by how interfaces and exceptions are realized in JAVA. By extending an existing compiler and by rewriting several programs to make use of our proposed language constructs, we found that the imposed declaration clutter tends to be moderate; in particular, we found that for general applications of implicit invocation with implicit announcement, fears that programs utilizing our form of modularization become unreasonably verbose are unjustified.
In software product line engineering, feature composition generates software tailored to specific requirements from a common set of artifacts. Superimposition is a popular technique to merge code pieces belonging to different features. The advent of model-driven development raises the question of how to support the variability of software product lines in modeling techniques. We propose to use superimposition as a model composition technique in order to support variability. We analyze the feasibility of superimposition as a model composition technique, offer a corresponding tool for model composition, and discuss our experiences with three case studies (including one industrial study) using this tool.
The separation of concerns is a fundamental principle in software engineering. Crosscutting concerns are concerns that do not align with hierarchical and block decomposition supported by mainstream programming languages. In the past, crosscutting concerns have been studied mainly in the context of object orientation. Feature orientation is a novel programming paradigm that supports the (de)composition of crosscutting concerns in a system with a hierarchical block structure. By means of two case studies we explore the problem of crosscutting concerns in functional programming and propose two solutions based on feature orientation.
Based on a general model of feature composition, we present a composition language that enables programmers by means of quantification and weaving to formulate extensions to programs written in different languages. We explore the design space of composition languages that rely on quantification and weaving and discuss our choices. We outline a tool that extends an existing infrastructure for feature composition and discuss results of three initial case studies.
A software product line (SPL) is a family of related program variants in a well-defined domain, generated from a set of features. A fundamental difference from classical application development is that engineers develop not a single program but a whole family with hundreds to millions of variants. This makes it infeasible to separately check every distinct variant for errors. Still engineers want guarantees on the entire SPL. A further challenge is that an SPL may contain artifacts in different languages (code, documentation, models, etc.) that should be checked. In this paper, we present CIDE, an SPL development tool that guarantees syntactic correctness for all variants of an SPL. We show how CIDE's underlying mechanism abstracts from textual representation and we generalize it to arbitrary languages. Furthermore, we automate the generation of safe plug-ins for additional languages from annotated grammars. To demonstrate the language-independent capabilities, we applied CIDE to a series of case studies with artifacts written in Java, C++, C, Haskell, ANTLR, HTML, and XML.
Tools support is crucial for the acceptance of a new programming language. However, providing such tool support is a huge investment that can usually not be provided for a research language. With FeatureIDE, we have built an IDE for AHEAD that integrates all phases of featureoriented software development. To reuse this investment for other tools and languages, we refactored FeatureIDE into an open source framework that encapsulates the common ideas of feature-oriented software development and that can be reused and extended beyond AHEAD. Among others, we implemented extensions for FeatureC++ and FeatureHouse, but in general, FeatureIDE is open for everybody to showcase new research results and make them usable to a wide audience of students, researchers, and practitioners.
The size of the structured query language (SQL) continuously increases. Extensions of SQL for special domains like stream processing or sensor networks come with own extensions, more or less unrelated to the standard. In general, underlying DBMS support only a subset of SQL plus vendor specific extensions. In this paper, we analyze application domains where special SQL dialects are needed or are already in use. We show how SQL can be decomposed to create an extensible family of SQL dialects. Concrete dialects, e.g., a dialect for web databases, can be generated from such a family by choosing SQL features à la carte. A family of SQL dialects simplifies analysis of the standard when deriving a concrete dialect, makes it easy to understand parts of the standard, and eases extension for new application domains. It is also the starting point for developing tailor-made data management solutions that support only a subset of SQL. We outline how such customizable DBMS can be developed and what benefits, e.g., improved maintainability and performance, we can expect from this.
Database schemas are used to describe the logical design of a database. Diverse groups of users have different perspectives on the schema which leads to different local schemas. Research has focused on view integration to generate a global, consistent schema out of different local schemas or views. However, this approach seems to be too constrained when the generated global view should be variable and only a certain subset is needed. Variable schemas are needed in software product lines in which products are tailored to the needs of stakeholders. We claim that traditional modeling techniques are not sufficient for expressing a variable database schema. We show that software product line methodologies, when applied to the database schemas, overcome existing limitations and allow the generation of tailor-made database schemas.
Es gibt eine Vielzahl sehr unterschiedlicher Techniken, Sprachen und Werkzeuge zur Entwicklung von Softwareproduktlinien. Trotzdem liegen gemeinsame Mechanismen zu Grunde, die eine Klassifikation in Kompositions- und Annotationsansatz erlauben. Während der Kompositionsansatz in der Forschung große Beachtung findet, kommt im industriellen Umfeld hauptsächlich der Annotationsansatz zur Anwendung. Wir analysieren und vergleichen beide Ansätze anhand von drei repräsentativen Vertretern und identifizieren anhand von sechs Kriterien individuelle Stärken und Schwächen. Wir stellen fest, dass die jeweiligen Stärken und Schwächen komplementär sind. Aus diesem Grund schlagen wir die Integration des Kompositions- und Annotationsansatzes vor, um so die Vorteile beider zu vereinen, dem Entwickler eine breiteres Spektrum an Implementierungsmechanismen zu Verfügung zu stellen und die Einführung von Produktlinientechnologie in bestehende Softwareprojekte zu erleichtern.
Features express the variabilities and commonalities among programs in a software product line (SPL). A feature model defines the valid combinations of features, where each combination corresponds to a program in an SPL. SPLs and their feature models evolve over time. We classify the evolution of a feature model via modifications as refactorings, specializations, generalizations, or arbitrary edits. We present an algorithm to reason about feature model edits to help designers determine how the program membership of an SPL has changed. Our algorithm takes two feature models as input (before and after edit versions), where the set of features in both models are not necessarily the same, and it automatically computes the change classification. Our algorithm is able to give examples of added or deleted products and efficiently classifies edits to even large models that have thousands of features.
Superimposition is a composition technique that has been applied successfully in many areas of software development. Although superimposition is a general-purpose concept, it has been (re)invented and implemented individually for various kinds of software artifacts. We unify languages and tools that rely on superimposition by using the language-independent model of feature structure trees (FSTs). On the basis of the FST model, we propose a general approach to the composition of software artifacts written in different languages, Furthermore, we offer a supporting framework and tool chain, called FEATUREHOUSE. We use attribute grammars to automate the integration of additional languages, in particular, we have integrated Java, C#, C, Haskell, JavaCC, and XML. Several case studies demonstrate the practicality and scalability of our approach and reveal insights into the properties a language must have in order to be ready for superimposition.
Software product lines can be implemented with many different approaches. However, there are common underlying mechanisms which allow a classification into compositional and annotative approaches. While research focuses mainly on composition approaches like aspect- or feature-oriented programming because those support feature traceability and modularity, in practice annotative approaches like preprocessors are common as they are easier to adopt. In this paper, we compare both groups of approaches and find complementary strengths. We propose an integration of compositional and annotative approaches to combine advantages, increase flexibility for the developer, and ease adoption.
Software product line development is a mature technique to implement similar programs tailored to serve the needs of multiple users while providing a high degree of reuse. This approach also scales for larger product lines that use smaller product lines to fulfill special tasks. In such compositions of SPLs, the interacting product lines depend on each other and programs generated from these product lines have to be correctly configured to ensure correct communication between them. Constraints between product lines can be used to allow only valid combinations of generated programs. This, however, is not sufficient if multiple instances of one product line are involved. In this paper we present an approach that uses UML and OO concepts to model compositions of SPLs. The model extends the approach of constraints between SPLs to constraints between instances of SPLs and integrates SPL specialization. Based on this model we apply a feature-oriented approach to simplify the configuration of complete SPL compositions.
Software product lines (SPLs) enable stakeholders to derive different software products for a domain while providing a high degree of reuse of their code units. Software products are derived in a configuration process by combining different code units. This configuration process becomes complex if SPLs contain hundreds of features. In many cases, a stakeholder is not only interested in functional but also in resulting non-functional properties of a desired product. Because SPLs can be used in different application scenarios alternative implementations of already existing functionality are developed to meet special nonfunctional requirements, like restricted binary size and performance guarantees. To enable these complex configurations we discuss and present techniques to measure nonfunctional properties of software modules and use these values to compute SPL configurations optimized to the users needs.
Modifying an application usually means to stop the application, apply the changes, and start the application again. That means, the application is not available for at least a short time period. This is not acceptable for highly available applications. One reasonable approach which faces the problem of unavailability is to change highly available applications at runtime. To allow extensive runtime adaptation the application must be enabled for unanticipated changes even of already executed program parts. This is due to the fact that it is not predictable what changes become necessary and when they have to be applied. Since Java is commonly used for developing highly available applications, we discuss its shortcomings and opportunities regarding unanticipated runtime adaptation. We present an approach based on Java HotSwap and object wrapping which overcomes the identified shortcomings and evaluate it in a case study.
Implementing software product lines is a challenging task. Depending on the implementation technique the code that realizes a feature is often scattered across multiple code units. This way it becomes difficult to trace features in source code which hinders maintenance and evolution. While previous effort on visualization technologies in software product lines has focused mainly on the feature model, we suggest tool support for feature traceability in the code base. With our tool CIDE, we propose an approach based on filters and views on source code in order to visualize and trace features in source code.
Feature-oriented programming (FOP) is a paradigm that incorporates programming language technology, program generation techniques, and stepwise refinement. In their GPCE'07 paper, Thaker et al. suggest the development of a type system for FOP to guarantee safe feature composition, i.e, to guarantee the absence of type errors during feature composition. We present such a type system along with a calculus for a simple feature-oriented, Java-like language, called Feature Featherweight Java (FFJ). Furthermore, we explore four extensions of FFJ and how they affect type soundness.
A functional aspect is an aspect that has the semantics of a transformation; it is a function that maps a program to an advised program. Functional aspects are composed by function composition. In this paper, we explore functional aspects in the context of aspect-oriented refactoring. We show that refactoring legacy applications using functional aspects is just as flexible as traditional aspects in that (a) the order in which aspects are refactored does not matter, and (b) the number of potential aspect interactions is decreased. We analyze several aspect-oriented programs of different sizes to support our claims.
Feature modules are the building blocks of programs in software product lines (SPLs). A foundational assumption of feature-based program synthesis is that features are composed in a predefined order. Recent work on virtual separation of concerns reveals a new model of feature interactions that shows that feature modules can be quantized as compositions of smaller modules called derivatives. We present this model and examine some of its unintuitive consequences, namely, that (1) a given program can be reconstructed by composing features in any order, and (2) the contents of a feature module (as expressed as a composition of derivatives) is determined automatically by a feature order. We show that different orders allow one to “adjust” the contents of a feature module to isolate and study the impact of interactions that a feature has with other features. Using derivatives, we show the utility of generalizing safe composition (SC), a basic analysis of SPLs that verifies program type-safety, to prove that every legal composition of derivatives (and thus any composition order of features) produces a typesafe program, which is a much stronger SC property.
A software product line (SPL) is an efficient means to generate a family of program variants for a domain from a single code base. However, because of the potentially high number of possible program variants, it is difficult to test all variants and ensure properties like type-safety for the entire SPL. While first steps to type-check an entire SPL have been taken, they are informal and incomplete. In this paper, we extend the Featherweight Java (FJ) calculus with feature annotations to be used for SPLs. By extending FJ's type system, we guarantee that – given a well-typed SPL – all possible program variants are welltyped as well. We show how results from this formalization reflect and help implementing our own language-independent SPL tool CIDE.
Feature-Oriented Software Development (FOSD) provides a multitude of formalisms, methods, languages, and tools for building variable, customizable, and extensible software. Along different lines of research, different notions of a feature have been developed. Although these notions have similar goals, no common basis for evaluation, comparison, and integration exists. We present a feature algebra that captures the key ideas of feature orientation and provides a common ground for current and future research in this field, in which also alternative options can be explored.
We present a feature-based approach, known from software product lines, to the development of service-oriented architectures. We discuss five benefits of such an approach: improvements in modularity, variability, uniformity, specifiability, and typeability. Subsequently, we review preliminary experiences and results, and propose an agenda for further research in this direction.
Building software product lines (SPLs) with features is a challenging task. Many SPL implementations support features with coarse granularity - e.g., the ability to add and wrap entire methods. However, fine-grained extensions, like adding a statement in the middle of a method, either require intricate workarounds or obfuscate the base code with annotations. Though many SPLs can and have been implemented with the coarse granularity of existing approaches, fine-grained extensions are essential when extracting features from legacy applications. Furthermore, also some existing SPLs could benefit from fine-grained extensions to reduce code replication or improve readability. In this paper, we analyze the effects of feature granularity in SPLs and present a tool, called Colored IDE (CIDE), that allows features to implement coarse-grained and fine-grained extensions in a concise way. In two case studies, we show how CIDE simplifies SPL development compared to traditional approaches.
Software product lines (SPL) allow to generate tailormade software by manually configuring reusable core assets. However, SPLs with hundreds of features and millions of possible products require an appropriate support for semi-automated product derivation. This derivation has to be based on non-functional properties that are related to core assets and domain features. Both elements are part of different models connected via complex mappings. We propose a model that integrates features and core assets in order to allow semi-automated product derivation.
Aspect-Oriented Programming (AOP) aims at modularizing crosscutting concerns. AspectJ is a popular AOP language extension for Java that includes numerous sophisticated mechanisms for implementing crosscutting concerns modularly in one aspect. The language allows to express complex extensions, but at the same time the complexity of some of those mechanisms hamper the writing of simple and recurring extensions, as they are often needed especially in software product lines. In this paper we propose an AspectJ extension that introduces a simplified syntax for simple and recurring extensions. We show that our syntax proposal improves evolvability and modularity in AspectJ programs by avoiding those mechanisms that may harm evolution and modularity if misused. We show that the syntax is applicable for up to 74\% of all pointcut and advice mechanisms by analysing three AspectJ case studies.
Aspect-oriented programming (AOP) is a novel programming paradigm that aims at modularizing complex software. It embraces several mechanisms including (1) pointcuts and advice as well as (2) refinements and collaborations. Though all these mechanisms deal with crosscutting concerns, i.e., a special class of design and implementation problems that challenge traditional programming paradigms, they do so in different ways. In this article we explore their relationship and their impact on software modularity. This helps researchers and practitioners to understand their differences and guides to use the right mechanism for the right problem.
Software product line is a paradigm to develop a family of software products with the goal of reuse. In this paper, we focus on a scenario in which different products from different product lines are combined together in a third product line to yield more elaborate products, i.e., a product line consumes products from third product line suppliers. The issue is not how different products can be produced separately, but how they can be combined together. We propose a service-oriented architecture where product lines are regarded as services, yielding a service-oriented product line. This paper illustrates the approach with an example for a service-oriented architecture of a web portal product line supplied by portlet product lines.
Creating a software product line from a legacy application is a difficult task. We propose a tool that helps automating tedious tasks of refactoring legacy applications into features and frees the developer from the burden of performing laborious routine implementations.
Taking an extractive approach to decompose a legacy application into features is difficult and laborious with current approaches and tools. We present a prototype of a tooldriven approach that largely hides the complexity of the task.
Software product lines aim to create highly configurable programs from a set of features. Common belief and recent studies suggest that aspects are well-suited for implementing features. We evaluate the suitability of AspectJ with respect to this task by a case study that refactors the embedded database system Berkeley DB into 38 features. Contrary to our initial expectations, the results were not encouraging. As the number of aspects in a feature grows, there is a noticeable decrease in code readability and maintainability. Most of the unique and powerful features of AspectJ were not needed. We document where AspectJ is unsuitable for implementing features of refactored legacy applications and explain why.
Stepwise refinement (SWR) is fundamental to software engineering. As aspectoriented programming (AOP) is gaining momentum in software development, aspects should be considered in the light of SWR. In this paper, we elaborate the notion of aspect refinement that unifies AOP and SWR at the architectural level. To reflect this unification to the programming language level, we present an implementation technique for refining aspects based on mixin composition along with a set of language mechanisms for refining all kinds of structural elements of aspects in a uniform way (methods, pointcuts, advice). To underpin our proposal, we contribute a fully functional compiler on top of AspectJ, present a non-trivial, medium-sized case study, and derive a set of programming guidelines.
Collaborations are a frequently occurring class of crosscutting concerns. Prior work has argued that collaborations are better implemented using Collaboration Languages (CLs) rather than AspectJ-like Languages (ALs). The main argument is that aspects flatten the objectoriented structure of a collaboration, and introduce more complexity rather than benefits – in other words, CLs and ALs differ with regard to program comprehension. To explore the effects of CL and AL modularization mechanisms on program comprehension, we propose to conduct a series of experiments. We present ideas on how to arrange such experiments that should serve as a starting point and foster a discussion with other researchers.
Schon seit einigen Jahren macht die aspektorientierte Programmierung von sich reden. Daneben zieht in jüngster Zeit die merkmalsorientierte Programmierung die Aufmerksamkeit auf sich. Beide verfolgen mit der Verbesserung der Modularität von Softwarebausteinen ähnliche Ziele, realisieren dies aber auf unterschiedliche Art und Weise - jeweils mit Vor- und Nachteilen.}
The integration of aspects into the methodology of stepwise software development and evolution is still an open issue. This paper focuses on the global quantification mechanism of nowadays aspect-oriented languages that contradicts basic principles of this methodology. One potential solution to this problem is to bound the potentially global effects of aspects to a set of local development steps. We discuss several alternatives to implement such bounded aspect quantification in AspectJ. Afterwards, we describe a concrete approach that relies on meta-data and pointcut restructuring in order to control the quantification of aspects. Finally, we discuss open issues and further work.
Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.