I am a 5th year PhD at Institute for Software Research, School of Computer Science at Carnegie Mellon University.
My advisor is Professor Christian Kästner.
I am interested in understanding how developers collaborate in the modern social coding situation, such as fork-based software development, helping developers better collaborate with less inefficiencies, helping open source communities better evolve. I discover and evaluate existing interventions and develop new ones that steer collaborative development with forks toward better practices, such as better coordination among otherwise independent developers.
Our paper What the Fork: A Study of Inefficient and Efficient Forking Practices in Social Coding got accepted for ESEC/FSE 2019 !
Invited to Dagstuhl Seminar 19191 on Software Evolution in Time and Space: Unifying Version and Variability Management.
Our paper Identifying Redundancies in Fork-based Development got accepted for SANER 2019 !
Our paper Poster: Forks Insight: Providing an Overview of GitHub Forks. got accepted for ICSE 2018 Poster Track !
Our paper Adding Sparkle to Social Coding: An Empirical Study of Repository Badges in the npm Ecosystem got accepted for ICSE 2018 !
Forking and pull requests have been widely used in open-source communities as a uniform development and contribution mechanisms, which gives developers the flexibility to modify their own fork without affecting others. However, some projects observe severe inefficiencies, including lost and duplicate contributions and fragmented communities.We observed that different communities experience these inefficiencies to widely different degrees and interviewed practitioners indicate several project characteristics and practices, including modularity and coordination mechanisms, that may encourage more efficient forking practices. In this paper, we explore how open-source projects on GitHub differ with regard to forking inefficiencies. Using logistic regression models, we analyzed the association of context factors with the inefficiencies and found that better modularity and centralized management can encourage more contributions and a higher fraction of accepted pull requests, suggesting specific good practices that project maintainers can adopt to reduce forking-related inefficiencies in their community.
Fork-based development is popular and easy to use, but makes it difficult to maintain an overview of the whole community when the number of forks increases. This may lead to redundant development where multiple developers are solving the same problem in parallel without being aware of each other. Redundant development wastes effort for both maintainers and developers. In this paper, we designed an approach to identify redundant code changes in forks as early as possible by extracting clues indicating similarities between code changes, and building a machine learning model to predict redundancies. We evaluated the effectiveness from both the maintainer's and the developer's perspectives. The result shows that we achieve 57-83% precision for detecting duplicate code changes from maintainer's perspective, and we could save developers' effort of 1.9-3.0 commits on average. Also, we show that our approach significantly outperforms existing state-of-art.
Fork-based development has been widely used both in open source community and industry, because it gives developers flexibility to modify their own fork without affecting others. Unfortunately, this mechanism has downsides; when the number of forks becomes large, it is difficult for developers to get or maintain an overview of activities in the forks. Current tools provide little help. We introduced INFOX, an approach to automatically identifies not-merged features in forks and generates an overview of active forks in a project. The approach clusters cohesive code fragments using code and network analysis techniques and uses information-retrieval techniques to label clusters with keywords. The clustering is effective, with 90% accuracy on a set of known features. In addition, a human-subject evaluation shows that INFOX can provide actionable insight for developers of forks.
In fast-paced, reuse-heavy software development, the transparency provided by social coding platforms like GitHub is essential to decision making. Developers infer the quality of projects using visible cues, known as signals, collected from personal profile and repository pages. We report on a large-scale, mixed-methods empirical study of npm packages that explores the emerging phenomenon of repository badges, with which maintainers signal underlying qualities about the project to contributors and users. We investigate which qualities maintainers intend to signal and how well badges correlate with those qualities. After surveying developers, mining 294,941 repositories, and applying statistical modeling and time series analysis techniques, we find that non-trivial badges, which display the build status, test coverage, and up-to-dateness of dependencies, are mostly reliable signals, correlating with more tests, better pull requests, and fresher dependencies. Displaying such badges correlates with best practices, but the effects do not always persist.
Build systems contain a lot of configuration knowledge about a software system, such as under which conditions specific files are compiled. Extracting such configuration knowledge is important for many tools analyzing highly-configurable systems, but very challenging due to the complex nature of build systems. We design an approach, based on SYMake, that symbolically evaluates Makefiles and extracts configuration knowledge in terms of file presence conditions and conditional parameters. We implement an initial prototype and demonstrate feasibility on small examples.
Elastic resource management is one of the key characteristics of cloud computing systems. Existing elastic approaches focus mainly on single resource consumption such as CPU consumption, rarely considering comprehensively various features of applications. Applications deployed on a PaaS are usually heterogeneous. While sharing the same resource, these applications are usually quite different in resource consuming. How to deploy these heterogeneous applications on the smallest size of hardware thus becomes a new research topic. In this paper, we take into consideration application's CPU consumption, I/O consumption, consumption of other server resources and application's request rate, all of which are defined as application features. This paper proposes a practical and effective elasticity approach based on the analysis of application features. The evaluation experiment shows that, compared with traditional approach, our approach can save up to 32.8% VMs without significant increase of average response time and SLA violation.