I am interested in understanding how developers collaborate in the modern social coding situation, such as fork-based software development, helping developers better collaborate with less inefficiencies, helping open source communities better evolve. I discover and evaluate existing interventions and develop new ones that steer collaborative development with forks toward better practices, such as better coordination among otherwise independent developers.
Our paper How Has Forking Changed in the Last 20 Years? A Study of Hard Forks on GitHub got accepted by ICSE 2020 !
Our paper Improving Collaboration Efficiency in Fork-based Development got accepted by ASE 2019 Doctoral Symposium !
Our paper What the Fork: A Study of Inefficient and Efficient Forking Practices in Social Coding got accepted for ESEC/FSE 2019 !
Our paper Identifying Redundancies in Fork-based Development got accepted for SANER 2019 !
Our paper Poster: Forks Insight: Providing an Overview of GitHub Forks. got accepted for ICSE 2018 Poster Track !
Our paper Adding Sparkle to Social Coding: An Empirical Study of Repository Badges in the npm Ecosystem got accepted for ICSE 2018 !
Forking and pull requests have been widely used in open-source communities as a uniform development and contribution mechanisms, which gives developers the flexibility to modify their own fork without affecting others. However, some projects observe severe inefficiencies, including lost and duplicate contributions and fragmented communities.We observed that different communities experience these inefficiencies to widely different degrees and interviewed practitioners indicate several project characteristics and practices, including modularity and coordination mechanisms, that may encourage more efficient forking practices. In this paper, we explore how open-source projects on GitHub differ with regard to forking inefficiencies. Using logistic regression models, we analyzed the association of context factors with the inefficiencies and found that better modularity and centralized management can encourage more contributions and a higher fraction of accepted pull requests, suggesting specific good practices that project maintainers can adopt to reduce forking-related inefficiencies in their community.
Bugs are inevitable in software development and maintenance processes. Recently a lot of research efforts have been devoted to automatic program repair, aiming to reduce the efforts of debugging. However, since it is difficult to ensure that the generated patches meet all quality requirements such as correctness, developers still need to review the patch. In addition, current techniques produce only patches without explanation, making it difficult for the developers to understand the patch. Therefore, we believe a more desirable approach should generate not only the patch but also an explanation of the patch. To generate a patch explanation, it is important to first understand how patches were explained. In this paper, we explored how developers explain their patches by manually analyzing 300 merged bug-fixing pull requests from six projects on GitHub. Our contribution is twofold. First, we build a patch explanation model, which summarizes the elements in a patch explanation, and corresponding expressive forms. Second, we conducted a quantitative analysis to understand the distributions of elements, and the correlation between elements and their expressive forms.
Fork-based development is popular and easy to use, but makes it difficult to maintain an overview of the whole community when the number of forks increases. This may lead to redundant development where multiple developers are solving the same problem in parallel without being aware of each other. Redundant development wastes effort for both maintainers and developers. In this paper, we designed an approach to identify redundant code changes in forks as early as possible by extracting clues indicating similarities between code changes, and building a machine learning model to predict redundancies. We evaluated the effectiveness from both the maintainer's and the developer's perspectives. The result shows that we achieve 57-83% precision for detecting duplicate code changes from maintainer's perspective, and we could save developers' effort of 1.9-3.0 commits on average. Also, we show that our approach significantly outperforms existing state-of-art.
Fork-based development has been widely used both in open source community and industry, because it gives developers flexibility to modify their own fork without affecting others. Unfortunately, this mechanism has downsides; when the number of forks becomes large, it is difficult for developers to get or maintain an overview of activities in the forks. Current tools provide little help. We introduced INFOX, an approach to automatically identifies not-merged features in forks and generates an overview of active forks in a project. The approach clusters cohesive code fragments using code and network analysis techniques and uses information-retrieval techniques to label clusters with keywords. The clustering is effective, with 90% accuracy on a set of known features. In addition, a human-subject evaluation shows that INFOX can provide actionable insight for developers of forks.
In fast-paced, reuse-heavy software development, the transparency provided by social coding platforms like GitHub is essential to decision making. Developers infer the quality of projects using visible cues, known as signals, collected from personal profile and repository pages. We report on a large-scale, mixed-methods empirical study of npm packages that explores the emerging phenomenon of repository badges, with which maintainers signal underlying qualities about the project to contributors and users. We investigate which qualities maintainers intend to signal and how well badges correlate with those qualities. After surveying developers, mining 294,941 repositories, and applying statistical modeling and time series analysis techniques, we find that non-trivial badges, which display the build status, test coverage, and up-to-dateness of dependencies, are mostly reliable signals, correlating with more tests, better pull requests, and fresher dependencies. Displaying such badges correlates with best practices, but the effects do not always persist.
Build systems contain a lot of configuration knowledge about a software system, such as under which conditions specific files are compiled. Extracting such configuration knowledge is important for many tools analyzing highly-configurable systems, but very challenging due to the complex nature of build systems. We design an approach, based on SYMake, that symbolically evaluates Makefiles and extracts configuration knowledge in terms of file presence conditions and conditional parameters. We implement an initial prototype and demonstrate feasibility on small examples.
Elastic resource management is one of the key characteristics of cloud computing systems. Existing elastic approaches focus mainly on single resource consumption such as CPU consumption, rarely considering comprehensively various features of applications. Applications deployed on a PaaS are usually heterogeneous. While sharing the same resource, these applications are usually quite different in resource consuming. How to deploy these heterogeneous applications on the smallest size of hardware thus becomes a new research topic. In this paper, we take into consideration application's CPU consumption, I/O consumption, consumption of other server resources and application's request rate, all of which are defined as application features. This paper proposes a practical and effective elasticity approach based on the analysis of application features. The evaluation experiment shows that, compared with traditional approach, our approach can save up to 32.8% VMs without significant increase of average response time and SLA violation.