Shurui Zhou

[pronunciation: Shu-ray Joe]


5000 Forbes Ave · Pittsburgh, PA 15213
Email: shuruiz (at)

I am on the job market !

I am a final-year Ph.D. student at Institute for Software Research, School of Computer Science at Carnegie Mellon University.
My advisor is Professor Christian Kästner.

I am interested in understanding how developers collaborate in the modern social coding situation, such as fork-based software development, helping developers better collaborate with less inefficiencies, helping open source communities better evolve. I discover and evaluate existing interventions and develop new ones that steer collaborative development with forks toward better practices, such as better coordination among otherwise independent developers.


  • Dec. 8 2019

    Our paper How Has Forking Changed in the Last 20 Years? A Study of Hard Forks on GitHub got accepted by ICSE 2020 !

  • Aug. 5 2019

    Our paper Improving Collaboration Efficiency in Fork-based Development got accepted by ASE 2019 Doctoral Symposium !

  • May. 24 2019

    Our paper What the Fork: A Study of Inefficient and Efficient Forking Practices in Social Coding got accepted for ESEC/FSE 2019 !

  • May. 5 2019

    Invited to Dagstuhl Seminar 19191 on Software Evolution in Time and Space: Unifying Version and Variability Management. [abstract]

  • Nov. 30 2018

    Our paper Identifying Redundancies in Fork-based Development got accepted for SANER 2019 !

  • Mar. 4 2018

    Our paper Poster: Forks Insight: Providing an Overview of GitHub Forks. got accepted for ICSE 2018 Poster Track !

  • Dec. 13 2017

    Our paper Identifying Features in Forks got accepted for ICSE 2018 !

    Our paper Adding Sparkle to Social Coding: An Empirical Study of Repository Badges in the npm Ecosystem got accepted for ICSE 2018 !

  • Service

  • Organizer of FOSD 2018 meeting.
  • Sub-Reviewer for ICSE 2020, FSE 2019, ASE 2019, ICSE 2018, ASE 2017, FSE 2017, SPLC 2017, ICSE 2017, VAMOS 2017, SPLC 2016, ASE 2015 and TSE 2015
  • Reviewer for TSE 2019
  • Publications

    ASE 2019 Doctoral Symposium

    S. Zhou. Improving Collaboration Efficiency in Fork-based Development. In Proceedings of the Companion of the International Conference on Automated Software Engineering (ASE), New York, NY: ACM Press, 2019. [pdf] [poster] [slides]

    FSE 2019

    S. Zhou, B. Vasilescu, C. Kästner. What the Fork: A Study of Inefficient and Efficient Forking Practices in Social Coding. Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2019. Acceptance rate: 24% (74/303) [pdf] [slides]

    Forking and pull requests have been widely used in open-source communities as a uniform development and contribution mechanisms, which gives developers the flexibility to modify their own fork without affecting others. However, some projects observe severe inefficiencies, including lost and duplicate contributions and fragmented communities.We observed that different communities experience these inefficiencies to widely different degrees and interviewed practitioners indicate several project characteristics and practices, including modularity and coordination mechanisms, that may encourage more efficient forking practices. In this paper, we explore how open-source projects on GitHub differ with regard to forking inefficiencies. Using logistic regression models, we analyzed the association of context factors with the inefficiencies and found that better modularity and centralized management can encourage more contributions and a higher fraction of accepted pull requests, suggesting specific good practices that project maintainers can adopt to reduce forking-related inefficiencies in their community.

    ISSRE 2019

    J. Liang, Y. Hou, S. Zhou, J. Chen, Y. Xiong, G. Huang. How to Explain a Patch: An Empirical Study of Patch Explanations in Open Source Projects. The 30th International Symposium on Software Reliability Engineering (ISSRE), Berlin, Germany, October 2019. [pdf]

    Bugs are inevitable in software development and maintenance processes. Recently a lot of research efforts have been devoted to automatic program repair, aiming to reduce the efforts of debugging. However, since it is difficult to ensure that the generated patches meet all quality requirements such as correctness, developers still need to review the patch. In addition, current techniques produce only patches without explanation, making it difficult for the developers to understand the patch. Therefore, we believe a more desirable approach should generate not only the patch but also an explanation of the patch. To generate a patch explanation, it is important to first understand how patches were explained. In this paper, we explored how developers explain their patches by manually analyzing 300 merged bug-fixing pull requests from six projects on GitHub. Our contribution is twofold. First, we build a patch explanation model, which summarizes the elements in a patch explanation, and corresponding expressive forms. Second, we conducted a quantitative analysis to understand the distributions of elements, and the correlation between elements and their expressive forms.

    SANER 2019

    L. Ren, S. Zhou, and C. Kästner , and A. Wąsowski. Identifying Redundancies in Fork-based Development. In Proceedings of the 27th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2019. Acceptance rate: 27 % (40/148) [pdf] [slides]

    Fork-based development is popular and easy to use, but makes it difficult to maintain an overview of the whole community when the number of forks increases. This may lead to redundant development where multiple developers are solving the same problem in parallel without being aware of each other. Redundant development wastes effort for both maintainers and developers. In this paper, we designed an approach to identify redundant code changes in forks as early as possible by extracting clues indicating similarities between code changes, and building a machine learning model to predict redundancies. We evaluated the effectiveness from both the maintainer's and the developer's perspectives. The result shows that we achieve 57-83% precision for detecting duplicate code changes from maintainer's perspective, and we could save developers' effort of 1.9-3.0 commits on average. Also, we show that our approach significantly outperforms existing state-of-art.

    ICSE 2018 Poster Track

    L. Ren, S. Zhou, and C. Kästner. Poster: Forks Insight: Providing an Overview of GitHub Forks. In Proceedings of the Companion of the International Conference on Software Engineering (ICSE), New York, NY: ACM Press, 2018. Poster. [pdf]

    ICSE 2018

    S. Zhou, Ș. Stănciulescu, O. Leßenich, Y. Xiong, A. Wąsowski, and C. Kästner. Identifying Features in Forks. In Proceedings of the 40th International Conference on Software Engineering (ICSE), New York, NY: ACM Press, May 2018. Acceptance rate: 21 % (105/502) [pdf][slides]

    Fork-based development has been widely used both in open source community and industry, because it gives developers flexibility to modify their own fork without affecting others. Unfortunately, this mechanism has downsides; when the number of forks becomes large, it is difficult for developers to get or maintain an overview of activities in the forks. Current tools provide little help. We introduced INFOX, an approach to automatically identifies not-merged features in forks and generates an overview of active forks in a project. The approach clusters cohesive code fragments using code and network analysis techniques and uses information-retrieval techniques to label clusters with keywords. The clustering is effective, with 90% accuracy on a set of known features. In addition, a human-subject evaluation shows that INFOX can provide actionable insight for developers of forks.

    ICSE 2018

    A. Trockman, S. Zhou, C. Kästner, and B. Vasilescu. Adding Sparkle to Social Coding: An Empirical Study of Repository Badges in the npm Ecosystem. In Proceedings of the 40th International Conference on Software Engineering (ICSE), New York, NY: ACM Press, May 2018. Acceptance rate: 21 % (105/502) [pdf]

    In fast-paced, reuse-heavy software development, the transparency provided by social coding platforms like GitHub is essential to decision making. Developers infer the quality of projects using visible cues, known as signals, collected from personal profile and repository pages. We report on a large-scale, mixed-methods empirical study of npm packages that explores the emerging phenomenon of repository badges, with which maintainers signal underlying qualities about the project to contributors and users. We investigate which qualities maintainers intend to signal and how well badges correlate with those qualities. After surveying developers, mining 294,941 repositories, and applying statistical modeling and time series analysis techniques, we find that non-trivial badges, which display the build status, test coverage, and up-to-dateness of dependencies, are mostly reliable signals, correlating with more tests, better pull requests, and fresher dependencies. Displaying such badges correlates with best practices, but the effects do not always persist.

    Releng 2015

    S. Zhou, J. Al-Kofahi, T. Nguyen, C. Kästner, and S. Nadi. Extracting Configuration Knowledge from Build Files with Symbolic Analysis. In Proceedings of the 3rd International Workshop on Release Engineering (Releng), pages 20--23, New York, NY: ACM Press, May 2015.

    Build systems contain a lot of configuration knowledge about a software system, such as under which conditions specific files are compiled. Extracting such configuration knowledge is important for many tools analyzing highly-configurable systems, but very challenging due to the complex nature of build systems. We design an approach, based on SYMake, that symbolically evaluates Makefiles and extracts configuration knowledge in terms of file presence conditions and conditional parameters. We implement an initial prototype and demonstrate feasibility on small examples.

    Internetware 2013

    W.Hao, S. Zhou, T. Yang, R. Zhang, and Q. Wang. 2013. Elastic resource management for heterogeneous applications on PaaS. In Proceedings of the 5th Asia-Pacific Symposium on Internetware (Internetware '13). ACM, New York, NY, USA

    Elastic resource management is one of the key characteristics of cloud computing systems. Existing elastic approaches focus mainly on single resource consumption such as CPU consumption, rarely considering comprehensively various features of applications. Applications deployed on a PaaS are usually heterogeneous. While sharing the same resource, these applications are usually quite different in resource consuming. How to deploy these heterogeneous applications on the smallest size of hardware thus becomes a new research topic. In this paper, we take into consideration application's CPU consumption, I/O consumption, consumption of other server resources and application's request rate, all of which are defined as application features. This paper proposes a practical and effective elasticity approach based on the analysis of application features. The evaluation experiment shows that, compared with traditional approach, our approach can save up to 32.8% VMs without significant increase of average response time and SLA violation.


    Identifying Redundant PRs on GitHub

    We monitor the coming pull request of each GitHub project and detect potentially redundant pull request pairs to save maintainer and contributors' effort. This is intended to be a GitHub bot still under implementation for our [SANER'19] paper. [code]

    This project is designed as a complementary of the GitHub network view. We sift all active forks and summarize changes with statistics and representative keywords. It is a light-weight programming language independent web service for our [ICSE'18] INFOX paper and the [ICSE'18 poster] paper.