Shurui Zhou

[pronunciation: Shu-ray Joe]


5000 Forbes Ave · Pittsburgh, PA 15213
Email: shuruiz (at)

I will be joining the University of Toronto the Department of Electrical & Computer Engineering (ECE) as an Assistant Professor this Fall!

I am interested in helping distributed and interdisciplinary software teams to collaborate more efficiently, especially in the context of modern open-source collaboration forms, fork-based development, and interdisciplinary teams when building AI-enabled systems or scientific software. To achieve my goals, I combine advances in tooling and software engineering principles with insights from other disciplines that study human collaboration, for which I combine and mix a wide range of research methods. I discover and evaluate existing interventions and develop new ones that steer collaborative development toward better practices.

I received my Ph.D.'s degree in May. 2020 from the Institute for Software Research, School of Computer Science at Carnegie Mellon University. I am very fortunate to work with my advisor Professor Christian Kästner, and my ‘informal’ advisor and collaborator Professor Bogdan Vasilescu. I received my Master's degree in Software Enigneering from Peking University in 2014, and my Bachelor's degree in Software Engineering from Xi'an Jiaotong University in 2011.


Apr 28. 2020
Successfully defended my Ph.D thesis:
"Improving Collaboration Efficiency in Fork-based Development"!

Dec. 2019 - Apr. 2020
[Invited talk:] Improving Collaboration Efficiency for Software Development at:
- Peking University
- Rochester Institute of Technology
- Stevens Institute of Technology
- University of Illinois Urbana-Champaign
- Oregon State University
- Drexel University
- George Mason University
- Stony Brook University
- University of British Columbia
- Univeristy of Toronto
- University of Texas at Austin


May. 5 2019
Dagstuhl Seminar 19191 -- Software Evolution in Time and Space: Unifying Version and Variability Management. [Seminar abstract] [lighting talk - Version Control For AI]


  • [Organizer] FOSD 2018 meeting
  • [PC] VariVolution 2020 Workshop
  • [Reviewer] TSE 2020, TSE 2019
  • [Sub-Reviewer] ICSE 2020, FSE 2019, ASE 2019, ICSE 2018, ASE 2017, FSE 2017, SPLC 2017, ICSE 2017, VAMOS 2017, SPLC 2016, ASE 2015 and TSE 2015
  • Publications


    Ph.D. Thesis
    Committee Members: Christian Kästner, James D. Herbsleb, Laura A. Dabbish, Andrzej Wąsowski.

    MSR 2020 - Mining Challenge
    A. Bhattacharjee, S. Nath, S. Zhou, D. Chakroborti, B. Roy, C. Roy, and K. Schneider. An Exploratory Study to Find Motives behind Cross-platform Forks from Software Heritage Dataset. In Proceedings of the 17th International Conference on Mining Software Repositories (MSR) - Mining Challenge Track, 2020.
    The fork-based development mechanism provides the flexibility and the unified processes for software teams to collaborate easily in a distributed setting without too much coordination overhead. Currently, multiple social coding platforms support fork-based development, such as GitHub, GitLab, and Bitbucket. Although these different platforms virtually share the same features, they have different emphasis. As GitHub is the most popular platform and the corresponding data is publicly available, most of the current studies are focusing on GitHub hosted projects. However, we observed anecdote evidences that people are confused about choosing among these platforms, and some projects are migrating from one platform to another, and the reasons behind these activities remain unknown. With the advances of Software Heritage Graph Dataset (SWHGD), we have the opportunity to investigate the forking activities across platforms. In this paper, we conduct an exploratory study on 10 popular open-source projects to identify cross-platform forks and investigate the motivation behind. Preliminary result shows that cross-platform forks do exist. For the 10 subject systems used in this study, we found 81,357 forks in total among which 179 forks are on GitLab. Based on our qualitative analysis, we found that most of the cross-platform forks that we identified are mirrors of the repositories on another platform, but we still find cases that were created due to preference of using certain functionalities (e.g. Continuous Integration (CI)) supported by different platforms. This study lays the foundation of future research directions, such as understanding the differences between platforms and supporting cross-platform collaboration.
    ICGSE 2020
    K. Constantino, S. Zhou, M. Souza, E. Figueiredo, and C. Kästner. Understanding Collaborative Software Development: An Interview Study. In Proceedings of the 15th ACM/IEEE International Conference on Global Software Engineering (ICGSE), 2020. [to be appear]
    In globally distributed software development, many software developers have to collaborate and deal with issues of collaboration. Although collaboration is challenging, collaborative development produces better software than any developer could produce alone. Unlike previous work which focuses on the proposal and evaluation of models and tools to support collaborative work, this paper presents an interview study aiming to understand (i) the motivations, (ii) how collaboration happens, and (iii) the challenge and barriers of collaborative software development. After interviewing twelve experienced software developers from GitHub, we found different types of collaborative contributions, such as in management of requests for changes. Our analysis also indicates that the main barriers for collaboration are related to non-technical, rather than technical issues.
    ICSE 2020
    S. Zhou, B. Vasilescu, C. Kästner. How Has Forking Changed in the Last 20 Years? A Study of Hard Forks on GitHub. In Proceedings of the 42nd International Conference on Software Engineering (ICSE), 2020. Acceptance rate: 20.9% (129/617)
    The notion of forking has changed with the rise of distributed version control systems and social coding environments, like GitHub. Traditionally forking refers to splitting off an independent development branch (which we call hard forks); research on hard forks, conducted mostly in pre-GitHub days showed that hard forks were often seen critical as they may fragment a community. Today, in social coding environments, open-source developers are encouraged to fork a project in order to contribute to the community (which we call social forks), which may have also influenced perceptions and practices around hard forks. To revisit hard forks, we identify, study, and classify 15,306 hard forks on GitHub and interview 18 owners of hard forks or forked repositories. We find that, among others, hard forks often evolve out of social forks rather than being planned deliberately and that perception about hard forks have indeed changed dramatically, seeing them often as a positive noncompetitive alternative to the original project.


    ASE 2019 Doctoral Symposium
    S. Zhou. Improving Collaboration Efficiency in Fork-based Development. In Proceedings of the Companion of the International Conference on Automated Software Engineering (ASE), New York, NY: ACM Press, 2019.
    FSE 2019
    S. Zhou, B. Vasilescu, C. Kästner. What the Fork: A Study of Inefficient and Efficient Forking Practices in Social Coding. In Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2019. Acceptance rate: 24% (74/303)
    Forking and pull requests have been widely used in open-source communities as a uniform development and contribution mechanisms, which gives developers the flexibility to modify their own fork without affecting others. However, some projects observe severe inefficiencies, including lost and duplicate contributions and fragmented communities.We observed that different communities experience these inefficiencies to widely different degrees and interviewed practitioners indicate several project characteristics and practices, including modularity and coordination mechanisms, that may encourage more efficient forking practices. In this paper, we explore how open-source projects on GitHub differ with regard to forking inefficiencies. Using logistic regression models, we analyzed the association of context factors with the inefficiencies and found that better modularity and centralized management can encourage more contributions and a higher fraction of accepted pull requests, suggesting specific good practices that project maintainers can adopt to reduce forking-related inefficiencies in their community.
    ISSRE 2019
    J. Liang, Y. Hou, S. Zhou, J. Chen, Y. Xiong, G. Huang. How to Explain a Patch: An Empirical Study of Patch Explanations in Open Source Projects. The 30th International Symposium on Software Reliability Engineering (ISSRE), Berlin, Germany, October 2019.
    Bugs are inevitable in software development and maintenance processes. Recently a lot of research efforts have been devoted to automatic program repair, aiming to reduce the efforts of debugging. However, since it is difficult to ensure that the generated patches meet all quality requirements such as correctness, developers still need to review the patch. In addition, current techniques produce only patches without explanation, making it difficult for the developers to understand the patch. Therefore, we believe a more desirable approach should generate not only the patch but also an explanation of the patch. To generate a patch explanation, it is important to first understand how patches were explained. In this paper, we explored how developers explain their patches by manually analyzing 300 merged bug-fixing pull requests from six projects on GitHub. Our contribution is twofold. First, we build a patch explanation model, which summarizes the elements in a patch explanation, and corresponding expressive forms. Second, we conducted a quantitative analysis to understand the distributions of elements, and the correlation between elements and their expressive forms.
    SANER 2019
    L. Ren, S. Zhou, and C. Kästner , and A. Wąsowski. Identifying Redundancies in Fork-based Development. In Proceedings of the 27th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2019. Acceptance rate: 27 % (40/148)
    Fork-based development is popular and easy to use, but makes it difficult to maintain an overview of the whole community when the number of forks increases. This may lead to redundant development where multiple developers are solving the same problem in parallel without being aware of each other. Redundant development wastes effort for both maintainers and developers. In this paper, we designed an approach to identify redundant code changes in forks as early as possible by extracting clues indicating similarities between code changes, and building a machine learning model to predict redundancies. We evaluated the effectiveness from both the maintainer's and the developer's perspectives. The result shows that we achieve 57-83% precision for detecting duplicate code changes from maintainer's perspective, and we could save developers' effort of 1.9-3.0 commits on average. Also, we show that our approach significantly outperforms existing state-of-art.


    ICSE 2018 - Poster
    L. Ren, S. Zhou, and C. Kästner. Poster: Forks Insight: Providing an Overview of GitHub Forks. In Proceedings of the Companion of the International Conference on Software Engineering (ICSE), New York, NY: ACM Press, 2018. Poster.
    ICSE 2018
    S. Zhou, Ș. Stănciulescu, O. Leßenich, Y. Xiong, A. Wąsowski, and C. Kästner. Identifying Features in Forks. In Proceedings of the 40th International Conference on Software Engineering (ICSE), New York, NY: ACM Press, May 2018. Acceptance rate: 21 % (105/502)
    Fork-based development has been widely used both in open source community and industry, because it gives developers flexibility to modify their own fork without affecting others. Unfortunately, this mechanism has downsides; when the number of forks becomes large, it is difficult for developers to get or maintain an overview of activities in the forks. Current tools provide little help. We introduced INFOX, an approach to automatically identifies not-merged features in forks and generates an overview of active forks in a project. The approach clusters cohesive code fragments using code and network analysis techniques and uses information-retrieval techniques to label clusters with keywords. The clustering is effective, with 90% accuracy on a set of known features. In addition, a human-subject evaluation shows that INFOX can provide actionable insight for developers of forks.
    ICSE 2018
    A. Trockman, S. Zhou, C. Kästner, and B. Vasilescu. Adding Sparkle to Social Coding: An Empirical Study of Repository Badges in the npm Ecosystem. In Proceedings of the 40th International Conference on Software Engineering (ICSE), New York, NY: ACM Press, May 2018. Acceptance rate: 21 % (105/502)
    In fast-paced, reuse-heavy software development, the transparency provided by social coding platforms like GitHub is essential to decision making. Developers infer the quality of projects using visible cues, known as signals, collected from personal profile and repository pages. We report on a large-scale, mixed-methods empirical study of npm packages that explores the emerging phenomenon of repository badges, with which maintainers signal underlying qualities about the project to contributors and users. We investigate which qualities maintainers intend to signal and how well badges correlate with those qualities. After surveying developers, mining 294,941 repositories, and applying statistical modeling and time series analysis techniques, we find that non-trivial badges, which display the build status, test coverage, and up-to-dateness of dependencies, are mostly reliable signals, correlating with more tests, better pull requests, and fresher dependencies. Displaying such badges correlates with best practices, but the effects do not always persist.


    Releng 2015
    S. Zhou, J. Al-Kofahi, T. Nguyen, C. Kästner, and S. Nadi. Extracting Configuration Knowledge from Build Files with Symbolic Analysis. In Proceedings of the 3rd International Workshop on Release Engineering (Releng) 2015.
    Build systems contain a lot of configuration knowledge about a software system, such as under which conditions specific files are compiled. Extracting such configuration knowledge is important for many tools analyzing highly-configurable systems, but very challenging due to the complex nature of build systems. We design an approach, based on SYMake, that symbolically evaluates Makefiles and extracts configuration knowledge in terms of file presence conditions and conditional parameters. We implement an initial prototype and demonstrate feasibility on small examples.


    Internetware 2013
    W.Hao, S. Zhou, T. Yang, R. Zhang, and Q. Wang. 2013. Elastic resource management for heterogeneous applications on PaaS. In Proceedings of the 5th Asia-Pacific Symposium on Internetware (Internetware '13). ACM, New York, NY, USA
    Elastic resource management is one of the key characteristics of cloud computing systems. Existing elastic approaches focus mainly on single resource consumption such as CPU consumption, rarely considering comprehensively various features of applications. Applications deployed on a PaaS are usually heterogeneous. While sharing the same resource, these applications are usually quite different in resource consuming. How to deploy these heterogeneous applications on the smallest size of hardware thus becomes a new research topic. In this paper, we take into consideration application's CPU consumption, I/O consumption, consumption of other server resources and application's request rate, all of which are defined as application features. This paper proposes a practical and effective elasticity approach based on the analysis of application features. The evaluation experiment shows that, compared with traditional approach, our approach can save up to 32.8% VMs without significant increase of average response time and SLA violation.


    Identifying Redundant PRs on GitHub

    We monitor the coming pull request of each GitHub project and detect potentially redundant pull request pairs to save maintainer and contributors' effort. This is intended to be a GitHub bot still under implementation for our [SANER'19] paper. [code]

    This project is designed as a complementary of the GitHub network view. We sift all active forks and summarize changes with statistics and representative keywords. It is a light-weight programming language independent web service for our [ICSE'18] INFOX paper and the [ICSE'18 poster] paper.