Computer Science Thesis Proposal
- Remote Access Enabled - Zoom
- Virtual Presentation
- AURICK QIAO
- Ph.D. Student
- Computer Science Department
- Carnegie Mellon University
Co-adaptive Resource Management for Distributed Machine Learning
In the recent decade, machine learning (ML) has found unprecedented success in solving practical problems across diverse application domains, such as recommendation systems, ad-click prediction, sentiment analysis, object detection, and more. Behind this success is an ever-increasing demand for computational resources, which can be leveraged to train larger and more complex models on vaster data. With the availability of hardware resources trending currently towards shared and dynamic computing environments such as clouds and data-centers, efficient and automatic resource management is quickly becoming a key requirement for machine learning in the real world.
Historically, existing software frameworks which traditionally supported high-performance computing (HPC) or big-data processing workloads, such as MPI and Hadoop, have been re-purposed to additionally support distributed machine learning workloads. More recent frameworks are designed with ML workloads in mind, and have proven to significantly improve ML training time and resource utilization. This thesis proposal takes an evolutionary step along this direction. Furthermore, most ML-oriented resource management systems view the training algorithm as an application-level procedure which should be exactly preserved. We challenge that notion by presenting new systems which deliberately alter their applications during training. Doing so results in better adaptivity to failures, more efficient resource utilization, and automatic configuration of ML applications in dynamic-resource environments.
Eric P. Xing (Chair)
Gregory R. Ganger
Phillip B. Gibbons
Joseph E. Gonzalez (University of California, Berkeley)
Additional Proposal Information
Zoom Participation. See announcement.