Elie Krevat

Elie Krevat

Department of Computer Science
Carnegie Mellon University
ekrevat at cs dot cmu dot edu

I'm currently working on the future of transportation with self-driving vehicles, developing distributed ML and deep learning systems for computer vision and autonomy. Previously, I was a Ph.D. student in computer science at Carnegie Mellon, researching many flavors of distributed systems, analytical modeling, applied machine learning, and large-scale data analysis. As a graduate research assistant in the Parallel Data Lab I was advised by Greg Ganger.

Back to Top

Research

Research interests include combining ML and systems techniques to create smarter, automated, and reactive systems. I'm excited about building tools that surface and learn from complex system relationships through large-scale data analysis. A great application of this is building out the autonomy platform for self-driving cars and trucks.

Distributed ML training and autonomy pipelines for self-driving vehicles

There are many unique challenges when it comes to processing a huge and rich corpus of logs to extract features and develop ML models for self-driving vehicles. It requires an efficient means for distributed training and scoring of models while analyzing their performance metrics on and off vehicle. This involves a combination of computer vision applications, classical ML methods, and deep learning.

Automated analysis and mitigation of performance problems in service-oriented architectures

Responding to resource-sensitive performance problems is becoming increasingly difficult for system administrators, and expensive in the amount of unnecessary overprovisioning, as distributed and cloud computing applications are built across larger numbers of interconnected shared services. Performance problems are common from many sources such as service upgrades and configuration errors, and the continuous flow of changing user requests that take different paths in the system. Unfortunately, root cause problem diagnosis efforts can take hours or days to isolate and fix the problem, even for system experts.

My dissertation proposes an automated approach to analyze and mitigate performance problems through the reactive provisioning of machines. This "quick fix" leverages end-to-end flow analysis to determine the critical path of requests, to help classify performance issues, and to direct an efficient allocation of resources to the services that affect client-perceived delays. These automated tools surface and learn from complex data relationships; they apply ML, data mining, and graphical and statistical analyses to predict and measure corrective actions. In many cases, problems can be mitigated in a few minutes after a problem is detected, returning client performance to acceptable levels and allowing any other problem diagnosis efforts to continue unconstrained.

Seeking Efficient Data-Intensive Computing

New programming frameworks for scale-out parallel analysis, such as MapReduce and Hadoop, have become a cornerstone for exploiting large datasets. However, there has been little analysis of how these systems perform relative to the capabilities of the hardware on which they run. We developed simple models of I/O resource consumption and applied it to a map-reduce workload to produce ideal lower bounds on runtimes, exposing the inefficiency of popular scale-out systems. Using a simplified dataflow processing tool called Parallel DataSeries (PDS), we also demonstrated that the model's ideal can be approached within 20%, and explored the reasons for the gap between ideal and actual performance that are faced by any DISC system built atop standard OS and networking services. We found that disk stragglers and network slowdown effects are the primary culprits for lost efficiency.

Incast: TCP Throughput Collapse in Cluster-based Storage Systems

Building cluster-based storage systems using commodity TCP/IP and Ethernet networks is attractive because of their low cost, ease-of-use, and the desire to combine routing infrastructures for LAN, SAN, and high performance computing. Yet an important barrier to their use is the TCP Incast problem, where bursty traffic from synchronized reads in cluster-based storage systems produce a one to two order magnitude TCP throughput collapse. We have studied the network conditions that cause this TCP throughput collapse in both simulation and real-world deployments, examined the effectiveness of TCP- and Ethernet-level solutions, and with our latest publication we have found reasonable solutions to the problem with high resolution timers that implement a microsecond-granularity TCP retransmission timeout. This solution is both feasible and practical for fast storage networks while also safe for wide area networks, revisiting an older assumption on spurious TCP retransmissions that no longer holds true.

Back to Top

Publications

Back to Top

Other Projects and Presentations

Back to Top

Teaching

At CMU I TAed 15-712: Advanced Operating Systems and Distributed Systems with Dave Andersen and 15-213: Introduction to Computer Systems with Greg Ganger and Randy Bryant.

At MIT I TAed 6.033: Computer System Engineering with Frans Kaashoek while earning my M.Eng. degree.

Back to Top

Background

Before CMU, I completed a B.S. and M.Eng. in computer science at MIT, with a minor in economics. My master's thesis included work from a few summers and a semester of research at IBM T.J. Watson Research Center on system software for the Blue Gene supercomputer. I also spent a few years honing my product development experience at Microsoft as a software design engineer, where I played around with pre-alpha Windows technologies and developed the first two versions of Office Accounting Professional, a stand-alone product and third-party development platform for small businesses.

Back to Top