In this talk, we first introduce Alluxio, a memory-speed distributed storage system. Next, we enumerate the different types of I/O operations supported by Alluxio, contrasting their advantages and disadvantages with respect to performance and fault tolerance. We then zero in on asynchronous writes, which allow applications to write data to Alluxio at memory-speed (or network-speed for fault tolerance) while Alluxio asynchronously writes the data to remote storage. We highlight interesting design and implementation details and present experimental evaluation of performance and fault tolerance.

Jiří Šimša is a software engineer at Alluxio, Inc. and one of the top committers and a project management committee member of the Alluxio open source project. Prior to working on Alluxio, he was a software engineer at Google, where he worked on a framework for the Internet of Things. Jiri is a CSD PhD program graduate and PDL alumnus. Go Pens!

Faculty Host: Garth Gibson

As more data and computation move from local to cloud environments, datacenter distributed systems have become a dominant backbone for many modern applications. However, the complexity of cloud-scale hardware and software ecosystems has outpaced existing testing, debugging, and verification tools.

I will describe three new classes of bugs that often appear in large-scale datacenter distributed systems: (1) distributed concurrency bugs, caused by non-deterministic timings of distributed events such as message arrivals as well as multiple crashes and reboots; (2) limpware-induced performance bugs, design bugs that surface in the presence of "limping" hardware and cause cascades of performance failures; and (3) scalability bugs, latent bugs that are scale dependent, typically only surface in large-scale deployments (100+ nodes) but not necessarily in small/medium-scale deployments.

The findings above are based on our long, large-scale cloud bug study (3000+ bugs) and cloud outage study (500+ outages). I will present some of our work in understanding and combating the three classes of bugs above, including semantic-aware model checking (SAMC), taxonomy of distributed concurrency bugs (TaxDC), impacts of limpware ("Limplock"), path-based speculative execution (PBSE), and scalability checks (SCk).

Haryadi Gunawi is a Neubauer Family Assistant Professor in the Department of Computer Science at the University of Chicago where he leads the UCARE research group (UChicago systems research on Availability, Reliability, and Efficiency). He received his Ph.D. in Computer Science from the University of Wisconsin, Madison in 2009. He was a postdoctoral fellow at the University of California, Berkeley from 2010 to 2012. His current research focuses on cloud computing reliability and new storage technology. He has won numerous awards including NSF CAREER award, NSF Computing Innovation Fellowship, Google Faculty Research Award, NetApp Faculty Fellowships, and Honorable Mention for the 2009 ACM Doctoral Dissertation Award.

Faculty Host: Garth Gibson

His research focus is in improving dependability of storage and cloud computing systems in the context of (1) performance stability, wherein he is interested in building storage and distributed systems that are robust to "limping" hardware, (2) reliability, wherein he is interested in combating non-deterministic concurrency bugs in cloud-scale distributed systems, and (3) scalability, wherein he is interested in developing approaches to find latent scalability bugs that only appear in large-scale deployments.

CNTK is Microsoft's open-source deep learning framework designed from the ground up for scalability and flexibility. The latest v2.0 release contains both C++ and Python APIs, with a built-in Layer Library for high-level model description. The toolkit has been extensively used internally at Microsoft for video, image, text and speech data. CNTK is well-regarded as the fastest toolkit for recurrent neural networks, which can easily be 5-10x faster than other toolkits such as TensorFlow and Torch. It was the key to Microsoft Research's recent breakthrough in speech recognition by reaching human parity in conversational speech recognition. In this talk, I will explain how CNTK achieves significant speed up through symbolic loops, sequence batching, and new schemes of data-parallel training.

For more details about CNTK.

Cha Zhang, Principal Researcher, Microsoft Research Dr. Cha Zhang is a Principal Researcher at Microsoft Research, and he currently manages the CNTK team. He received the B.S. and M.S. degrees from Tsinghua University, Beijing, China in 1998 and 2000, respectively, both in Electronic Engineering, and the Ph.D. degree in Electrical and Computer Engineering from Carnegie Mellon University, in 2004. Before joining the CNTK team, he spent over 10 years developing audio/image/video processing and machine learning techniques, and has published over 80 technical papers and held 20+ US patents. He won the best paper award at ICME 2007 and the best student paper award at ICME 2010. He was the Program Co-Chair for VCIP 2012, and the General Co-Chair for ICME 2016. He currently serves as an Associate Editor for IEEE Trans. on Circuits and Systems for Video Technology, and IEEE Trans. on Multimedia.

Faculty Host: Garth Gibson

TensorFlow is an open-source machine learning system, originally developed by the Google Brain team, which operates at large scale and in heterogeneous environments. TensorFlow trains and executes a variety of machine learning models at Google, including deep neural networks for image recognition and machine translation. The system uses dataflow graphs to represent stateful computations, and achieves high performance by mapping these graphs across clusters of machines containing multi-core CPUs, general-purpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). In this talk, I will give a brief overview of how TensorFlow was designed, and how it differs from contemporary systems. I will also highlight the opportunities for future systems research on top of TensorFlow.

For more details about TensorFlow.

Derek Murray is a Software Engineer in the Google Brain team, working on TensorFlow. Previously, he was a researcher at Microsoft Research Silicon Valley where he worked on the Naiad project, and he received his PhD in Computer Science from the University of Cambridge.

Faculty Host: Garth Gibson

Discrete GPUs provide massive parallelism to support today’s most interesting high throughput workloads such as deep learning, computational finance, and visual analytics. Intel is making strides in increasing the capability of the GPU on the SoC to support these workloads and there are cases where an integrated GPU can be a compelling solution with a lower total cost of ownership for GPGPU computing.  In this talk we will go into the architectural details of the GPGPU Architecture of Intel Processor Graphics and address the question: How do I program the full teraflop GPU integrated with my CPU?

Adam Lake is a member of Intel’s GPGPU architecture team with a current focus on Compute/GPGPU Architecture. He represented Intel for OpenCL 1.2 and 2.0 and was instrumental in the design of features including shared virtual memory, device side enqueue, improving the execution and memory models, and driving support for an intermediate representation. He was a Sr. Software Architect on Larrabee, now known as Xeon Phi, and has over 40 patents or patents pending. Adam worked previously in non-photorealistic rendering and the design of stream programming systems which included the implementation of simulators, assemblers, and compilers. He did his undergraduate work at the University of Evansville, his graduate studies at UNC Chapel Hill, and spent time at Los Alamos National Laboratory. He has been a co-author on 2 SIGGRAPH papers, numerous book chapters and other peer reviewed papers in the field of computer graphics, and was the editor of Game Programming Gems 8.

Girish Ravunnikutty is a member of  GPGPU architecture team at Intel. During his career at Intel, Girish’s major focus has been GPU compute performance analysis and path finding features for future GPU architectures. His analysis and optimizations efforts led to multiple software design wins for Intel Graphics. Girish architected the first OpenCL performance analysis tool from Intel. Before joining Intel, Girish worked with Magma Design Automation and IBM labs. He did his Master’s specializing in GPU Compute at University of Florida, Gainesville, and he worked with Oakridge National Laboratories accelerating Particle in cell algorithm on GPU’s.

Visitor Host: Mike Kozuch

In AWS, we are running large scale cloud services that are the core platform for and millions of AWS customers. Building and operating these systems at scale has taught us several lessons and best practices: 1, how does one determine what they are building is right for their customer 2, how does one architect for scale and ensure correctness, 3, how does one test these large scale systems and 4, how do you deploy systems globally, etc.? Throughout this presentation, I will talk through these learnings as they apply to various systems I have built (such as DynamoDB, Paxos based systems, AI services like Rekognition). We will finish the talk with discussion on how AWS is exposing AI technologies to our customers to drive the development of cutting edge technology solutions.

Swami is VP in AWS in charge of all Amazon AI and Machine learning initiatives. More  details on Amazon AI.  (blog)

Previously, Swami was the General Manager of NoSQL and some big data services in AWS. He managed the engineering, product management and operations for core AWS database services that are the foundational building blocks for AWS: DynamoDB, Amazon ElastiCache (in-memory engines), Amazon QuickSight, SimpleDB and a few other big data services in the works. Swami has been awarded more than 200 patents, authored around 40 referred scientific papers and journals, and participate in several academic circles and conferences.

In addition to these, he also built more than 15 AWS Cloud Services like CloudFront, Amazon RDS, Amazon S3, Amazon's Paxos based lock service, original Amazon Dynamo etc. He was also one of the main authors for Amazon Dynamo paper along with Werner Vogels. Amazon Dynamo now is the foundation for many other NoSQL systems like Riak, Cassandra and Voldemort.

Faculty Host: Majd Sakr

Big data analytics requires high programmer productivity and high performance simultaneously on large-scale clusters. However, current big data analytics frameworks (e.g. Apache Spark) have prohibitive runtime overheads since they are library-based. We introduce an auto-parallelizing compiler approach that exploits the characteristics of the data analytics domain and is accurate, unlike previous auto-parallelization methods. We build High Performance Analytics Toolkit (HPAT), which parallelizes high-level scripting (Julia) programs automatically, generates efficient MPI/C++ code, and provides resiliency. Furthermore, HPAT provides automatic optimizations for scripting programs, such as fusion of array operations. Thus, HPAT is 369x to 2033x faster than Spark on the Cori supercomputer at LBL/NERSC and 20x-256x on Amazon AWS for machine learning benchmarks.

We also propose a compiler-based approach for integrating data frames into HPAT to build HiFrames. It automatically parallelizes and compiles relational operations along with other array computations in end-to-end data analytics programs, and generates efficient MPI/C++ code. HiFrames is 3.6x to 70x faster than Spark SQL for basic relational operations and can be several orders of magnitude faster for advanced operations.

Ehsan Totoni is a Research Scientist at Intel Labs. He develops programming systems for large-scale HPC and big data analytics applications with a focus on productivity and performance. He received his Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign in 2014. During his Ph.D. studies, he was a member of the Charm++/AMPI team working on performance and energy efficiency of HPC applications using adaptive runtime techniques.

Faculty Host: MIke Kozuch

The Square Kilometer Array radio telescope, currently under design by institutions in 10 countries for deployment in remote deserts around 2022, will be a revolutionary scientific instrument to observe the universe. Ultra large HPC systems will transform a massive stream of antenna data -- as much as an exa-byte per day -- into scientific data placed into an archive for worldwide consumption. The steepest challenges include extreme parallelism in the algorithms and providing 200 PB/sec of memory bandwidth under strict power constraints. This presentation covers an overview of the telescope and of the software and system architecture that is currently under development.

Peter Braam is a scientist and entrepreneur focused on problems in large scale computing. Originally trained as a mathematician, he has worked at several academic institutions including Oxford, CMU and Cambridge. One of his startup companies developed the Lustre file system which is widely used. During the last few years he has focused on computing for the SKA telescope.

Faculty Host: M. Satyanarayanan

Alluxio, formerly Tachyon, is an open source memory speed virtual distributed storage system. In this talk, I will first introduce the Alluxio project and then describe the Alluxio architecture, focusing on two of its distinguishing features: tiered storage and unified namespace. Tiered storage provides applications running on top of Alluxio with the ability to store data in local storage tiers (memory, SSDs, and hard drives), transparently managing data based on pluggable policies. Unified namespace provides applications running on top of Alluxio with the ability to access data from heterogenous set of remote storage systems (such as HDFS, S3, or GCS) through the same API and namespace.

Jiří Šimša is a software engineer at Alluxio, Inc. and one of the top committers and a project management committee member of the Alluxio open source project. Prior to working on Alluxio, he was a software engineer at Google, where he worked on a framework for the Internet of Things. Jiri is a CMU and PDL alumnus, earning a PhD for his work on automated testing of concurrent systems under the guidance of professors Garth Gibson and Randy Bryant. He is a big fan of the Pittsburgh Penguins.

Jiří welcomes discussion about opportunities at Alluxio after their tech talk.


We will overview current and future work on building foundations for scaling machine learning and graph processing in Apache Spark.

Apache Spark is the most active open source Big Data project, with 1000+ contributors. The ability to scale is a key benefit of Spark: the same code should run on a laptop or 100's to 1000's of machines. Another big attraction is integration of analytics libraries for machine learning (ML) and graph processing.

This talk will cover the juncture between the low-level (scaling) and high-level (analytics) components of Spark. The most important change for ML and graphs on Spark in the past year has been a migration of analytics libraries to use Spark DataFrames instead of RDDs. This ongoing migration is laying the groundwork for future speedups and scaling. In addition to API impacts, we will discuss the integration of analytics with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management and code generation.

Joseph Bradley is an Apache Spark committer and PMC member, working as a Software Engineer at Databricks. He focuses on Spark MLlib, GraphFrames, and other advanced analytics on Spark. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.

Faculty Host: Majd Sakr


Subscribe to SDI/ISTC