Discrete GPUs provide massive parallelism to support today’s most interesting high throughput workloads such as deep learning, computational finance, and visual analytics. Intel is making strides in increasing the capability of the GPU on the SoC to support these workloads and there are cases where an integrated GPU can be a compelling solution with a lower total cost of ownership for GPGPU computing.  In this talk we will go into the architectural details of the GPGPU Architecture of Intel Processor Graphics and address the question: How do I program the full teraflop GPU integrated with my CPU?

Adam Lake is a member of Intel’s GPGPU architecture team with a current focus on Compute/GPGPU Architecture. He represented Intel for OpenCL 1.2 and 2.0 and was instrumental in the design of features including shared virtual memory, device side enqueue, improving the execution and memory models, and driving support for an intermediate representation. He was a Sr. Software Architect on Larrabee, now known as Xeon Phi, and has over 40 patents or patents pending. Adam worked previously in non-photorealistic rendering and the design of stream programming systems which included the implementation of simulators, assemblers, and compilers. He did his undergraduate work at the University of Evansville, his graduate studies at UNC Chapel Hill, and spent time at Los Alamos National Laboratory. He has been a co-author on 2 SIGGRAPH papers, numerous book chapters and other peer reviewed papers in the field of computer graphics, and was the editor of Game Programming Gems 8.

Girish Ravunnikutty is a member of  GPGPU architecture team at Intel. During his career at Intel, Girish’s major focus has been GPU compute performance analysis and path finding features for future GPU architectures. His analysis and optimizations efforts led to multiple software design wins for Intel Graphics. Girish architected the first OpenCL performance analysis tool from Intel. Before joining Intel, Girish worked with Magma Design Automation and IBM labs. He did his Master’s specializing in GPU Compute at University of Florida, Gainesville, and he worked with Oakridge National Laboratories accelerating Particle in cell algorithm on GPU’s.

Visitor Host: Mike Kozuch

In AWS, we are running large scale cloud services that are the core platform for and millions of AWS customers. Building and operating these systems at scale has taught us several lessons and best practices: 1, how does one determine what they are building is right for their customer 2, how does one architect for scale and ensure correctness, 3, how does one test these large scale systems and 4, how do you deploy systems globally, etc.? Throughout this presentation, I will talk through these learnings as they apply to various systems I have built (such as DynamoDB, Paxos based systems, AI services like Rekognition). We will finish the talk with discussion on how AWS is exposing AI technologies to our customers to drive the development of cutting edge technology solutions.

Swami is VP in AWS in charge of all Amazon AI and Machine learning initiatives. More  details on Amazon AI.  (blog)

Previously, Swami was the General Manager of NoSQL and some big data services in AWS. He managed the engineering, product management and operations for core AWS database services that are the foundational building blocks for AWS: DynamoDB, Amazon ElastiCache (in-memory engines), Amazon QuickSight, SimpleDB and a few other big data services in the works. Swami has been awarded more than 200 patents, authored around 40 referred scientific papers and journals, and participate in several academic circles and conferences.

In addition to these, he also built more than 15 AWS Cloud Services like CloudFront, Amazon RDS, Amazon S3, Amazon's Paxos based lock service, original Amazon Dynamo etc. He was also one of the main authors for Amazon Dynamo paper along with Werner Vogels. Amazon Dynamo now is the foundation for many other NoSQL systems like Riak, Cassandra and Voldemort.

Faculty Host: Majd Sakr

Big data analytics requires high programmer productivity and high performance simultaneously on large-scale clusters. However, current big data analytics frameworks (e.g. Apache Spark) have prohibitive runtime overheads since they are library-based. We introduce an auto-parallelizing compiler approach that exploits the characteristics of the data analytics domain and is accurate, unlike previous auto-parallelization methods. We build High Performance Analytics Toolkit (HPAT), which parallelizes high-level scripting (Julia) programs automatically, generates efficient MPI/C++ code, and provides resiliency. Furthermore, HPAT provides automatic optimizations for scripting programs, such as fusion of array operations. Thus, HPAT is 369x to 2033x faster than Spark on the Cori supercomputer at LBL/NERSC and 20x-256x on Amazon AWS for machine learning benchmarks.

We also propose a compiler-based approach for integrating data frames into HPAT to build HiFrames. It automatically parallelizes and compiles relational operations along with other array computations in end-to-end data analytics programs, and generates efficient MPI/C++ code. HiFrames is 3.6x to 70x faster than Spark SQL for basic relational operations and can be several orders of magnitude faster for advanced operations.

Ehsan Totoni is a Research Scientist at Intel Labs. He develops programming systems for large-scale HPC and big data analytics applications with a focus on productivity and performance. He received his Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign in 2014. During his Ph.D. studies, he was a member of the Charm++/AMPI team working on performance and energy efficiency of HPC applications using adaptive runtime techniques.

Faculty Host: MIke Kozuch

The Square Kilometer Array radio telescope, currently under design by institutions in 10 countries for deployment in remote deserts around 2022, will be a revolutionary scientific instrument to observe the universe. Ultra large HPC systems will transform a massive stream of antenna data -- as much as an exa-byte per day -- into scientific data placed into an archive for worldwide consumption. The steepest challenges include extreme parallelism in the algorithms and providing 200 PB/sec of memory bandwidth under strict power constraints. This presentation covers an overview of the telescope and of the software and system architecture that is currently under development.

Peter Braam is a scientist and entrepreneur focused on problems in large scale computing. Originally trained as a mathematician, he has worked at several academic institutions including Oxford, CMU and Cambridge. One of his startup companies developed the Lustre file system which is widely used. During the last few years he has focused on computing for the SKA telescope.

Faculty Host: M. Satyanarayanan

Alluxio, formerly Tachyon, is an open source memory speed virtual distributed storage system. In this talk, I will first introduce the Alluxio project and then describe the Alluxio architecture, focusing on two of its distinguishing features: tiered storage and unified namespace. Tiered storage provides applications running on top of Alluxio with the ability to store data in local storage tiers (memory, SSDs, and hard drives), transparently managing data based on pluggable policies. Unified namespace provides applications running on top of Alluxio with the ability to access data from heterogenous set of remote storage systems (such as HDFS, S3, or GCS) through the same API and namespace.

Jiří Šimša is a software engineer at Alluxio, Inc. and one of the top committers and a project management committee member of the Alluxio open source project. Prior to working on Alluxio, he was a software engineer at Google, where he worked on a framework for the Internet of Things. Jiri is a CMU and PDL alumnus, earning a PhD for his work on automated testing of concurrent systems under the guidance of professors Garth Gibson and Randy Bryant. He is a big fan of the Pittsburgh Penguins.

Jiří welcomes discussion about opportunities at Alluxio after their tech talk.


We will overview current and future work on building foundations for scaling machine learning and graph processing in Apache Spark.

Apache Spark is the most active open source Big Data project, with 1000+ contributors. The ability to scale is a key benefit of Spark: the same code should run on a laptop or 100's to 1000's of machines. Another big attraction is integration of analytics libraries for machine learning (ML) and graph processing.

This talk will cover the juncture between the low-level (scaling) and high-level (analytics) components of Spark. The most important change for ML and graphs on Spark in the past year has been a migration of analytics libraries to use Spark DataFrames instead of RDDs. This ongoing migration is laying the groundwork for future speedups and scaling. In addition to API impacts, we will discuss the integration of analytics with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management and code generation.

Joseph Bradley is an Apache Spark committer and PMC member, working as a Software Engineer at Databricks. He focuses on Spark MLlib, GraphFrames, and other advanced analytics on Spark. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.

Faculty Host: Majd Sakr

This talk is to discuss the need and a framework of distributed end-to-end video analytics. Video cameras are ubiquitous nowadays, but the limited Internet bandwidth has prevented the video data from being useful in an economically efficient manner. Additionally, many video analytics tasks mandate quick turnaround time demanding (near) real time decision making. Distributed end-to-end video analytics, which collaboratively uses devices, edge servers, and the cloud as the computing platform, is the way to go. However, distributed visual workload can be difficult to develop and manage, especially for startups or SMEs who have been playing an integral in the Internet of Things. We present a framework for facilitating and expediting development and management of distributed video analytics workload across an end-to-end system comprising camera, gateway and the cloud.

Dr. Yen-Kuang Chen is a Principal Engineer at Intel Corporation. His research areas span from emerging applications that can utilize the true potential of internet of things to computer architecture that can embrace emerging applications. He has more than 50 US patents and around 100 technical publications. He is one of the key contributors to Supplemental Streaming SIMD Extension 3 and Advanced Vector Extension in Intel microprocessors. He is the Editor-in-Chief of IEEE Journal on Emerging and Selected Topics in Circuits and Systems, and IEEE CAS Distinguished Lecturer. He received his Ph.D. degree from Princeton University and is an IEEE Fellow.

Faculty Host: Kayvon Fatahalian

Graph clustering has many important applications in computing, but due to growing sizes of graphs, even traditionally fast clustering methods such as spectral partitioning can be computationally expensive for real-world graphs of interest. Motivated partly by this, so-called local algorithms for graph clustering have received significant interest due to the fact that they can find good clusters in a graph with work proportional to the size of the cluster rather than that of the entire graph. This feature has proven to be crucial in making such graph clustering and many of its downstream applications efficient in practice. While local clustering algorithms are already faster than traditional algorithms that touch the entire graph, they are sequential and there is an opportunity to make them even more efficient via parallelization. In this talk, we show how to parallelize many of these algorithms in the shared-memory multicore setting, and we analyze the parallel complexity of these algorithms. We present comprehensive experiments on large-scale graphs showing that our parallel algorithms achieve good parallel speedups on a modern multicore machine, thus significantly speeding up the analysis of local graph clusters in the very large-scale setting.

Julian Shun is currently a Miller Research Fellow (post-doc) at UC Berkeley. He obtained his Ph.D. in Computer Science from Carnegie Mellon University, and his undergraduate degree in Computer Science from UC Berkeley. He is interested in developing large-scale parallel algorithms for graph processing, and parallel text algorithms and data structures. He is also interested in designing methods for writing deterministic parallel programs and benchmarking parallel programs. He has received the ACM Doctoral Dissertation Award, CMU School of Computer Science Doctoral Dissertation Award, Miller Research Fellowship, Facebook Graduate Fellowship, and a best student paper award at the Data Compression Conference.

Faculty Hosts: Guy Blelloch, Phil Gibbons

Most major content providers use content delivery networks (CDNs) to serve web content to their users. CDNs achieve high performance by using a large distributed system of caching servers. The first and fastest caching level in a CDN server is the memory-resident Hot Object Cache (HOC). A major goal of a CDN is to maximize the object hit ratio (OHR) of its HOCs. Maximizing the OHR is challenging because web object sizes are highly variable and HOCs have a small capacity. This challenge has lead to a wealth of sophisticated cache eviction policies. In contrast, cache admission policies have received little attention.

This talk presents AdaptSize: a new HOC caching system based on a size-aware cache admission policy. AdaptSize is based on a new statistical cache tuning method that continuously adapts the parameters of its cache admission policy to the request traffic. In experiments with Akamai production traces, AdaptSize improves the OHR by 30-44% over Nginx and by 49-92% over Varnish, which are two widely-used production systems. Further, AdaptSize's tuning method consistently achieves about 80% of the OHR of offline parameter tuning, and is significantly more robust than state-of-the-art cache tuning methods based on hill climbing. To demonstrate feasibility in a production setting, we show that AdaptSize can be incorporated into Varnish with low processing and memory overheads and negligible impact on cache server throughput.

Joint work with Ramesh K. Sitaraman (University of Massachusetts at Amherst & Akamai Technologies) and
Mor Harchol-Balter (Carnegie Mellon University)

Daniel S. Berger is a Ph.D.  student in computer science at the University of Kaiserslautern, Germany. His research interests intersect systems, mathematical modeling, and performance testing. As part of his PhD work, Daniel is exploring the boundaries of achievable cache hit ratios in Internet content delivery. Daniel has spent several months each year on research visits: at CMU (2015), Warwick University (2014), T-Labs Berlin (2013), ETH Zurich (2012), and at the University of Waterloo (2011). He received his B.Sc (2012) and M.Sc (2014) in computer science from the University of Kaiserslautern. Previously, he worked as a data scientist at the German Cancer Research Center (2008-2010).

Faculty Host: Mor Harchol-Balter

This talk describes the Splice Machine RDBMS designed to power today's new class of modern applications that require high scalability and high-availability while simultaneously executing OLTP and OLAP workloads. Splice Machine is a full ANSI SQL database that is ACID compliant, supports secondary indexes, constraints, triggers, and stored procedures. It uses a unique, distributed snapshot isolation algorithm that preserves transactional integrity, and avoids the latency of 2PC methods. The talk will also present a variety of distributed join algorithms implemented in the Splice Machine executor and how the optimizer automatically evaluates each query, and sends it to the right data flow engine. OLTP queries such as small read/writes and range queries are executed on HBase, and OLAP queries such as large joins or aggregations are executed on Spark. The system can ensure that OLAP queries do not interfere with OLTP queries because the engines are run in separate processes each with tiered and prioritized resource management. We will also describe a few use cases where Splice Machine has been deployed commercially.

Monte Zweben is CEO of Splice Machine -- maker of the first dual-engine RDBMS on HBase and Spark. Monte's early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. Monte then founded and was CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he became VP and General Manager of the Manufacturing Business Unit.

In 1998, Monte was the founder and CEO of Blue Martini Software -- the leader in e-commerce and multi-channel marketing systems. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. Zweben currently serves as Chairman of Rocket Fuel Inc. (NASDAQ:FUEL) and serves on the Dean's Advisory Board for Carnegie-Mellon's School of Computer Science.


Subscribe to SDI/ISTC