SDI/ISTC

Alluxio, formerly Tachyon, is an open source memory speed virtual distributed storage system. In this talk, I will first introduce the Alluxio project and then describe the Alluxio architecture, focusing on two of its distinguishing features: tiered storage and unified namespace. Tiered storage provides applications running on top of Alluxio with the ability to store data in local storage tiers (memory, SSDs, and hard drives), transparently managing data based on pluggable policies. Unified namespace provides applications running on top of Alluxio with the ability to access data from heterogenous set of remote storage systems (such as HDFS, S3, or GCS) through the same API and namespace.

Jiří Šimša is a software engineer at Alluxio, Inc. and one of the top committers and a project management committee member of the Alluxio open source project. Prior to working on Alluxio, he was a software engineer at Google, where he worked on a framework for the Internet of Things. Jiri is a CMU and PDL alumnus, earning a PhD for his work on automated testing of concurrent systems under the guidance of professors Garth Gibson and Randy Bryant. He is a big fan of the Pittsburgh Penguins.

Jiří welcomes discussion about opportunities at Alluxio after their tech talk.

 

We will overview current and future work on building foundations for scaling machine learning and graph processing in Apache Spark.

Apache Spark is the most active open source Big Data project, with 1000+ contributors. The ability to scale is a key benefit of Spark: the same code should run on a laptop or 100's to 1000's of machines. Another big attraction is integration of analytics libraries for machine learning (ML) and graph processing.

This talk will cover the juncture between the low-level (scaling) and high-level (analytics) components of Spark. The most important change for ML and graphs on Spark in the past year has been a migration of analytics libraries to use Spark DataFrames instead of RDDs. This ongoing migration is laying the groundwork for future speedups and scaling. In addition to API impacts, we will discuss the integration of analytics with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management and code generation.

BIO:
Joseph Bradley is an Apache Spark committer and PMC member, working as a Software Engineer at Databricks. He focuses on Spark MLlib, GraphFrames, and other advanced analytics on Spark. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.

Faculty Host: Majd Sakr

This talk is to discuss the need and a framework of distributed end-to-end video analytics. Video cameras are ubiquitous nowadays, but the limited Internet bandwidth has prevented the video data from being useful in an economically efficient manner. Additionally, many video analytics tasks mandate quick turnaround time demanding (near) real time decision making. Distributed end-to-end video analytics, which collaboratively uses devices, edge servers, and the cloud as the computing platform, is the way to go. However, distributed visual workload can be difficult to develop and manage, especially for startups or SMEs who have been playing an integral in the Internet of Things. We present a framework for facilitating and expediting development and management of distributed video analytics workload across an end-to-end system comprising camera, gateway and the cloud.


Dr. Yen-Kuang Chen is a Principal Engineer at Intel Corporation. His research areas span from emerging applications that can utilize the true potential of internet of things to computer architecture that can embrace emerging applications. He has more than 50 US patents and around 100 technical publications. He is one of the key contributors to Supplemental Streaming SIMD Extension 3 and Advanced Vector Extension in Intel microprocessors. He is the Editor-in-Chief of IEEE Journal on Emerging and Selected Topics in Circuits and Systems, and IEEE CAS Distinguished Lecturer. He received his Ph.D. degree from Princeton University and is an IEEE Fellow.

Faculty Host: Kayvon Fatahalian

Graph clustering has many important applications in computing, but due to growing sizes of graphs, even traditionally fast clustering methods such as spectral partitioning can be computationally expensive for real-world graphs of interest. Motivated partly by this, so-called local algorithms for graph clustering have received significant interest due to the fact that they can find good clusters in a graph with work proportional to the size of the cluster rather than that of the entire graph. This feature has proven to be crucial in making such graph clustering and many of its downstream applications efficient in practice. While local clustering algorithms are already faster than traditional algorithms that touch the entire graph, they are sequential and there is an opportunity to make them even more efficient via parallelization. In this talk, we show how to parallelize many of these algorithms in the shared-memory multicore setting, and we analyze the parallel complexity of these algorithms. We present comprehensive experiments on large-scale graphs showing that our parallel algorithms achieve good parallel speedups on a modern multicore machine, thus significantly speeding up the analysis of local graph clusters in the very large-scale setting.

Julian Shun is currently a Miller Research Fellow (post-doc) at UC Berkeley. He obtained his Ph.D. in Computer Science from Carnegie Mellon University, and his undergraduate degree in Computer Science from UC Berkeley. He is interested in developing large-scale parallel algorithms for graph processing, and parallel text algorithms and data structures. He is also interested in designing methods for writing deterministic parallel programs and benchmarking parallel programs. He has received the ACM Doctoral Dissertation Award, CMU School of Computer Science Doctoral Dissertation Award, Miller Research Fellowship, Facebook Graduate Fellowship, and a best student paper award at the Data Compression Conference.

Faculty Hosts: Guy Blelloch, Phil Gibbons

Most major content providers use content delivery networks (CDNs) to serve web content to their users. CDNs achieve high performance by using a large distributed system of caching servers. The first and fastest caching level in a CDN server is the memory-resident Hot Object Cache (HOC). A major goal of a CDN is to maximize the object hit ratio (OHR) of its HOCs. Maximizing the OHR is challenging because web object sizes are highly variable and HOCs have a small capacity. This challenge has lead to a wealth of sophisticated cache eviction policies. In contrast, cache admission policies have received little attention.

This talk presents AdaptSize: a new HOC caching system based on a size-aware cache admission policy. AdaptSize is based on a new statistical cache tuning method that continuously adapts the parameters of its cache admission policy to the request traffic. In experiments with Akamai production traces, AdaptSize improves the OHR by 30-44% over Nginx and by 49-92% over Varnish, which are two widely-used production systems. Further, AdaptSize's tuning method consistently achieves about 80% of the OHR of offline parameter tuning, and is significantly more robust than state-of-the-art cache tuning methods based on hill climbing. To demonstrate feasibility in a production setting, we show that AdaptSize can be incorporated into Varnish with low processing and memory overheads and negligible impact on cache server throughput.

Joint work with Ramesh K. Sitaraman (University of Massachusetts at Amherst & Akamai Technologies) and
Mor Harchol-Balter (Carnegie Mellon University)


Daniel S. Berger is a Ph.D.  student in computer science at the University of Kaiserslautern, Germany. His research interests intersect systems, mathematical modeling, and performance testing. As part of his PhD work, Daniel is exploring the boundaries of achievable cache hit ratios in Internet content delivery. Daniel has spent several months each year on research visits: at CMU (2015), Warwick University (2014), T-Labs Berlin (2013), ETH Zurich (2012), and at the University of Waterloo (2011). He received his B.Sc (2012) and M.Sc (2014) in computer science from the University of Kaiserslautern. Previously, he worked as a data scientist at the German Cancer Research Center (2008-2010).

Faculty Host: Mor Harchol-Balter

This talk describes the Splice Machine RDBMS designed to power today's new class of modern applications that require high scalability and high-availability while simultaneously executing OLTP and OLAP workloads. Splice Machine is a full ANSI SQL database that is ACID compliant, supports secondary indexes, constraints, triggers, and stored procedures. It uses a unique, distributed snapshot isolation algorithm that preserves transactional integrity, and avoids the latency of 2PC methods. The talk will also present a variety of distributed join algorithms implemented in the Splice Machine executor and how the optimizer automatically evaluates each query, and sends it to the right data flow engine. OLTP queries such as small read/writes and range queries are executed on HBase, and OLAP queries such as large joins or aggregations are executed on Spark. The system can ensure that OLAP queries do not interfere with OLTP queries because the engines are run in separate processes each with tiered and prioritized resource management. We will also describe a few use cases where Splice Machine has been deployed commercially.

Monte Zweben is CEO of Splice Machine -- maker of the first dual-engine RDBMS on HBase and Spark. Monte's early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. Monte then founded and was CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he became VP and General Manager of the Manufacturing Business Unit.

In 1998, Monte was the founder and CEO of Blue Martini Software -- the leader in e-commerce and multi-channel marketing systems. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. Zweben currently serves as Chairman of Rocket Fuel Inc. (NASDAQ:FUEL) and serves on the Dean's Advisory Board for Carnegie-Mellon's School of Computer Science.

This talk describes the Splice Machine RDBMS designed to power today's new class of modern applications that require high scalability and high-availability while simultaneously executing OLTP and OLAP workloads. Splice Machine is a full ANSI SQL database that is ACID compliant, supports secondary indexes, constraints, triggers, and stored procedures. It uses a unique, distributed snapshot isolation algorithm that preserves transactional integrity, and avoids the latency of 2PC methods.

The talk will also present a variety of distributed join algorithms implemented in the Splice Machine executor and how the optimizer automatically evaluates each query, and sends it to the right data flow engine. OLTP queries such as small read/writes and range queries are executed on HBase, and OLAP queries such as large joins or aggregations are executed on Spark. The system can ensure that OLAP queries do not interfere with OLTP queries because the engines are run in separate processes each with tiered and prioritized resource management. We will also describe a few use cases where Splice Machine has been deployed commercially.

Monte Zweben is CEO of Splice Machine -- maker of the first dual-engine RDBMS on HBase and Spark. Monte's early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. Monte then founded and was CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he became VP and General Manager of the Manufacturing Business Unit.

In 1998, Monte was the founder and CEO of Blue Martini Software -- the leader in e-commerce and multi-channel marketing systems. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. Zweben currently serves as Chairman of Rocket Fuel Inc. (NASDAQ:FUEL) and serves on the Dean's Advisory Board for Carnegie-Mellon's School of Computer Science.

Faculty Host: Andy Pavlo

This talk will provide an overview of LinkedIn's distributed stream processing platform, including Samza/Kafka/Databus. It will first cover the high level scenarios for stream processing in LinkedIn, followed by detailed requirements around scalability, re-processing, accuracy of results, and ease of programmability; then we will focus on the requirements of stateful stream processing applications and explain how Samza’s state management allows us to build applications that meet all the above requirements. The key concepts, architecture and usage in LinkedIn's stream processing pipeline will be explained, including state management in Samza, the use and configuration of Kafka and Databus as input/output and as a change log.

We will also discuss in detail how we leverage the reliable, replayable messaging system (i.e. Kafka) together with durable state management in Samza to build a Lambda-less stream processing platform. The key mechanism to achieve a unified process model between batch and real-time stream is windowing. We will dive into the requirements and our solutions to windowing a real-time stream in this talk as well.

Yi Pan graduated from UCI with a Ph.D. in Computer Science in 2008. Since then, he has worked in distributed platforms for Internet applications for 8 years. He started at Yahoo! working on Yahoo!'s NoSQL database project, leading the development of multiple features, such as real-time notification of database updates, secondary index, and live-migration from legacy systems to NoSQL databases. Later, he joined and led the development of the Cloud Messaging System, which is used heavily as a pub-sub service and transaction log for distributed databases at Yahoo!. Since 2014, he joined LinkedIn and quickly became the lead of the Apache Samza team at LinkedIn, which provides a scalable stream processing service for the whole company. >

Faculty Hosts: Majd Sakr, Garth Gibson

Partially funded by Yahoo! Labs

Today's cloud computing infrastructure requires substantial trust. Cloud users rely on both the provider's staff and its globally-distributed software/hardware platform not to expose any of their private data. We introduce the notion of shielded execution, which protects the confidentiality and integrity of a program and its data from the platform on which it runs (i.e., the cloud operator's OS, VM and firmware). The talk presents two prototype systems that allow applications to run in the cloud without having to trust it.

Haven is the first system to achieve shielded execution of unmodified legacy applications, including SQL Server and Apache, on a commodity OS (Windows) and commodity hardware. Haven addresses the dual challenges of executing unmodified legacy binaries and protecting them from a malicious host. VCCC is the first practical framework that allows users to run distributed MapReduce computations in the cloud while keeping their code and data secret, and ensuring the correctness and completeness of their results. Both systems leverage the hardware protections of Intel SGX to defend against privileged code and physical attacks such as memory probes.

Marcus Peinado is an Architect in the Platform Infrastructure Group at Microsoft Research, Redmond. His interests include Operating Systems, Trusted Computing and System Security. His past and current projects in these areas include Haven, VCCC, Hyper-V, Windows Media security, Controlled Channel attacks and the MAS rootkit detector. Marcus holds a Ph.D. from Boston University.

Faculty Hosts: Majd Sakr, Garth Gibson

User-facing, latency-critical services account for a large fraction of the servers in large-scale datacenters, but typically exhibit low resource utilization during off-peak times. An effective approach for extracting more value from these servers is to co-locate the services with batch workloads. Thus, in this paper, we build systems that harvest spare compute cycles and storage space from datacenters for such batch workloads. The main challenge is minimizing the performance impact on the services, while being aware of and resilient to their utilization and management patterns. To overcome this challenge, we propose techniques for giving priority over the resources to the services, and leveraging historical information about them. Our results characterize the dynamics of how services are utilized and managed in ten large-scale production datacenters. Using real experiments and simulations, we also show that our techniques eliminate data loss and unavailability in many scenarios, while protecting the co-located services and significantly improving batch job execution time.

BIO: 
Marcus Fontoura is currently a Partner Architect at Microsoft, where he works on the end-to-end architecture for Azure compute. In his previous roles at Microsoft, Marcus worked on the production infrastructure for Bing and in several Bing Ads projects. Prior to Microsoft, Marcus was a Research Scientist at Google (2011-2013) where he worked in the Search Infrastructure team. His focus was on the serving systems powering Google.com search. At Google, Marcus worked in many projects including performance and scalability of retrieval engines, novel compression schemes, indexing systems, and networking. He also worked in retrieval techniques for large-scale machine learning systems.

Before joining Google, Marcus was a Research Scientist at Yahoo! Research (2005-2010) working on several advertising projects. Marcus also worked as the architect for a large-scale software platform for indexing and content serving, which is used in several of Yahoo!'s display and textual adverting systems. Yahoo! Superstar and was awarded with two Yahoo! You Rock awards. Prior to Yahoo!, Marcus worked at the IBM Almaden Research Center (2000-2005), where he co-developed a query processor for XPath queries over XML streams. This was one of key components of the implementation of the XML data type in the IBM DB2 Relational Database System. In another project at IBM, he was one of the key researchers developing an Enterprise Search Engine that resulted in a new software product for IBM - the IBM OmniFind Enterprise Search.

Marcus finished his Ph.D. studies in 1999, at the Pontifical Catholic University of Rio de Janeiro, Brazil (PUC-Rio), in a joint program with the Computer Systems Group, University of Waterloo, Canada. The main contributions from his Ph.D. thesis have been condensed in the book The UML Profile for Framework Architectures, published by Addison-Wesley in 2001. After finishing his Ph.D. Marcus was a post-doctoral researcher in the Computer Science Department at Princeton University for one year (1999-2000). Marcus is an ACM Distinguished Member and an IEEE Senior Member. He has more than 25 issued patents and more than 50 published papers. Marcus has been in several program committees over the years, including SIGIR, WWW, WSDM, KDD, and CIKM. Recently Marcus was a co-chair of the WWW 2013 developers track.

Faculty Hosts: Garth Gibson, Majd Sakr

Pages

Subscribe to SDI/ISTC