CMU's New Graduate Database System Internals Course // Blog // Andy Pavlo

TL;DR

I created a new course at CMU on database system internals. It was a lot of work. All of the lectures are now available on YouTube.
One of the key aspects of the course was practical experience on writing code in a real-world DBMS but I should have done more code reviews.
The student's final projects were quite good. Many of them made it into the Peloton code base.

It was a wild semester. Development of CMU's new self-driving DBMS is moving forward. I got some death threats on Twitter (unfortunately they were taken down before I could take screenshots). We submitted seven SIGMOD papers and got three accepted. KB ran the Pittsburgh Marathon. Two of my PhD students won fellowships: Joy Arulraj won one from Samsung and Dana Van Aken won one from NSF.

I spent most of the last five months teaching a new graduate-level course at CMU: 15-721 Database Systems. This course was last taught in 2005 by Nastasia Ailamaki (who is now at EPFL). Since the previous incarnation was over 10 years old, there was no material that I could reuse. I decided to not make it a typical seminar course (i.e., read papers, work on a single project) and instead chose to do all of the lectures myself. And because I wanted to put the videos for all of my lectures on-line, I decided to spend a lot of time to develop the course materials from scratch.

The following is my reflection on how I thought the course went and what I plan to do to make it better in the future.

Preparation

I have been preparing to teach 15-721 for the last 12 months. As part of this, I had conversations with people in industry to find out what they thought was the important topics to cover in a graduate-level course on DBMS internals. This included both large companies, research groups, and start-ups. The way that I couched the question is to ask them what skills did they want a student to have when they applied for a job as a DBMS developer at their organization. Without me prompting them, they all said the same thing: they want students that can start working on a large system's codebase with little hand holding. They felt that if a person is smart enough, then can learn what they need to know about DBMS internals along the way. One particular DBMS company that I talked to said that they test an applicant's ability to do this during the interview by having them implement a new (small) feature in Memcached. They told me that this task is the most informative signal of an applicant's ability. Given this, I decided that the majority of the students' grades would be derived from programming projects and that I would try to teach them how to apply the concepts that we discussed in class in a real-world DBMS.

Course Overview

The course was focused on the design and implementation of in-memory, multi-core DBMSs. I only covered single-node systems because the DBMS that the students would use for their projects is not distributed (yet). Another overarching theme for the course was how to design a DBMS to support hybrid workloads (i.e., HTAP) that are comprised of both OLTP operations and OLAP queries. Throughout the semester, I also tried to discuss these topics in the context of both academic (e.g., HyPer, Silo, HYRISE) and commercial (e.g., Microsoft Hekaton, MemSQL, VoltDB) DBMSs^[1].

The first half of the course was on what I call the fundamentals of an in-memory DBMS architecture. These are the components that one would need to always implement when building a new DBMS from scratch: storage, concurrency control, query execution, indexes, join algorithms, and logging. Such topics are typically covered in an undergraduate-level introduction course (and this great survey), but I tailored the readings and lectures to discuss how to implement these things efficiently on multi-core CPUs. Although there was not a specific lecture on hardware advancements (e.g., NUMA, SIMD), I tried to sprinkle that discussion in throughout the semester. I had originally hoped to discuss hardware transactional memory but we ran out of time at the end.

The second half of the semester was on additional modern techniques and optimizations, such as compression, query compilation, vectorization, and handling larger-than-memory databases. These are things that are not necessary but are expected in any new system coming out today. Good examples of these optimizations can be found in well-known systems like MemSQL and HyPer, as well as newer upstarts like ScyllaDB and Vitesse DB.

Programming Projects

All of the projects in my course are based on the new full-featured DBMS (Peloton) that we are developing at CMU. Peloton is an in-memory relational DBMS that is designed for HTAP. It is currently based on Postgres's front-end, but we are in the process of ripping those parts out this summer so that we can get better performance and go distributed. The big thing about Peloton that makes it different than existing systems is that it is designed to be completely autonomous (i.e., no DBAs). I will discuss more about that part at a later date.

I assigned the students three projects during the semester. The first one was an individual assignment where everyone had to implement a single-threaded hash join operator. This was meant to just have the students set up their development environment and get their feet wet in the system. All of the students did well on this. The second project was to have everyone implement an in-memory, lock-free Bw-Tree from Microsoft Hekaton. It is my opinion that the Bw-Tree is the state-of-the-art concurrent OLTP index^[2], but there is no (complete) open source implementation. I assigned the students into groups of three for this project. Nobody completed the whole thing, but some groups finished everything but garbage collection. In retrospect, this was too ambitious for a course project and I will probably use a simpler data structure next year (e.g., concurrent skip list).

For the final project, I let the students choose their own topic. I told them the goal was to write high-quality code that could then be merged into the full system. To ensure this, we did the following things:

We required them to follow Google's C++ coding style and provided a clang-format script to check this.
We gave them Jenkins accounts so that it would test their code whenever they pushed a change to Github.
We required each group to write test cases (using gtest) and provided them with an automated LCOV-generated code coverage report.
We required each group to perform a code review of another group's project. Then each group is required to address the feedback and update their code accordingly.

The last item turned out to be the most important. With 10 different projects, I nor my baller TA (Joy) had enough time to look over everyone's code. The students really took to this part of the course and did a good job with it. One common problem that the students found was not using C++11 smart pointers.

Not all of the projects were merged into our master branch. Some of them turned out to be too experimental (e.g., multi-threading, a new Cascades-style query planner). One group went "rogue" and started optimizing the entire system to remove the stupid bottlenecks that we had from when we first started writing the system. This was quite impressive. The CMU MS students that did this (Jiexi Lin, Ran Xian, and Runshen Zhu) along with a visiting PhD from Singapore (Yingjun Wu) were able to improve Peloton's ability to scale up to up to 80 cores. The class voted them as the best project and they were awarded the grand prize: a CMU DB shirt, a copy of the The Manga Guide to Databases, and a rare autographed copy of Larry Ellison's biography where he professes his love for the Wu Tang Clan^[3].

Jiexi Lin, Ran Xian, and Runshen Zhu were voted by the class as having the best final project for their work on improving Peloton's scalability.

Lessons Learned

The biggest mistake in this course that I made was having the students do the code review at the end of the semester. I also only had them do one round of reviews. I now realize that should have had them do their code reviews half way through the project and then re-assigned them to do a second round with another group again near the end.

The second mistake was that our master branch was changing a lot during the semester because we had a lot of students fixing bugs and adding new features. I should have frozen the master during this time and had the students push changes to separate branches.

I also felt that I tried to cover too much in each lecture. I always had a single reading assignment per class, but then also included additional "optional" readings. I'm sure that nobody read those papers (since they had other classes). I suppose that if I had a text book that I could reference than I felt that my lectures could have been more directed.

I am pleased with how things turned out with putting all of the lecture videos on-line and making them publicly available. I think that every professor should do this. I had the option to make them only available to CMU students, but I felt that it was important to have these things open to anyone. I have gotten some good feedback about some parts, and bad feedback about others. I did get two emails from two different people working at DBMS companies (that I won't name) asking me to cover their system. I am always interested in suggestions like this but I felt that these systems were outside the scope of the course.

What's Next?

I will teach this course again in Spring 2017. I plan to only have two projects next time and these projects will have multiple rounds of code reviews. I hope to also have a more comprehensive testing suite for Peloton. This would include both unit and regression testing (using Postgres as an oracle), as well as speed tests using something like Codespeed. As for topics, I felt that I could have probably dropped the first lecture on index locking and latching. I might also drop the lecture on OCC since we really ended up only talking about MVCC for the rest of the semester (although you need to understand Silo in order to understand SiloR).

Preparing for each lecture took a lot of time, but now they are done once it won't take me as much time to prepare next year. Instead, I plan to write comprehensive lecture notes. Some random person in China already started doing this. I will make this available on-line with the videos and potentially turn this into an open source textbook^[4].

Another idea I had is to have a guest speaker from industry come and give a guest lecture about how they do testing for their DBMS. Although 15-721 is technically not a software engineering course, I think that testing is more important in databases then other areas in CS because people get pissed if you lose their data.

I am also planning to revamp CMU's introductory DB course in Fall 2017. This is going to be another year long effort. We currently teach from the Raghu and Johannes' famed cow textbook, but it is 10 years old and I think that I can do it better. There are a lot of topics that I think are critical for students to know now but are not covered in their book (e.g., column stores, in-memory DBMSs).

Acknowledgements

I first want to thank my incredible PhD student Joy Arulraj for TA-ing the class. Joy is the Ultimate Warrior of databases. He is going on the academic job market at the end of 2017. Everyone else will be crushed and driven before him. He will then hear the lamentations of their men and/or women.

Second, I would like to thank MemSQL and Micron for their generous hardware donations for the class. All of the students developed their projects using this donated equipment. I could not have done this without their support.

Lastly, thanks to all of the DBMS companies that donated shirts for me to give out to the students in recognition of their final projects: Snowflake, VoltDB, SpliceMachine, NuoDB, Altibase, and MongoDB.

Footnotes

I wanted to talk about some of the "classic" in-memory DBMSs too, such as P*TIME, SolidDB, and Altibase, but I really found it hard to find useful technical documentation about what they actually do. I had the same problems with Hekaton, TimesTen, and MemSQL, but I cheated and emailed the developers at those companies directly.
MassTree is probably better for string keys, but my intuition is that the Bw-Tree will be better for numeric keys. I haven't measured this yet though.
ODB was a DBA.
I got this idea from Kayvon Fatahalian and his Parallel Computer Architecture and Programming course.