An Updated Guide on Where to Apply for a PhD in Databases in the US (2018) // Blog // Andy Pavlo

TL;DR

This is an updated version of my 2016 guide on where to apply for a Ph.D. in databases. I tried to expand the list to include other areas beyond systems. Again, don't just look at the rankings. Whether an advisor is the right fit for you matters more.

The Ph.D. admissions season is upon us once again! That means applicants will repeat the same clichés as people in previous years. This includes things like students emailing to tell me about how they want to come CMU to work on graph databases (bad). The SOPs with overly effusive language that talk about how "Prof. Pavlo is the world's premiere expert in databases" (good) or how Berkeley/Stanford/MIT is the "dream school for me" even though I work at CMU (bad). Lastly there is the shady behavior from some people trying to get in by making up stories (bad) or lying on their CV (bad); more on that in the next blog article. I love this excitement and uncertainty. I don't watch TV other than baseball, news, or illegitimate theater, so this is my source of drama.

I originally posted my guide on what schools to apply to for a Ph.D. in database systems in 2016 because I was writing recommendation letters for six students that year. I figured it would be useful materials for others to help them figure out where to apply for graduate school. But that was two years ago and a lot of things have changed since then. LeBron is not with Cleveland anymore, and some of the professors that I listed in my old guide are not with the same universities as they used to be. I am writing letters for four students this year, so I figured it was time to update my guide.

Given this, the following is my list of database groups in the US that I am encouraging my students to apply to in 2018 (other than CMU). Note that I do not know whether these professors are taking new students. I am describing what aspects of their research projects that I find interesting or why I like them as a person.

My first Ph.D student graduated this summer and is now faculty at GA Tech to be part of Atlanta's crunk scene. I am taking more students for fall 2019, but by "more" I mean only one or two. You can apply to CMU to work with me (early deadline is December 1st) but you should also consider other schools. I am the only "core" DB professor at CMU^[1]. Other DB groups have more faculty members and admit more students than I can. And you should consider other schools beyond the "top four" (i.e., MIT, CMU, Stanford, Berkeley). I believe that it is a mistake when applicants say that they only applied to these four schools. The database professors at the other schools that I list below are doing awesome work. For example, Wisconsin is traditionally the database research powerhouse.

It important for you to also be aware that some professors may end up leaving their university before fall 2019. Make sure you ask around to find out what their plans are so that you do not come to a university only to find that they have left. They could be on a (temporary?) leave of absence for start-ups (e.g., Peter Bailis, Amol Deshpande), industry (e.g., Yannis Papakonstantinou, Feifei Li), or they may have moved to another university. Some professors are still at their respective schools but may be in a (temporary?) leadership role that is eating up most of their time. For example, Mike Franklin, Ugur Cetintemel, and Zack Ives are currently the department chairs at Chicago, Brown, and Penn, respectively.

Disclaimers

This list is not exhaustive. I do not have a formal criteria of who I choose to include in my list. My only judgment call was whether a person recently wrote a paper that I admire and that I wish that I wrote.

There are other professors doing novel database research in non-systems areas, such as Immanuel Trummer (Cornell), Jun Yang (Duke), Ashwin Machanavajjhala (Duke), Mirek Riedewald (Northeastern), Tiark Rompf (Purdue), and Dan Suciu (UW). I am only listing the ones that are doing work that is somewhat related to the research problems that I am focusing on now. There is also newer faculty, like Joy Arulraj (GT), Stephen Bach (Brown), and Marco Serafini (UMass), that just started and have yet to establish their research agenda. They are definitely interested in taking new students.

There are also awesome people doing good systems research but do not typically publish in the same conferences that I do. Notable examples include Dave Andersen (CMU), Eddie Kohler (Harvard), Raluca Ada Popa (Berkeley), Adam Belay (MIT), and Joey Gonzalez (Berkeley). Again, I am only listing my SIGMOD/VLDB fam here.

Lastly, I am again only including schools in the United States. I have stomach problems when I travel abroad that Imodium does not seem to remedy, so I think that this is fine. If you don't want to live in the US, then you should consider applying to work with Ken Salem (Waterloo), Hannes Muhleisen (CWI), Peter Boncz (CWI), Natassa Ailamaki (EPFL), Christoph Koch (EPFL), Thomas Neumann (TUM), Gustavo Alonso (ETH), Viktor Leis (???), Holger Pirk (ICL), or Jana Giceva (ICL). And yes I realize that most of these people are either German or work at a university in a German speaking country. This should not be surprising. After the United States, Germany has produced the second most number of DBMSs than any other country in the world.

By Research Area

The list of schools kept growing as I was writing this article, so I generated an approximate grouping to cross-reference the schools by database research area:

Cleaning: Columbia, Illinois, MIT, Wisconsin
Concurrency Control: Chicago, CMU, Maryland, Wisconsin
Distributed Databases: Maryland, UC Berkeley, UC Santa Cruz
Hardware Acceleration: Columbia, Ohio, UC San Diego
Machine Learning (Applied): CMU, MIT, UC Berkeley, UC San Diego, Washington, Wisconsin
Query Optimization: Chicago, CMU, Washington
Security: Northwestern
System Internals: Berkeley, CMU, Harvard, Maryland, Ohio, Princeton, Washington, Wisconsin
Visualization/Interaction: Columbia, Ohio, Illinois

The List (2018)

Chicago — Chicago's CS department is on the rise. This is the break out story of the last two years in the DB world. They have a new building. Database legend Mike Franklin is now the department chair. Aaron Elmore is doing great work on a variety of areas (e.g., anomalies, data dedup). He also has a joint VLDB 2017 paper with Aditya Parameswaran on a git-style versioned DBMS called OrpheusDB. Chicago also just hired Sanjay Krishnan from Berkeley who has some of the first work on using reinforcement learning on query optimization.
Columbia — If you want to work on hardware-related problems for DB systems, then you want to work with Ken Ross. If you want to do work on systems for better visualization and interaction with data, then you want Eugene Wu. Eugene started as professor in 2016 but he has been cranking out VLDB papers since 2004. This was so long ago that Ronald Reagan was still alive. That's how you know Eugene is on point with his research skills.
Harvard — So Margo Seltzer left Harvard for Canada. This was a surprise to many people. To put this in perspective to people outside of the research community, this is equivalent to Ice Cube leaving NWA. Despite this loss, Stratos Ideros is still doing trill DB systems research. Stratos' CrimsonDB project and latest papers are remarkable. Rather than make one-off data structures that are better than previous ones, he is codifying the first principles for designing DBMS' internal data structures. He then uses these principles to guide his research. I highly encourage you to go read his SIGMOD 2018 paper on this topic.
Illinois — I admire Aditya Parameswaran's research and productivity. He has a lot of active projects in data exploration, cleaning, integration, and crowd sourcing. He does the dirty work of talking to real people and then building usable systems that solve real-world problems (e.g., Zenvisage, DataSpread). Talking with people is not always possible in our field because some DB professors have restraining orders against them or they are under house arrest. The other DB professor at UIUC is Kevin Chang. I am not that familiar with Kevin's work except where it overlaps with Aditya. But Kevin cranks out a lot of papers at the top DB conferences.
Maryland — UMD is another school that is blowing up for databases. The two big changes last year was that (1) DB heartthrob Dan Abadi left Yale for UMD and (2) he was soon joined by Stonebraker's student Leilani Battle from MIT. Dan works on the same type of systems problems that both Stratos and myself work on. Leilani works on interactive systems for data exploration. Both would be excellent advisers. As I mentioned above, Amol Deshpande is away doing a start-up. UMD is also getting a new building for the CS department.
Michigan — Everything that I said about the Michigan DB group in 2016 is more true now than it was before. Mike Cafarella has been on a LOA since his start-up with Chris Re got bought by Apple. But that is over and he's back at Michigan to ramp up his research again. Barzan Mozafari is still doing great work on systems that is having real-world impact. His VLDB 2018 paper on lock-request scheduling is now the default scheduling algorithm in MySQL v8. He is also building an approximate query engine called VerdictDB that works on top of existing SQL-on-Hadoop engines. Lastly, there is H.V. Jagadish; he's one of the more senior people listed here and he is still actively involved in research.
MIT — Mike Stonebraker and Sam Madden are two of my favorite people in the world. Mike is unstoppable. He is 75 years old and still on fire. If I send him an email after midnight on a random weekday, I get a response right away or the first thing the next morning. Two years ago I went to go visit MIT and he still came to our meeting even though he was in the hospital the night before because he broke his leg. His opinion is still highly valued in the community. People stop and listen to him anytime he speaks in a meeting. He also is one of the nicest and most down to earth person that I have met. Not only has Mike made me a better scientist, he has made me strive to be a better person and try to help people as much as I have seen him help others. As for Sam, there were rumors going around that he was on cocaine because he is such a prolific researcher. I've seen the results of his urine tests and I can assure you that Sam is clean. He is that good. Not that the number of publications matter, but Sam is tied for first in the US with the most SIGMOD/VLDB publications in the last five years according to CSrankings.org. But Sam also has publications in non-DB fields while also raising two kids. It is hard to compete with that kind of energy^[2].
Northwestern — To the best of my knowledge, Jennie Rogers is the only person in my list that is looking at security and privacy issues related to database systems. This is an important problem that I wish I could make myself be interested in it more^[3]. See her VLDB 2017 paper on building a privacy-preserving query engine for federated databases.
Ohio — I have been quite public about my distaste with Ohio due to a romantic relationship that ended badly and caused me to lose a winter coat that I was quite fond of. If you can overlook my personal hangups, then you should check out Spyros Blanas and Arnab Nandi. Together they have been looking at building scalable query engines for super computing environments with hardware-accelerated networking.
Princeton — Technically there is not a database group at Princeton. The last SIGMOD/VLDB paper published from their department was in 2007. So why am I including them here? Well, it's because one of the hottest database start-ups right now (TimeScaleDB) was founded by Princeton professor Mike Freedman. He acts coy about not being a database person. But deep down Mike knows that he has been living a lie all of his life by focusing on networking research and that his true calling is to write SIGMOD papers^[4].
Stanford — There are so many impressive things about the Stanford crew. It is more than the systems (e.g., Snorkel, DeepDive, MacroBase), start-ups (e.g., SambaNova, Sisu), industry consortiums (e.g., DAWN), and awards (e.g., MacArthur) that comes out of their squad. What is more profound is their forward-thinking vision about how the world should be and their ability to bend the world to that vision. A good example of this is Chris Re's push on how "Software 2.0" will have ML at the center of the future software stacks ( see his CIDR 2019 paper). Matei Zaharia's Weld project seeks to build a common execution substrate for the various libraries and ML frameworks that data scientists are using today. As I mentioned above, Peter Bailis (aka the "Don Johnson of Databases") is away running his new start-up.
UC Berkeley — Joe Hellerstein still remains one of the best. This man does not stop. Joe does systems, he does theory, he does applications. He is the full package. His squad just dropped one the most scalable distributed key-value stores (Anna). The core is idea to have a single codebase that can elastically scale from single socket machines to thousands of nodes. Another one of Joe's projects that I am excited about is the data provenance platform Ground. The scope of the Ground project has expanded and is now used in the Flor project for tracking models and meta-data in ML workflows. The amount of intellectual prowess that Joe yields is still slightly intimidating to me. This is not because of anything he does on purpose, but rather his mannerisms and methodical style of speaking (you'll know it when you see it). But he is down to earth and I know is a thoughtful advisor.
UC San Diego — Arun Kumar is the junior database faculty member at UCSD that is cranking out great papers. He has mostly been focused on building better management tools and systems for ML workloads, such as for training models and operating over unstructured data (e.g., images, audio). Arun came out of Jignesh Patel's group at Wisconsin, so he is also applying his Madison-bred system skills to hardware acceleration for ML workloads as well. I think that Arun has good taste in picking interesting and timely problems to work on.
UC Santa Cruz — The DB faculty at UCSC are the few people (along with UCSB) that are still doing "core" distributed database research^[5]. Peter Alvaro is working on building more reliable distributed applications in his Disorderly Labs. Faisal Nawab started in 2017 and is focused on distributed concurrency control protocols that include edge (IoT) devices.
Washington — The Myria DBMS project concluded in 2018. The new big systems project that Alvin Cheung and Magda Balazinska are working on is LightDB. It is a DBMS for storing and querying virtual reality (VR) videos. This part of a larger research theme for Magda looking into how to support analytics over image and video data sets using deep learning. Like Sanjay at Chicago, Magda also has a student looking into early techniques for applying ML to query optimization. Alvin also has an interesting PL-related project for automatically checking the equivalency of SQL statements called Cosette. The demo is impressive.
Wisconsin — The Wisconsin DB group is probably the strongest that it has been in a long time. They have hired so many great researchers in the last two years. And this was already having two of the best senior DB researchers. First of all, you have Theo Rekatsinas working on applying ML methods for data cleaning, integration, and fusion (see his HoloClean project). AnHai Doan is doing similar work on using ML for entity resolution. Then you have Paris Koutris doing theory work that I appreciate but don't understand. Lastly, Jignesh Patel is still crushing systems research. I always learn so much every time I read one of his papers (e.g., see his VLDB 2017 paper on aggressively using Bloom filters for join reordering without an optimizer). Jignesh's raw DB skills have caused him to never be afraid of anything in his life except for one thing...

Footnotes

CMU has other professors that dabble in databases, like Dave Andersen and Todd Mowry, and Greg Ganger. We also have the beloved Christos Faloutsos but he has been mostly publishing in data mining conferences (WWW, SIGKDD) for the last decade.
I have only published in database conferences, I have zero children, and I have been struggling to keep my shirt on around campus due to my ongoing battle with vivacious HPD.
I don't care about security because I like to live dangerously. I don't wear seat belts. I shower without a bath mat. And I use the password 1234 for whenever I can for my online accounts.
I may be grossly misinterpreting Mike's body language when I talk about databases with him. You should really check with him first whether he wants to write DB papers before you join Princeton.
By "core" research, I mean classic problems like consensus protocols, concurrency control, and consistency issues.