15-859: Algorithms for Big Data, Fall 2021

Instructor: David Woodruff
Lecture time: Thursdays, 3:05pm-6:00, GHC 4303
TA: Taisuke (Tai) Yasuda
David's office hours: Mondays, 10:30-11:30am ET, GHC 7217, or by appointment
Tai's recitation: Fridays, 4-5pm ET, on Zoom Zoom (will be recorded - ask Tai for the recording if you can't make it)
Tai's office hours: Wednesdays, 6-7pm ET, on Zoom
Piazza site: piazza.com/cmu/fall2021/15859/home


Grading Latex Course Description    Lectures    Recitations    Problem sets    References   

Grading

Grading is based on problem sets, scribing a lecture, and a presentation/project. There will be no exams. General information about the breakdown for the grading is available here: grading.pdf

Latex

Homework solutions, scribe notes, and final projects must be typeset in LaTeX. If you are not familiar with LaTeX, see this introduction.

A template for your scribe notes is here: template.tex

Lectures


Tai's Recitations


Problem sets


Course Description

With the growing number of massive datasets in applications such as machine learning and numerical linear algebra, classical algorithms for processing such datasets are often no longer feasible. In this course we will cover algorithmic techniques, models, and lower bounds for handling such data. A common theme is the use of randomized methods, such as sketching and sampling, to provide dimensionality reduction. In the context of optimization problems, this leads to faster algorithms, and we will see examples of this in the form of least squares regression and low rank approximation of matrices and tensors, as well as robust variants of these problems. In the context of distributed algorithms, dimensionality reduction leads to communication-efficient protocols, while in the context of data stream algorithms, it leads to memory-efficient algorithms. We will study some of the above problems in such models, such as low rank approximation, but also consider a variety of classical streaming problems such as counting distinct elements, finding frequent items, and estimating norms. Finally we will study lower bound methods in these models showing that many of the algorithms we covered are optimal or near-optimal. Such methods are often based on communication complexity and information-theoretic arguments.

References

One recommended reference book is the lecturer's monograph Sketching as a Tool for Numerical Linear Algebra.

This course was previously taught at CMU in the following links. In Fall 2020, all lectures were recorded with Panopto, which you have access to:
Fall 2017
Fall 2019
Fall 2020

Some videos from a shorter version of this course I taught are available here: videos . Note that mine start on 27-02-2017.

Materials from the following related courses might be useful in various parts of the course:
Intended audience: The course is indended for both graduate students and advanced undegraduate students with mathematical maturity and comfort with algorithms, discrete probability, and linear algebra. No other prerequisites are required.
Maintained by David Woodruff