Algorithms for Big Data, Fall 2020.

15-859: Algorithms for Big Data, Fall 2020

Instructor: David Woodruff

Lecture time: Thursdays, 16:00-19:00, on Zoom, starting Sep 10

TA: Praneeth Kacham

David's office hours: Mondays, 10:30-11:30am ET, or by appointment

Praneeth's recitation: Fridays, 8-9pm ET, on Zoom, starting Sep 11

Praneeth's office hours: Wednesdays, 8-9pm ET, or by appointment

Piazza site: piazza.com/cmu/fall2020/15859/home

Grading

Latex

Course Description

Lectures

Recitations

Problem sets

References

Grading

Grading is based on problem sets, scribing a lecture, and a presentation/project. There will be no exams. General information about the breakdown for the grading is available here: grading.pdf

Latex

Homework solutions, scribe notes, and final projects must be typeset in LaTeX. If you are not familiar with LaTeX, see this introduction.

A template for your scribe notes is here: template.tex

Lectures

Lecture 1 slides , video (Least squares regression, subspace embeddings, net arguments, matrix Chernoff)
Scribe notes 1
Scribe notes 2
Lecture 2 slides , video (Subsampled Randomized Hadamard Transform, Approximate Matrix Product, CountSketch)
Scribe notes 3
Scribe notes 4
Lecture 3 slides , video (CountSketch, Affine Embeddings, Low Rank Approximation - please also see lecture 2 slides, where we will start)
Scribe notes 5
Scribe notes 6
Lecture 4 slides , video (Low Rank Approximation continued, Sketching-based Preconditioning, Leverage Scores)
Scribe notes 7
Scribe notes 8
Lecture 5 slides , video (Coresets, Distributed Low Rank Approximation)
Scribe notes 9
Scribe notes 10
Lecture 6 slides , video (Distributed Low Rank Approximation, L1 Regression)
Scribe notes 11
Scribe notes 12
Lecture 7 slides , video (L1 Regression, Data Stream Model)
Scribe notes 13
Scribe notes 14
Lecture 8 slides , video (Estimating Norms in a Stream)
Scribe notes 15
Scribe notes 16
Lecture 9 slides , video (Heavy Hitters)
Scribe notes 17 and 18
Lecture 10 slides , video (Estimating Number of Non-Zeros, Information Theory)
Scribe notes 19
Scribe notes 20
Lecture 11 slides , video (Streaming Lower Bounds, Gaussian Width, Compressed Sensing)
Scribe notes 21
Scribe notes 22

Recitations

See links/handouts on piazza

Problem sets

Course Description

With the growing number of massive datasets in applications such as machine learning and numerical linear algebra, classical algorithms for processing such datasets are often no longer feasible. In this course we will cover algorithmic techniques, models, and lower bounds for handling such data. A common theme is the use of randomized methods, such as sketching and sampling, to provide dimensionality reduction. In the context of optimization problems, this leads to faster algorithms, and we will see examples of this in the form of least squares regression and low rank approximation of matrices and tensors, as well as robust variants of these problems. In the context of distributed algorithms, dimensionality reduction leads to communication-efficient protocols, while in the context of data stream algorithms, it leads to memory-efficient algorithms. We will study some of the above problems in such models, such as low rank approximation, but also consider a variety of classical streaming problems such as counting distinct elements, finding frequent items, and estimating norms. Finally we will study lower bound methods in these models showing that many of the algorithms we covered are optimal or near-optimal. Such methods are often based on communication complexity and information-theoretic arguments.

References

One recommended reference book is the lecturer's monograph Sketching as a Tool for Numerical Linear Algebra.

Some videos from a shorter version of this course I taught are available here: videos . Note that mine start on 27-02-2017.

This course was previously taught at CMU in both Fall 2017 and Fall 2019

Materials from the following related courses might be useful in various parts of the course:

Algorithms for Big Data 2013 Jelani Nelson (Harvard)
Algorithms for Big Data 2015 Jelani Nelson (Harvard)
Algorithmic Techniques for Massive Data: Alexandr Andoni (Columbia).
Algorithms for Big Data: Chandra Chekuri (UIUC).
Algorithms for Big Data: Grigory Yaroslavtsev (UPenn).
Algorithms for Modern Data Models: Ashish Goel (Stanford).
Data Streams Algorithms: Andrew McGregor (UMass Amherst).
Data Stream Algorithms: Amit Chakrabarti (Dartmouth).
Dealing with Massive Data: Sergei Vassilvitskii (Columbia).
I/O-algorithms: Lars Arge (Aarhus).
Mining Massive Data Sets: Jure Leskovec (Stanford).
Models of Computation for Massive Data: Jeff Phillips (Utah).
Randomized Algorithms for Matrices and Data: Michael Mahoney (UC Berkeley).
Seminar on Data Streams: Hung Ngo, Atri Rudra (Buffalo).
Sublinear algorithms: Piotr Indyk, Ronitt Rubinfeld (MIT).
Sublinear and streaming algorithms: Paul Beame (University of Washington).
The Modern Algorithmic Toolbox: Tim Roughgarden, Gregory Valiant (Stanford).
Sublinear Algorithms for Big Data: Qin Zhang (University of Indiana Bloomington)
A list of compressed sensing courses, compiled by Igor Carron.

Intended audience: The course is indended for both graduate students and advanced undegraduate students with mathematical maturity and comfort with algorithms, discrete probability, and linear algebra. No other prerequisites are required.

Maintained by David Woodruff