Everything you always wanted to know about Cardinality Estimation* (*but were afraid to ask)
April 30, 2025 (GHC 8102)

Abstract: The Cardinality Estimation/Distinct Elements problem is to approximate

the number of distinct elements in a data stream using a small

probabilistic data structure called a "sketch". This problem has been

studied for 40 years, has many industrial applications, and is

featured prominently in most courses on Big Data algorithmics. It is

therefore a real puzzle to explain why research on this popular and

fundamental problem has been unusually slow.

This talk presents a complete history of the Cardinality Estimation

problem from Flajolet and Martin's seminal 1983 paper to the present,

and includes an account of how the research community became

fractured, delaying many natural developments by decades. I will

present our recent efforts to achieve information-theoretically

optimal cardinality sketches, which draws on two notions of

"information" developed in the 20th century: Fisher information

(governing optimal point estimation) and Shannon entropy (governing

optimal space/communication).

Joint work with Dingyu Wang.