Abstract: The Cardinality Estimation/Distinct Elements problem is to approximate
the number of distinct elements in a data stream using a small
probabilistic data structure called a "sketch". This problem has been
studied for 40 years, has many industrial applications, and is
featured prominently in most courses on Big Data algorithmics. It is
therefore a real puzzle to explain why research on this popular and
fundamental problem has been unusually slow.
This talk presents a complete history of the Cardinality Estimation
problem from Flajolet and Martin's seminal 1983 paper to the present,
and includes an account of how the research community became
fractured, delaying many natural developments by decades. I will
present our recent efforts to achieve information-theoretically
optimal cardinality sketches, which draws on two notions of
"information" developed in the 20th century: Fisher information
(governing optimal point estimation) and Shannon entropy (governing
optimal space/communication).
Joint work with Dingyu Wang.