CS-15-829 D
Paper Write-Up Guidelines
For each required paper on the reading list, please compose a brief write-up.
Each write-up should include a short overview and summary of the main idea of
the paper, a list of negative points, a list of positive points, and a list of
research topics for discussion. Please use the following sample write-up
as a guideline:
Access Path Selection in a Relational DBMS
by P.G. Selinger, M.M. Astrahan, D.D. Chamberlin, R.A. Lorie, T.G.
Price
One-line summary: Optimize queries by selecting from a tree of
possible query step orderings, by using statistics of existing relations,
selectivity factors that estimate how many tuples will be returned by a
predicate, and appropriately weighted CPU and disk costs of accessing tuples.
Overview/Main Points
- Degrees of freedom for optimization: Permutations of join orders and
method of joining; evaluation order among query blocks.
- Accessing tuples via RSS:
- Segment scan: scan group of disk pages, which may contain tuples from
many relations; return only those tuples from a desired relation.
- Index scan (if this relation has an index which maps keys to tuple-ID's
containing that key): scan the leaf pages of the index B-tree, then get the
tuple corresponding to each TID. The index is clustered if this can
be done without scanning any tuple page more than once.
- Scans can take a set of predicates used to filter out some tuples. A
sargable predicate is one that can be put into the form "column-name rel-op
value", e.g. "Salary < 2000".
- Overall cost being optimized is:
(page fetches) + W * (RSI I/O's), where W is a weighting factor between CPU
and disk.
- Optimizer uses the following stats, which are kept for each relation T and
index I:
- NCARD(T): cardinality of T
- TCARD(T): number of pages that hold tuples of T
- P(T): fraction of data pages in a segment that hold tuples of T, i.e.
TCARD(T)/(no. of pages in segment)
- ICARD(I): number of distinct keys in index
- NINDX(I): number of pages in index
- Selectivity factor: An estimate of what fraction of input tuples
will be returned by a relation. Table 1 shows how to estimate F for various
query elements ("column = value", "column between val1 and val2", "expr1 OR
expr2", etc.). As shown in the table, if not enough is known about the query
element to estimate its F, a "rule of thumb constant" is used (e.g. 1/10).
- QCARD(Q): Query cardinality is computed as the product of the
cardinalities of the query's FROM list and the selectivity factors of the
query block's Boolean factors.
- Number of expected RSI calls is product of relation cardinalities and
selectivity factors of the sargable predicates.
- Join ordering. A tuple order is interesting if it is
specified by the query. Some queries may specify no interesting orders (i.e.
return results unordered). To optimize join cost, need to choose between
cheapest method that produces an interesting order and cheapest method that
produces unordered result plus cost of sorting. Note that multiway joins can
be pipelined.
- Heuristic: find best join order for successively larger subsets of tables.
(Justification: once you've joined the first k relations, joining the
result to the k+1st relation is independent of the order of joining the
first k.)
- To reduce search space, only consider join orders that have join pro=edicates
relating the inner relation to other relations already joined. E.g. if
T1 and T2 are joined on column C1, and T2 and T3 are joined on column C2, then
the orderings {T1 T3 T2} and {T3 T1 T2} will be ignored.
- Cost of nested loop join or merge join is:
(Outer-cost(path1) + N * Inner-cost(path2)), where outer-cost and inner-cost
are the cost of scanning a tuple (applying all applicable predicates) from the
respective paths, and N is the product of the cardinalities and selectivity
factors of all relations joined so far (i.e. a product-of-products).
- Observation: nested-loop and merge join costs are the same formula!
Merging is sometimes better because inner-cost may be much less.
Positive points (what I like about this paper)
(provide at least 5)
- The seminal work on query optimization relying on table statistics and a
model of CPU and disk utilization.
- The "sargable predicates" and the "interesting order" are nailed down as
the most important factors for correct query optimization.
- The method covers all algorithms (sequential scan, index, join) of query
execution in a neat way.
- The method really finds the cheapest plan assuming statistics are correct.
- It is trivially extendible to incorporate new join/select algorithms
Negative points (what I don't like about this paper)
(provide at least 5)
- Cost models are too simplistic given today's multiheaded CPU's and
synchronous nonblocking I/O subsystems. (Compilers have the same problem in
trying to optimize code for hairy new architectures; seems like there would be
many parallels between query optimization and the back end of a compiler.)
- Enumeration of most join orders expensive as ntables & nmethods goes up
- Only considers left deep trees - wouldn't consider plan: (a ø b) ø (c ø d)
- Weighting factor changes over time, method for calculating it unclear.
- Cost model assumes even distributions (not always true).
- Inaccuracy of estimates may lead to a totally different path than the
correct one.
Research questions and points for discussion
(provide about 5)
- What do cost models in today's database systems take into account? Is that
enough?
- How can we refine the CPU component to incorporate cache and memory
hierarchy behavior?
- Are all I/O operations created equal? How are cost models affected?
- How can we evaluate cost without using selectivities?
-
This sample has much borrowed text written by Steve Gribble, Armando Fox,
Eric Anderson, and Anastassia Ailamaki.