Query Plan

y query plan is an expression tree with relational algebra operators as internal nodes and file access paths as links. Leaf nodes are databases. Query optimisation is used to reduce costs of each operation on the tree.

%%🖋 Edit in Excalidraw, and the dark exported image%%

They are evaluated i n a ‘bottom-up’ approach, with the subtrees of a node being evaluated before the node itself. This is similar to post-order traversal, except nodes are evaluated instead of just being visited.

Generation

Query Blocks

The query optimiser simplifies a complex query into it’s basic query blocks:

A query block is any statement that starts with a SELECT operation.
The optimiser attempts to optimise innermost blocks first, that is subqueries first.

---Outer block
SELECT name FROM A
	WHERE age IN
	--Inner block
	SELECT MAX(age) FROM B

Conversion to Relational Algebra expressions

Each query block is then converted to it’s corresponding relational algebra expression:

SELECT attribute FROM S is converted to $π_{a tt r ib u t e} (S)$
WHERE conditiion is converted to $σ_{co n d i t i o n} (S)$
- Any where condition which compares primary keys gets converted to a natural join

Relational algebra equivalences are used to control the order of operations:

Equivalences

Automatic Conversations

Cross Product + Selection over equal ids = Natural Join:

σ_{A . i d = B . i d} (A \times B) \equiv A ⋈_{A . i d = B . i d} B

Selection is distributed over joins (to reduce join cost) when possible:

σ_{A . a g e > 10} (A ⋈_{A . i d = B . i d} B) \equiv (σ_{A . a g e > 10} (A)) ⋈ B

Projection is distributed over joins when possible. Ensure that the subset of attributes to be projected is in the ‘outer’ loop:

π_{nam e} (A ⋈_{A . i d = B . i d} B) = π_{nam e} [(π_{nam e, i d} (A)) ⋈_{A . i d = B . i d} (π_{nam e, i d} (B))

Cost Estimation

For every generated query plan, the cost estimator must estimate:

The size of the result for each operation in the Query Plan Tree (in pages)
The cost of each operation in the tree. See query optimisation for various methods

It uses catalogues to do this:

For each block, the maximum size of the result is given by the product of the number of tuples in the FROM clause. The maximum is when a cartesian product (cross product) is performed.

max (n_{t u pl es}) = n_{t u pl es} (A) \cdot n_{t u pl es} (B) \cdot \dots

Like in query optimisation, we can use the Reducing Factor:

When selecting from a single table $A$ , with $k$ predicates, we have the estimated size:

Estimated Size = n_{t u pl es} (A) \cdot i = 1 \prod k RF_{i}

When combining it with joins, of $m$ tables, we have:

Estimated Size = (i = 1 \prod m n_{t u pl es} (T_{m})) \cdot (i = 1 \prod k RF_{i})

Note that there are no reduction factors if there is no selections being performed.

Reducing Factor

Predicate	$RF =$	Why?
$a tt r = v a l$	$\frac{1}{n _{k eys} ( a tt r )}$	$n_{k eys}$ is the number of distinct values If we match against a certain key, then only elements which are equal to that key are returned.
$a tt r > v a l$	$\frac{a tt r _{ma x} - v a l}{a tt r _{ma x} - a tt r _{min}}$	The domain of values is $[a tt r_{min}, a tt r_{ma x}]$ . Since we only match the fraction of the entire domain from $v a l$ to $a tt r_{ma x}$ , we do the final fraction
$a tt r < v a l$	$\frac{v a l - a tt r _{min}}{a tt r _{ma x} - a tt r _{min}}$	Opposite of the above
$A . a tt r = B . a tt r$ (Joins)	$\frac{1}{max ( n _{k eys} ( A _{a tt r} ) , n _{k eys} ( B _{a tt r} ))}$
Else	A magic number, $\frac{1}{10}$	Chosen only when no other option is available, is not very accurate

Single-Relation Queries Cost (No Joins)

When querying over a single relation/table, the query optimiser tries multiple access paths to obtain the final result:

Linearly scan through the unsorted data, checking all predicates at each instance. $Cost = n_{p a g es}$
Indexed searching over a primary key
- Either search in b+ tree for tree-based indices. $Cost = h + 1$
- Or search through the hashmap for hash-based indices.
Clustered Index matching one or more predicates. $I/O Operations = (i_{p a g es} + n_{p a g es}) \cdot \prod_{i = 1}^{k} RF_{i}$
Unclustered index matching one or more predicates. $I/O Operations = (i_{p a g es} + n) \cdot \prod_{i = 1}^{k} RF_{i}$

Multi-relation Queries Cost (Joins)

For join operations:

Select the order of joining (since join is commutative). $A \times B \times C, C \times A \times B, e t c .$ There are $N!$ orderings (where $N$ is the number of relations)
For each join, select a joining algorithm. Commonly used are Hash Sort, Sort-Merge Join, and Block-Based Nested Loop Join
For each relation, treat as a single-relation query and use access methods as above

In total, the number of possible query plans are:

# Query Plans = n! \times 3^{n - 1} \times (i + 1)^{n}

Since there are $n!$ possible join orderings, the possible configurations for the query plan tree increase very quickly. As such, a property is enforced on the tree, namely that it becomes a left-deep join tree. This also removes the need for any auxiliary arrays %%🖋 Edit in Excalidraw, and the dark exported image%%

#todo Examples:

Questionably Accurate Notes

Explorer

Query Plan

Generation

Query Blocks

Conversion to Relational Algebra expressions

Automatic Conversations

Cost Estimation

Reducing Factor

Single-Relation Queries Cost (No Joins)

Multi-relation Queries Cost (Joins)

Table of Contents

Related Concepts

See Also:

Questionably Accurate Notes

Explorer

Query Plan

Generation §

Query Blocks §

Conversion to Relational Algebra expressions §

Automatic Conversations §

Cost Estimation §

Reducing Factor §

Single-Relation Queries Cost (No Joins) §

Multi-relation Queries Cost (Joins) §

Table of Contents

Related Concepts

See Also:

Generation

Query Blocks

Conversion to Relational Algebra expressions

Automatic Conversations

Cost Estimation

Reducing Factor

Single-Relation Queries Cost (No Joins)

Multi-relation Queries Cost (Joins)