y query plan is an expression tree with relational algebra operators as internal nodes and file access paths as links. Leaf nodes are databases. Query optimisation is used to reduce costs of each operation on the tree.
%%🖋 Edit in Excalidraw, and the dark exported image%%
They are evaluated i n a ‘bottom-up’ approach, with the subtrees of a node being evaluated before the node itself. This is similar to post-order traversal, except nodes are evaluated instead of just being visited.
Generation
Query Blocks
The query optimiser simplifies a complex query into it’s basic query blocks:
- A query block is any statement that starts with a
SELECT
operation. - The optimiser attempts to optimise innermost blocks first, that is subqueries first.
---Outer block
SELECT name FROM A
WHERE age IN
--Inner block
SELECT MAX(age) FROM B
Conversion to Relational Algebra expressions
Each query block is then converted to it’s corresponding relational algebra expression:
SELECT attribute FROM S
is converted toWHERE conditiion
is converted to- Any where condition which compares primary keys gets converted to a natural join
Relational algebra equivalences are used to control the order of operations:
Automatic Conversations
- Cross Product + Selection over equal ids = Natural Join:
- Selection is distributed over joins (to reduce join cost) when possible:
- Projection is distributed over joins when possible. Ensure that the subset of attributes to be projected is in the ‘outer’ loop:
Cost Estimation
For every generated query plan, the cost estimator must estimate:
- The size of the result for each operation in the Query Plan Tree (in pages)
- The cost of each operation in the tree. See query optimisation for various methods
It uses catalogues to do this:
For each block, the maximum size of the result is given by the product of the number of tuples in the FROM
clause. The maximum is when a cartesian product (cross product) is performed.
Like in query optimisation, we can use the Reducing Factor:
When selecting from a single table , with predicates, we have the estimated size:
When combining it with joins, of tables, we have:
Note that there are no reduction factors if there is no selections being performed.
Reducing Factor
Predicate | Why? | |
---|---|---|
is the number of distinct values If we match against a certain key, then only elements which are equal to that key are returned. | ||
The domain of values is . Since we only match the fraction of the entire domain from to , we do the final fraction | ||
Opposite of the above | ||
(Joins) | ||
Else | A magic number, | Chosen only when no other option is available, is not very accurate |
Single-Relation Queries Cost (No Joins)
When querying over a single relation/table, the query optimiser tries multiple access paths to obtain the final result:
- Linearly scan through the unsorted data, checking all predicates at each instance.
- Indexed searching over a primary key
- Either search in b+ tree for tree-based indices.
- Or search through the hashmap for hash-based indices.
- Clustered Index matching one or more predicates.
- Unclustered index matching one or more predicates.
Multi-relation Queries Cost (Joins)
For join operations:
- Select the order of joining (since join is commutative). There are orderings (where is the number of relations)
- For each join, select a joining algorithm. Commonly used are Hash Sort, Sort-Merge Join, and Block-Based Nested Loop Join
- For each relation, treat as a single-relation query and use access methods as above
In total, the number of possible query plans are:
Since there are possible join orderings, the possible configurations for the query plan tree increase very quickly. As such, a property is enforced on the tree, namely that it becomes a left-deep join tree. This also removes the need for any auxiliary arrays
%%🖋 Edit in Excalidraw, and the dark exported image%%
#todo Examples: