Query Optimisation

To efficiently process a query, a RDBMS needs to be able optimise certain relational operators which are the backbone of every query. The main operations it needs to be able to perform and optimise are selection, projection and joins.

We use the same basic operation to

Analysing Runtimes

Also note that:

$n$ = Number of records in the table
$n_{p a g es}$ = Number of pages in the table

Selection

Recall what a selection is:

In order to optimise a selection, a DBMS needs to know:

All available indexes/pointers to records (to actually find the files)
The expected size of the result i.e. number of pages (to know when to stop looking)

Reducing Factor

A DBMS uses it’s optimiser to estimate the expected size using reducing factors, which is a fraction per predicate, which is a basic relational statement (i.e. involves only one relational comparators):

Estimated Output Size = Size of relation \times i = 1 \prod n RF_{i}

$RF$ : Reducing factor (also called selectivity), a fraction that states how many records satisfy the associated predicate
$n$ : Number of predicates/ basic relational statements

Once the estimated size is obtained, the DBMS can search through the database file structure to find records matching the selection criteria:

Searching: Heap Files

For heap files (unsorted files), it needs to linear search through every single record so it doesn’t miss anything.

%%🖋 Edit in Excalidraw, and the dark exported image%%

I/O Operations = n_{p a g es}

Searching: Sorted Files

For sorted files, the DBMS can simply binary search to find the first record matching condition, then iterate till until the estimate size of records have been scanned

%%🖋 Edit in Excalidraw, and the dark exported image%%

I/O Operations = Binary search I/O Cost + e_{p a g es} = Binary search I/O Cost + RF \cdot n_{p a g es}

= lo g_{2} (n_{p a g es}) + RF \cdot n_{p a g es}

Searching Through Indexed Files

First the DBMS needs to traverse through the B+ Tree (or use the hash table if the index is hash-based) to find indexes that satisfy the criteria.

Hash-based Indexes Only Work On Equality Checks!

Then, linearly iterate through the indexes (note that the indexes would be stored in their own pages, which need to be brought to RAM). Let $i_{p a g es}$ be the number of pages occupied by indexes.#todo B-tree has leaf nodes doubly linked.

If the indexes are clustered, then the total cost is:

I/O Operations = (i_{p a g es} + n_{p a g es}) \cdot RF

Why? Every index points to a record, and so if only $RF \cdot n$ records satisfy the criteria, then we only need to iterate through $RF \cdot i$ indexes. For each index, we need to load up the page it points to, but because the indexes are clustered, many of the indexes point to the same page.

If the indexes are unclustered, then the total cost is:

I/O Operations = (i_{p a g es} + n) \cdot RF

This is because the index lookup does not guarantee indexes pointing to records in the same page (due to the unclustered nature), so we need do an I/O operation for each record (total $n$ records).

Composed Selection Criteria

In most cases, the selection criteria contains multiple predicates, involving multiple attributes/columns.

A tree-based index can only match predicates that involve attributes in the prefix of the search key fields . What this means is that if we have a composed selection criteria of multiple predicates, an indexed database is only useful if at least one predicate has any of the prefixes of the search key field(s)

Predicates must involve search key field to be optimised

If we have a index on column $a$ , we can only optimise selections if they have predicates involving $a$ .

On the fly checking is checking predicates that are not part of the search key fields. It does not contribute to the I/O cost, because the page is already loaded when checking from key fields.

In terms of reduction factors, we only have them if the predicates involve search keys:

Example: Three different selections

Assume we have a database containing columns $a, b, c$ with a unclustered index on column $a, b$ (i.e. our indexes are $(a, b)$ ) Look at the following selection criteria:

$a > 5$

$a = 4 \land b > 5$

$b > 5 \land c = 5$

$a < 4 \land b > 7 \land c = 3$

$c = 2$
The size estimations would be as follows:

$n \cdot RF (a)$ : Reduction factor involving $a$ , because $(a)$ is a prefix of $(a, b)$

$n \cdot RF (a) \cdot RF (b)$ : Reduction factors involving both $a$ and $b$ , because $(a, b)$ is a prefix of $(a, b)$

$n$ : No reduction factors, because $(b)$ is not a prefix of $(a, b)$ . Would need to perform a full table scan

$n \cdot RF (a) \cdot RF (b)$ : Reduction factors only involve $a$ and $b$ and $c$ needs to be checked on the fly

$n$ : No reduction factors because $(c)$ is not in $(a, b)$

Projection

Recall projection is:

Projection

Notice the Duplicate elimination warning? That’s what the DBMS aims to optimise: elimination of duplicates

Projecting Factor

Like with the reducing factor for selection, the projecting factor is a ratio of projected columns to total columns, which lets us know what percentage of attributes are being projected. The DBMS uses this to estimate how much space the final output will take.

PF = Projecting Factor = \frac{n _{p ro j ec t e d}}{n _{co l u mn s}}

This can be done by either sorting the data using sorting algorithms (which guarantees duplicates are next to each other) or use hash collisions

The main complication with sorting is that the data stored in disk is mostly much larger than the maximum space in RAM (where it can be processed). To solve this, we use an external sorting algorithm.

External Merge

The main method for this is with external merge sort:

Process

In terms of O operations, external merge sort over $n_{p a g es}$ has a runtime of:

2 \cdot n_{p a sses} \cdot n_{p a g es}

There is a factor of two because we need to both read & write data when sorting.

Hash-based

#todo

Operation Cost

The total cost to remove duplicates with external merge sort is:

Reading the entire table, so we only keep the attributes that need to be projected: $n_{p a g es}$
Write pages with projected attributes to disk: $n_{p a g es} \times PF$
Sort pages with projected attributes using external merge sort: $2 \cdot n_{p a sses} \cdot (n_{p a g es} \cdot PF)$
Read sorted projected pages to discard adjacent duplicates: $n_{p a g es} \times PF$

Joins

Joins are generally very expensive to perform, and in the worst case result in the cross product. Out of all the possible joins, the natural join is the most commonly queried one, and hence need to be the most optimised. A RDBMS can implement a join in many ways:

Nested-loop joins
- Simple nested-loop join
- Page based nested-loop join
- Block nested-loop join
Sort-merge join
Hash join

Simple Nested-Loop Join

The most straight-forward method of performing a natural join over unsorted files is using a nested for-loop:

Algorithm SimpleNestedLoopJoin(A: Table, B: Table):
	For each tuple a in A do:
		For each tuple b in B do:
			If a.attribute == b.attribute then:
				Add (a,b) to the Join
			End If
		End do
	End do
	Return Join
End Algorithm

The final join is unsorted.

Total cost:

n_{p a g es} (A) + n_{t u pl es} (A) \cdot n_{p a g es} (B)

Page-based Nested-Loop Join

Page-based nested-loop joins exploit the fact that multiple tuples are stored in a single page, so we can start by loading a page and then comparing every tuple in that page in order to minimise the number of I/O operations.

Algorithm PageBasedNestedLoopJoin(A: Table, B: Table):
	For each page a_page in A do:
		For each page b_page in B do:
			For each tuple a in A do:
				For each tuple b in B do:
					If a.attribute == b.attribute then:
						Add (a,b) to the Join
					End If
				End do
			End do
		End do
	End do
	Return Join
End Algorithm

The final join is unsorted.

Because we measure our total runtime in terms of page accesses, once we access a page, any tuples in it do not add extra time. Hence the final cost is:

n_{p a g es} (A) + n_{p a g es} (A) \cdot n_{p a g es} (B)

Block Nested-Loop Join

A further optimisation is to exploit as much main memory (RAM) available. The block-based nested-loop join first reserves two pages in the RAM for an input and output buffer to reduce the number of I/O operations.

Assume we are treating $A$ as the outer loop table, and $B$ as the inner loop table

If the maximum capacity of the RAM (in pages) is $n$ , then the block-based approach reads in $n - 2$ sized ‘blocks’ of pages of $A$ , comparing it with the input buffer for $B$ . If the join conditions are satisfied, then writes $(A, B)$ to the output buffer page.

Whenever the $n - 2$ block is fully read, read in a new block (of $A$ ) from disk
Whenever the input buffer is read, read in a new page (of $B$ ) from disk
Whenever the output buffer is full, flush it (write to disk)

%%🖋 Edit in Excalidraw, and the dark exported image%%

The total I/O operations are:

n_{p a g es} (A) + n_{b l oc k s} (A) * n_{p a g es} (B)

n_{b l oc k s} (A) = ⌈ \frac{n _{p a g es} ( A )}{n - 2} ⌉

Sort-Merge Join

The sort-merge join is an improvement over nested-loop joins, because once $A$ and $B$ are sorted, there is no need to check all of $B$ for a single value of $A$ , since if we have a value $B . i d > A . i d$ , every value of $B$ after fails to satisfy the condition, so we can stop checking.

The ‘merging’ operation is simply taking the column values of $A$ and $B$ and merging them to form the join $A ⋈ B$ .

$A$ is always only scanned once, and (approximately), $B$ is scanned once. Technically, there would be multiple scans when $B$ has duplicates, but the amortised cost is that $B$ is only scanned once.

The total I/O cost is:

sor t (A) + sor t (B) + n_{p a g es} (A) + n_{p a g es} (B)

Assuming we use external merge sort:

2 \cdot n_{p a sses} (A) \cdot n_{p a g es} (A) + 2 \cdot n_{p a sses} (B) \cdot n_{p a g es} (B) + n_{p a g es} (A) + n_{p a g es} (B)

= n_{p a g es} (A) (1 + 2 \cdot n_{p a sses} (A)) + n_{p a g es} (B) (1 + 2 \cdot n_{p a sses} (B))

Hash Join

#todo

General Join Conditions

For composed join conditions that consists of only equality checks, i.e. equi-joins, we can use any of the join techniques, with sort-merge join and hash join giving the best runtimes:

For sort-merge join, sort over the combination of columns in $A$ and $B$ . For example, if we have the join: $A ⋈_{A . i d = B . i d \land A . a g e = B . i d} B$ we sort $A$ with the combination of $(i d, a g e)$ (this can be done by string concatenation, for example), and sort $B$ over $i d$ .
For hash join, we partition over the combination of columns

For joins containing inequalities in their join conditions, i.e. theta-joins, we cannot* use hash join or sort-merge join! The best option is to use block nested-loop join.

Questionably Accurate Notes

Explorer

Query Optimisation

Selection

Reducing Factor

Searching: Heap Files

Searching: Sorted Files

Searching Through Indexed Files

Composed Selection Criteria

Projection

Projecting Factor

External Merge

Hash-based

Operation Cost

Joins

Simple Nested-Loop Join

Page-based Nested-Loop Join

Block Nested-Loop Join

Sort-Merge Join

Hash Join

General Join Conditions

Table of Contents

Related Concepts

See Also:

Questionably Accurate Notes

Explorer

Query Optimisation

Selection §

Reducing Factor §

Searching: Heap Files §

Searching: Sorted Files §

Searching Through Indexed Files §

Composed Selection Criteria §

Projection §

Projecting Factor §

External Merge §

Hash-based §

Operation Cost §

Joins §

Simple Nested-Loop Join §

Page-based Nested-Loop Join §

Block Nested-Loop Join §

Sort-Merge Join §

Hash Join §

General Join Conditions §

Table of Contents

Related Concepts

See Also:

Selection

Reducing Factor

Searching: Heap Files

Searching: Sorted Files

Searching Through Indexed Files

Composed Selection Criteria

Projection

Projecting Factor

External Merge

Hash-based

Operation Cost

Joins

Simple Nested-Loop Join

Page-based Nested-Loop Join

Block Nested-Loop Join

Sort-Merge Join

Hash Join

General Join Conditions