Context-free grammar (CFG) are a recursive set of rules that generate a language, similar to their weaker counterpart, regular expressions. Unlike regular expressions, they are capable of generating non-regular languages, such as the ubiquitous language.
CFGs are used in many markdown file formats, such as JSON and XML. In fact, the definition of regular expression is actually a CFG. The ability to recursively define a set of rules gives CFGs much more power than regular expressions.
Definition
Context-free Grammar (CFG) ^definition
A context-free grammar is defined as a 4-tuple:
- = An alphabet. In the context of CFGs, this is also referred to a set of terminals (since they do not recursively depend on anything, think of base cases in recursion
- = A finite set of variables (also called non-terminals, for obvious reasons). These act like recursive expressions.
- = A finite set of production rules, written in the form . takes in a variable and returns a string containing terminals (belonging to ) and/or more variables (belonging to ).
- can (and usually is) self-referential. A production rule is valid.
- Another common notation used is that when we use a production rule, on a string to get , we can write it as :
- = The start variable
Note the similarities between production rules and logical substitution, with the only difference being that a logical substitution applies to the entire formula, replacing every instance, while a production rule replaces once instance at a time.
Derivation ^definition-derivation
Let and be two strings containing variables and/or terminals (). Also let be similar strings containing variables and/or terminals. Then derives iff:
z
x \Rightarrow^* z
* We say that $c$ is *derived from* $p$ * The derivation must be *finite* (as in $y_{k}$ is defined for some natural number) * Note the similarities to [resolution proofs](Resolution%20Proof.md)
- Strings in are known to be in sentential form (because they still need to recursed further, so not quite a sentence)
- A string in is a sentence (since it contains all terminals).
Language of a CFG ^definition-language
The language, generated by a CFG, (with start variable ) is given by:
Ambiguity
Ambiguity
A string is ambiguous with respect to a grammar, , iff has more than one parse tree for . That is, the parse tree for is not unique.
A grammar, is ambiguous if it contains at least one ambiguous string.
Backus–Naur Form (Shorthand Notation)
A way to simplify writing the production rules, , is by using the Backus-Naur Form, where:
Becomes:
A further simplification is to only define the production rules for a CFG, because anything on the left of the arrow must be a variable, and anything on the right (that isn’t already a variable), must belong to the alphabet.
For example:
results in:
Generating A Language
Generating a language from a CFG is very straightforward. Let be a (initially empty) string and
- , set the string to be the start variable of
- While contains at least one variable:
- Pick any one production rule from and apply it to one instance/occurrence of the variable. If contains multiple variable, pick one. For example, if a variable is encountered and contains , replace one occurrence of in with .
- When has no more variables (contains only terminals), stop.
Applying every permutation production rules across every variable occurrence will generate a langage from .
Parse Tree
Context-free grammars can easily be depicted by an expression tree, due to its recursive nature.
The expression tree is a rooted tree where
- An expression tree for a CFG must have the start variable, as the root.
- Each node is either a terminal or variable.
- Each non-leaf node and its children is associated with a production rule. In particular, if the production rule is A→wA→w then AA is the label of the non-leaf node and the symbols in ww are the labels of the children, in left to right order.
- Each leaf is a terminal. The concatenation of the leaf labels, in left to right order, gives the string
For example, take the CFG defining arithmetic expressions. The expression tree (for a given expression) would be of the form:
%%🖋 Edit in Excalidraw, and the dark exported image%%
Examples
E1: Arithmetic BODMAS as a CFG ^e1
Any arithmetic expression can easily be expressed as a CFG:
- needs to also include our operators and brackets, since being a language, we have no notion of ‘evaluating’ an expression (no semantics), but we have a very strict syntax.
- : For defined arthmatic operators versus expressions
R:\quad\text{op} \to + \ \vert \ - \ \vert \ \times \ \vert \ \div \ \vert \ \quad\quad \text{expr} \to 0 \ \vert \ 1 \ \vert \ \dots \ \vert \ 9 \ \vert \ \text{(expr)}\ \vert \ \text{expr op expr}
* $S = \text{expr}$
E2: CFG for equal number of 0s and 1s in order
The context-free language given below can be expressed as a CFG:
We can express this using only the production rules: