A neural network is ‘trained’ by being fed a dataset, which is usually sampled into smaller subsets called folds. Some folds are used for training, and some are used to test.

Similar to Simulated Annealing, the neural network makes a guess, and using a loss function, finds out how close it’s guess was to the actual guess. Then it makes changes such that the global loss is at a minimum. A common loss function is the mean square error, also known as the variance.

There are various methods to find the global minimum of a loss function:

As a neural network learns at a learning rate defined by the symbol (Greek letter eta). By many iterations (called epochs) over the training data, it slowly adjusts the different weights so that they can map to the actual value correctly.

Gradient descent

Gradient descent is an iterative numerical process used to train Neural Networks to try to find the minimum of a multivariable loss function, which represents the errors by adjusting the weights on the inputs.

Because a neural network’s final result is the activation value, which is the weighted sum of the inputs, gradient descent uses chain rule derivates:

  • L = Loss Function
  • w = weight of a given input
  • A = Activation value (output)

Gradient descent optimisation

#todo

Stochastic Gradient Descent