Notes on Neural Network and Deep Learning

This is the 1st course of the Deep Learning Specialization on Coursera by Andrew Ng

Definitions

Forward Propagation

For the $l^\text{th}$ hidden layer, forward propagation is

and the cost function is

where $L_i$ is the $i$th loss function for one instance.

Backward Propagation

In the backward propagation process, we care about how the change of the variables in the network affects the cost, i.e., to compute the derivative of the variables. At the last layer (layer $K$), the activation is essentially the output, i.e., $A^{[K]} = \hat{y}$ and its derivative is

where the notation $\mathrm{d} X$ represents the derivative of the cost function $J$ relative to $X$1.

Following the chain rule, for the $l^\text{th}$ hidden layer2, we can find the derivative for the linear product

The hadamard product $\odot$ shows up because $g$ is a scalar function with element-wise mapping. Same procedure can be applied to obtain the derivative for the weight matrix and bias vector of the hidden layer

The first equation above can be broke down to

The linear property of the dot product promises a nice concise from to compute the derivative. Then the activation of the next backward layer is computed as

Gradient Descent

A simple gradient descent step is applied to update the weight matrix and the bias vector,

This forms a complete process of the fundamental computation in a neural network.

References

  1. This definition is different from that in the course. There Prof. Ng actually defines $\mathrm{d} A^{[l]} = \frac{\mathrm{d} L}{\mathrm{d} A^{[l]}}$ and $\mathrm{d} W^{[l]} = \frac{\mathrm{d} J}{\mathrm{d} W^{[l]}}$, this causes a bit confusion. 

  2. Following the last note, because of the different definiton here, the $1/m$ factor in equations calculating $dW$ and $db$ in the course won’t show up here.