Building your own neural network from scratch can be challenging and confusing.
It can often be insightful to look at something simple to help grasp the concepts.
In this tutorial we'll be doing exactly that. We will work on creating a neural network to solve the simple XOR problem in Python.

When given two binary inputs $x_1$ and $x_2$, we want to output a 1 if exactly one of them is a 1. Otherwise we want to output a 0. This is represented by the truth table below:

$x_1$ | $x_2$ | Output |

0 | 0 | 0 |

0 | 1 | 1 |

1 | 0 | 1 |

1 | 1 | 0 |

If we plot the XOR problem, we can see that the data is not linearly separable. In other words, we can't separate the blue dots from the red dots using a straight line.

This means that our neural network will need to have at least one hidden layer with a non linear activation function to classify all inputs correctly.

The sigmoid function is a commonly used activation function which we will use for the purpose of this tutorial:

We design the neural network taking the following points into consideration:

- From experience, we know that the XOR problem can be written as a combination of the "AND" function and the "OR" function: $A$ XOR $B$ = ($A$ OR $B$) AND NOT ($A$ AND $B$). This allows us to engineer our network to have 2 nodes in the hidden layer.
- Our network has to output one value between 0 and 1, so we will have one node in the output layer (we could have two nodes representing the probability of 0 and 1 respectively but let's keep it simple for now).
- In general adding a bias term is a good idea because it lets a neural network learn a function that does not have to go through the origin.

The architecture of our network now looks as follows:

Our goal is to learn the weights in the network that will produce the correct output given our inputs ($x_1$ and $x_2$).

Weights $w_1$-$w_6$ correspond to the layer 1 weights and weights $w_7$-$w_9$ correspond to the layer 2 weights. With the above code, they should have been initialized as follows:

$\color{red}{w_1}$ | $\color{red}{w_2}$ | $\color{red}{w_3}$ | $\color{blue}{w_4}$ | $\color{blue}{w_5}$ | $\color{blue}{w_6}$ |

0.42572141 | 0.73539729 | 0.38338077 | 0.81458374 | 0.8680032 | 0.97945663 |

$\color{green}{w_7}$ | $\color{green}{w_8}$ | $\color{green}{w_9}$ |

0.89319435 | 0.20971517 | 0.74182765 |

$\begin{align} h_1 & = \sigma(1 \cdot \color{red}{w_1} + x_1 \cdot \color{red}{w_2} + x_2 \cdot \color{red}{w_3}) \\\\ & = \sigma(1 \cdot 0.42572141 + 0 \cdot 0.73539729 + 0 \cdot 0.38338077) \\\\ & = \sigma(0.42572141) \\\\ & = 0.60485152 \end{align}$

$\begin{align} h_2 & = \sigma(1 \cdot \color{blue}{w_4} + x_1 \cdot \color{blue}{w_5} + x_2 \cdot \color{blue}{w_6}) \\\\ & = \sigma(1 \cdot 0.81458374 + 0 \cdot 0.8680032 + 0 \cdot 0.97945663) \\\\ & = \sigma(0.81458374) \\\\ & = 0.69308541 \end{align}$

$\begin{align} \hat{y} & = \sigma(1 \cdot \color{green}{w_7} + h_1 \cdot \color{green}{w_8} + h_2 \cdot \color{green}{w_9}) \\\\ & = \sigma(1 \cdot 0.89319435 + 0.60485152 \cdot 0.20971517 + 0.69308541 \cdot 0.74182765) \\\\ & = \sigma(1.53419081) \\\\ & = 0.82261865 \end{align}$

So when $x_1$ and $x_2$ are both zero, the output is 0.82261865.

Similarly can do the same for the other input rows:

$x_1$ | $x_2$ | $\hat{y}$ | Expected Output |

0 | 0 | 0.82261865 | 0 |

0 | 1 | 0.84215617 | 1 |

1 | 0 | 0.84269835 | 1 |

1 | 1 | 0.85314043 | 0 |

We can see that the error grows larger when the difference between the exepcted output ($y$) and the calculated output ($\hat{y}$) is larger. It makes sense that we are taking the difference between the produced and expected values, but why is there a squared term in there?

- It keeps everything we add to our sum positive so that our error makes sense. If we didn't keep it positive then predicing a 1 as a 0 would give an error of 1 and predicing a 0 as a 1 would give us a -1. Adding these together would tell us that we had an error of 0 which is not what we want. Ofcourse, we could have just used an absolute value.
- More importantly, we want to penalise predictions which are further away from the expected values non linearly. For example if we expect a 0 and get a 0.1 this might be ok but if we predicted a 0.5 for that same value it would be a problem. Without squaring, the error is only 5 times worse. If we square the numbers then we have errors of 0.01 and 0.25, so the second prediction is now 25 times worse.

What we want to do is obtain the weights that minimize this mean squared error. We can do this by using a technique called gradient descent.

$w_7$ = $w_7$ - $\alpha \frac{\partial E}{\partial W_7}$,

where $\alpha$ is our learning rate and $E$ is the MSE. For now, let's look at the error for the case where both $x_1$ and $x_2$ are zero to make things simpler:

If we observe how $w_7$ affects the error, it contributes directly to $\hat{y}$.

Let's make a quick substitution to make things easier:

$\begin{align} \hat{y} & = \sigma(1 \cdot \color{green}{w_7} + h_1 \cdot \color{green}{w_8} + h_2 \cdot \color{green}{w_9}) \\\\ & = \sigma(net_3) \\\\ \end{align}$

where $net_3 = 1 \cdot \color{green}{w_7} + h_1 \cdot \color{green}{w_8} + h_2 \cdot \color{green}{w_9}$

We see that:

- $E$ is a function of $\hat{y}$.
- $\hat{y}$ is a function of $net_3$.
- $net_3$ is a function of $w_7$.

$\sigma'(x) = \sigma(x) \cdot (1-\sigma(x))$,

where $\sigma'(x)$ denotes the differentiation of the sigma function.

Using the chain rule:

$E = (y_1-\hat{y}_1)^2$

$\frac{\partial E}{w_7} = \frac{\partial E}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial net_3} \frac{\partial net_3}{\partial w_7}$

$\frac{\partial E}{\partial \hat{y}} = 2(\hat{y} - y) = 2(0.82261865 - 0) = 1.6452373$

$\frac{\partial \hat{y}}{\partial net_3} = \hat{y} \cdot (1 - \hat{y}) = 0.82261865 (1 - 0.82261865) = 0.1459172067$

$\frac{\partial net_3}{\partial w_7} = 1$

$\frac{\partial E}{w_7} = (1.6452373)(0.1459172067)(1) = 0.2400684312 $

Now we know how to update weight $w_7$:

$w_7 = w_7 - \alpha \frac{\partial E}{w_7}$

It is important to note that we do not update this weight immediately, otherwise we could influence our other weight updates. Instead we memorise the amount that our weight needs to be updated by. We do this by making use of a temporary matrix. We must also do the same calculations for weights $w_8$ and $w_9$ with the small difference that these weights are not attached to bias nodes:

$\frac{\partial E}{w_8} = \frac{\partial E}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial net_3} \frac{\partial net_3}{\partial w_8}$

$\frac{\partial net_3}{\partial w_8} = h_1 = 0.60485152$

$\frac{\partial E}{w_8} = (1.6452373)(0.1459172067)(0.60485152) = 0.1452057555 $

$\frac{\partial E}{w_9} = \frac{\partial E}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial net_3} \frac{\partial net_3}{\partial w_9}$

$\frac{\partial net_3}{\partial w_9} = h_2 = 0.69308541$

$\frac{\partial E}{w_9} = (1.6452373)(0.1459172067)(0.69308541) = 0.166387927 $

The code for this is as follows:

Again we will make a quick substitution to make things easier:

$\begin{align} h_1 & = \sigma(1 \cdot \color{red}{w_1} + x_1 \cdot \color{red}{w_2} + x_2 \cdot \color{red}{w_3}) \\\\ & = \sigma(net_1) \\\\ \end{align}$

where $net_1 = 1 \cdot \color{red}{w_1} + x_1 \cdot \color{red}{w_2} + x_2 \cdot \color{red}{w_3}$

We see that:

- $E$ is a function of $\hat{y}$.
- $\hat{y}$ is a function of $net_3$.
- $net_3$ is a function of $h_1$.
- $h_1$ is a function of $net_1$.
- $net_1$ is a function of $w_1$.

So now using the chain rule:

$ \frac{\partial E}{ \partial w_1} = \frac{\partial E}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial net_3} \frac{\partial net_3}{\partial h_1} \frac{\partial h_1}{\partial net_1} \frac{\partial net_1}{\partial w_1}$

We've already calculated $\frac{\partial E}{\partial \hat{y}}$ and $\frac{\partial \hat{y}}{\partial net_3}$

$\frac{\partial net_3}{\partial h_1} = w_8 = 0.20971517 $

$\begin{align} \frac{\partial h_1}{\partial net_1} & = \frac{\partial \sigma(net_1)}{\partial net_1} \\\\ & = \sigma(net_1) (1 - \sigma(net_1)) \\\\ & = h_1 (1 - h_1) \\\\ & = 0.60485152 (1 - 0.60485152) \\\\ & = 0.2390061588 \\\\ \end{align}$

$\frac{\partial net_1}{\partial w_1} = 1$

$ \begin{align} \frac{\partial E}{\partial w_1} &= (1.6452373)(0.1459172067)(0.20971517)(0.2390061588)(1) \\\\ &= 0.012033 \end{align}$

We repeat the above proceedure for weights 2 and 3 noting that:

$\frac{\partial net_1}{\partial w_2} = x_1 = 0$

$\frac{\partial net_1}{\partial w_3} = x_2 = 0$

This means that:

$\frac{\partial E}{\partial w_2} = 0$

$\frac{\partial E}{\partial w_2} = 0$

For weights $w_4-w_6$ there is a slight change. Since these weights contribute to $h_2$ and not $h_1$, we need to differentiate $net_3$ with respect to $h_2$ instead. When this is complete, we should obtain the following values:

$\frac{\partial E}{\partial w_1}$ | 0.012033 |

$\frac{\partial E}{\partial w_2}$ | 0 |

$\frac{\partial E}{\partial w_3}$ | 0 |

$\frac{\partial E}{\partial w_4}$ | 0.03788282 |

$\frac{\partial E}{\partial w_5}$ | 0 |

$\frac{\partial E}{\partial w_6}$ | 0 |

The weights have caused the network to learn $h_1$ as an $AND$ function, and $h_2$ as an $OR$ function.

$x_1$ | $x_2$ | $h_1$ | $x_1$ $AND$ $x_2$ |

0 | 0 | 0.001 | 0 |

0 | 1 | 0.093 | 0 |

1 | 0 | 0.094 | 0 |

1 | 1 | 0.879 | 1 |

$x_1$ | $x_2$ | $h_2$ | $x_1$ $OR$ $x_2$ |

0 | 0 | 0.065 | 0 |

0 | 1 | 0.969 | 1 |

1 | 0 | 0.971 | 1 |

1 | 1 | 1 | 1 |

Let's review the function learnt for weights $w_7$-$w_9$ independently from the rest of the network. We do this by having $h_1$ and $h_2$ take on binary values instead of their actual values. Add this to the end of the previous code.

The weights $w_7$-$w_9$ have learnt the function: $NOT(h_1)$ $AND$ $h_2$

$h_1$ | $NOT(h_1)$ | $h_2$ | $\hat{y}$ | $NOT(h_1)$ $AND$ $h_2$ |

0 | 1 | 0 | 0.02 | 0 |

0 | 1 | 1 | 0.99 | 1 |

0 | 0 | 0 | 0 | 0 |

0 | 0 | 1 | 0.011 | 0 |

$ \begin{align} \hat{y} &= NOT(h_1) \, AND \, h_2 \\\\ &= NOT(x_1 \, AND \, x_1) \, AND \, (x_1 \, OR \, x_2) \end{align}$

This is a represents the $XOR$ function!