Understand the concept of Gradient Descent and Back-propagation to get some idea of how Neural Networks work.
Deep Learning Series
In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It is written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.
Neural Networks &
Mechanics of Deep Learning
We have already covered some of the basics of the architecture and the respective components in the previous posts. But we need to understand one of the most important concepts.
How do Neural networks exactly work?
How are the weights updated in Neural networks?
Well, let’s get into the algorithms behind Neural Networks.
Gradient Descent
For most machine learning algorithms, optimization is used to minimize the cost/error function. Gradient Descent is one of the most popular optimization algorithms used in Machine Learning. There are many powerful ML algorithms that use gradient descent such as linear regression, logistic regression, support vector machine (SVM) and neural networks.
Intuition
Let’s take the classic mountain valley example with a twist, you meet a pirate and in your travels you discover a map to the golden chalice of wisdom. The secret location is the lowest point in a very dark and deep valley. Given that there is no possible sources of natural or artificial light in this magical valley, both the pirate and you are in a race to reach the bottom of the valley in pitch darkness. The pirate decides to take steps forward randomly with the hope of eventually reaching the lowest point.
Both of you have the same starting point, you think there must be a smarter way. At every step you decide to feel the gradient (slope) around you, and take the steepest step possible. By taking the best possible step every time, you win!
That is analogous to the gradient descent technique. We are operating in the blind trying to take a step in the most optimal direction.
Let us say that we fit a regression model on our dataset. We need a cost function to minimize the error between our prediction and the actual value. The plot of our cost function will look like:
Gradient is another word for slope and the first step in gradient descent is to pick a starting value at random or set it to 0. Now, a gradient has the following characteristics:
- Direction
- Magnitude
Let’s take a mathematical function to further understand the same.
In mathematical terms, if our function is:
$
f(x) = e^{2}\sin(x)
$
The derivative:
$
\frac {\partial f}{\partial x} = e^2\cos(x)
$
If x = 0
$
\frac{\partial f}{\partial x} (0) = e^2 \approx 7.4
$
So when you start at 0 and move a little (take a step), the function changes by about 7.4 times (magnitude) the amount that you changed. Similarly, if you have multiple variables we take partial derivatives:
$
z = f(x,y) = xy + x^2
$
For a function such as the one above we first take y as a constant and follow differentiate it in terms of x ( Here: y + 2x). Then we take x as a constant and take the derivative in terms of y (Here: x). Consider if x = 3 and y = -3 then f(x,y) = 9. The final value is obtained from the use of the chain rule of calculus.
$\nabla f $
the sign of the final gradient points in the direction of greatest change of the function.
In a feed-forward network, we are learning how does the error vary as the weight is adjusted. The relationship between the net’s error and a single weight will look something like the image below (we will get into more detail a little later):
As a neural network learns, it slowly adjusts several weights by calculating (dE/dw) the derivative of network Error with respect to the weights.
Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point. For example:
Metric | Value |
---|---|
Gradient Magnitude | 2.5 |
Learning Rate | 0.01 |
Then the gradient descent algorithm will pick the next point 0.025 away from the previous point. A small learning rate will take too long and a very large learning rate the algorithm might diverge away from the minimum point (miss the minimum completely).
Finally, the weights are updated incrementally after each epoch (pass over the training dataset) till we get the best results.
Stochastic Gradient Descent
In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration. So far, we have assumed that the batch has been the entire data set. But for large datasets, the gradient computation might be expensive.
stochastic gradient descent offers a lighter-weight solution. At each iteration, rather than computing the gradient ∇f(x), stochastic gradient descent randomly samples i at uniform and computes ∇fi(x) instead.
Back-propagation
Back-propagation is simply a technique or method of updating the weights. We are aware of partial derivatives, chain rule and most importantly gradient descent. But with Neural networks having multiple layers and different activation functions make it difficult to visualize how everything comes together. Consider, a simple example with the following architecture:
Forward Pass
Step 1: Initialization Let us initialize the weights and the bias.Table 1 a: Weight Initialization Example
Weights | Value |
---|---|
w1 | 0.10 |
w2 | 0.15 |
w3 | 0.03 |
w4 | 0.08 |
w5 | 0.18 |
w6 | 0.06 |
w7 | 0.11 |
w8 | 0.26 |
Table 1: Dominated/Non-Dominated Example
Bias | Value |
---|---|
b1 | 0.05 |
b2 | 0.42 |
Assume take the initial input values to be [0.95,0.06] and the target value [0.05,0.82].
Step 2: Calculations
To get the value of H1:
H1 = w1 * x1 + w2 * x2 + b1 = 0.1 * 0.95 + 0.15 * 0.06 + 0.05 = 0.154
As we have a sigmoid activation function:
$
\frac{1}{1+e^{-X}}
H1 = \frac{1}{1+e^{-H1}} = \frac{1}{1+e^{-0.154}} = 0.538
$
Similarly, we can calculate H2.
H1 = 0.538 and H2 = 0.52
Now we calculate the value for output nodes Y1 and Y2.
Y1 = w5 * H1 + w6 * H2 + b2 = 0.18 * 0.538 + 0.06 * 0.52 + 0.42 = 0.548
$
Y1 = \frac{1}{1+e^{-Y1}} = \frac{1}{1+e^{-0.548}} = 0.633
$
Upon calculation:
Y1 = 0.633 & Y2 = 0.648
Step 3: Error Function Let the error function be:
$
J( \theta ) = {( {target – {output}})^2}
$
Total Error (E) = E1 + E2 = 0.184972 E1 = 0.5 * (0.05 - 0.63368)^2 = 0.17 E2 = 0.5 * (0.82 - 0.64893)^2 = 0.014
Backward Pass
Back-propagate the Errors to update the weights.
Error at W5:
$
\partial E \over \partial W5
$
$
= ({\partial E \over \partial output Y1}) * ({\partial output Y1 \over \partial Y1}) * ({\partial Y1 \over \partial W5})
$
Component 1: The Cost/Error Function
target: T output: out E = 0.5 * (T1 - out Y1)^2 + 0.5 * (T2 - out Y2)^2 Differentiating: - (T1 - out Y1) = - (0.05 - 0.63368) = 0.58368
Component 2: The Activation function
output: out out Y1 = 1/(1 + exp(-Y1)) Differentiating: out Y1 * (1 - out Y1) = 0.63368 * (1 - 0.63368) = 0.23213
Component 3: The Function of Weights
Y1 = w5 * H1 + w6 * H2 + b2 Differentiating: H1 * 1 = 0.538
Finally, we have the change in W5:
$ \partial E \over \partial W5
$
=0.58368∗0.23213∗0.538
=0.07289
In order to update W5 recall the discussion on gradient descent. Let alpha be learning rate with a chosen value of 0.01.
Updated W5 will be:
$
W5 + \alpha * ({\partial E \over \partial W5})
$
=0.18+0.01∗0.07289
=0.1807289
Similarly, we can update the remaining weights. Let’s have a look at the formula to update W1:
\frac{\partial E}{\partial w1}
equals
$
(\sum\limits_{i}{\frac{\partial E}{\partial out_{i}} * \frac{\partial out_{i}}{\partial Y_{i}} * \frac{\partial Y_{i}}{\partial out_{h1}}}) * \frac{\partial out_{h1}}{\partial H1} * \frac{\partial H1}{\partial w_{1}}
$
It feels like it is complicated, but really we are going back layer by layer to get the respective value. As w1 feeds into neuron H1 and H1 is connected to Y1 and Y2. Moving backwards, we are differentiating the error function following which Y1 and Y2 (the activation function and the function of Weights) . That leads us to H1 where we differentiate its activation function and its respective function of weights.
This is how we back-propagate the errors and update all the weights. Once we update all the weights, that is one epoch or pass over the dataset. Further, we start the entire process of forward pass and backward pass again. This process is repeated for multiple times with the purpose of minimizing error.
When do we stop?
We stop prior to over-fitting that is we want the minimum validation error but we do not want the training error to be lower than the validation error.
Hopefully, this explains the entire process of how neural networks actually work and sheds some light on gradient descent and back-propagation.
What’s Next?
Activation: We have talked about activation functions in the past posts, but let’s understand in more detail the different types of activation functions and explore their characteristics.