{"id":511,"date":"2018-12-04T13:11:03","date_gmt":"2018-12-04T13:11:03","guid":{"rendered":"https:\/\/datagradient.com\/?p=511"},"modified":"2021-09-22T17:18:35","modified_gmt":"2021-09-22T17:18:35","slug":"mechanics-of-deep-learning","status":"publish","type":"post","link":"https:\/\/datasciencediscovery.com\/index.php\/2018\/12\/04\/mechanics-of-deep-learning\/","title":{"rendered":"Mechanics of Deep Learning"},"content":{"rendered":"\n<p>Understand the concept of Gradient Descent and Back-propagation to get some idea of how Neural Networks work.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"deep-learning-series\">Deep Learning Series<\/h3>\n\n\n\n<p>In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It is written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained&nbsp;<a href=\"https:\/\/datasciencediscovery.com\/index.php\/2018\/11\/18\/deep-learning-introduction\/\">here<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Neural Networks &amp;<\/h2>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"mechanics-of-deep-learning\">Mechanics of Deep Learning<\/h2>\n\n\n\n<p>We have already covered some of the basics of the architecture and the respective components in the previous posts. But we need to understand one of the most important concepts.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>How do Neural networks exactly work?<\/p><p>How are the weights updated in Neural networks?<\/p><\/blockquote>\n\n\n\n<p>Well, let\u2019s get into the algorithms behind Neural Networks.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"gradient-descent\">Gradient Descent<\/h3>\n\n\n\n<p>For most machine learning algorithms, optimization is used to minimize the cost\/error function. Gradient Descent is one of the most popular optimization algorithms used in Machine Learning. There are many powerful ML algorithms that use gradient descent such as linear regression, logistic regression, support vector machine (SVM) and neural networks.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>Intuition<\/p><\/blockquote>\n\n\n\n<p>Let\u2019s take the classic mountain valley example with a twist, you meet a pirate and in your travels you discover a map to the golden chalice of wisdom. The secret location is the lowest point in a very dark and deep valley. Given that there is no possible sources of natural or artificial light in this magical valley, both the pirate and you are in a race to reach the bottom of the valley in pitch darkness. The pirate decides to take steps forward randomly with the hope of eventually reaching the lowest point.<\/p>\n\n\n\n<p>Both of you have the same starting point, you think there must be a smarter way. At every step you decide to feel the gradient (slope) around you, and take the steepest step possible. By taking the best possible step every time, you win!<\/p>\n\n\n\n<p>That is analogous to the gradient descent technique. We are operating in the blind trying to take a step in the most optimal direction.<\/p>\n\n\n\n<p>Let us say that we fit a regression model on our dataset. We need a cost function to minimize the error between our prediction and the actual value. The plot of our cost function will look like:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"291\" height=\"300\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2018\/12\/Gradient_Descent.jpg?resize=291%2C300\" alt=\"gradient descent\" class=\"wp-image-798\" title=\"\" data-recalc-dims=\"1\"><figcaption><\/figcaption><\/figure>\n\n\n\n<p><a href=\"https:\/\/www.quora.com\/Does-Gradient-Descent-Algo-always-converge-to-the-global-minimum\" target=\"_blank\" rel=\"noopener\">Source<\/a>.<\/p>\n\n\n\n<p>Gradient is another word for slope and the first step in gradient descent is to pick a starting value at random or set it to 0. Now, a gradient has the following characteristics:<br><\/p>\n\n\n\n<ul><li>Direction <\/li><li>Magnitude<\/li><\/ul>\n\n\n\n<p>Let\u2019s take a mathematical function to further understand the same.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>In mathematical terms, if our function is:<\/p><\/blockquote>\n\n\n<p>$<br \/>\nf(x) = e^{2}\\sin(x)<br \/>\n$<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>The derivative:<\/p><\/blockquote>\n\n\n<p>$<br \/>\n\\frac {\\partial f}{\\partial x} = e^2\\cos(x)<br \/>\n$<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>If x = 0<\/p><\/blockquote>\n\n\n<p>$<br \/>\n\\frac{\\partial f}{\\partial x} (0) = e^2 \\approx 7.4<br \/>\n$<\/p>\n\n\n\n<p>So when you start at 0 and move a little (take a step), the function changes by about 7.4 times (magnitude) the amount that you changed. Similarly, if you have multiple variables we take partial derivatives:<\/p>\n\n\n<p>$<br \/>\nz = f(x,y) = xy + x^2<br \/>\n$<\/p>\n\n\n\n<p>For a function such as the one above we first take y as a constant and follow differentiate it in terms of x ( Here: y + 2x). Then we take x as a constant and take the derivative in terms of y (Here: x). Consider if x = 3 and y = -3 then f(x,y) = 9. The final value is obtained from the use of the chain rule of calculus.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"118\" height=\"45\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2018\/12\/chain_rule.jpg?resize=118%2C45\" alt=\"chain rule\" class=\"wp-image-800\" title=\"\" data-recalc-dims=\"1\"><figcaption><\/figcaption><\/figure>\n\n\n<p>$\\nabla f $<\/p>\n\n\n\n<p>the sign of the final gradient points in the direction of greatest change of the function.<\/p>\n\n\n\n<p>In a feed-forward network, we are learning how does the error vary as the weight is adjusted. The relationship between the net\u2019s error and a single weight will look something like the image below (we will get into more detail a little later):<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"370\" height=\"74\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2018\/12\/backprop_chain_rule.jpg?resize=370%2C74\" alt=\"back propagation Chain Rule Derivative\" class=\"wp-image-801\" srcset=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2018\/12\/backprop_chain_rule.jpg?w=370&amp;ssl=1 370w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2018\/12\/backprop_chain_rule.jpg?resize=300%2C60&amp;ssl=1 300w\" sizes=\"(max-width: 370px) 100vw, 370px\" title=\"\" data-recalc-dims=\"1\"><figcaption><\/figcaption><\/figure>\n\n\n\n<p>As a neural network learns, it slowly adjusts several weights by calculating (dE\/dw) the derivative of network Error with respect to the weights.<\/p>\n\n\n\n<p>Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point. For example: <\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table class=\"has-fixed-layout\"><thead><tr><th>Metric<\/th><th>Value<\/th><\/tr><\/thead><tbody><tr><td>Gradient Magnitude<\/td><td>2.5<\/td><\/tr><tr><td>Learning Rate<\/td><td>0.01<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Then the gradient descent algorithm will pick the next point 0.025 away from the previous point. A small learning rate will take too long and  a very large learning rate the algorithm might diverge away from the minimum point (miss the minimum completely).<\/p>\n\n\n\n<p>Finally, the weights are updated incrementally after each epoch (pass over the training dataset) till we get the best results.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"stochastic-gradient-descent\">Stochastic Gradient Descent<\/h3>\n\n\n\n<p>In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration. So far, we have assumed that the batch has been the entire data set. But for large datasets, the gradient computation might be expensive.<br>stochastic gradient descent offers a lighter-weight solution. At each iteration, rather than computing the gradient \u2207f(x), stochastic gradient descent randomly samples i at uniform and computes \u2207fi(x) instead.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"back-propagation\">Back-propagation<\/h3>\n\n\n\n<p>Back-propagation is simply a technique or method of updating the weights. We are aware of partial derivatives, chain rule and most importantly gradient descent. But with Neural networks having multiple layers and different activation functions make it difficult to visualize how everything comes together. Consider, a simple example with the following architecture:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"791\" height=\"388\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2018\/12\/neural_network_layer.jpg?resize=791%2C388\" alt=\"neural networks\" class=\"wp-image-802\" srcset=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2018\/12\/neural_network_layer.jpg?w=791&amp;ssl=1 791w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2018\/12\/neural_network_layer.jpg?resize=300%2C147&amp;ssl=1 300w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2018\/12\/neural_network_layer.jpg?resize=768%2C377&amp;ssl=1 768w\" sizes=\"(max-width: 791px) 100vw, 791px\" title=\"\" data-recalc-dims=\"1\"><figcaption><\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"forward-pass\">Forward Pass<\/h4>\n\n\n\n<p><strong>Step 1:<\/strong>&nbsp;Initialization Let us initialize the weights and the bias.Table 1 a: Weight Initialization Example<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-background\" style=\"background-color:#e7f5fe\"><thead><tr><th>Weights<\/th><th>Value<\/th><\/tr><\/thead><tbody><tr><td>w1<\/td><td>0.10<\/td><\/tr><tr><td>w2<\/td><td>0.15<\/td><\/tr><tr><td>w3<\/td><td>0.03<\/td><\/tr><tr><td>w4<\/td><td>0.08<\/td><\/tr><tr><td>w5<\/td><td>0.18<\/td><\/tr><tr><td>w6<\/td><td>0.06<\/td><\/tr><tr><td>w7<\/td><td>0.11<\/td><\/tr><tr><td>w8<\/td><td>0.26<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Table 1: Dominated\/Non-Dominated Example<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Bias<\/th><th>Value<\/th><\/tr><\/thead><tbody><tr><td>b1<\/td><td>0.05<\/td><\/tr><tr><td>b2<\/td><td>0.42<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Assume take the initial input values to be&nbsp;<strong>[0.95,0.06]<\/strong>&nbsp;and the target value&nbsp;<strong>[0.05,0.82]<\/strong>.<\/p>\n\n\n\n<p><strong>Step 2:<\/strong>&nbsp;Calculations<\/p>\n\n\n\n<p>To get the value of H1:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">H1 = w1 * x1 + w2 * x2 + b1\n   = 0.1 * 0.95 + 0.15 * 0.06 + 0.05\n   = 0.154\n<\/pre>\n\n\n\n<p>As we have a sigmoid activation function:<\/p>\n\n\n<p>$<br \/>\n\\frac{1}{1+e^{-X}}<br \/>\nH1 = \\frac{1}{1+e^{-H1}} = \\frac{1}{1+e^{-0.154}} = 0.538<br \/>\n$<\/p>\n\n\n\n<p>Similarly, we can calculate H2.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>H1 = 0.538 and H2 = 0.52<\/p><\/blockquote>\n\n\n\n<p>Now we calculate the value for output nodes Y1 and Y2.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">Y1 = w5 * H1 + w6 * H2 + b2\n   = 0.18 * 0.538 + 0.06 * 0.52 + 0.42\n   = 0.548\n<\/pre>\n\n\n<p>$<br \/>\nY1 = \\frac{1}{1+e^{-Y1}} = \\frac{1}{1+e^{-0.548}} = 0.633<br \/>\n$<\/p>\n\n\n\n<p>Upon calculation:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>Y1 = 0.633 &amp; Y2 = 0.648<\/p><\/blockquote>\n\n\n\n<p><strong>Step 3:<\/strong>&nbsp;Error Function Let the error function be:<\/p>\n\n\n<p>$<br \/>\nJ( \\theta ) = {( {target &#8211; {output}})^2}<br \/>\n$<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">Total Error (E) = E1 + E2 = 0.184972\nE1 = 0.5 * (0.05 - 0.63368)^2 = 0.17\nE2 = 0.5 * (0.82 - 0.64893)^2 = 0.014 \n<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"backward-pass\">Backward Pass<\/h4>\n\n\n\n<p>Back-propagate the Errors to update the weights.<\/p>\n\n\n\n<p>Error at W5:<\/p>\n\n\n<p>$<br \/>\n\\partial E \\over \\partial W5<br \/>\n$<\/p>\n\n\n<p>$<br \/>\n= ({\\partial E \\over \\partial output Y1}) * ({\\partial output Y1 \\over \\partial Y1}) * ({\\partial Y1 \\over \\partial W5})<br \/>\n$<\/p>\n\n\n\n<p>Component 1: The Cost\/Error Function<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">target: T\noutput: out\nE = 0.5 * (T1 - out Y1)^2 + 0.5 * (T2 - out Y2)^2\nDifferentiating:\n- (T1 - out Y1) = - (0.05 - 0.63368) = 0.58368\n<\/pre>\n\n\n\n<p>Component 2: The Activation function<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">output: out\nout Y1 = 1\/(1 + exp(-Y1))\nDifferentiating:\nout Y1 * (1 - out Y1) = 0.63368 * (1 - 0.63368) = 0.23213\n<\/pre>\n\n\n\n<p>Component 3: The Function of Weights<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">Y1 = w5 * H1 + w6 * H2 + b2\nDifferentiating:\nH1 * 1 = 0.538\n<\/pre>\n\n\n\n<p>Finally, we have the change in W5:<\/p>\n\n\n<p>$ \\partial E \\over \\partial W5<br \/>\n$<\/p>\n\n\n\n<p><code>=0.58368\u22170.23213\u22170.538<\/code><\/p>\n\n\n\n<p><code>=0.07289<\/code><\/p>\n\n\n\n<p>In order to update W5 recall the discussion on gradient descent. Let alpha be learning rate with a chosen value of 0.01.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>Updated W5 will be:<\/p><\/blockquote>\n\n\n<p>$<br \/>\nW5 + \\alpha * ({\\partial E \\over \\partial W5})<br \/>\n$<\/p>\n\n\n\n<p><code>=0.18+0.01\u22170.07289<\/code><\/p>\n\n\n\n<p><code>=0.1807289<\/code><\/p>\n\n\n\n<p>Similarly, we can update the remaining weights. Let\u2019s have a look at the formula to update W1:<\/p>\n\n\n<p>\\frac{\\partial E}{\\partial w1}<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>equals<\/p><\/blockquote>\n\n\n<p>$<br \/>\n(\\sum\\limits_{i}{\\frac{\\partial E}{\\partial out_{i}} * \\frac{\\partial out_{i}}{\\partial Y_{i}} * \\frac{\\partial Y_{i}}{\\partial out_{h1}}}) * \\frac{\\partial out_{h1}}{\\partial H1} * \\frac{\\partial H1}{\\partial w_{1}}<br \/>\n$<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" loading=\"lazy\" width=\"791\" height=\"388\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2018\/12\/neural_network_layer.jpg?resize=791%2C388\" alt=\"neural networks\" class=\"wp-image-802\" srcset=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2018\/12\/neural_network_layer.jpg?w=791&amp;ssl=1 791w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2018\/12\/neural_network_layer.jpg?resize=300%2C147&amp;ssl=1 300w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2018\/12\/neural_network_layer.jpg?resize=768%2C377&amp;ssl=1 768w\" sizes=\"(max-width: 791px) 100vw, 791px\" title=\"\" data-recalc-dims=\"1\"><figcaption><\/figcaption><\/figure>\n\n\n\n<p>It feels like it is complicated, but really we are going back layer by layer to get the respective value. As w1 feeds into neuron H1 and H1 is connected to Y1 and Y2. Moving backwards, we are differentiating the error function following which Y1 and Y2 (the activation function and the function of Weights) . That leads us to H1 where we differentiate its activation function and its respective function of weights.<\/p>\n\n\n\n<p>This is how we back-propagate the errors and update all the weights. Once we update all the weights, that is one&nbsp;<strong>epoch<\/strong>&nbsp;or pass over the dataset. Further, we start the entire process of forward pass and backward pass again. This process is repeated for multiple times with the purpose of minimizing error.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>When do we stop?<\/p><\/blockquote>\n\n\n\n<p>We stop prior to over-fitting that is we want the minimum validation error but we do not want the training error to be lower than the validation error.<\/p>\n\n\n\n<p>Hopefully, this explains the entire process of how neural networks actually work and sheds some light on gradient descent and back-propagation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"what-s-next\">What\u2019s Next?<\/h3>\n\n\n\n<p><a href=\"https:\/\/datasciencediscovery.com\/index.php\/2018\/11\/21\/deep-learning-activation-function\/\">Activation<\/a>: We have talked about activation functions in the past posts, but let\u2019s understand in more detail the different types of activation functions and explore their characteristics.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Understand the concept of Gradient Descent and Back-propagation to get some idea of how Neural Networks work. Deep Learning Series In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":549,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[72,111],"tags":[107,104,80,106,108,75,109,105],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/09\/brain-gears.jpg?fit=1280%2C853&ssl=1","jetpack_publicize_connections":[],"_links":{"self":[{"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/posts\/511"}],"collection":[{"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/comments?post=511"}],"version-history":[{"count":12,"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/posts\/511\/revisions"}],"predecessor-version":[{"id":803,"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/posts\/511\/revisions\/803"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/media\/549"}],"wp:attachment":[{"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/media?parent=511"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/categories?post=511"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/tags?post=511"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}