What does a neural network actually do?

 

Being one of the most commonly used algorithms in machine learning and more specifically in deep learning, neural networks present a powerful tool that proved to be very successful in treating many problems is a wide variety of domains.

From the first day that I was introduced to this powerful tool while reading an article about Artificial Intelligence in 2018, I was captivated by the idea of creating an artificial neural network that can mimic the human brain. Upon further readings, I've gotten more fascinated by the different architectures of neural networks and the results that they achieved on many complex tasks. However, one question was always at the back of my head and it took me a while to fully grasp its answer, and that is: What does a neural network actually do? 

In other terms, How does a neural network compute a certain prediction?


There are many frameworks and libraries that made machine learning accessible to a bigger community and not only researchers that code their algorithms from scratch. Nowadays many machine learning practitioners are not interested in learning the mathematical background of machine learning and they're satisfied with learning and understanding the architecture and workflow of a certain model, which is understandable if the task at hand is relatively simple and the available tools can achieve good results by simply adjusting some hyper-parameters. 

However, when trying to solve more complicated problems using machine learning then it is very recommended to have a good level of knowledge, and a deeper understanding of how the used algorithm work. Knowing how a certain optimizer work, or the difference between loss functions, or the specificity of the common activation functions for Neural Networks, can make a tangible difference on the obtained results.

In this blog post, I will try to clarify the work a neural network does in order to make a prediction by simplifying the mathematics behind it, so that you can have a deeper understanding of what's happening when you train your next neural network. To do so, we will study an example of binary classification using a neural network with one hidden layer and only one unit. We will use this over-simplified neural network to predict whether a picture represent the number "9" or not.

This 1-unit neural network can be also thought of as a logistic regression algorithm since we're going to use a sigmoid function as the activation function. This choice will be clarified later on this post.

Note: The code fragments shown in this post are available on my Github.


Neural Networks

Since the early days of artificial neural networks(ANNs), the back-propagation algorithm was used to train ANNs by calculating derivatives of the loss function using the chain rule mathematical method. This algorithm is still used by the most famous machine learning frameworks such as Scikit-learn, Tensorflow and PyTorch, because of it's effectiveness, accuracy, and consistency.

So the work done by an ANN while training can be split in two consecutive phases, as shown in the figure below:
  1. The Forward-Propagation
  2. The Back-Propagation

Simplified ANNs diagram

Forward-Propagation

During this first phase of training, the neural network passes the input the the hidden layers consecutively until in reaches the output layer from which a prediction is calculated and a loss is assigned to that prediction.
The loss function is used to penalize the neural network in way that, when the prediction made is "close" to the true target, then the loss is small and when the prediction made is "distant" from the true target then the loss is big.

So, can you guess what is the mission of the neural network during training?
Yes! It's to minimize the loss function, because by doing so, we make sure that the resulted predictions are the best outcome we can achieve.

And since the Loss is a function (a.k.a a Cost function) and the task is to minimize it, then that's where mathematics comes into play. There is a huge field in the mathematics world that is called Optimization and it is focused on studying thoroughly the problems of minimizing or maximizing a function under certain constraints.

Most of real-life optimization problems are complicated and can't be solved analytically which means by using calculus and algebra and obtaining a specific solution, and that is due to many factors such as the complexity and non regularity of the cost function.
Through out the years, there are many algorithms proposed by mathematicians that can solve optimization problems numerically (i.e. using computers), and one of the most famous algorithm is the Gradient Descent. This algorithm is now available in the majority of the machine learning frameworks and it has been upgraded and improved since it was introduced by Cauchy in 1847.

Generally, the gradient descent algorithm can minimize a function by advancing step-by-step towards the minimum over a certain number of iterations, where each step is taken in the opposite side of the derivative of the function in that position. The following graphic shows the steps taken by the gradient descent to minimize the parabola function over the interval [-10,10].

Gradient descent in 1D

For more details about the gradient descent algorithm, please refer to this post.

As shown in the graphic above, the parameter x is updated in each step the algorithm makes until it reaches the optimal solution, and so we say that the algorithm converges towards a minimum, which is in this case x*=0
This is a simplified illustration of what happens when training a neural network, the parameters of a neural network are the Weights and Biases, and they are updated step-by-step until reaching convergence.
The process of updating the internal parameters of an ANN is done through what's is known as Backward-propagation or simply Back-prop.

Backward-Propagation

At this stage is the learning process actually happens, the derivatives of the loss function with respect to the internal variables of the neural network are calculated, and the gradient descent algorithm updates the weights and biases. The mathematical chain rule make the computations of derivatives with respect to the variables on the layer number n using the derivatives with respect to the variables on the layer number n+1, and so the derivatives are calculated while propagating from the output layer to the input layer direction. And that is why this method is called Backward-Propagation.

Binary classification example

Training a neural network that can predict if a picture represent a certain object or not, is a binary classification example and can be achieved using non-complicated architecture. In this example, we will use a neural network with only one hidden layer and one unit. This can be viewed as a logistic regression algorithm, although we will use the concepts of forward and backward propagations to train the neural network. The following image shows the architecture of the neural network,

Binary Classification Neural Network example

MNIST is a publicly available dataset of hand-written digits that can be found here, or in a .csv format here.
The image is considered as a array of 784 pixels, the array representation of an image is passed as an input to the neural network, then the weighted sum of the array is calculated in the one unit of the next layer. Finally, the activation of the weighted sum is computed and a prediction is made with an assigned loss. This is the Forward propagation under the chosen configuration.

The computational graph of this forward propagation is presented in the following image,

Forward Propagation computational graph


The \(\sigma\) function is the sigmoid activation function, and \(\mathcal{L}\) is the log loss function for one sample defined by: \[\mathcal{L}(y,\hat{y})=-[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]\]
And the loss over all the training set of images is given by the mean of the losses,  \[\mathcal{L}_N(y,\hat{y})=-\frac{1}{N} \sum_{i=1}^N[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})],\quad N:\mbox{Nbr. of  samples}\]
 

Once the loss is calculated, the Back-prop begins. The following graph shows the computations made during this phase to determine the derivates of the loss function with respect to the internal variables using the chain rule.



Finally, using the gradient descent algorithm the weights are updates such as: \[W_{i+1} = W_i  - \alpha \frac{\partial \mathcal{L}}{\partial W_i},\quad  b_{i+1} = b_i  - \alpha \frac{\partial \mathcal{L}}{\partial b_i}\]
Where i is the number of iteration of the gradient descent, and \(\alpha\) is the learning rate of the gradient descent.
Once the weights and biases are updated, the next iteration of the gradient descent begins and the same process is repeated until reaching convergence via a stopping criterion.
Stopping criteria refers to conditions that must be reached in order to stop the execution of the algorithm. Some of the most common stopping conditions are: execution time, total number of iterations, non-improving iterations, optimal (lower bound for min, upper bound for max) solution found, etc.

What to keep in mind?

  • ANNs learn by minimizing a loss/cost function.
  • The minimization process is carried by an optimization algorithm such as the Gradient Descent.
  • At each iteration of the optimization algorithm, the weights and biases are updated using Back-prop.
  • The training process stops when the algorithm converges.
In this post, a simplified representation of neural networks was presented in order to clarify the calculation done behind the scene of the training phase. Although, the commonly used neural networks nowadays are more complex and have other parameters that comes into play, but the main concept behind them are similar and keeping that in mind would be very helpful when trying to create machine learning models.
The code for the described example above is available here.



Machine Learning
July 05, 2021
0

Search

Popular Posts

Boosting Your Machine Learning Models with Bagging Techniques

Introduction: In the world of machine learning, improving the accuracy and ro…

What is Stable Diffusion and How Does it Work?

Stable Diffusion stands as a cutting-edge deep learning model introduced in 2…

Exploring the Tech Job Horizon: Unveiling Insights from 25,000 Opportunities

In the rapidly advancing landscapes of Information Technology, Artificial Int…

Recent Comments

Contact Me