Introduction to Deep Learning
Deep learning is a machine learning research area that is based on a particular type of learning mechanism. It is characterized by the effort to create a learning model at several levels, in which the most profound levels take as input the outputs of previous levels, transforming them and always abstracting more. This insight on the levels of learning is inspired by the way the brain processes information and learns, responding to external stimuli.
Each learning level corresponds, hypothetically, to one of the different areas which make up the cerebral cortex.
The visual cortex, which is intended to solve image recognition problems, shows a sequence of sectors placed in a hierarchy. Each of these areas receives an input representation, by means of flow signals that connect it to other sectors.
Each level of this hierarchy represents a different level of abstraction, with the most abstract features defined in terms of those of the lower level. At a time when the brain receives an input image, the processing goes through various phases, for example, detection of the edges or the perception of forms (from those primitive to those gradually more and more complex).
As the brain learns by trial and activates new neurons by learning from the experience, even in deep learning architectures, the extraction stages or layers are changed based on the information received at the input.
The scheme, on the next page shows what has been said in the case of an image classification system, each block gradually extracts the features of the input image, going on to process data already preprocessed from the previous blocks, extracting features of the image that are increasingly abstract, and thus building the hierarchical representation of data that comes with on deep learning based system.
More precisely, it builds the layers as follows along with the figure representation:
Here is the visual representation of the process:
Figure 1: A deep learning system at work on a facial classification problem
The development of deep learning consequently occurred parallel to the study of artificial intelligence, and especially neural networks. After beginning in the 50 s, it is mainly in the 80s that this area grew, thanks to Geoff Hinton and machine learning specialists who collaborated with him. In those years, computer technology was not sufficiently advanced to allow a real improvement in this direction, so we had to wait until the present day to see, thanks to the availability of data and the computing power, even more significant developments.
As for the areas of application, deep learning is employed in the development of speech recognition systems, in thesearch patterns, and especially, in the image recognition, thanks to its learning characteristics for levels, which enable it to focus, step by step, on the various areas of an image to be processed and classified.
Artificial Neural Networks (ANNs) are one of the main tools that take advantage of the concept of deep learning. They are an abstract representation of our nervous system, which contains a collection of neurons that communicate with each other through connections called axons. The first artificial neuron model was proposed in 1943 by McCulloch and Pitts in terms of a computational model of nervous activity. This model was followed by another, proposed by John von Neumann, Marvin Minsky, Frank Rosenblatt (the so-called perceptron), and many others.
As you can see in the following figure, a biological neuron is composed of the following:
This is what a biological neuron model looks like:
Figure 2: Biological neuron model
The neuron activity is in the alternation of sending the signal (active state) and rest/reception of signals from other neurons (inactive state).
The transition from one phase to another is caused by the external stimuli represented by signals that are picked up by the dendrites. Each signal has an excitatory or inhibitory effect, conceptually represented by a weight associated with the stimulus. The neuron in an idle state accumulates all the signals received until they have reached a certain activation threshold.
Similar to the biological one, the artificial neuron consists of the following:
The following figure represents the artificial neuron:
Figure 3: Artificial neuron model
The output, that is, the signal whereby the neuron transmits its activity outside, is calculated by applying the activation function, also called the transfer function, to the weighted sum of the inputs. These functions have a dynamic between -1 and 1, or between 0 and 1.
There is a set of activation functions that differs in complexity and output:
From the simplest forms, used in the prototyping of the first artificial neurons, we then move on to more complex ones that allow greater characterization of the functioning of the neuron. The following are just a few:
It should be recalled that the network, and then the weights in the activation functions, will then be trained. As the selection of the activation function is an important task in the implementation of the network architecture, studies indicate marginal differences in terms of output quality if the training phase is carried out properly.
Figure 4: Most used transfer functions
In the preceding figure, the functions are labeled as follows:
The learning process of a neural network is configured as an iterative process of optimization of the weights, and is therefore of the supervised type. The weights are modified based on the network performance on a set of examples belonging to the training set, where the category they belong to is known. The aim is to minimize a loss function, which indicates the degree to which the behavior of the network deviates from the desired one. The performance of the network is then verified on a test set consisting of objects (for example, images in a image classification problem) other than those of the training set.
A supervised learning algorithm used is the backpropagation algorithm.
The basic steps of the training procedure are as follows:
The availability of efficient algorithms to weights optimization, therefore, constitutes an essential tool for the construction of neural networks. The problem can be solved with an iterative numerical technique called gradient descent (GD).
This technique works according to the following algorithm:
The gradient G of the error function E provides the direction in which the error function with the current values has the steeper slope, so to decrease E, we have to make some small steps in the opposite direction, -G (see the following figures).
By repeating this operation several times in an iterative manner, we move in the direction in which the gradient G of the function E is minimal (see the following figure):
Figure 5: Gradient descent procedure
As you can see, we move in the direction in which the gradient G of the function E is minimal.
In GD optimization, we compute the cost gradient based on the complete training set; hence, we sometimes also call it batch GD. In the case of very large datasets, using GD can be quite costly, since we are only taking a single step for one pass over the training set. Thus, the larger the training set, the slower our algorithm updates the weights and the longer it may take until it converges to the global cost minimum.
An alternative approach and the fastest of gradient descent, and for this reason, used in DNNs, is the Stochastic Gradient Descent (SGD).
In SGD, we use only one training sample from the training set to do the update for a parameter in a particular iteration. Here, the term stochastic comes from the fact that the gradient based on a single training sample is a stochastic approximation of the true cost gradient.
Due to its stochastic nature, the path toward the global cost minimum is not direct, as in GD, but may zigzag if we are visualizing the cost surface in a 2D space (see the following figure, (b) Stochastic Gradient Descent - SDG).
We can make a comparison between these optimization procedures, showing the next figure, the gradient descent (see the following figure, (a) Gradient Descent - GD) assures that each update in the weights is done in the right direction--the one that minimizes the cost function. With the growth of datasets' size, and more complex computations in each step, SGD came to be preferred in these cases. Here, updates to the weights are done as each sample is processed and, as such, subsequent calculations already use improved weights. Nonetheless, this very reason leads to it incurring some misdirection in minimizing the error function:
Figure 6: GD versus SDG
The way to connect the nodes, the number of layers present, that is, the levels of nodes between input and output, and the number of neurons per layer, defines the architecture of a neural network.
In multilayer networks, one can identify the artificial neurons of layers such that:
The input and output layers define inputs and outputs; there are hidden layers, whose complexity realizes different behaviors of the network. Finally, the connections between neurons are represented by as many matrices are the pairs of adjacent layers. Each array contains the weights of the connections between the pairs of nodes of two adjacent layers. The feed-forward networks are networks with no loops within the layers.
Following is the graphical representation of multilayer perceptron architecture:
Figure 7: A multilayer perceptron architecture
Deep Neural Networks (DNNs) are artificial neural networks strongly oriented to deep learning. Where normal procedures of analysis are inapplicable due to the complexity of the data to be processed, such networks are an excellent modeling tool. DNNs are neural networks, very similar to those we have discussed, but they must implement a more complex model (a great number of neurons, hidden layers, and connections), although they follow the learning principles that apply to all machine learning problems (that is, supervised learning).
As they are built, the DNNs work in parallel, so they are able to treat a lot of data. They are a sophisticated statistical system, equipped with a good immunity to errors.
Unlike algorithmic systems where you can examine the output generation step by step, in neural networks, you can also have very reliable results, but sometimes without the ability to understand the reasons for those results. There are no theorems to generate optimal neural networks--the likelihood of getting a good network is all in the hands of its creator, who must be familiar with statistical concepts, and particular attention must be given to the choice of predictor variables.
Finally, we observe that, in order to be productive, the DNNs require training that properly tunes the weights. Training can take a long time if the data to be examined and the variables involved are high, as is often the case when you want optimal results.
Convolutional Neural Networks (CNNs) has been designed specifically for image recognition. Each image used in learning is divided into compact topological portions, each of which will be processed by filters to search for particular patterns. Formally, each image is represented as a three-dimensional matrix of pixels (width, height, and color), and every sub-portion is put on convolution with the filter set. In other words, scrolling each filter along the image computes the inner product of the same filter and input. This procedure produces a set of feature maps (activation maps) for the various filters. By superimposing the various feature maps of the same portion of the image, we get an output volume. This type of layer is called a convolutional layer.
The following figure shows a typical CNN architecture:
Figure 8: Convolutional neural network architecture
A Restricted Boltzmann Machine (RBM) consists of a visible and a hidden layer of nodes, but without visible-visible connections and hidden-hidden by the term restricted. These restrictions allow more efficient network training (training that can be supervised or unsupervised).
This type of neural network can represent with few size of the network a large number of features of the inputs; in fact, the n hidden nodes can represent up to 2n features. The network can be trained to respond to a single question (yes/no), up until (again, in binary terms) a total of 2nquestions.
The architecture of the RBM is as follows, with neurons arranged according to a symmetrical bipartite graph:
Figure 9: Restricted Boltzmann Machine architecture
Stacked autoencoders are DNNs that are typically used for data compression. Their particular hourglass structure clearly shows the first part of the process, where the input data is compressed, up to the so-called bottleneck, from which the decompression starts.
The output is then an approximation of the input. These networks are not supervised in the pretraining (compression) phase, and the fine-tuning (decompression) phase is supervised:
Figure 10: Stack autoencoder architecture
The fundamental feature of a Recurrent Neural Network (RNN) is that the network contains at least one feedback connection, so the activations can flow around in a loop. It enables the networks to do temporal processing and learn sequences, for example, perform sequence recognition/reproduction or temporal association/prediction. RNN architectures can have many different forms. One common type consists of a standard multilayer perceptron (MLP) plus added loops. These can exploit the powerful non-linear mapping capabilities of the MLP, and also have some form of memory. Others have more uniform structures, potentially with every neuron connected to all the others, and may also have stochastic activation functions. For simple architectures and deterministic activation functions, learning can be achieved using similar gradient descent procedures to those leading to the backpropagation algorithm for feed-forward networks.
The following figure shows a few of the most important types and features of RNNs:
Figure 11: Recurrent Neural Network architecture
Almost all libraries provide the possibility of using the graphics processor to speed up the learning process, and are released under an open license and are the result of implementation by university research groups.Before starting the comparison, refer figure 12 which is one of the most complete charts of neural networks till date. If you see the URL and related papers, you will find that the idea of neural networks is pretty old and the software frameworks that we are going to compare below also adapts similar architecture during their framework development:
Finally, the following table provides a summary of each framework's salient features, including TensorFlow, which will be described (of course!!) in the our courses:
TensorFlow |
Torch |
Caffe |
Theano |
|
Programming language used |
Python and C++ |
Lua |
C++ |
Python |
GPU card support |
Yes |
Yes |
Yes |
By default, no |
Pros |
|
|
|
|
Cons |
|
|
|
|
Deep learning framework comparison
We introduced some of the fundamental themes of deep learning. It consists of a set of methods that allow a machine learning system to obtain a hierarchical representation of data, on multiple levels. This is achieved by combining simple units, each of which transforms the representation at its own level, starting from the input level, in a representation at a higher level, slightly more abstract.
In recent years, these techniques have provided results never seen before in many applications, such as image recognition and speech recognition. One of the main reasons for the spread of these techniques has been the development of GPU architectures, which considerably reduced the training time of DNNs. There are different DNN architectures, each of which has been developed for a specific problem.