The Top 10 Deep Learning Methods

September 29, 2020

The Top 10 Deep Learning Methods

Here I would to share 10 powerful deep learning methods that AI engineers can apply to their machine learning problems.

Back Propagation

Back propagation is a method used to compute partial derivatives of a function. It computes the chain rule with a specific order of operations that is highly efficient. The main thing in back propagation is its expression for the partial derivative ∂C/∂w of the cost function with respect to any weight in the network. The expression shows how fast the cost changes when we change the weights and biases which changes the overall behavior of the network. The Back propagation algorithm computes for the minimum value of the error function in weight space. It calculates using a technique called the delta rule or gradient descent.

Stochastic Gradient Descent

Stochastic gradient descent is a very popular and common algorithm used in numerous Machine Learning algorithms. A gradient is the slope of a function. It measures the degree of change of a variable in response to the variations of another variable. Mathematically, Gradient Descent output is the partial derivative of a set of parameters of its inputs. The greater the gradient, the steeper the slope. Gradient Descent take huge computation for iterating through algorithm. Hence it is slow on huge data.

Stochastic Gradient Descent comes to rescue. The word ‘stochastic’ means a process that is related with a random probability.

Hence, in Stochastic Gradient Descent, a few samples are selected randomly from the whole data set for each iteration to reduce the computations immensely.

Learning Rate Decay

To increase performance and reduce training time it is better to adapt the learning rate for stochastic gradient descent optimization. This is also called as learning rate annealing or adaptive learning rates. The most used technique during training is to reduce the learning rate over time. This benefits in such a way that when larger learning rates are used at the beginning of the training procedure and gradually reducing to smaller learning rates, thus smaller training updates are made to weights later. This has the result of fast learning good weights early and fine tuning them later.

Two popular learning rate decay are:

Decrease the learning rate slowly based on the epoch.
Decrease the learning rate using punctuated large drops at specific epochs.

Dropout

A powerful machine learning system is deep neural nets with a large number of parameters. But overfitting is a serious problem in such networks. Large networks are also slow to use, making it hard to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique to rescue this problem.

The key clue is to randomly drop units from the neural network during training. This prevents units from co-adapting too much. At test time, it is easy to estimate the effect of averaging the predictions of all these thinned networks by simply using a single untwined network that has smaller weights. This predominantly reduces overfitting and gives major enhancements over other regularization methods. Dropout has been used to increase the performance of neural networks on supervised learning tasks such as in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark datasets.

Max Pooling

To maintain a proper balance between computing resources and extracting meaningful features from images, down-sizing or downsampling must be done at proper intervals. We use a concept called Pooling to achieve this. Pooling provides a method to downsample feature maps by summarizing the existence of features in the feature maps.

Max Pooling is the most commonly used pooling method.

Max pooling is a sample-based discretization process. The object is to down-sample an input representation reducing its dimensionality and allowing for assumptions to be made about features contained in the sub-regions.

Max Pooling is a convolution process where the Kernel extracts the maximum value of the area it convolves. Max Pooling simply says to the Convolutional Neural Network that we will carry forward only that information, if that is the largest information available amplitude wise. Max pooling is done by applying a max filter to generally non-overlapping subregions of the initial representation.

Batch Normalization

Batch normalization helps in careful tuning of weight initialization and learning parameters for neural networks including deep networks.

Weights problem:

Whatever the initialization of weights, let it be random or empirically chosen, they are far away from the learned weights. If we take a mini batch, during initial epochs, there will be many outliers.

A small perturbation within in the initial layers, results to a large change in the later layers.

During back-propagation, the gradients have to compensate the outliers, before learning the weights to produce required outputs. This leads to the necessity of further epochs to converge.

Batch normalization regularizes this gradient from distraction to outliers and flow towards the common goal within a range of the mini batch.

Long Short-Term Memory

A LSTM network has the following three features that differentiate it from a usual neuron in a recurrent neural network:

It has control on deciding when to let the input enter the neuron.
It has control on deciding when to recollect what was computed in the previous time step.
It has control on choosing when to let the output pass on to the next time stamp.

The beauty of the LSTM is that it chooses all this based on the current input itself.

Skip-gram

The aim of word embedding models is to learn a high-dimensional dense depiction for each vocabulary term in which the similarity between embedding vectors shows the semantic or syntactic similarity between the corresponding words. Skip-gram would be a model for learning word embedding algorithms.

The main idea behind the skip-gram model is as follows: Two vocabulary terms are similar, if they share similar context.

For example, assume that you have a sentence, like “cats are mammals”. If you employ the term “dogs” rather than “cats”, the sentence still remains a meaningful sentence. So, in this example, “dogs” and “cats” can share the same context i.e., they are mammals.

Based on the above hypothesis, you can consider a context window i.e., a window containing k consecutive terms. Then you could skip one of these words and try to find a neural network that gets all terms except the one skipped and predicts the skipped term. Therefore, if two words repeatedly share similar contexts within a large corpus, the embedding vectors of these terms will have close vectors.

Continuous Bag Of Words

In natural language processing problems, we would like to learn to represent each word in a document as a vector of numbers such that words that appear in similar context have vectors that are close to each other. In continuous bag of words model, the goal is to be ready to use the context surrounding a specific word and predict the actual word.

We do this by taking lots and lots of sentences in a large corpus and every time we see a word, we take the surrounding word. Then we input the context words to a neural network and predict the word within the center of this context.

When we have thousands of such context words and therefore the center word, we have one instance of a dataset for the neural network. We train the neural network and eventually the encoded hidden layer output represents the embedding for a particular word. It so happens that once we train this over a large number of sentences, words in similar context get similar vectors.

Transfer Learning

Let’s think about how an image would run through a Convolutional Neural Networks. Say you have got an image, you apply convolution to it, and you get combinations of pixels as outputs. Let’s say they’re edges. Now apply convolution again, so now your output is combinations of edges… or lines. Now apply convolution again, so your output is combinations of lines then on. You can consider it as each layer looking for a specific pattern. The last layer of your neural network tends to urge very specialized. Perhaps if you were performing on ImageNet, your networks last layer would be trying to find children or dogs or airplanes or whatever. A few layers back you would possibly see the network looking for eyes or ears or mouth or wheels.

Each layer during a deep CNN progressively builds up higher- and higher-level representations of features. The last couple layers tend to be specialized on whatever data you fed into the model. On the other hand, the first layers are much more generic, there are many simple patterns common among a way larger class of images.

Transfer learning is once you take a CNN trained on one dataset, cut off the last layer(s), retrain the models last layer(s) on a special dataset. Instinctively, you’re retraining the model to recognize different higher-level features. As a result, training time gets cut down a lot so transfer learning may be a helpful tool once you don’t have enough data or if training takes an excessive amount of resources.