Support Vector Machine
Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification or regression problems. Nevertheless, it is mostly used in classification problems. In the SVM classification, we plot each data item as a point in n-dimensional space (where n is number of features) with the value of each feature is represented as value of a specific coordinate. Then, classification is performed by finding the hyper-plane that differentiates the two classes very well.
Hyperplanes are decision boundaries which helps in the classification of the data points. Data points falling on either side of the hyperplane can be credited or subjected to different classes. A hyperplane is a line that separates the input variable space. In SVM, a hyperplane is selected to separate the points in the input variable space by their respective class, as either class 0 or class 1. Correspondingly, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line (as line is single or one dimensional). If the number of input features is 3, then the hyperplane becomes a two-dimensional plane.
To separate the two classes of data points, there are many possible hyperplanes that could be selected. To get the better performance of the model, we require a plane with maximum margin. Now, our objective is to find a plane which has the maximum margin. To make it simple, there should be the maximum distance between data points of both classes. Maximizing the margin distance provides some strengthening or reinforcement, so that future data points can be classified with more confidence.
Data points that are closer to the hyperplane are called support vectors and they influence the position and orientation or direction of the hyperplane. By using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. Support Vectors are the points that help us build our SVM.
In two-dimensional space, hyperplane is visualized as a line and let us assume that all of our input points can be completely separated by this line.
B0 + (B1 * X1) + (B2 * X2) = 0
Where the coefficients B1 and B2 determines the slope of the line and the intercept is determined by B0 are calculated by the learning algorithm, and X1 and X2 are the two input variables.
By using Hyperplane, we can make classifications. By giving the input values into the line equation, you can calculate whether a new point is above or below the line.
- If the equation returns a value greater than 0 and the point belongs to the first class (class 0), then it is above the Hyperplane.
- If the equation returns a value less than 0 and the point belongs to the second class (class 1), then it is below the Hyperplane.
- If the data value is zero, then that data point is close to Hyperplane and the point may be difficult to classify.
- If the magnitude of the data value is large, then the model may have more confidence in the prediction.
The distance between the line and the closest data points to the Hyperplane is referred as the margin. The best or optimal line which separates the two classes is the line that has the largest margin. This is called as the Maximal-Margin hyperplane.
The margin is calculated as the perpendicular distance from the line to the closest points of Hyperplane or Support vectors. Only these points are applicable in defining the line and in the construction of the classifier.
A margin is a separation of Hyperplane to the closest class points or Support Vectors. A good margin has the larger separation for both the classes. Figures below gives an example of good and bad margin. A good margin allows the points to be in their respective classes without crossing to another class.
Gamma is a parameter which must be specified to the learning algorithm. A best default value for gamma is 0.1, where gamma is often varied between ranges 0 < gamma < 1. The radial kernel is very local and can create complex regions within the feature space in two-dimensional space.
In SVM, we take the output of the linear function for classification of values. If that output is greater than 1, we classify it with one class and if the output is -1, we classify it with another class. Since the threshold values are changed to 1 and -1 in SVM, we obtain this range of values ([-1,1]) which acts as margin.
In the SVM algorithm, for better performance, we are trying to maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is called as hinge loss.
The cost function is 0, if the predicted value and the actual value are of the same sign. If they are not of same sign, then calculate the loss value. We have to add a regularization parameter to the cost function. The main objective of the regularization parameter is to balance the margin maximization and loss. After adding the regularization parameter, the cost function becomes,
Now we have the loss function, to find the gradient, we take partial derivatives with respect to the weights. Using these gradients, we can update our weights.
When there is no misclassification, which means our model correctly predicts the class of our data point, we only have to update the gradient from the regularization parameter.
Gradient Update — No Misclassification
When there is a misclassification, that means our model make a mistake on the prediction of the class of our data point, we include the loss along with the regularization parameter to perform gradient update.
Gradient Update — Misclassification
The learning of the hyperplane in linear SVM is done by changing the problem using linear algebra. Here kernel comes into the picture.
There are different types of Kernel, they are
- Linear Kernel
- Polynomial Kernel
- Radial Kernel
Linear Kernel SVM
The dot-product is the kernel and can be written as:
K(x, xi) = sum(x * xi)
The kernel defines the distance measure between new data and the support vectors. The dot product is the similarity measure used for linear SVM or a linear kernel as the distance is a linear combination of the inputs.
Other kernels can also be used which can transform the input space into higher dimensions such as a Polynomial Kernel and a Radial Kernel. This is called as the Kernel Trick.
It is necessary to use more complex kernels as it allows lines to separate the classes that are curved in nature or even more complex. This can lead to more accurate performance of classifiers.
Polynomial Kernel SVM
K(x,xi) = 1 + sum(x * xi)^d
Here the degree of the polynomial must be stated by hand to the learning algorithm. When d=1, it will become the linear kernel. This polynomial kernel allows the curved lines in the input space.
Radial Kernel SVM
We can have more complex lines compared to Polynomial kernel by using Radial kernel.
K(x,xi) = exp(-gamma * sum((x – xi^2))
The Regularization parameter tells the SVM optimization how much you have to avoid misclassifying each and every training example. It is denoted by C in the SVM algorithm.
For large values of C, the optimization will choose a smaller-margin hyperplane, if that hyperplane performs better. On the contrary, a very small value of C will cause the optimizer to look for a larger-marginal hyperplane, even if that hyperplane misclassifies more points or does not performs well.
Below figure has some misclassification due to lower regularization value.
Below figure shows the Higher Regularization value
The gamma parameter defines how far the influence of a single training example reaches. In other words, with low gamma, points far away from separation line or Hyperplane are considered in calculation. Whereas high gamma means the points close to plausible line or Hyperplane are considered in calculation.