Support Vector Machines Overview

Support Vector Machines is a supervised machine learning classification method. Support Vector Machines are considered linear separators. This is making the assumption that the data is linearly separable in some dimensional space. If the data isn’t linearly separable in its current dimensional space, then kernels can be used to transform the data. This will be explained in more detail later. Each support vector machine can separate the data into two groups. So, if the data has more than two labels, then more than one Support Vector Machine will be needed. In general, if there are ‘n’ labels (where ‘n’ is the number of labels and ‘n’ is greater than 1) then `n – 1` support vector machines will be needed.

Algorithm

Courtesy of Professor Ami Gates, Dept. Applied Math, Data Science, University of Colorado

The goal is to determine the hyperplane that best separates the two classes. There are a few important things to note in the above figure. The dashed lines are called support vectors or margin lines. w is the weight vector and has an inverse slope to the hyperplane. b is the bias or translation, in 2D this would be thought of as the y-intercept. x is a row (vector) in the dataset. yi is a label (either -1 or 1). The margin m is the distance between the two support vectors. Any data vector that is below the separating line will be labeled with -1 and any data vector above the separating line will be labeled with +1. This can be shown mathematically using equation highlighted in yellow above. It will also need to be included as a constraint when finding the optimal separating line. Since there are an infinite number of possible hyperplanes, the optimal one needs to be found and it needs to adhere to the constraints. This is done through a quadratic optimization using the equations below.

Courtesy of Professor Ami Gates, Dept. Applied Math, Data Science, University of Colorado

Primal Equation Definitions: w is the weight vector

Dual Equation Definitions: alpha_i and alpha_j are Lagrange multipliers, y_i and y_j are labels (1 or -1), x_i and x_j are rows (vectors) from the dataset where the resulting value is a dot product.

The primal equation can be obtained using the equations for the support vectors (wx+b+1 and wx+b-1). However, we need to ensure that optimizing the primal equation is done under the constraints and the value of w is unknown at the start. This is solved using Lagrange multipliers. Lagrange multipliers help find the local maxima or minima of a function subject to constraints. Including the Lagrange multipliers along with some math gets to the dual equation. There will be one Lagrange multiplier for every vector in the dataset. An important note is that for any data point not on a support vector, the corresponding Lagrange multiplier will be zero. So, another way to think about it is, if every Lagrange multiplier is found for every data point, then every data point with a non-zero Lagrange multiplier is a support vector. Notice that the dual equation doesn’t include w or b. The dual equation requires the Lagrange multipliers, the labels of the row vectors being included, and the dot product of the row vectors being included. These are all things that are known or can be solved for in the case of the Lagrange multipliers (the dual equation can be used to solve for the Lagrange multipliers). So, optimizing the dual equation, will allow w to be found.

Once the Lagrange multipliers are found, then w can be solved for (w = SUMi λi yi xi). Once w is solved for, then b can be solved for (yi(wTxi + b) – 1 = 0). Once w and b are solved for then the optimal separating hyperplane is found.

So, overall need to use the dual equation in order to find w. The Lagrange multipliers can be found using the dual equation, which requires the dot products of every single combination of data vectors, and every single label from each data vector.

Once everything is solved for, then new data vectors (rows) can be classified. To classify a new vector, the new vector is projected onto w. To project the new vector onto w the inner product of the two vectors needs to be taken. If the resulting inner product is greater than 0, then it is classified as +1 and if the resulting inner product is less than 0, then it is classified as -1. This stems from the fact that the equation of the separating hyperplane is w^T*x + b. Then any vector classified by -1 falls under the line w^T*x + b - 1 and any vector classified by +1 falls above the line w^T*x + b + 1.

Kernels

Support Vector Machines assumes that the data is linearly separable. So, what happens when the data isn’t linearly separable in its current state? This is where kernels come in. Kernels are functions that transform data into a different dimensional space where it may become linearly separable. For a function to be considered a kernel, it needs to produce a dot product. This is because the kernel function replaces the x_ix_j in the dual equation. Since the result of x_ix_j is a dot product, the kernel function has a to take in two data vectors and output a dot product. It is important to note that the kernel doesn’t actually transform the data. It instead just returns the dot product of two data vectors if they were to be transformed.

Some of the most common types of kernels are:

Linear: ab
- a is a vector, b is a vector
RBF (Gaussian): exp(-gamma*(a-b)^2)
- a is a vector, b is a vector, gamma is 1/(2sigma^2)
Sigmoid: tanh((gamma*ab) + r)
- a is a vector, b is a vector, r is the coefficient, gamma determines curvature
Polynomial: (ab + r)^d
- d is the degree, a is a vector, b is a vector, r is the coefficient
- Need to specify a degree of the polynomial or it will default to 3 in sklearn
- The degree also doesn’t correlate exactly to the dimensional space (this is due to how vector math works out to return a dot product)
- For, example when d is 2, data is transformed into 3D space

Here is an example of casting two 2D points where a=(2,1) and b=(3,4) into a new dimensional space, using a polynomial kernel with r=1 and d=2.

From the kernel function, plugging in 1 for r and 2 for d and then expanding: (ab + 1)^2
- (ab + 1)^2 = a1^2b1^2 + 2a1a2b1b2 + a2^2b2^2 + a1b1 + a2b2 + 1
The dot product is then
- [a1^2, sqrt(2)a1a2, a2^2, a1, a2, 1] dot [b1^2, sqrt(2)b1b2, b2^2, b1, b2, 1]
Then plugging in the values from the points a and b
- [4, sqrt(2)(2), 1, 2, 1, 1] dot [9, sqrt(2)(12), 16, 3, 4, 1]
The resulting dot product between a and b that the kernel outputs is then
- [(4)(9) + sqrt(2)(2)sqrt(2)(12) + (1)(16) + (2)(3) + (1)(4) + (1)(1)] = [36 + 48 + 16 + 6 + 4 + 1] = 111

Here is a visual example of why kernels are useful. They can solve the XOR problem.

The transformation being used here is actually the polynomial kernel with r=0 and d = 2.
The points are originally not linearly separable (left plot). However, when the transformation is applied and the resulting dot product obtained, the points are now linearly separable in a different dimensional space (right plot).