These days, when talking about machine learning, oftentimes we hear about the power of deep learning and neural networks. But prior to this recent hype of neural networks, support vector machines (or SVMs) were rocking the scene. Though they aren’t necessarily as hyped up as neural networks and deep learning are now, don’t let this fool you. SVMs are still frequently used, high-performing, and extremely dependable. What’s more, there are situations in which one might prefer an SVM over any sort of neural network because of the unique benefits they have to offer, such as lower prediction memory cost. Let’s take a look.
The idea of an SVM is drawn from the concept of a maximal margin classifier. Simply put, given a series of data points (typically of two-classes, but we’ll discuss extensions to multiclass scenarios as well), an SVM aims to put a hyperplane (or a dividing line) between the two classes of points that creates the maximum distance between that line and the closest points of each class. In two dimensions, the line would look something like this:
As we can see, there are actually many places that we could put a dividing plane between these points. All of these lines would be valid decision boundaries. But with an SVM, we don’t want to settle for any valid decision boundary: We want the maximal margin decision boundary.
Now, you may have noticed that when talking about this decision boundary, we only mention the points closest to the boundary. That’s because with an SVM, they’re the only points that actually matter. In fact, we can think of these points as representing those that are hardest to classify. These points are referred to as support vectors (which is where SVMs gets their name) due to the fact that they support the decision boundary. Remove any of these points or shift them in anyway and the hyperplane shifts too. We can ignore all other points that aren’t as close as the support vectors (until they are closer to the boundary than the support vectors) because no matter how they move or are removed, they do not alter the distance between the closest points and the decision boundary.
So far, we’ve looked at a set of points that cleanly separate with a decision boundary. However, in the real world, data tends not to be so clean. Take this example:
Because of this, we need to relax our assumption that there exists a line that maximally separates each class and does so perfectly. To do so, we introduce the concept of a soft margin classifier. This simply means that we allow some number of points to be misclassified and are okay with that. In most implementations of SVMs, this is tuned by a regularization parameter called C. High values of C mean that the SVM will be more sensitive and will not misclassify many points whereas low values of C will allow the SVM more freedom. Finding the best value for C is usually done using a cross-validation split of data.
Underneath the hood, SVMs are using the dot product between points and margins to figure out where to place the boundary. This gives a notion of similarity using the following equation:
Through training an SVM, we end up optimizing a quadratic programming problem. This means that there is only one global minimum. To find it, we can utilize various optimization strategies, though non-linear optimization approaches tend to be the route that is taken, setting up the optimization of SVMs as a constrained optimization problem.
What happens when our data can’t simply have a straight-line decision boundary drawn through it? What if our data looks like this?
SVMs handle data like this through something called the kernel trick. We mentioned before that SVMs utilize the dot product between support vectors and new points to determine classification. The kernel trick extends this, giving the effect of transforming data into a higher dimension and supplying the dot product in that new dimension. So instead of having our data classified like above, we can project it onto the z-axis and easily slice it with a plane:
This can be done just by using a simple equation like:
To make this happen, we don’t have to transform the data and then take the dot product in that higher dimensional space. That’s why it’s a trick. We simply use the kernel equation, which calculates what that dot product would look like in that space, giving us a measure of that for our SVM. Popular kernels include the polynomial kernel and the Gaussian radial basis function (RBF) kernel.
SVMs offer a lot of benefits. They’re fast at prediction time and dependable. They work very well in domains with limited data and, uniquely enough, work well when the dimensionality of data is higher than the number of samples. They’re low memory cost at prediction time due to the fact that they only need to store their support vector points and the weights associated with them to reconstruct the decision boundary. Finally, with the kernel trick, SVMs easily adapt to complex and diverse situations with little change in calculation cost.
However, SVMs do have a couple of things that decrease their success. For one, they’re completely deterministic. Once the decision boundary is set, there is no concept of probabilistic classification. SVMs also do not scale well with an increase in the number of samples. At worst, this can be cubic and at best it is still quadratic. They’re also very dependent on the value of C that we set to make it a soft margin classifier. Another interesting downside is that SVMs expose their support vectors, so if there’s any data within them that should be private, this is readily exposed to anyone who inspects the support vectors.
SVMs are also only defined for a binary classification task. This has not stopped them from being extended to the multiclass scenario. In the multiclass setting, one can employ a strategy of a one-versus-all or a series of one-versus-one classifications to select the top scoring class for each new point. Each of these have computational tradeoffs, so it’s important to be aware of this fact.
SVMs can be used anywhere you have need for classification. They’ve been used in the image domain for things like face classification (usually with some sort of pre-processing features instead of raw pixel values). They’re used frequently in the natural language processing domain, particularly when using features like word counts or TF-IDF scores with stop word removal. These features tend to produce very sparse, high dimensional vectors for text samples, which is where SVMs shine.
Hopefully, after reading this article, you see why SVMs are so powerful and still relevant today. There are situations when an SVM is superior than any deep-learning approach and oftentimes can be tested right out of the box with most machine learning libraries. They do have their trade-offs, however. A good approach tends to be to see how far you can get on a task with a simple regression model (logistic or otherwise) and then move into an SVM if the need presents itself.