SVM P.1 – Loading Sklearn Datasets

Subscribe to Tech With Tim!

Support Vector Machines (SVM)

SVM stands for a support vector machine. SVM's are typically used for classification tasks similar to what we did with K Nearest Neighbors. They work very well for high dimensional data and are allow for us to classify data that does not have a linear correspondence. For example classifying a data set like the one below.
svm python tutorial data
Attempting to use K Nearest Neighbors to do this would likely give us a very low accuracy score and is not favorable. This is where SVM's are useful.

Importing Modules

Before we start we need to import a few things from sklearn.

import sklearn
from sklearn import svm
from sklearn import datasets

Loading Data

In previous tutorials we did quite a bit of work to load in our data sets from places like the UCI Machine Learning Repository. That is a very useful skill and is something you will often have to do when applying these algorithm to your own data. However, now that we have learned this we will use the data sets that come with sklearn. These are much nicer to work with and have some nice methods that make loading in data very quick.

For this tutorial we will be using a breast cancer data set. It consists of many features describing a tumor and classifies them as either cancerous or non cancerous.

To load our data we will simply do the following.

cancer = datasets.load_breast_cancer()

To see a list of the features in the data set we can do:

print("Features: ", cancer.feature_names)

Similarly for the labels.

print("Labels: ", cancer.target_names)

The output should look like this.
svm python

Splitting Data

Now that we have loaded in our data set it is time to split it into training and testing data. We will do this like seen in previous tutorials.

x =  # All of the features
y =  # All of the labels

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2)

If we want to have a look at our data we can print the first few instances.

print(x_train[:5], y_train[:5])

Full Code

import sklearn
from sklearn import datasets
from sklearn import svm

cancer = datasets.load_breast_cancer()


x =
y =

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2)

print(x_train, y_train)

The next tutorial will explain how a SVM works and the math behind it. Following that I will go over implementing the algorithm.