Introduction to KNN
KNN stands for K-Nearest Neighbors. KNN is a machine learning algorithm used for classifying data. Rather than coming up with a numerical prediction such as a students grade or stock price it attempts to classify data into certain categories. In the next few tutorials we will be using this algorithm to classify cars in 4 categories based upon certain features.
Downloading the Data
Download Data: Download Now
*IMPORTANT* If you choose to download the file from the UCI website yous must make the following change (if you clicked the download button it has been done for you).
CHANGE: Add the following line to the top of your file and click save.
Your file should now look like the following:
Before we start we need to import a few modules. Most of these should be familiar to you. The only one we have yet to import is the following:
This will be used to normalize our data and convert non-numeric values into numeric values.
Now our imports should include the following.
After placing our car.data file into our current script directory we can load our data. To load our data we will use the pandas module like seen in previous tutorials.
As you may have noticed much of our data is not numeric. In order to train the K-Nearest Neighbor Classifier we must convert any string data into some kind of a number. Luckily for us sklearn has a method that can do this for us.
We will start by creating a label encoder object and then use that to encode each column of our data into integers.
The method fit_transform() takes a list (each of our columns) and will return to us an array containing our new values.
Now we need to recombine our data into a feature list and a label list. We can use the zip() function to makes things easier.
Finally we will split our data into training and testing data using the same process seen previously.