Tech With Tim Logo
Go back

KNN P.1 – Irregular Data

Introduction to KNN

KNN stands for K-Nearest Neighbors. KNN is a machine learning algorithm used for classifying data. Rather than coming up with a numerical prediction such as a students grade or stock price it attempts to classify data into certain categories. In the next few tutorials we will be using this algorithm to classify cars in 4 categories based upon certain features.

Downloading the Data The data set we will be using is the Car Evaluation Data Set from the UCI Machine Learning Repository. You can download the .data file below.

Download Data: download_now_3f2f58a063.png

IMPORTANT If you choose to download the file from the UCI website yous must make the following change (if you clicked the download button it has been done for you).

CHANGE: Add the following line to the top of your file and click save. buying,maint,door,persons,lug_boot,safety,class

Your file should now look like the following: kuva_2023-04-23_171132592.png

Importing Modules

Before we start we need to import a few modules. Most of these should be familiar to you. The only one we have yet to import is the following:

from sklearn import preprocessing

This will be used to normalize our data and convert non-numeric values into numeric values.

Now our imports should include the following.

import sklearn
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
from sklearn import linear_model, preprocessing

Loading Data

After placing our car.data file into our current script directory we can load our data. To load our data we will use the pandas module like seen in previous tutorials.

data = pd.read_csv("car.data")
print(data.head())  # To check if our data is loaded correctly

Converting Data

As you may have noticed much of our data is not numeric. In order to train the K-Nearest Neighbor Classifier we must convert any string data into some kind of a number. Luckily for us sklearn has a method that can do this for us.

We will start by creating a label encoder object and then use that to encode each column of our data into integers.

le = preprocessing.LabelEncoder()

The method fit_transform() takes a list (each of our columns) and will return to us an array containing our new values.

buying = le.fit_transform(list(data["buying"]))
maint = le.fit_transform(list(data["maint"]))
door = le.fit_transform(list(data["door"]))
persons = le.fit_transform(list(data["persons"]))
lug_boot = le.fit_transform(list(data["lug_boot"]))
safety = le.fit_transform(list(data["safety"]))
cls = le.fit_transform(list(data["class"]))

Now we need to recombine our data into a feature list and a label list. We can use the zip() function to makes things easier.

X = list(zip(buying, maint, door, persons, lug_boot, safety))  # features
y = list(cls)  # labels

Finally we will split our data into training and testing data using the same process seen previously.

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)

Full Code

import sklearn
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
from sklearn import linear_model, preprocessing

data = pd.read_csv("car.data")
print(data.head())

le = preprocessing.LabelEncoder()
buying = le.fit_transform(list(data["buying"]))
maint = le.fit_transform(list(data["maint"]))
door = le.fit_transform(list(data["door"]))
persons = le.fit_transform(list(data["persons"]))
lug_boot = le.fit_transform(list(data["lug_boot"]))
safety = le.fit_transform(list(data["safety"]))
cls = le.fit_transform(list(data["class"]))
 
predict = "class"  #optional

X = list(zip(buying, maint, door, persons, lug_boot, safety))
y = list(cls)

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)
Design & Development by Ibezio Logo