This tutorial is going be to dedicated to understanding how to properly manipulate data sets and get data in a useful form. Since we are going to be using large data sets throughout all of the future tutorials this is very important to understand.
Installing Necessary Packages
To make loading in our data easier we need to install another package called pandas. We will do this the same way we installed the other packages from the previous tutorial.
Simply activate your environment and type pip install pandas from the command prompt.
This is a list of packages you should have installed before starting this tutorial.
Downloading Our Data
In this specific tutorial we will be implementing the linear regression algorithm to predict students final grade based on a series of attributes. To do this we need some data!
We are going to be using the Student Performance data set from the UCI Machine Learning Repository. You can download the data set here or from the direct link below:
Download Data Set: Download Now
This data set consists of 33 attributes for each student. You can see a description of each attribute here. It is great that there is many attributes but we likely don't want to consider all of them when trying to predict a students grade. Therefore, we will trim this data set down so we only have the attributes we need.
Before we start coding we should import all of the following.
import pandas as pd import numpy ads np import sklearn from sklearn import linear_model from sklearn.utils import shuffle
Loading in Our Data
Once you've downloaded the data set and placed it into your main directory you can load it in using the pandas module.
data = pd.read_csv("student-mat.csv", sep=";") # Since our data is seperated by semicolons we need to do sep=";"
To see our data frame we can type:
This will print out the first 5 students in our data frame.
Trimming Our Data
Since we have so many attributes and not all are relevant we need to select the ones we want to use. We can do this by typing the following.
data = data[["G1", "G2", "G3", "studytime", "failures", "absences"]]
Now our data frame only has the information associated with those 6 attributes.
Separating Our Data
Now that we've trimmed our data set down we need to separate it into 4 arrays. However, before we can do that we need to define what attribute we are trying to predict. This attribute is known as a label. The other attributes that will determine our label are known as features. Once we've done this we will use numpy to create two arrays. One that contains all of our features and one that contains our labels.
predict = "G3" X = np.array(data.drop([predict], 1)) # Features y = np.array(data[predict]) # Labels
After this we need to split our data into testing and training data. We will use 90% of our data to train and the other 10% to test. The reason we do this is so that we do not test our model on data that it has already seen.
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)
Now we are ready to implement the linear regression algorithm
as pd import numpy as np import sklearn from sklearn import linear_model from sklearn.utils import shuffle data = pd.read_csv("student-mat.csv", sep=";") data = data[["G1", "G2", "G3", "studytime", "failures", "absences"]] predict = "G3" X = np.array(data.drop([predict], 1)) y = np.array(data[predict]) x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)