Linear Regression P.1

Subscribe to Tech With Tim!

This tutorial is going be to dedicated to understanding how to properly manipulate data sets and get data in a useful form. Since we are going to be using large data sets throughout all of the future tutorials this is very important to understand.

Installing Necessary Packages

To make loading in our data easier we need to install another package called pandas. We will do this the same way we installed the other packages from the previous tutorial.

Simply activate your environment and type pip install pandas from the command prompt.

This is a list of packages you should have installed before starting this tutorial.
- numpy
- pandas
- sklearn

Downloading Our Data

In this specific tutorial we will be implementing the linear regression algorithm to predict students final grade based on a series of attributes. To do this we need some data!

We are going to be using the Student Performance data set from the UCI Machine Learning Repository. You can download the data set here or from the direct link below:

Download Data Set: Download Now

This data set consists of 33 attributes for each student. You can see a description of each attribute here. It is great that there is many attributes but we likely don't want to consider all of them when trying to predict a students grade. Therefore, we will trim this data set down so we only have the attributes we need.

Importing Modules/Packages

Before we start coding we should import all of the following.

import pandas as pd
import numpy ads np
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle

Loading in Our Data

Once you've downloaded the data set and placed it into your main directory you can load it in using the pandas module.

data = pd.read_csv("student-mat.csv", sep=";")
# Since our data is seperated by semicolons we need to do sep=";"

To see our data frame we can type:


This will print out the first 5 students in our data frame.

Trimming Our Data

Since we have so many attributes and not all are relevant we need to select the ones we want to use. We can do this by typing the following.

data = data[["G1", "G2", "G3", "studytime", "failures", "absences"]]

Now our data frame only has the information associated with those 6 attributes.

Separating Our Data

Now that we've trimmed our data set down we need to separate it into 4 arrays. However, before we can do that we need to define what attribute we are trying to predict. This attribute is known as a label. The other attributes that will determine our label are known as features. Once we've done this we will use numpy to create two arrays. One that contains all of our features and one that contains our labels.

predict = "G3"

X = np.array(data.drop([predict], 1)) # Features
y = np.array(data[predict]) # Labels

After this we need to split our data into testing and training data. We will use 90% of our data to train and the other 10% to test. The reason we do this is so that we do not test our model on data that it has already seen.

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)

Now we are ready to implement the linear regression algorithm

Full Code

as pd
import numpy as np
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle

data = pd.read_csv("student-mat.csv", sep=";")

data = data[["G1", "G2", "G3", "studytime", "failures", "absences"]]

predict = "G3"

X = np.array(data.drop([predict], 1))
y = np.array(data[predict])

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)