Text Classification P1
Subscribe to Tech with Tim
Text Classification
Another large application of neural networks is text classification. In these next few tutorials we will use a neural network to classify movie reviews as either positive or negative.
Install Previous Version of Numpy
There is a bug when using this specific dataset that requires us to install the previous version of numpy, we can do this by running the following in our cmd:
pip install numpy==1.16.1
This is the current working solution as of May 14, 2019. If you are reading this after that date you may not need to do this.
Loading Data
The dataset we will use for these next tutorials is the IMDB movie dataset from keras. To load and split the data we will do the same as we did in the previous tutorial.
import tensorflow as tf from tensorflow import keras import numpy imdb = keras.datasets.imdb (train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
Integer Encoded Data
Having a look at our data we'll notice that our reviews are integer encoded. This means that each word in our reviews are represented as positive integers where each integer represents a specific word. This is necessary as we cannot pass strings to our neural network. However, if we (as humans) want to be able to read our reviews and see what they look like we'll have to find a way to turn those integer encoded reviews back into strings. The following code will do this for us:
# A dictionary mapping words to an integer index _word_index = imdb.get_word_index() word_index = {k:(v+3) for k,v in _word_index.items()} word_index["<PAD>"] = 0 word_index["<START>"] = 1 word_index["<UNK>"] = 2 # unknown word_index["<UNUSED>"] = 3 reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) def decode_review(text): return " ".join([reverse_word_index.get(i, "?") for i in text]) # this function will return the decoded (human readable) reviews
We start by getting a dictionary that maps all of our words to an integer, add some more keys to it like , etc. and then reverse that dictionary so we can use integers as keys that map to each word. The function defied will take as a list the integer encoded reviews and return the human readable version.
Preprocessing Data
If we have a look at some of our loaded in reviews we'll notice that they are different lengths. This is an issue. We cannot pass different length data into out neural network. Therefore we must make each review the same length. To do this we will follow the procedure below:
- if the review is greater than 250 words then trim off the extra words
- if the review is less than 250 words add the necessary amount of 's to make it equal to 250.
Luckily for us keras has a function that can do this for us:
train_data = keras.preprocessing.sequence.pad_sequences(train_data, value=word_index["<PAD>"], padding="post", maxlen=250) test_data = keras.preprocessing.sequence.pad_sequences(test_data, value=word_index["<PAD>"], padding="post", maxlen=250)
Defining the Model
Finally we will define our model! This model is a little bit different and will be discussed in depth in the next tutorial.
model = keras.Sequential() model.add(keras.layers.Embedding(88000, 16)) model.add(keras.layers.GlobalAveragePooling1D()) model.add(keras.layers.Dense(16, activation="relu")) model.add(keras.layers.Dense(1, activation="sigmoid")) model.summary() # prints a summary of the model
Full Code
import tensorflow as td from tensorflow import keras import numpy as np data = keras.datasets.imdb (train_data, train_labels), (test_data, test_labels) = data.load_data(num_words=88000) word_index = data.get_word_index() word_index = {k:(v+3) for k, v in word_index.items()} word_index["<PAD>"] = 0 word_index["<START>"] = 1 word_index["<UNK>"] = 2 word_index["<UNUSED>"] = 3 reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) def decode_review(text): return " ".join([reverse_word_index.get(i, "?") for i in text]) train_data = keras.preprocessing.sequence.pad_sequences(train_data, value=word_index["<PAD>"], padding="post", maxlen=250) test_data = keras.preprocessing.sequence.pad_sequences(test_data, value=word_index["<PAD>"], padding="post", maxlen=250) model = keras.Sequential() model.add(keras.layers.Embedding(88000, 16)) model.add(keras.layers.GlobalAveragePooling1D()) model.add(keras.layers.Dense(16, activation="relu")) model.add(keras.layers.Dense(1, activation="sigmoid")) model.summary() # prints a summary of the model