Don’t get alarmed, we are going to put what we have learnt into practice on a playground kaggle data set explaining the code along the way.
Deep Learning Series
In this series, with each blog post we dive deeper into the world of deep learning and along the way build intuition and understanding. This series has been inspired from and references several sources. It is written with the hope that my passion for Data Science is contagious. The design and flow of this series is explained here.
Update (Dec 2018)
This was coded sometime back and utilizes the library Fastai version 0.7, however recently there have been some updates in the library and new releases in pytorch as well. The current code will no longer work with Fastai v1, while there are still some important concepts that can be learned from this code such as:
- Practical Application of Deep Learning
- Better modeling Practices like data augmentation, image standardization.
- Hyperparameter tuning
- Transfer Learning
Invasive Species
We have covered some basic concepts regarding what neural networks are and how do they work. However, I feel it has been too much theory and while learning any new concept it is also important to see that theory in action. Let’s start!!!
Let’s pick up a playground problem from Kaggle. Invasive species can have damaging effects on the environment, the economy, and even human health. Consider, tangles of kudzu that overwhelm trees in Georgia while cane toads threaten habitats in over a dozen countries worldwide. This means it is a very important to track and stop the spread of these invasive species. Think of how costly and difficult it will be to undertake this task at a large scale. Trained scientists would be required to visit designated areas and take note of the species inhabiting them. Using such a highly qualified workforce is expensive, time inefficient, and insufficient since humans cannot cover large areas when sampling.
Looks like a very interesting use case for Deep Learning.
What we need is a labeled dataset of images marked as invasive or safe. Our algorithm will take care of the rest. You can start a kernel (python jupyter notebook) using this link and follow along. Few settings to keep in mind, make sure that you have GPU and internet enabled. There are several libraries in python for deep learning however, we will use fastai.
Link The full code is available here.
Let’s start coding!!!
# Get automatic reloading and inline plotting %reload_ext autoreload %autoreload 2 %matplotlib inline
Just some basic commands as practice, autoreload reloads modules automatically before entering the execution and matplotlib inline is a magic command that plots your outputs better.
### Import Required Libraries # Using Fastai Libraries from fastai.imports import * from fastai.transforms import * from fastai.conv_learner import * from fastai.model import * from fastai.dataset import * from fastai.sgdr import * from fastai.plots import * import numpy as np import pandas as pd import torch import os PATH = "../input" print(os.listdir(PATH)) TMP_PATH = "/tmp/tmp" MODEL_PATH = "/tmp/model/" sz= 224 bs = 58 arch = resnet34
Defining some variables:
- Path: Location/path to the dataset
- sz: size that the images will be resized to in order to ensure that the training runs quickly.
- bs: the batch size that is we can break the data up into smaller parts.
- arch: it is the selected architecture of the neural network model.
I know in this series we have not yet covered how the convolution function and in particular how CNN’s work. However, for now all we need to know is that CNN is a type of neural network popular for image classification and Resnet is a type of architecture. Resnet-34 has 34 layers!
The programming framework used to behind the scenes to work with NVidia GPUs is called CUDA. Further, to improve performance, we need to check for NVidia package called CuDNN (special accelerated functions for deep learning).
### Checking GPU Set up print(torch.cuda.is_available()) print(torch.backends.cudnn.enabled)
Both of these should be true.
Now let’s look at what form the data is in, that is we need to understand how the data directories are structured, what the labels are and what some sample images look like. f’ is a convenient way to reference a path/string.
files = os.listdir(f'{PATH}/train')[:5] ## train contains image names print(files) img = plt.imread(f'{PATH}/train/{files[0]}') plt.imshow(img); print(img.shape)
We get the height, width and channels using img.shape. In img[:4,:4], img is a 3 dimensional array giving us the value for Red Green Blue pixel values. The image above should give us an idea of the height of the image. Now, let’s split the data into train and validation set.
label_csv = f'{PATH}/train_labels.csv' n = len(list(open(label_csv))) - 1 # header is not counted (-1) val_idxs = get_cv_idxs(n) # random 20% data for validation set print(n) #Total Data size print(len(val_idxs)) #Validation dataset size
label_df = pd.read_csv(label_csv) ### Count of both classes label_df.pivot_table(index="invasive", aggfunc=len).sort_values('name', ascending=False)
Label CSV contains the name and the corresponding label (1 or 0) where 1 means it has an invasive tag.Table 1: Target Variable Distribution
Label | Count |
---|---|
1 | 1448 |
0 | 847 |
tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1) data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}/train_labels.csv', test_name='test', val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs)
tfms stands for transformations. tfms_from_model takes care of resizing, image cropping, initial normalization and more.A pre-defined list of functions are carried on in transforms_side_on. We can also specify random zooming of images up to specified scale by adding the max_zoom parameter.
With ImageClassifierData.from_csv we are just putting together everything (train, validation set, the labels and batch size).
fn = f'{PATH}/train' + data.trn_ds.fnames[0] #img = PIL.Image.open(fn) size_d = {k: PIL.Image.open(f'{PATH}/' + k).size for k in data.trn_ds.fnames} row_sz, col_sz = list(zip(*size_d.values())) row_sz = np.array(row_sz); col_sz = np.array(col_sz) plt.hist(row_sz);
A plot of the distribution of the size of the images. Ideally, we want all images to have a standard size to allow easier computation.
Our first model: To make the process quick we will first run a pre-trained model and observe the results. Further, we can tweak the model for improvements. A pre-trained model means a model created by some one else to solve a different problem, the weights corresponding to the activation function are saved/trained based on their dataset. We will try out their weights as is, that is instead of coming up with our own weights specific to our dataset, we will just use their weights. This is what we call transfer learning.
Is that a good idea?
Well, usually these weights are attained by training on a very large dataset for example Imagenet. It helps speed up the your training process.
We have train set with 1836 images and test set with 1531 which is not much to attain a high accuracy model where weights are trained from scratch. Further, in the article regarding the black box we had observed how gradients and edges are found in the initial layer of a neural network. That is useful information for our use case as well.
Let us form a function to get the data and resize images if necessary.
def get_data(sz, bs): # sz: image size, bs: batch size tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1) data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}/train_labels.csv', test_name='test', val_idxs=val_idxs, suffix='.jpg', tfms=tfms, bs=bs) return data if sz > 500 else data.resize(512,TMP_PATH) # Reading the jpgs and resizing is slow for big images, so resizing them all to standard size first saves time
data = get_data(sz, bs) learn = ConvLearner.pretrained(arch, data, precompute=True,tmp_name=TMP_PATH, models_name=MODEL_PATH) learn.fit(1e-2, 3)
ConvLearner.pretrained builds learner that contains a pre-trained model. The last layer of the model needs to be replaced with the layer of the right dimensions. The pretained model was trained for 1000 classes therfore the final layer predicts a vector of 1000 probabilities. However, what we need is only a two dimensional vector. The diagram below shows in an example how this was done in one of the earliest successful CNNs. The layer “FC8” here would get replaced with a new layer with 2 outputs.
Parameters are learned by fitting a model to the data. Hyperparameters are another kind of parameter, that cannot be directly learned from the regular training process. These parameters express “higher-level” properties of the model such as its complexity or how fast it should learn. In learn.fit we provide the learning rate and the number of epochs (times we pass over the complete dataset).
The output of learn.fit is:Table 2: Loss/Accuracy By Epoch
epoch | trn_loss | val_loss | accuracy |
---|---|---|---|
0 | 0.379021 | 0.196531 | 0.932462 |
1 | 0.285149 | 0.168239 | 0.947712 |
2 | 0.229199 | 0.14343 | 0.947712 |
94% accuracy on our first model!!!
Error Analysis
Let’s form some function to try and understand what the model is doing correct and wrong. we will explore:
- A few correct labels at random
- A few incorrect labels at random
- The most correct labels of each class (i.e. those with highest probability that are correct)
- The most incorrect labels of each class (i.e. those with highest probability that are incorrect)
- The most uncertain labels (i.e. those with probability closest to 0.5).
# this gives prediction for validation set. Predictions are in log scale log_preds = learn.predict() print(log_preds.shape) preds = np.argmax(log_preds, axis=1) # from log probabilities to 0 or 1 probs = np.exp(log_preds[:,1]) # pr(1) # Where Species = Invasive is class 1 def rand_by_mask(mask): return np.random.choice(np.where(mask)[0], min(len(preds), 4), replace=False) def rand_by_correct(is_correct): return rand_by_mask((preds == data.val_y)==is_correct) def plots(ims, figsize=(12,6), rows=1, titles=None): f = plt.figure(figsize=figsize) for i in range(len(ims)): sp = f.add_subplot(rows, len(ims)//rows, i+1) sp.axis('Off') if titles is not None: sp.set_title(titles[i], fontsize=16) plt.imshow(ims[i]) def load_img_id(ds, idx): return np.array(PIL.Image.open(f'{PATH}/'+ds.fnames[idx])) def plot_val_with_title(idxs, title): imgs = [load_img_id(data.val_ds,x) for x in idxs] title_probs = [probs[x] for x in idxs] print(title) return plots(imgs, rows=1, titles=title_probs, figsize=(16,8)) if len(imgs)>0 else print('Not Found.') def most_by_mask(mask, mult): idxs = np.where(mask)[0] return idxs[np.argsort(mult * probs[idxs])[:4]] def most_by_correct(y, is_correct): mult = -1 if (y==1)==is_correct else 1 return most_by_mask(((preds == data.val_y)==is_correct) & (data.val_y == y), mult)
Let’s take a look at what we get if we were to call these functions. Keep in mind our classification threshold is 0.5.
# 1. A few correct labels at random plot_val_with_title(rand_by_correct(True), "Correctly classified")
# 2. A few incorrect labels at random plot_val_with_title(rand_by_correct(False), "Incorrectly classified")
# Most correct classifications: Class 0 plot_val_with_title(most_by_correct(0, True), "Most correct classifications: Class 0")
# Most correct classifications: Class 1 plot_val_with_title(most_by_correct(1, True), "Most correct classifications: Class 1")
# Most incorrect classifications: Actual Class 0 Predicted Class 1 plot_val_with_title(most_by_correct(0, False), "Most incorrect classifications: Actual Class 0 Predicted Class 1")
# Most incorrect classifications: Actual Class 1 Predicted Class 0 plot_val_with_title(most_by_correct(1, False), "Most incorrect classifications: Actual Class 1 Predicted Class 0")
# Most uncertain predictions most_uncertain = np.argsort(np.abs(probs -0.5))[:4] plot_val_with_title(most_uncertain, "Most uncertain predictions")
Scope of Improvement:
- Find an Optimal Learning Rate
- Use Data Augmentation techniques
- Instead of using a Pre-trained model, train more layers of the neural network based on our dataset
## How does loss change with changes in Learning Rate (For the Last Layer) learn.lr_find() learn.sched.plot_lr()
The method learn.lr_find() helps you find an optimal learning rate. It uses the technique developed in the 2015 paper Cyclical Learning Rates for Training Neural Networks, where we simply keep increasing the learning rate from a very small value, until the loss stops decreasing.
# Note that the loss is still clearly improves till lr=1e-2 (0.01). # The LR can vary as a part of the stochastic gradient descent over time. learn.sched.plot()
We can see the plot of loss versus learning rate to see where our loss stops decreasing:
Now, that we have an idea of how to select our learning rate. To set the number of epochs, we just need to ensure that there is no over-fitting. Let’s talk about data augmentation.
Data augmentation is a good step to prevent over-fitting. That is, by cropping/zooming/rotating the image, we can ensure that the model does not learn patterns specific to the train data and generalizes well to new data.
def get_augs(): tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1) data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}/train_labels.csv', bs = 2, tfms=tfms, suffix='.jpg', val_idxs=val_idxs, test_name='test') x,_ = next(iter(data.aug_dl)) return data.trn_ds.denorm(x)[1] # An Example of data augmentation ims = np.stack([get_augs() for i in range(6)]) plots(ims, rows=2)
With precompute = TRUE, all layers of the Neural network are set to frozen excluding the last layer. Thus we are only updating the weights in the last layer with our dataset. Now, we will train the model with the option precompute as false and cycle_len enabled. Cycle Length uses a technique called stochastic gradient descent with restarts (SGDR), a variant of learning rate annealing, which gradually decreases the learning rate as training progresses. In other words, SGDR reduces the learning rate every mini-batch, and reset occurs every cycle_len epoch. This is helpful because as we get closer to the optimal weights, we want to take smaller steps.
learn.precompute=False learn.fit(1e-2, 3, cycle_len=1)
Table 3: Loss/Accuracy By Epoch
epoch | trn_loss | val_loss | accuracy |
---|---|---|---|
0 | 0.221001 | 0.1623 | 0.943355 |
1 | 0.232999 | 0.179043 | 0.941176 |
2 | 0.224435 | 0.148815 | 0.947712 |
Calling learn.sched.plot_lr() once again:
To unfreeze layers however, we will call unfreeze. We will also try differential rates for the respective layers.
learn.unfreeze() lr=np.array([1e-4,1e-3,1e-2]) learn.fit(lr, 3, cycle_len=1, cycle_mult=2)
Table 4: Loss/Accuracy By Epoch
epoch | trn_loss | val_loss | accuracy |
---|---|---|---|
0 | 0.323539 | 0.178492 | 0.923747 |
1 | 0.247502 | 0.132352 | 0.949891 |
2 | 0.192528 | 0.128903 | 0.954248 |
3 | 0.165231 | 0.101978 | 0.962963 |
4 | 0.141049 | 0.106319 | 0.960784 |
5 | 0.121947 | 0.103018 | 0.960784 |
6 | 0.107445 | 0.100944 | 0.965142 |
Improved our model, 96.5% accuracy…
Above, we have the learning rate of the final layers. The learning rates of the earlier layers are fixed at the same multiples of the final layer rates as we initially requested (i.e. the first layers have 100x smaller, and middle layers 10x smaller learning rates, since we set lr=np.array([1e-4,1e-3,1e-2]).
To get a better picture, we can use Test time augmentation (learn.TTA()), that is we use data augmentation techniques on our validation set. Thus, by making predictions on both the validation set images and their augmented images, we will be more accurate.
Our confusion matrix:
Our final accuracy was 96.73% and upon submission to the public leader-board we got 98%.
Code Summary and Explanation Steps
Data Exploration:
- Explore the data size and get an idea of how the images look like.
- Check the distribution of image sizes. Resizing of Images (Standardizing) might be required to speed up the process.
Models Tweaking:
- Run a quick model (smaller number of epochs) with precompute = TRUE, that is only updating the weights of last layer.
- Evaluate the Performance by observing the train and validation loss and the overall accuracy.
- Explore the Images of the most correct/incorrect classifications to understand if there are any visible patterns/reasons of wrong classification. It helps to get more comfortable with what the model is doing.
- Find optimal Learning Rate using lr_find(). We want a learning rate where loss is improving.
- Train last layer from precomputed activations for 1-2 epochs.
- Use data augmentation and train the last layer again (cycle_len = 1).
- Unfreeze layers and retrain the model. Set the earlier layers to 3x-10x lower learning rate than next higher layer.
- Recheck the Learning Rate (lr_find).
- Train full network with cycle_mult=2 until over-fitting.
- Use Test time augmentation to get a better picture regarding the accuracy.