import pickle,gzip,math,os,time,shutil,torch,matplotlib as mpl, numpy as np
import pandas as pd,matplotlib.pyplot as plt
from pathlib import Path
from torch import tensor
from torch.utils.data import DataLoader
from typing import Mapping
This is not my content it’s a part of Fastai’s From Deep Learning Foundations to Stable Diffusion course. I add some notes for me to understand better thats all. For the source check Fastai course page.
Convolutions
::: {.cell 0=‘e’ 1=‘x’ 2=‘p’ 3=‘o’ 4=‘r’ 5=‘t’}
import torch
from torch import nn
from torch.utils.data import default_collate
from typing import Mapping
from miniai.training import *
from miniai.datasets import *
:::
fit
<function miniai.training.fit(epochs, model, loss_func, opt, train_dl, valid_dl)>
'image.cmap'] = 'gray' mpl.rcParams[
= Path('data')
path_data = path_data/'mnist.pkl.gz'
path_gz with gzip.open(path_gz, 'rb') as f: ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
= map(tensor, [x_train, y_train, x_valid, y_valid]) x_train, y_train, x_valid, y_valid
In the context of an image, a feature is a visually distinctive attribute. For example, the number 7 is characterized by a horizontal edge near the top of the digit, and a top-right to bottom-left diagonal edge underneath that.
It turns out that finding the edges in an image is a very common task in computer vision, and is surprisingly straightforward. To do it, we use a convolution. A convolution requires nothing more than multiplication, and addition.
Understanding the Convolution Equations
To explain the math behind convolutions, fast.ai student Matt Kleinsmith came up with the very clever idea of showing CNNs from different viewpoints.
Here’s the input:
Here’s our kernel:
Since the filter fits in the image four times, we have four results:
= x_train.view(-1,28,28)
x_imgs = x_valid.view(-1,28,28) xv_imgs
'figure.dpi'] = 30 mpl.rcParams[
= x_imgs[7]
im3 ; show_image(im3)
= tensor([[-1,-1,-1],
top_edge 0, 0, 0],
[ 1, 1, 1]]).float() [
We’re going to call this our kernel (because that’s what fancy computer vision researchers call these).
=False); show_image(top_edge, noframe
The filter will take any window of size 3×3 in our images, and if we name the pixel values like this:
\[\begin{matrix} a1 & a2 & a3 \\ a4 & a5 & a6 \\ a7 & a8 & a9 \end{matrix}\]
it will return \(-a1-a2-a3+a7+a8+a9\).
= pd.DataFrame(im3[:13,:23])
df format(precision=2).set_properties(**{'font-size':'7pt'}).background_gradient('Greys') df.style.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
2 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
3 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
4 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
5 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.15 | 0.17 | 0.41 | 1.00 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.68 | 0.02 | 0.00 |
6 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.17 | 0.54 | 0.88 | 0.88 | 0.98 | 0.99 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.62 | 0.05 |
7 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.70 | 0.98 | 0.98 | 0.98 | 0.98 | 0.99 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.23 |
8 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.43 | 0.98 | 0.98 | 0.90 | 0.52 | 0.52 | 0.52 | 0.52 | 0.74 | 0.98 | 0.98 | 0.98 | 0.98 | 0.23 |
9 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.11 | 0.11 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 0.05 | 0.88 | 0.98 | 0.98 | 0.67 | 0.03 |
10 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.95 | 0.98 | 0.98 | 0.56 | 0.00 |
11 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.34 | 0.74 | 0.98 | 0.98 | 0.98 | 0.05 | 0.00 |
12 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.36 | 0.83 | 0.96 | 0.98 | 0.98 | 0.98 | 0.80 | 0.04 | 0.00 |
3:6,14:17] * top_edge).sum() (im3[
tensor(2.9727)
7:10,14:17] * top_edge).sum() (im3[
tensor(-2.9570)
def apply_kernel(row, col, kernel): return (im3[row-1:row+2,col-1:col+2] * kernel).sum()
4,15,top_edge) apply_kernel(
tensor(2.9727)
for j in range(5)] for i in range(5)] [[(i,j)
[[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4)],
[(1, 0), (1, 1), (1, 2), (1, 3), (1, 4)],
[(2, 0), (2, 1), (2, 2), (2, 3), (2, 4)],
[(3, 0), (3, 1), (3, 2), (3, 3), (3, 4)],
[(4, 0), (4, 1), (4, 2), (4, 3), (4, 4)]]
= range(1,27)
rng = tensor([[apply_kernel(i,j,top_edge) for j in rng] for i in rng])
top_edge3 ; show_image(top_edge3)
= tensor([[-1,0,1],
left_edge -1,0,1],
[-1,0,1]]).float() [
=False); show_image(left_edge, noframe
= tensor([[apply_kernel(i,j,left_edge) for j in rng] for i in rng])
left_edge3 ; show_image(left_edge3)
Convolutions in PyTorch
import torch.nn.functional as F
import torch
What to do if you have 2 months to complete your thesis? Use im2col.
Here’s a sample numpy implementation.
= im3[None,None,:,:].float()
inp = F.unfold(inp, (3,3))[0]
inp_unf inp_unf.shape
torch.Size([9, 676])
= left_edge.view(-1)
w w.shape
torch.Size([9])
= w@inp_unf
out_unf out_unf.shape
torch.Size([676])
= out_unf.view(26,26)
out ; show_image(out)
%timeit -n 1 tensor([[apply_kernel(i,j,left_edge) for j in rng] for i in rng]);
7.14 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit -n 100 (w@F.unfold(inp, (3,3))[0]).view(26,26);
27.2 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit -n 100 F.conv2d(inp, left_edge[None,None])
15.7 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
= tensor([[ 0,-1, 1],
diag1_edge -1, 1, 0],
[1, 0, 0]]).float() [
=False); show_image(diag1_edge, noframe
= tensor([[ 1,-1, 0],
diag2_edge 0, 1,-1],
[ 0, 0, 1]]).float() [
=False); show_image(diag2_edge, noframe
= x_imgs[:16][:,None]
xb xb.shape
torch.Size([16, 1, 28, 28])
= torch.stack([left_edge, top_edge, diag1_edge, diag2_edge])[:,None]
edge_kernels edge_kernels.shape
torch.Size([4, 1, 3, 3])
= F.conv2d(xb, edge_kernels)
batch_features batch_features.shape
torch.Size([16, 4, 26, 26])
The output shape shows we gave 64 images in the mini-batch, 4 kernels, and 26×26 edge maps (we started with 28×28 images, but lost one pixel from each side as discussed earlier). We can see we get the same results as when we did this manually:
= xb[1,0]
img0 ; show_image(img0)
1,i] for i in range(4)]) show_images([batch_features[
Strides and Padding
With appropriate padding, we can ensure that the output activation map is the same size as the original image.
With a 5×5 input, 4×4 kernel, and 2 pixels of padding, we end up with a 6×6 activation map.
If we add a kernel of size ks
by ks
(with ks
an odd number), the necessary padding on each side to keep the same shape is ks//2
.
We could move over two pixels after each kernel application. This is known as a stride-2 convolution.
Creating the CNN
= x_train.shape
n,m = y_train.max()+1
c = 50 nh
= nn.Sequential(nn.Linear(m,nh), nn.ReLU(), nn.Linear(nh,10)) model
= nn.Sequential(
broken_cnn 1,30, kernel_size=3, padding=1),
nn.Conv2d(
nn.ReLU(),30,10, kernel_size=3, padding=1)
nn.Conv2d( )
broken_cnn(xb).shape
torch.Size([16, 10, 28, 28])
::: {.cell 0=‘e’ 1=‘x’ 2=‘p’ 3=‘o’ 4=‘r’ 5=‘t’}
def conv(ni, nf, ks=3, stride=2, act=True):
= nn.Conv2d(ni, nf, stride=stride, kernel_size=ks, padding=ks//2)
res if act: res = nn.Sequential(res, nn.ReLU())
return res
:::
Refactoring parts of your neural networks like this makes it much less likely you’ll get errors due to inconsistencies in your architectures, and makes it more obvious to the reader which parts of your layers are actually changing.
= nn.Sequential(
simple_cnn 1 ,4), #14x14
conv(4 ,8), #7x7
conv(8 ,16), #4x4
conv(16,16), #2x2
conv(16,10, act=False), #1x1
conv(
nn.Flatten(), )
simple_cnn(xb).shape
torch.Size([16, 10])
= x_train.view(-1,1,28,28)
x_imgs = x_valid.view(-1,1,28,28)
xv_imgs = Dataset(x_imgs, y_train),Dataset(xv_imgs, y_valid) train_ds,valid_ds
::: {.cell 0=‘e’ 1=‘x’ 2=‘p’ 3=‘o’ 4=‘r’ 5=‘t’}
= 'mps' if torch.backends.mps.is_available() else 'cuda' if torch.cuda.is_available() else 'cpu'
def_device
def to_device(x, device=def_device):
if isinstance(x, torch.Tensor): return x.to(device)
if isinstance(x, Mapping): return {k:v.to(device) for k,v in x.items()}
return type(x)(to_device(o, device) for o in x)
def collate_device(b): return to_device(default_collate(b))
:::
from torch import optim
= 256
bs = 0.4
lr = get_dls(train_ds, valid_ds, bs, collate_fn=collate_device)
train_dl,valid_dl = optim.SGD(simple_cnn.parameters(), lr=lr) opt
= fit(5, simple_cnn.to(def_device), F.cross_entropy, opt, train_dl, valid_dl) loss,acc
0 0.3630618950843811 0.8875999997138977
1 0.16439641580581665 0.9496000003814697
2 0.24622697901725768 0.9316000004768371
3 0.25093305287361145 0.9335999998092651
4 0.13128829071521758 0.9618000007629395
= optim.SGD(simple_cnn.parameters(), lr=lr/4)
opt = fit(5, simple_cnn.to(def_device), F.cross_entropy, opt, train_dl, valid_dl) loss,acc
0 0.08451943595409393 0.9756999996185303
1 0.08082638642787933 0.9777999995231629
2 0.08050601842403411 0.9778999995231629
3 0.08200360851287841 0.9773999995231628
4 0.08405050563812255 0.9761999994277955
Understanding Convolution Arithmetic
In an input of size 64x1x28x28
the axes are batch,channel,height,width
. This is often represented as NCHW
(where N
refers to batch size). Tensorflow, on the other hand, uses NHWC
axis order (aka “channels-last”). Channels-last is faster for many models, so recently it’s become more common to see this as an option in PyTorch too.
We have 1 input channel, 4 output channels, and a 3×3 kernel.
0][0] simple_cnn[
Conv2d(1, 4, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
= simple_cnn[0][0]
conv1 conv1.weight.shape
torch.Size([4, 1, 3, 3])
conv1.bias.shape
torch.Size([4])
The receptive field is the area of an image that is involved in the calculation of a layer. conv-example.xlsx shows the calculation of two stride-2 convolutional layers using an MNIST digit. Here’s what we see if we click on one of the cells in the conv2 section, which shows the output of the second convolutional layer, and click trace precedents.
The blue highlighted cells are its precedents—that is, the cells used to calculate its value. These cells are the corresponding 3×3 area of cells from the input layer (on the left), and the cells from the filter (on the right). Click trace precedents again:
In this example, we have just two convolutional layers. We can see that a 7×7 area of cells in the input layer is used to calculate the single green cell in the Conv2 layer. This is the receptive field
The deeper we are in the network (specifically, the more stride-2 convs we have before a layer), the larger the receptive field for an activation in that layer.
Color Images
A colour picture is a rank-3 tensor:
from torchvision.io import read_image
= read_image('images/grizzly.jpg')
im im.shape
torch.Size([3, 1000, 846])
1,2,0)); show_image(im.permute(
= plt.subplots(1,3)
_,axs for bear,ax,color in zip(im,axs,('Reds','Greens','Blues')): show_image(255-bear, ax=ax, cmap=color)
These are then all added together, to produce a single number, for each grid location, for each output feature.
We have ch_out
filters like this, so in the end, the result of our convolutional layer will be a batch of images with ch_out
channels.
Export -
import nbdev; nbdev.nbdev_export()