Advanced Machine Learning

24: Convolutional Neural Networks

Outline for the lecture

History of CNNs
Bulding Blocks
Skip Connections
Fully Convolutional Neural Nets
Semantic Segmentation with Twists
(even more) Advanced Uses of CNN

Convolutions what?

History of CNNs

Cat's brain 1962 (Hubel and Wiesel)

Fukushima's Neurocognitron 1979

Time Delay Neural Network 1989

CNN 1989

CNN 1998

CNN+GPU+MaxPooling 2011

AlexNet 2012

CNN: bulding blocks

Convolving a kernel with an image

\[ \left( \begin{array}{ccc} 0 & 1 & 2 \\ 2 & 2 & 0 \\ 0 & 1 & 2 \\ \end{array} \right) \] convolution 2

Convolving a kernel with an image

Padding and symmetries

Padding and symmetries

How do the channels look?

Pooling: maxpooling

Pooling: average

How do we produce a class prediction?

One-convolution

Upconvolution

Dilated convolution

Play with a simulator

Basic building blocks

Convolution with a filter
Zero Padding
Channels and channel-kernel relationship
Pooling (max and average)
Moving from convolution layers to predictions
One convolution
Upconvolution
Dilated convolution

Skip connections

Dark knowledge

It has been conjectured quite early that deeper models should be more powerful than shallow, but we were unable to train them effectively no matter how hard we've tried. Various pre-training approaches, when parameters of each layer are initialized in a smart way, were helping only to a point. The difficulties were discouraging and we did not know how to move forward with thin and deep networks.
Until in 2013 Rich Caruana observed the following effect: (Explain the mimic nets for teacher and student training)
Geoff Hinton - called the grandfather of deep learning by some - came up with explanation, which he termed: dark knowledge (explain hard labels and soft labels)
and the team of Yoshia Bengio has used this observation for training deep and thin networks they called fitnets
yet, to some, this two stage teacher+student process seemed a bit cumbersome.

Highway networks (May 2015 on arxiv)

$$ \vec{y} = H(\vec{x}, \bm{W}_H) $$
$$ \vec{y} = H(\vec{x}, \bm{W}_H) \odot T(\vec{x}, \bm{W}_T) + \vec{x} \odot C(\vec{x}, \bm{W}_C) $$
$$ \vec{y} = H(\vec{x}, \bm{W}_H) \odot T(\vec{x}, \bm{W}_T) + \vec{x} \odot (1 - T(\vec{x}, \bm{W}_T)) $$
$$ \vec{y} = \left\{ \begin{array}{ll} \vec{x} & \mbox{if }\;\;T(\vec{x}, \bm{W}_T)=0,\\ H(\vec{x}, \bm{W}_H) & \mbox{if }\;\;T(\vec{x}, \bm{W}_T)=1 \end{array} \right. $$
What if untrained gate is always open and does not let gradients flow?
Initialize gate biases to large negative values!