C3-ConvolutionalNeuralNetworks.qmd

---
output: html_document
editor_options:
  chunk_output_type: console
---

# Convolutional Neural Networks (CNN)

```{r}
#| echo: false
#| include: false
#| results: false
reticulate::use_virtualenv("r-keras")
library(keras3)
library(tensorflow)
tf
tf$abs(3.)
```

The main purpose of convolutional neural networks is image recognition. (Sound can be understood as an image as well!) In a convolutional neural network, we have at least one convolution layer, additional to the normal, fully connected deep neural network layers.

Neurons in a convolution layer are connected only to a small spatially contiguous area of the input layer (*receptive field*). We use this structure (*feature map*) to scan the **entire** features / neurons (e.g. picture). Think of the feature map as a *kernel* or *filter* (or imagine a sliding window with weighted pixels) that is used to scan the image. As the name is already indicating, this operation is a convolution in mathematics. The kernel weights are optimized, but we use the same weights across the entire input neurons (*shared weights*).

The resulting (hidden) convolutional layer after training is called a *feature map*. You can think of the feature map as a map that shows you where the "shapes" expressed by the kernel appear in the input. One kernel / feature map will not be enough, we typically have many shapes that we want to recognize. Thus, the input layer is typically connected to several feature maps, which can be aggregated and followed by a second layer of feature maps, and so on.

You get one convolution map/layer for each kernel of one convolutional layer.

## Example MNIST

We will show the use of convolutional neural networks with the MNIST data set. This data set is maybe one of the most famous image data sets. It consists of 60,000 handwritten digits from 0-9.

To do so, we define a few helper functions:

```{r chunk_chapter5_9}
library(keras3)

rotate = function(x){ t(apply(x, 2, rev)) }

imgPlot = function(img, title = ""){
  col = grey.colors(255)
  image(rotate(img), col = col, xlab = "", ylab = "", axes = FALSE,
     main = paste0("Label: ", as.character(title)))
}
```

The MNIST data set is so famous that there is an automatic download function in Keras:

```{r chunk_chapter5_10}
data = dataset_mnist()
train = data$train
test = data$test
```

Let's visualize a few digits:

```{r chunk_chapter5_11}
oldpar = par(mfrow = c(1, 3))
.n = sapply(1:3, function(x) imgPlot(train$x[x,,], train$y[x]))
par(oldpar)
```

Similar to the normal machine learning workflow, we have to scale the pixels (from 0-255) to the range of $[0, 1]$ and one hot encode the response. For scaling the pixels, we will use arrays instead of matrices. Arrays are called tensors in mathematics and a 2D array/tensor is typically called a matrix.

```{r chunk_chapter5_12}
train_x = array(train$x/255, c(dim(train$x), 1))
test_x = array(test$x/255, c(dim(test$x), 1))
train_y = keras3::to_categorical(train$y)
test_y = keras3::to_categorical(test$y)
```

The last dimension denotes the number of channels in the image. In our case we have only one channel because the images are black and white.

Most times, we would have at least 3 color channels, for example RGB (red, green, blue) or HSV (hue, saturation, value), sometimes with several additional dimensions like transparency.

To build our convolutional model, we have to specify a kernel. In our case, we will use 16 convolutional kernels (filters) of size $2\times2$. These are 2D kernels because our images are 2D. For movies for example, one would use 3D kernels (the third dimension would correspond to time and not to the color channels).

::: panel-tabset
## Keras

```{r chunk_chapter5_13}
model = keras_model_sequential(shape(28L, 28L, 1L))
model |>
 layer_conv_2d(filters = 16L,
               kernel_size = c(2L, 2L), activation = "relu") |>
 layer_max_pooling_2d() |>
 layer_conv_2d(filters = 16L, kernel_size = c(3L, 3L), activation = "relu") |>
 layer_max_pooling_2d() |>
 layer_flatten() |>
 layer_dense(100L, activation = "relu") |>
 layer_dense(10L, activation = "softmax")
summary(model)
plot(model)
```

## Torch

Prepare/download data:

```{r chunk_chapter5_14}
library(torch)
library(torchvision)
torch_manual_seed(321L)
set.seed(123)

train_ds = mnist_dataset(
  ".",
  download = TRUE,
  train = TRUE,
  transform = transform_to_tensor
)

test_ds = mnist_dataset(
  ".",
  download = TRUE,
  train = FALSE,
  transform = transform_to_tensor
)
```

Build dataloader:

```{r chunk_chapter5_15}
train_dl = dataloader(train_ds, batch_size = 32, shuffle = TRUE)
test_dl = dataloader(test_ds, batch_size = 32)
first_batch = train_dl$.iter()
df = first_batch$.next()

df$x$size()
```

Build convolutional neural network: We have here to calculate the shapes of our layers on our own:

**We start with our input of shape (batch_size, 1, 28, 28)**

```{r chunk_chapter5_16}
sample = df$x
sample$size()
```

**First convolutional layer has shape (input channel = 1, number of feature maps = 16, kernel size = 2)**

```{r chunk_chapter5_17}
conv1 = nn_conv2d(1, 16L, 2L, stride = 1L)
(sample |> conv1())$size()
```

Output: batch_size = 32, number of feature maps = 16, dimensions of each feature map = $(27 , 27)$ Wit a kernel size of two and stride = 1 we will lose one pixel in each dimension... Questions:

-   What happens if we increase the stride?
-   What happens if we increase the kernel size?

**Pooling layer summarizes each feature map**

```{r chunk_chapter5_18}
(sample |> conv1() |> nnf_max_pool2d(kernel_size = 2L, stride = 2L))$size()
```

kernel_size = 2L and stride = 2L halfs the pixel dimensions of our image.

**Fully connected layer**

Now we have to flatten our final output of the convolutional neural network model to use a normal fully connected layer, but to do so we have to calculate the number of inputs for the fully connected layer:

```{r chunk_chapter5_19}
dims = (sample |> conv1() |>
          nnf_max_pool2d(kernel_size = 2L, stride = 2L))$size()
# Without the batch size of course.
final = prod(dims[-1]) 
print(final)
fc = nn_linear(final, 10L)
(sample |> conv1() |> nnf_max_pool2d(kernel_size = 2L, stride = 2L)
  |> torch_flatten(start_dim = 2L) |> fc())$size()
```

Build the network:

```{r chunk_chapter5_20, eval=FALSE}
net = nn_module(
  "mnist",
  initialize = function(){
    self$conv1 = nn_conv2d(1, 16L, 2L)
    self$conv2 = nn_conv2d(16L, 16L, 3L)
    self$fc1 = nn_linear(400L, 100L)
    self$fc2 = nn_linear(100L, 10L)
  },
  forward = function(x){
    x |>
      self$conv1() |>
      nnf_relu() |>
      nnf_max_pool2d(2) |>
      self$conv2() |>
      nnf_relu() |>
      nnf_max_pool2d(2) |>
      torch_flatten(start_dim = 2) |>
      self$fc1() |>
      nnf_relu() |>
      self$fc2()
  }
)
```
:::

We additionally used a pooling layer for downsizing the resulting feature maps. Without further specification, a $2\times2$ pooling layer is taken automatically. Pooling layers take the input feature map and divide it into (in our case) parts of $2\times2$ size. Then the respective pooling operation is executed. For every input map/layer, you get one (downsized) output map/layer.

As we are using the max pooling layer (there are sever other methods like the mean pooling), only the maximum value of these 4 parts is taken and forwarded further. Example input:

```         
1   2   |   5   8   |   3   6
6   5   |   2   4   |   8   1
------------------------------
9   4   |   3   7   |   2   5
0   3   |   2   7   |   4   9
```

We use max pooling for every field:

```         
max(1, 2, 6, 5)   |   max(5, 8, 2, 4)   |   max(3, 6, 8, 1)
-----------------------------------------------------------
max(9, 4, 0, 3)   |   max(3, 7, 2, 7)   |   max(2, 5, 4, 9)
```

So the resulting pooled information is:

```         
6   |   8   |   8
------------------
9   |   7   |   9
```

In this example, a $4\times6$ layer was transformed to a $2\times3$ layer and thus downsized. This is similar to the biological process called *lateral inhibition* where active neurons inhibit the activity of neighboring neurons. It's a loss of information but often very useful for aggregating information and prevent overfitting.

After another convolution and pooling layer, we flatten the output. This means that the following dense layer treats the previous layer as a full layer (so the dense layer is connected to all the weights from the last feature maps). You can think of this as transforming a matrix (2D) into a simple 1D vector. The full vector is then used. After flattening the layer, we can simply use our typical output layer.

::: panel-tabset
## Keras

The rest is as usual:

First we compile the model:

```{r chunk_chapter5_21}
model |>
  keras3::compile(
      optimizer = keras3::optimizer_adamax(0.01),
      loss = keras3::loss_categorical_crossentropy
  )
summary(model)
```

Then, we train the model:

```{r chunk_chapter5_22, eval=FALSE}
library(tensorflow)
library(keras3)

epochs = 5L
batch_size = 32L
model |>
  fit(
    x = train_x, 
    y = train_y,
    epochs = epochs,
    batch_size = batch_size,
    shuffle = TRUE,
    validation_split = 0.2
  )
```

## Torch

Train model:

```{r chunk_chapter5_23, eval=FALSE}
library(torch)
torch_manual_seed(321L)
set.seed(123)

model_torch = net()
opt = optim_adam(params = model_torch$parameters, lr = 0.01)

for(e in 1:3){
  losses = c()
  coro::loop(
    for(batch in train_dl){
      opt$zero_grad()
      pred = model_torch(batch[[1]])
      loss = nnf_cross_entropy(pred, batch[[2]], reduction = "mean")
      loss$backward()
      opt$step()
      losses = c(losses, loss$item())
    }
  )
  cat(sprintf("Loss at epoch %d: %3f\n", e, mean(losses)))
}
```

Evaluation:

```{r chunk_chapter5_24, eval=FALSE}
model_torch$eval()

test_losses = c()
total = 0
correct = 0

coro::loop(
  for(batch in test_dl){
    output = model_torch(batch[[1]])
    labels = batch[[2]]
    loss = nnf_cross_entropy(output, labels)
    test_losses = c(test_losses, loss$item())
    predicted = torch_max(output$data(), dim = 2)[[2]]
    total = total + labels$size(1)
    correct = correct + (predicted == labels)$sum()$item()
  }
)

mean(test_losses)
test_accuracy =  correct/total
test_accuracy
```
:::

## Example CIFAR

CIFAR10 is another famous image classification dataset. It consists of ten classes with colored images (see https://www.cs.toronto.edu/\~kriz/cifar.html).

```{r chunk_chapter5_24_cifar, eval=FALSE}
library(keras3)
data = keras3::dataset_cifar10()
train = data$train
test = data$test
image = train$x[1,,,]
image %>%  
 image_to_array() %>%
 `/`(., 255) %>%
 as.raster() %>%
 plot()
## normalize pixel to 0-1
train_x = array(train$x/255, c(dim(train$x)))
test_x = array(test$x/255, c(dim(test$x)))
train_y = keras3::to_categorical(train$y)
test_y = keras3::to_categorical(test$y)
model = keras_model_sequential(shape(32L, 32L,3L))
model |> 
 layer_conv_2d(input_shape = c(32L, 32L,3L),filters = 16L, kernel_size = c(2L,2L), activation = "relu") |> 
 layer_max_pooling_2d() |> 
 layer_dropout(0.3) |> 
 layer_conv_2d(filters = 16L, kernel_size = c(3L,3L), activation = "relu") |> 
 layer_max_pooling_2d() |> 
 layer_flatten() |> 
 layer_dense(10, activation = "softmax")
summary(model)
model |> 
 compile(
 optimizer = optimizer_adamax(),
 loss = keras3::loss_categorical_crossentropy
 )
early = callback_early_stopping(patience = 5L)
epochs = 1L
batch_size =20L
model |> fit(
 x = train_x, 
 y = train_y,
 epochs = epochs,
 batch_size = batch_size,
 shuffle = TRUE,
 validation_split = 0.2,
 callbacks = c(early)
)
```

## Exercise

::: {.callout-caution icon="false"}
#### Task: CNN for flower dataset

The next exercise is based on the flower dataset in the Ecodata package.

Follow the steps above and build your own convolutional neural network.

Finally, submit your predictions to the submission server. If you have extra time, take a look at kaggle and find the flower dataset challenge for specific architectures tailored for this dataset.

Tasks:

-   If you are unsure how do it, take a look at the solution and try to make the model more complex (e.g. add convolutional layers, regularization, etc.)
-   Take a look at this [notebook from kaggle](https://www.kaggle.com/code/rajmehra03/flower-recognition-cnn-keras) , try to copy their architecture (it is the same dataset but upsized (i.e. more pixels, the only difference is the input dimension))

Prepare data:

```{r}
library(tensorflow)
library(keras3)

train = EcoData::dataset_flower()$train/255
test = EcoData::dataset_flower()$test/255
labels = EcoData::dataset_flower()$labels
```

Plot flower:

```{r}
train[100,,,] |>
  image_to_array() |>
  as.raster() |>
  plot()
```

**Tip:** Take a look at the dataset chapter.

`r hide("Click here to see the solution for a minimal example")`

Build model:

```{r}
model = keras_model_sequential(shape(80L, 80L, 3L))
model |> 
  layer_conv_2d(filters = 4L, kernel_size = 2L) |> 
  layer_max_pooling_2d() |> 
  layer_flatten() |> 
  layer_dense(units = 5L, activation = "softmax")

### Model fitting ###

model |> 
  compile(loss = loss_categorical_crossentropy, 
          optimizer = optimizer_adamax(learning_rate = 0.01))

model |> 
  fit(x = train, y = keras::k_one_hot(labels, 5L))
```

Predictions:

```{r}
# Prediction on training data:
pred = apply(model |> predict(train), 1, which.max)
Metrics::accuracy(pred - 1L, labels)
table(pred)

# Prediction for the submission server:
pred = model |> predict(test) |> apply(1, which.max) - 1L
table(pred)
```

Submission:

```{r}
write.csv(data.frame(y = pred), file = "cnn.csv")
```

`r unhide()`
:::

## Advanced Training Techniques

### Data Augmentation

Having to train a convolutional neural network using very little data is a common problem. Data augmentation helps to artificially increase the number of images.

The idea is that a convolutional neural network learns specific structures such as edges from images. Rotating, adding noise, and zooming in and out will preserve the overall key structure we are interested in, but the model will see new images and has to search once again for the key structures.

Luckily, it is very easy to use data augmentation in Keras.

To show this, we will use our flower data set. We have to define a generator object (a specific object which infinitely draws samples from our data set). In the generator we can turn on the data augmentation.

::: panel-tabset
## Keras

```{r chunk_chapter5_25, eval=TRUE}
reticulate::use_virtualenv("r-keras")
library(keras3)
library(tensorflow)
library(tfdatasets)

data = EcoData::dataset_flower()
train = data$train/255
labels = data$labels


# data augemention model
aug = keras_model_sequential()
aug %>% 
  layer_random_flip() %>% 
  layer_random_rotation(factor = c(-0.5, 0.5)) %>% 
  layer_random_brightness(factor = c(-0.5, 0.5), value_range = c(0, 1))
aug


# Prepare datasets
dataset = tfdatasets::tensor_slices_dataset(list(train, to_categorical(labels)))
dataset <- dataset %>%
  dataset_map(function(x, y) list(aug(x), y))

batch_size <- 64

train_ds <- dataset |>
  dataset_batch(batch_size) |>
  dataset_prefetch(buffer_size = tf$data$AUTOTUNE)

# test!
batch <- train_ds |>
  dataset_take(1) |>
  as_iterator() |>
  iter_next()

as.array(batch[[1]][1,,,]) |>
  image_to_array() |>
  as.raster() |>
  plot()

# model fitting
model = keras_model_sequential(shape(80L, 80L, 3L))
model |> 
  layer_conv_2d(filters = 4L, kernel_size = 2L) |> 
  layer_max_pooling_2d() |> 
  layer_flatten() |> 
  layer_dense(units = 5L, activation = "softmax")


model |>
  keras3::compile(loss = keras3::loss_categorical_crossentropy,
                  optimizer = keras3::optimizer_adamax(learning_rate = 0.01))


model %>% fit(train_ds, epochs = 2L)

```

Using data augmentation we can artificially increase the number of images.

## Torch

In Torch, we have to change the transform function (but only for the train dataloader):

```{r chunk_chapter5_26, eval=FALSE}
library(torch)
torch_manual_seed(321L)
set.seed(123)

train_transforms = function(img){
  img |>
    transform_to_tensor() |>
    transform_random_horizontal_flip(p = 0.3) |>
    transform_random_resized_crop(size = c(28L, 28L)) |>
    transform_random_vertical_flip(0.3)
}

train_ds = mnist_dataset(".", download = TRUE, train = TRUE,
                         transform = train_transforms)
test_ds = mnist_dataset(".", download = TRUE, train = FALSE,
                        transform = transform_to_tensor)

train_dl = dataloader(train_ds, batch_size = 100L, shuffle = TRUE)
test_dl = dataloader(test_ds, batch_size = 100L)

model_torch = net()
opt = optim_adam(params = model_torch$parameters, lr = 0.01)

for(e in 1:1){
  losses = c()
  coro::loop(
    for(batch in train_dl){
      opt$zero_grad()
      pred = model_torch(batch[[1]])
      loss = nnf_cross_entropy(pred, batch[[2]], reduction = "mean")
      loss$backward()
      opt$step()
      losses = c(losses, loss$item())
    }
  )
  
  cat(sprintf("Loss at epoch %d: %3f\n", e, mean(losses)))
}

model_torch$eval()

test_losses = c()
total = 0
correct = 0

coro::loop(
  for(batch in test_dl){
    output = model_torch(batch[[1]])
    labels = batch[[2]]
    loss = nnf_cross_entropy(output, labels)
    test_losses = c(test_losses, loss$item())
    predicted = torch_max(output$data(), dim = 2)[[2]]
    total = total + labels$size(1)
    correct = correct + (predicted == labels)$sum()$item()
  }
)

test_accuracy =  correct/total
print(test_accuracy)
```
:::

### Transfer Learning {#sec-transfer}

Another approach to reduce the necessary number of images or to speed up convergence of the models is the use of transfer learning.

The main idea of transfer learning is that all the convolutional layers have mainly one task - learning to identify highly correlated neighboring features. This knowledge is then used for new tasks. The convolutional layers learn structures such as edges in images and only the top layer, the dense layer is the actual classifier of the convolutional neural network for a specific task. Thus, one could think that we could only train the top layer as classifier. To do so, it will be confronted by sets of different edges/structures and has to decide the label based on these.

Again, this sounds very complicated but it is again quite easy with Keras and Torch.

::: panel-tabset
## Keras

We will do this now with the CIFAR10 data set, so we have to prepare the data:

```{r chunk_chapter5_27, eval=TRUE}
library(tensorflow)
library(keras3)

data = keras3::dataset_cifar10()
train = data$train
test = data$test

rm(data)

image = train$x[5,,,]
image %>%
  image_to_array() %>% 
  `/`(., 255) %>%
  as.raster() %>%
  plot()

train_x = array(train$x/255, c(dim(train$x)))
test_x = array(test$x/255, c(dim(test$x)))
train_y = keras3::to_categorical(train$y)
test_y = keras3::to_categorical(test$y)

rm(train, test)
```

Keras provides download functions for all famous architectures/convolutional neural network models which are already trained on the imagenet data set (another famous data set). These trained networks come already without their top layer, so we have to set include_top to false and change the input shape.

```{r chunk_chapter5_28, eval=TRUE}
densenet = keras3::application_densenet201(include_top = FALSE,
                                   input_shape  = c(32L, 32L, 3L))
```

Now, we will not use a sequential model but just a "keras_model" where we can specify the inputs and outputs. Thereby, the output is our own top layer, but the inputs are the densenet inputs, as these are already pre-trained.

```{r chunk_chapter5_29, eval=TRUE}
model = keras3::keras_model(
  inputs = densenet$input,
  outputs = layer_flatten(
    layer_dense(densenet$output, units = 10L, activation = "softmax")
  )
)

# Notice that this snippet just creates one (!) new layer.
# The densenet's inputs are connected with the model's inputs.
# The densenet's outputs are connected with our own layer (with 10 nodes).
# This layer is also the output layer of the model.
```

In the next step we want to freeze all layers except for our own last layer. Freezing means that these are not trained: We do not want to train the complete model, we only want to train the last layer. You can check the number of trainable weights via summary(model).

```{r chunk_chapter5_30, eval=TRUE}
model |> keras3::freeze_weights(from = 1, to = length(model$layers) - 2)
summary(model)
```

And then the usual training:

```{r chunk_chapter5_31, eval=FALSE}
library(tensorflow)
library(keras3)

model |>
  keras3::compile(loss = keras3::loss_categorical_crossentropy, 
                 optimizer = optimizer_adamax())

model |>
  fit(
    x = train_x, 
    y = train_y,
    epochs = 1L,
    batch_size = 32L,
    shuffle = TRUE,
    validation_split = 0.2
  )
```

We have seen, that transfer learning can easily be done using Keras.

## Torch

```{r chunk_chapter5_32, eval=FALSE}
library(torchvision)
library(torch)
torch_manual_seed(321L)
set.seed(123)

train_ds = cifar10_dataset(".", download = TRUE, train = TRUE,
                           transform = transform_to_tensor)
test_ds = cifar10_dataset(".", download = TRUE, train = FALSE,
                          transform = transform_to_tensor)

train_dl = dataloader(train_ds, batch_size = 100L, shuffle = TRUE)
test_dl = dataloader(test_ds, batch_size = 100L)

model_torch = model_resnet18(pretrained = TRUE)

# We will set all model parameters to constant values:
model_torch$parameters |>
  purrr::walk(function(param) param$requires_grad_(FALSE))

# Let's replace the last layer (last layer is named 'fc') with our own layer:
inFeat = model_torch$fc$in_features
model_torch$fc = nn_linear(inFeat, out_features = 10L)

opt = optim_adam(params = model_torch$parameters, lr = 0.01)

for(e in 1:1){
  losses = c()
  coro::loop(
    for(batch in train_dl){
      opt$zero_grad()
      pred = model_torch(batch[[1]])
      loss = nnf_cross_entropy(pred, batch[[2]], reduction = "mean")
      loss$backward()
      opt$step()
      losses = c(losses, loss$item())
    }
  )
  
  cat(sprintf("Loss at epoch %d: %3f\n", e, mean(losses)))
}

model_torch$eval()

test_losses = c()
total = 0
correct = 0

coro::loop(
  for(batch in test_dl){
    output = model_torch(batch[[1]])
    labels = batch[[2]]
    loss = nnf_cross_entropy(output, labels)
    test_losses = c(test_losses, loss$item())
    predicted = torch_max(output$data(), dim = 2)[[2]]
    total = total + labels$size(1)
    correct = correct + (predicted == labels)$sum()$item()
  }
)

test_accuracy =  correct/total
print(test_accuracy)
```
:::

### Example: Flower dataset

Let's do that with our flower data set:

```{r chunk_chapter5_33, eval=FALSE}
library(keras3)
library(tensorflow)

data = EcoData::dataset_flower()
train = data$train/255
labels = data$labels


# data augemention model
aug = keras_model_sequential()
aug %>% 
  layer_random_flip() %>% 
  layer_random_rotation(factor = c(-0.5, 0.5)) %>% 
  layer_random_brightness(factor = c(-0.5, 0.5), value_range = c(0, 1))
aug


# Prepare datasets
dataset = tfdatasets::tensor_slices_dataset(list(train, to_categorical(labels)))
dataset <- dataset %>%
  dataset_map(function(x, y) list(aug(x), y))

batch_size <- 64

train_ds <- dataset |>
  dataset_batch(batch_size) |>
  dataset_prefetch(buffer_size = tf$data$AUTOTUNE)

# Transfer learning

# weights were trained to imagenet
pretrained_model = keras3::application_convnext_small(include_top = FALSE,
                                                       input_shape = c(80L, 80L, 3L))
# pretrained_model

keras3::freeze_weights(pretrained_model)
pretrained_model

# Build model

model = keras_model(inputs = pretrained_model$input,
                    outputs = pretrained_model$output |> 
                      layer_flatten() |> 
                      layer_dropout(0.2) |> 
                      layer_dense(units = 5L, activation = "softmax")
)
model %>% freeze_weights(from = 1, to = length(model$layers)-1)

model |>
  keras3::compile(loss = loss_categorical_crossentropy,
                  optimizer = keras3::optimizer_rmsprop(learning_rate = 0.0005))


model |> 
  fit(train_ds, epochs = 5L)


```