In this project, a CNN-LSTM encoder-decoder model was used to generate captions for images automatically. A complex deep learning model is used comprising of two components: a Convolutional Neural Network (CNN) that transforms an input image into a set of features - encoding the information in an image into a vector (known as embedding, which is a featurized representation of the image), and an RNN that turns those features into descriptive language.
Notebook 0:
Initializes the COCO API (the "pycocotools" library), used to access data from the MS COCO (Common Objects in Context) dataset.
Notebook 1:
uses the pycocotools, torchvision transforms, and NLTK to preprocess the images and the captions for network training. It also explores the EncoderCNN and DecoderRNN.
model.py:
Architecture and forward()
functions for encoder and decoder.
Notebook 2:
Selection of hyperparameter values and training. The hyperparameter selections are explained.
Notebook 3:
Using the trained model to generate captions for images in the test dataset.
data_loader.py:
Custom data loader for PyTorch combining the dataset and the sampler.
vocabulary.py:
Vocabulary constructor built from the captions in the training dataset.
This is the complete architecture of the image captioning model. The CNN encoder basically finds patterns in images and encodes it into a vector that is passed to the LSTM decoder that outputs a word at each time step to best describe the image. Upon reaching the <end>
token, the entire caption is generated and that is our output for that particular image.
The encoder is based on a Convolutional Neural Network that encodes an image into a featurized compact representation (in the form of an embedding).
The CNN-Encoder is a ResNet (Residual Network). These kind of networks help in diminishing the vanishing and exploding gradient problems. I used the ResNet-152 pre-trained model.
The CNN encoder is followed by a recurrent neural network (LSTM) that generates a corresponding sentence.
The feature vector is fed into the "DecoderRNN" (which is "unfolded" in time). Each word appearing as output at the top is fed back to the network as input (at the bottom) in a subsequent time step, until the entire caption is generated. The arrow pointing to the right that connects the LSTM boxes together represents hidden state information, which represents the network's "memory", also fed back to the LSTM at each time step.