Deep Neural Networks & Image Captioning

Part 1: Helping our users find love on Badoo

Laura Mitchell
Published in
5 min readFeb 28, 2019

--

Badoo is the largest dating network in the world, with over millions users across 190 countries who upload over 10 million photos per day to our platform. These images provide us with a rich data set we can derive a wealth of insights from.

Our Data Science team use image captioning to describe what is in these images. Image captioning is essentially the process of generating a textual description of a picture’s content.

With these descriptions we are delivering insights to the business, which in turn helps us to improve the user experience. For example, if we are able to identify that two users are both keen tennis players, we can give them a helping hand in finding each other on Badoo. Ultimately, in this way we’re helping people across the world find love!

In this blog I will explain how the Badoo Data Science team use Deep Neural Networks to create these image captions.

CNN and LSTM architecture

The Deep Neural Network model we have in place is motivated by the ‘Show and Tell: A Neural Image Caption Generator’ paper. It uses a combination of a Convolutional Neural Network (CNN) and a Long Short Term Memory (LSTM).

The union of a CNN and LSTM in its simplest terms takes the image as the input and outputs a description of what that image depicts. The CNN acts as an encoder and the LSTM as a decoder.

Role of the CNN

CNNs (or ConvNets) have proven very effective in image classification and object detection tasks. Other algorithms tend to use a lot of the spatial interaction between pixels, but a CNN effectively uses adjacent pixel information to downsample/convolve the image first. For the purposes of image captioning, we can take a CNN that has been pre-trained to identify objects in images. Popular pre-trained models for such tasks are VGG-16, Resnet-50 and InceptionV3.

It is the CNN that creates the feature vector from the image, which is called an embedding. The embedding from the CNN layer is then fed as an input into the LSTM. In other words, the encoded image is transformed to create the initial hidden state for the LSTM decoder.

Role of the LSTM

LSTMs are a type of Recurrent Neural Network (RNN) and, unlike many other deep learning architectures, RNNs can form a much deeper understanding of a sequence as they keep a memory of the previous elements. They are able to consider a past sequence of events in order to make a prediction or classification about the next event. For image captioning, the LSTM takes the extracted features from the CNN and produces a sequence of words that describe what was in the image.

At each output timestep the LSTM generates a new word in the sequence.

Beam search

LSTMs can use the beam search method to construct sentences. It is a linear layer that is used to transform the decoder’s output into a score for each word. A greedy approach would be to simply choose the word with the highest score. However, this is suboptimal: if the first word is incorrect, the rest of the sequence will hinge on that. Beam search, on the other hand, is not a greedy approach in the sense that it doesn’t have to choose a word at any point in the sequence until it has finished decoding.

At each iteration, the previous state of the LSTM are considered to generate the softmax vector. The top most probable words are kept and used in the next inference step.

More specifically, at the first decode step the top k candidates are considered. Then it generates the top k second words for those first k words. This is repeated at each time step until the sequence is terminated. The sequence with the best overall score is then chosen.

Datasets and training

There are many open source datasets containing images and their captions that can be used for training your model. Popular choices for such tasks are the MS-COCO and Flickr 8k datasets.

During training, the network back-propagates and updates the weights based on the gradient of the loss functions.

Model evaluation

To evaluate how similar the predicted sentence is to the ground truth, an evaluation metric such as the Bilingual Evaluation Understudy (BLEU) score can be considered. The fundamental concept of the BLEU score is that it considers the correspondence between the model’s output and that of a human.

Summary

In this blog, I gave an overview of how we have combined a CNN and a LSTM to generate textual descriptions of images, and in turn how we are helping our users connect with others with similar interests. Keep an eye out for part two of this article, which covers how the incorporation of attention networks can enhance the accuracy of our descriptions.

If cool data science projects like this one interest you, we are hiring, so get in touch and join the Badoo family! 🙂

Any suggestions/feedback are welcome, so please feel free to comment below.

--

--