Radenko and Srdjan: Image Captioning PSI:ML 2018

Our project was about Image Captioning. The task was to create a Machine Learning algorithm which will give a one sentence description for a given image as the input. We found Computer Vision to be a very interesting topic, overall. Also, the idea of having an algorithm which understands what is going on in the image sounded pretty amazing.

First of all, we did not invent an algorithm for solving this problem. During our research, we found couple of papers that gave a solution for it. We analyzed those papers and found Deep Learning to be an answer for it. Using a deep neural network, we got set of features of an image by using some of the last FC layers. These neural networks are usually pretrained on image classification problem.

The second part of the problem required using recurrent neural networks. The network receives image features on the input and process it with a goal of predicting next word. That is done in classic fashion of sequence to sequence, but in our case, we encoded the first sequence (image) into a feature vector. In addition to this architecture we added a layer which “explains” which part of the image affects every word in description. That layer is called attention layer. In the end, our algorithm understood what is going on in the image. Of course, it made (pretty funny) mistakes but in more than half cases it gave meaningful result.

In the image above, you can see the result of the algorithm when given an image we took as the input.

We struggled at couple of points during our work. We needed to do a lot of data preprocessing which turned out to be harder than it sounds. Tensorflow framework is great for machine learning projects but we did not have much hands-on experience with it, so we’ve hit some bugs/problems which took us quite a time to resolve.

We learned a lot during the project phase. First of all, we have improved our hands on experience and our Tensorflow skillset. And the most important is that we have developed our understanding of deep neural networks and recurrent neural networks.