My PhD thesis is divided into two parts. The subject of the first part was to use deep learning in order to devise methods that are able to semantically interpret the scene depicted on an image, i.e., to recognize the type of objects from which a scene consists of, and to localize those objects on the scene. For example, given an image from the front view of a car, such a method should be able to estimate where is the road and where the pavement on scene that it sees, as well as to localize and recognize objects of interests on the scene, such as other cars, humans, or obstacles on the road. The goal of the first part was to advance the state-of-the-art to this very interesting and practical type of image understanding problems.
Deep learning-based image understanding models, as those that I had to develop for the first part of my thesis, have been proven very successful. However, they have a major limitation, in order to successfully learn to perform such image understanding tasks, they require millions of manually annotated training images. By manually annotation I mean that for each training image, a human must specify what is the desired output that an image understanding system must have for this image. So, in the scene understanding case, the human should annotate with bounding boxes and pixel-wise labels the objects that exist in the image. This is very tedious and error-prone task that might take several minutes per image to be performed. Therefore, as you can understand, in this way it is very difficult and expensive to deploy deep learning based systems for real-life applications, such as self-driving cars, automatic diagnosis from medical images, etc. So, the goal of the second part of my thesis was to explore and propose methods that would allow to apply deep learning for image understanding problems using very limited amount of manually annotated training data.