Bootcamp Tech Blog #3: Monolithic approach
by Quanglong in Cinnamon Student AI Bootcamp 2020
Too tired of assembling modules for varied inputs? Let’s build a fixed architecture!
In the previous post, we discussed an approach for the VQA problem based on compositional models, which involves connecting distinct modules designed for specific capabilities. In this blog, let’s delve into a more straightforward approach: Monolithic.
The word “monolithic” indicates the main difference between module-based architecture. The structure of compositional models is dynamically and flexibly constructed to process varied inputs, while networks in monolithic methods remain unchanged.
Monolithic approaches are also known as “joint embedding approaches”. They use embedding techniques to extract visual and textual features and learn these embeddings in a common space. This is also different from compositional models that decompose the question to construct a proper network.
Motivations and Groundbreaking works
The monolithic approach does not grow out of nothing. In fact, every method is based on some ground, the ground — in this case — is from the problem of image captioning.
The concept of jointly embedding visual and textual content was first explored for the task of image captioning. The motivation behind this idea was the success of deep learning methods in both computer vision and natural language processing, facilitating the ability to learn the cross-modal representations in a common feature space. This allows learning the interactions and performing inference over the two modalities, which is further required in a VQA system compared to the task of image captioning.
For visual-features extraction, the use of pre-trained CNN models is prevalent, obvious examples include VGGNet, ResNet, and GoogLeNet. There are 2 main methods of extracting images: regional extraction and global extraction. Nowadays, the former is more popular due to its higher level of reasoning. The features extraction techniques for questions embedding and representations, however, are much more varied, due to the development of different sophisticated methods, examples include bag-of-words (BOW), long short term memory encoders (LSTM), gated recurrent units (GRU), and skip-thought vectors. The embeddings of words are then fed to a recurrent neural network to capture syntactic patterns and handle variable-length sequences. Most algorithms now consider VQA as a classification problem, while a few generate the answer using recurrent neural networks.
The main difference between models is the integration of the embeddings from two modalities. However, due to the lack of explicit alignment information between visual and textual contents, the task of learning cross-modal representations has become a weakly-supervised learning task. Different methods have been proposed to ease the learning of image-text semantic alignments.
Early attempts. (Malinowski et al., IVCC, 2015) proposed “Neural-Image-QA” with a Recurrent Neural Network using Long Short Term Memory (LSTM). Question and image features are fed together to a first “encoder” LSTM, producing a feature vector of fixed-size, which is then passed to a second “decoder” LSTM. The decoder generates variable-length answers, one word per each recurrent iteration. During the answer generation phase, the previously predicted word is fed into the LSTM until a special <END> symbol is predicted. Several variants of these methods were proposed, such as the “VIS+LSTM” model (Ren et al., 2015)
(Fukui et al., 2016) proposed a pooling method called “Multimodal Compact Bilinear pooling” to perform the joint embedding visual and textual features. They randomly project the 2 features to a higher-dimensional space and then convolve both vectors with multiplications in the Fourier space for efficiency. (Kim et al., NIPS, 2016) extended the deep residual learning for the visual question answering task, proposing a Multimodal Residual Networks (MRN) to learn the cross-modal representations.
(Saito et al., 2015) proposed a “DualNet” attempting to fully exploit the discriminative information of two feature streams. Most previous attempts only used one method to fuse the features. However, different fusing methods contain different information. This motivated them to integrate two kinds of operations, namely element-wise summations and element-wise multiplications.
DualNet is applicable to both real images and abstract scenes categories, while most of the previous works tackled only one domain. This is due to the fact that the 2 domains have fundamentally different characteristics.
Vision-language pre-training. Nowadays, large-scale pre-training methods of learning cross-modal representations on visual-textual pairs are becoming popular for vision-language tasks. This is motivated by the availability of large and high-quality datasets consisting of both visual and textual content. Recent studies on visual-language pretraining (VLP) have shown its effectiveness to learn generic representations from massive image-text data, and that fine-tuning VLP models on task-specific data achieves state-of-the-art results on many vision-language tasks. These VLP models are based on multi-layer Transformers. However, a VLP model is normally very large and costs much time to train.
(Hao Tan et al., 2019) proposed a large-scale Transformer model so-called LXMERT, consisting of 3 encoders: an object relationship encoder, a language encoder and a cross-modality encoder. The first two encoders only focus on a single modality, aiming to learn the intra-modality relationship, while the third attempts to learn the cross-modality representations. After the embedding layers, image and question features are fed into the object relationship encoder and the language encoder respectively. Then, the outputs of the first two encoders become the input for the cross-modality encoder. In the end, the cross-modality feature output is used to predict the answer.
At the time of writing this blog, OSCAR model achieves the first rank on the VQA-v2 dataset with an overall accuracy of 73.82%. It introduced object tags detected in images as anchor points to improve the learning of semantic alignments between images and texts, motivated by the observation that salient objects in an image can be accurately detected and often mentioned in the paired text. In addition, visual regions are often over-sampled, noisy, and ambiguous in comparison with textual content.
MCAN. Another outstanding method is the Deep Module Co-Attention Networks (MCAN), which won the championship in VQA Challenge 2019. The model consists of Modular Co-Attention (MCA) layers cascaded in-depth and these layers are based on multi-layer Transformers. Each MCA layer models the self-attention of questions and images, as well as the question-guided-attention of images jointly using a modular composition of two basic attention units, namely self-attention (SA) and guided-attention. Two alternative strategies for deep co-attention learning were proposed, namely stacking and encoder-decoder.
After the deep co-attention learning stage, the output image features and question features have already contained rich information about the attention weights over the question words and image regions. These features are then put into an attentional reduction model, followed by a multimodal fusion process and lastly a classification stage. From experiments, 6 cascaded MCA layers with encoder-decoder strategy seem to be optimal. MCAN architecture is monolithic because the overall architecture is unchanged despite having different components.
The monolithic approaches are straightforward in their principle and constitute the base of most current approaches to VQA. In the next post, we will discuss and compare the 2 main approaches in VQA (monolithic and module-based). Stay tuned!
— — — — — —
About Quanglong: he’s an excellent candidate who is participating in Cinnamon Student AI Bootcamp 2020. His main focus in Bootcamp is Computer Vision.
About “Bootcamp Student AI Bootcamp 2020: Ideas to Reality”: this is a scholarship program with a new format that provides the young in AI/Deep Learning field a solid foundation to practicalize their ideas and develop their own product from scratch. More info: here.