Bootcamp Tech Blog #3: Monolithic approach

Figure 1: Key difference between monolithic approach and module-based approach
Figure 2: VQA grows from Image captioning and requires a higher level of image reasoning.
Figure 3: The flow of the monolithic approach. The answer generation process can take the form of a classifier to predict a short answer from a predefined set or a recurrent network to produce variable-length phrases. (Top) Features of question and features of the whole image are combined. (Bottom) Different region-specific features of the image are selected. Attention mechanisms are used in which allow the output stage to focus on relevant parts of the image.
Figure 4: Flowchart of Neural-Image-QA.
Figure 5: Basic network architecture for VQA
Figure 6: Dual Net for real images (left) and abstract scenes (right)
Figure 7: Pre-training and fine-tuning VLP models for different vision-language tasks
Figure 8: LXMERT model with 3 encoders. NR, RL and NX are the number of cascaded layers in each encoder. ‘Self’ and ‘Cross’ are abbreviations for self-attention sub-layers and cross-attention sub-layers, respectively. ‘FF’ denotes a feed-forward sub-layer.
Figure 9: An example of an image-text pair (left) and visualizations in visual feature space and word embedding space (right). It can be seen that the word semantic space is more representative than the image region space.
Figure 10: OSCAR represents an image-text pair as a triple [word tokens, object tags, region features.
Figure 11: From left to right: Two basic attention units; Architecture of an MCA layer; Two deep co-attention learning strategies.
Figure 12: Overall flowchart of MCAN



