Bootcamp Tech Blog #2: Modular Method in Visual Question Answering

9 min readMay 31, 2020

By Khoa.N.A.Nguyen from Cinnamon Student AI Bootcamp 2020

In the previous blog, we have captured a general view of the VQA problem. In this one, we will discuss a natural approach named “modular method” with 2 models: Neural Module Networks and Neural-Symbolic Concept Learner.

Neural Module Networks

Instead of learning a single function to map questions and images to answers in the VQA problems, it is perhaps useful to consider it as a multitask learning process, in which each problem instance is associated with a different task (or a question). A simple question requires only one piece of information in the image to answer, whereas a complex one needs multiple steps. Therefore, intuitively, given that each of this broken-down sub-question can be answered by some core modules, and their combination is specified, the production of the final answer should be straight-forward.

Figure: The overall architecture of Modular Networks for Visual Question Answering. The input question is first parsed and converted to a canonical form. The core modules are arranged dynamically based on the canonical form.

In the next section, we will go into details of the encoding of input, how the modules are constructed and how the questions are parsed according to 2 chosen architectures.

Model architecture

Figure 1: Overall architecture of Neural Module Networks.

Neural Module Network approach commences by dynamically parse the question into a canonical form (i.e a dependency tree)(Figure 2). Given the question under this canonical form, a layout of modules is then generated and used to construct a prediction pipeline.,This pipeline will then be used to predict the answer module by module in a bottom-up manner. These modules take image features as input and produce a vector that can be combined with the encoded question to determine the answer, which is a classification problem over a built-in answer dictionary. An initial set of image features is produced by passing the input image through the convolutional part of LeNet, which is jointly trained with the LSTM question encoder and question answering part of the model. The overall architecture is given in Figure 1.

Figure 2: An example of a tree of question modules.

Modules

We will define a small set of modules that can be combined to form most of the configurations necessary for VQA tasks. These modules operate on three basic data types: images, unnormalized attention matrices, and labels. Below are few common modules used in neural module networks-based approaches.

Attention. An attention module attend[c] takes convolution of the input image with a distinctive weight vector of c to produce unnormalized attention to objects c in the form of a heatmap. For example, the module attend[dog] output is a matrix whose entries should be significant in regions of dogs in the image and trivial everywhere else, as shown in Figure 3.

Figure 3: Attention module with example of dog in Neural Modular Networks.

Re-attention. A re-attention module re-attend[c] consists of two fully connected layers with ReLU activation to map the current attention heatmap to another suitable one for the task. For example, re-attend[above] should take an attention matrix and shift the regions of highest activation upward (as above) (Figure 4) while re-attend[not] should move attention away from the active areas.

Figure 4: Re-attention module with example of above in Neural Modular Networks.

Combination. A combination module combine[c] convolves over the entire stack of two attentions followed by a ReLU as the activation function to combine these two attentions into a single one. For example, combine[and] should be active only in the regions that are active in both inputs, while combine[except] should be active where the first input is active and the second is inactive (Figure 5).

Figure 5: Combine module with example of except in Neural Modular Networks.

Classification. A classification module classify[c], which consists of a fully connected layer followed by a softmax activation, maps an attention matrix and the input image to a distribution over the answer dictionary. For example, classify[where] should return distribution over regions and objects attended to, as illustrated in Figure 6.

Figure 6: Classification module with example of where in Neural Modular Networks.

Measurement. In the measurement module measure[c], an attention matrix is fed through two fully connected layers followed by a ReLU and softmax as activation functions respectively to produce a distribution over labels. Because attention matrices passed between modules are unnormalized, the measure module is suitable for evaluating the existence of a detected object (Figure 7), or counting objects.

Figure 7: Measurement module with an example of exists in Neural Modular Networks.

Question parsing

After defining the set of modules, we need to combine them into a general layout, which will be specified by each question. The transformation consists of finding necessary modules to answer a given question and the connections between them. This specific layout is used as a modular network in question answering progress.

Parsing. Stanford Parser is utilized to convert the natural language question into a simple nested symbolic form on which a modular network can be derived by applying general layout. For example, “what material are the boxes made of?” becomes what(material, box), and “is the red shape blue?” becomes is(and(red, blue)). The parser also performs basic lemmatization to reduce sparsity of module instances; as shown in the above example, “boxes” is turned into “box”.

Layout. All leaves become attend modules, all internal nodes become re-attend or combine modules dependent on their arity, and root nodes become measure modules for yes/no questions and classify modules for all other question types.

Question encoding

After the Neural module networks, it is evident that the answer can be directly predicted without the need for the question encoder LSTM. However, the existence of that encoder is essential for two reasons. First, grammar is eliminated because it does not significantly change the semantics of the question though it might affect the answer in some cases. For example, the answer to the question “What are on the right of the old man?” should be in plural form. Second, the question encoder helps model obtain answers for trivial questions but missing or low-quality image data. For instance, it is reasonable to answer green for the question “What color is the tree?” but not pink or purple.

Results

Here are some examples:

Figure 8: Successful cases of Neural Modular Networks.

Figure 9: Failure cases of Neural Modular Networks.

Neuro-Symbolic Concept Learner

Inspired by the fact that humans combine knowledge of concepts with the input image to answer the question, a built-in definition dictionary is applied beside object representations in Neuro-Symbolic Concept Learner to guide this model predict answer correctly.

Model architecture

The visual parser converts the input image into vectors representing features of all detected objects while the semantic one parses the natural language question into an executable program with connected operations. The symbolic program executor then takes visual representations and concept embeddings as inputs for the question symbolic features to derive the final answer. Like Neural Modular Networks, we can think of the executable program as a tree and implement it from right to left (compared to from leaves to root). The overall architecture is given in Figure 10.

Figure 10: Overall architecture of Neuro-Symbolic Concept Learner.

Visual parser. The input image is sent to a pre-trained Mask R-CNN to detect all objects’ bounding boxes. Region-based features and image-based features are then obtained by feeding both the bounding boxes and the original image through a ResNet-34. Finally, we concatenate them to have feature representation for each object, as illustrated in Figure 11.

Figure 11: Visualization of object representations in Neuro-Symbolic Concept Learner.

Concept quantization. The purpose of this part is to determine the attribute concepts (e.g., shape and color) as well as the relational concepts (e.g., left and behind). Let us consider the ShapeOf attribute with visual concepts Cube, Sphere, and Cylinder. ShapeOf is a function mapping object representations into Shape space (different attributes belong to different spaces). Probabilities are computed between ShapeOf(Object) and visual concepts using the following formula to detect the predicted Shape of Object as demonstrated in Figure 12.

Figure 12: Process of concept quantization in Neuro-Symbolic Concept Learner.

Semantic parser. We build a domain-specific language (DSL) covering a collection of fundamental operations for VQA similar to modules in Neural Modular Networks (e.g., filter with certain concepts or query the attribute as shown in Figure 13). These operations share the same input and output interface; therefore, this parser can form programs of any complexity.

A bidirectional GRU is applied to encode the input question to a fixed-length embedding form, and a decoder based on GRU cells is used to convert the embeddings into a hierarchy program.

Figure 13: Example of an executable program in Neuro-Symbolic Concept Learner.

Symbolic program executor. This executor is a set of functional modules that can execute DSL operations using object representations and concept embeddings. Intermediate results of the executor are in the form of a vector of probabilities so that the execution is differentiable with respect to visual representations. Figure 14 shows an illustration of this process.

Figure 14: Example of executing a symbolic program in Neuro-Symbolic Concept Learner.

Curriculum concept learning

Consider the process of human concept learning, a child learns from the fundamental ones such as attributes to the relations between objects to complex questions. Motivated by that process, a curriculum (as shown in Figure 15) is built to help the concept dictionary cover necessary definitions for the VQA task. This curriculum makes the training process unsupervised because the model learns by merely looking at images and reading pairs of question and answer.

Figure 15: Curriculum concept learning in Neuro-Symbolic Concept Learner.

Results

Here are some examples:

Figure 16: Successful and failure cases of Neuro-Symbolic Concept Learner, respectively.

Conclusion

The common idea in the modular approach lies in the semantic parser converting natural language questions into connected modules. Modules, or simple questions, are executed from the deepest to the outside, which can be potentially applied to answer questions of any complexity.

In the next blog, we will discuss the monolithic method in the VQA task, which uses an end-to-end model to find a single function mapping question and images to answers.

References

About Khoa.N.A.Nguyen: A superb candidate who is participating in Cinnamon Student AI Bootcamp 2020. His main focus in Bootcamp is Computer Vision.
About “Bootcamp Student AI Bootcamp 2020: Ideas to Reality”: this is a scholarship program with a new format that provides the young in AI/Deep Learning field a solid foundation to practicalize their ideas and develop their own product from scratch. More info: here.

Bootcamp Tech Blog #2: Modular Method in Visual Question Answering

Written by Cinnamon AI

No responses yet