Key-Value extraction using Key-Value Extraction

4 min readSep 30, 2020

After Document Object Detection Steps, we would expect to already have text lines location and their content. The goal of the system is to provide an information extraction process for any type of administrative documents: contract, medical receipt, biological graph, etc. This article is here to explain how to do it.

Let us assume the case of reading invoice/receipt. For this, we would need a model that is able to find text lines that belong to 5 predefined categories

The next thing is to create a model that is able to classify these text lines into 5 predefined categories:

Company
Address
Date
Total
Other

Normally, as a human, when deciding if a component (text line) belongs to a category, locations and textual information should normally be both taken into account. Therefore we should somehow find a model that carries both spatial and textual information.

Each value in an invoice can be highly dependent on neighboring nodes

A search for a good model

A possible solution is picking a segmentation-based architecture as a pixel labeler (like UNET, Deeplab, Mask RCNN, etc). Our amazing team had proposed a “Multi-Stage Attentional U-Net” (MSAU) serving the goal with characteristics just mentioned, this model didn’t perform robust enough across different scenarios due to data limitation.

While MSAU is good, the Graph is generally better in detecting sparser keys while keeping information compact. For instance, when the key (say Total) and the value (say, $10) are too far away from each other, the Graph generally fares better, given that we have accurately extracted text lines.

In this article, we will focus on the mainly used Graph-based Key-Value detector since it is a crucial component to make Flax pipeline explainable.

Graph representation

With data collecting limitations and the need for a robust model, we introduce a more classical yet compact representation for receipts on this modern problem — Graph.

Graph representation of receipts

Using Graphs enables:

Spatial information to be encoded in node connections
Textual information to be represented by the content of the node itself

Given the output from Layout Detection and OCR, we can build the graph in which:

left, right, top, down connections are represented by adjacency matrices
node features are represented as a feature vector of the content in that text lines

Convolution

The class of each node is highly dependent on its neighboring nodes.
Let’s say, if “$10” appears right next to the “Total”, then it must be the total price of the receipt; whereas locating aside “Coke” or “Burger” may very well not.

Here’s the key: Locality matters. Classification on a node must take into accounts all adjacent ones. Since we want to count the locality upon node classification, accounting all adjacent neighbors while doing graph convolution has become our method.

Since we want to count the locality upon node classification, accounting all adjacent neighbors while doing graph convolution has become our method. Here is how:

Convolution on graph done by matrix multiplication

It is a mathematically and computationally convenient way to express convolution on graphs.

Demo

This is the demo of Key-value extraction using GCN. I traverse to different fields to see which class is that text line corresponds to (bounded by green boxes), and all neighboring nodes that explain for that decision (bounded by red boxes). The “explainability” part for that decision will be shared in other articles of key-value extraction.

Conclusion

The key takeaway here is locality, a class of a text line is decided by taking into account its position in the receipt and the “meaning” of itself. Hence, using graph convolutional networks can combine the best of both worlds!

Written by: Patrick
Consultant team: Marc, Sonny, Toni