After Document Object Detection Steps, we would expect to already have text lines location and their content. The goal of the system is to provide an information extraction process for any type of administrative documents: contract, medical receipt, biological graph, etc. This article is here to explain how to do it.
Let us assume the case of reading invoice/receipt. For this, we would need a model that is able to find text lines that belong to 5 predefined categories
The next thing is to create a model that is able to classify these text lines into 5 predefined categories:
Normally, as a human, when deciding if a component (text line) belongs to a category, locations and textual information should normally be both taken into account. Therefore we should somehow find a model that carries both spatial and textual information.
Each value in an invoice can be highly dependent on neighboring nodes
A search for a good model
A possible solution is picking a segmentation-based architecture as a pixel labeler (like UNET, Deeplab, Mask RCNN, etc). Our amazing team had proposed a “Multi-Stage Attentional U-Net” (MSAU) serving the goal with characteristics just mentioned, this model didn’t perform robust enough across different scenarios due to data limitation.
While MSAU is good, the Graph is generally better in detecting sparser keys while keeping information compact. For instance, when the key (say Total) and the value (say, $10) are too far away from each other, the Graph generally fares better, given that we have accurately extracted text lines.
In this article, we will focus on the mainly used Graph-based Key-Value detector since it is a crucial component to make Flax pipeline explainable.
With data collecting limitations and the need for a robust model, we introduce a more classical yet compact representation for receipts on this modern problem — Graph.
Graph representation of receipts
Using Graphs enables:
- Spatial information to be encoded in node connections
- Textual information to be represented by the content of the node itself
Given the output from Layout Detection and OCR, we can build the graph in which:
- left, right, top, down connections are represented by adjacency matrices
- node features are represented as a feature vector of the content in that text lines
The class of each node is highly dependent on its neighboring nodes.
Let’s say, if “$10” appears right next to the “Total”, then it must be the total price of the receipt; whereas locating aside “Coke” or “Burger” may very well not.
Here’s the key: Locality matters. Classification on a node must take into accounts all adjacent ones. Since we want to count the locality upon node classification, accounting all adjacent neighbors while doing graph convolution has become our method.
Since we want to count the locality upon node classification, accounting all adjacent neighbors while doing graph convolution has become our method. Here is how:
Convolution on graph done by matrix multiplication
It is a mathematically and computationally convenient way to express convolution on graphs.
This is the demo of Key-value extraction using GCN. I traverse to different fields to see which class is that text line corresponds to (bounded by green boxes), and all neighboring nodes that explain for that decision (bounded by red boxes). The “explainability” part for that decision will be shared in other articles of key-value extraction.
The key takeaway here is locality, a class of a text line is decided by taking into account its position in the receipt and the “meaning” of itself. Hence, using graph convolutional networks can combine the best of both worlds!
Written by: Patrick
Consultant team: Marc, Sonny, Toni