Amazon Textract is a machine learning (ML) service that automatically extracts printed text, handwriting, and other data from scanned documents that goes beyond simple optical character recognition (OCR) to identify and extract data from forms and tables.
Currently, thousands of customers are using Amazon Textract to process different types of documents. Many include tables across one or multiple pages, such as bank statements and financial reports.
Many developers expressed interest in merging Amazon Textract responses where tables exist across multiple pages. This post demonstrates how you can use the amazon-textract-response-parser utility to accomplish this and highlights a few tricks to optimize the process.
Solution overview
When tables span multiple pages, a series of steps and validations are required to determine the linkage across pages correctly.
These include analyzing the table structure similarities across pages (columns, headers, margins) and determining if any additional contents like headers or footers exist that may