With the Table Transformer (TATR) inference pipeline you can:
- Detect all tables in a document image.
- Recognize the structure of a table in a cropped table image and output to HTML or CSV (and other formats).
- Extract (detect and recognize) all tables in a document image in a single step.
Recognized/extracted tables can be output in multiple formats: 1. raw object bounding boxes; 2. list of cells, with properties; 3. HTML; 4. CSV.
Of these, the list of cells (#2) contains the most information, as it includes the bounding box of every cell. The HTML table format is somewhat lossy because it does not include bounding box information for the cells. The CSV format is slightly more lossy in that the first row and only the first row contains the column headers/names. So multi-row column headers must be "flattened" into a single row.
Converting a cropped table image to HTML or CSV:
from inference import TableExtractionPipeline
# Create inference pipeline
pipe = TableExtractionPipeline(det_config_path='detection_config.json', det_model_path='../pubtables1m_detection_detr_r18.pth', det_device='cuda', str_config_path='structure_config.json', str_model_path='../pubtables1m_structure_detr_r18.pth', str_device='cuda')
# Recognize table(s) from image
extracted_tables = pipe.recognize(img, tokens, out_objects=True, out_cells=True, out_html=True, out_csv=True)
# Select table (there could be more than one)
extracted_table = extracted_tables[0]
# Get output in desired format
objects = extracted_table['objects']
cells = extracted_table['cells']
csv = extracted_table['csv']
html = extracted_table['html']
To run the pipelines, you need to provide config and model checkpoint files.
For table detection you need:
- A detection model config JSON file
- A pre-trained detection model checkpoint file
For table structure recognition you need:
- A structure model config JSON file
- A pre-trained structure model checkpoint file
For end-to-end table extraction you need all four of the above files.
When running the sample code:
img
is expected to be of typePIL.Image
.tokens
is expected to be a list of dictionaries
tokens
contains a list of words and their bounding boxes in image coordinates. It is assumed to be sorted in reading order. The format for tokens
is:
[
{
'bbox': [0.0, 0.0, 50.0, 50.0]
'text': 'First'
},
{
'bbox': [52.0, 0.0, 102.0, 50.0]
'text': 'next'
}
]
where bbox
is in [xmin, ymin, xmax, ymax]
format.
Change to src
directory:
cd src
python inference.py --mode detect --detection_config_path detection_config.json --detection_model_path ../pubtables1m_detection_detr_r18.pth --detection_device cuda --image_dir [PATH TO DOCUMENT PAGE IMAGES] --words_dir [OPTIONAL PATH TO WORDS (ex. OCR) EXTRACTED FROM DOCUMENT PAGE IMAGES] --out_dir [PATH TO SAVE DETECTION OUTPUT] [FLAGS: -o,-z,-v,-p] --crop_padding 20
where:
--words_dir
is optional. But text content cannot be included in the final result without it.--crop_padding
determines how much padding in pixels will be added around a detected table before outputting a cropped image of the table. This amount should be adjusted based on the structure recognition model that will be used on these images.-o
means to output detected objects (with bounding boxes).-p
means to save cropped images of the detected tables and save a JSON with cropped words/tokens (if they were supplied as input).-v
means to print (verbose) output to the console.-z
means to create and save (visualize) figures depicting the detected tables.
python inference.py --mode recognize --structure_config_path structure_config.json --structure_model_path ../pubtables1m_structure_detr_r18.pth --structure_device cuda --image_dir [PATH TO CROPPED TABLE IMAGES] --words_dir [OPTIONAL PATH TO WORDS (ex. OCR) EXTRACTED FROM CROPPED TABLE IMAGES] --out_dir [PATH TO SAVE DETECTION OUTPUT] [FLAGS: -o,-c,-l,-m,-z,-v]
where:
--words_dir
is optional. But text content cannot be included in the final result without it. You can still use the visualize flag (-z) to see a depiction of the recognized table structure even without supplying text as input.-o
means to output detected objects (with bounding boxes)-l
means to output a list of recognized cells and their properties-m
means to output the recognized table in HTML format-c
means to output the recognized table in CSV format-v
means to print (verbose) output to the console-z
means to create and save (visualize) figures depicting the recognized tables and recognized cells in the tables
To run table extraction (detection and recognition combined end-to-end) on a folder of document page images:
python inference.py --mode extract --detection_config_path detection_config.json --detection_model_path ../pubtables1m_detection_detr_r18.pth --detection_device cuda --structure_config_path structure_config.json --structure_model_path ../pubtables1m_structure_detr_r18.pth --structure_device cuda --image_dir [PATH TO DOCUMENT PAGE IMAGES] --words_dir [OPTIONAL PATH TO WORDS (ex. OCR) EXTRACTED FROM DOCUMENT PAGE IMAGES] --out_dir [PATH TO SAVE DETECTION OUTPUT] [FLAGS: -o,-c,-l,-m,-z,-v,-p] --crop_padding 20
where:
--words_dir
is optional. But text content cannot be included in the final result without it. You can still use the visualize flag (-z) to see a depiction of the recognized table structure even without supplying text as input.