The code in this respository scrapes data tables from a PDF file. Once extracted from the PDF file, it is clean, analyzed, and mapped. The map allows the user to easily understand the data.
This repository contains a single Jupyter notebook:
Scraping tables from a PDF file GH.ipynb.
Input: A single url to a PDF file on a publicly available website.
Output: Three data tables as pandas dataframes that can be exported.
The code does the following:
- Reads in the names of the data tables in a PDF file. This allows the user to confirm that all data tables are being read.
- Extracts data tables as lists from the PDF.
- Accesses and flattens sublists.
- Filters data to separate the data tables, adds headings, and remove nan values. This is repeated three times to separate out three different data tables