-
Notifications
You must be signed in to change notification settings - Fork 479
Comparison with other PDF Table Extraction libraries and tools
This page of the wiki aims to compare Camelot's output (qualitatively) with other open-source libraries and tools. Chances are that you've already used one of the libraries/tools mentioned below, have had problems with getting the desired output and are here to see if Camelot can extract tables from your PDFs better.
We believe that Camelot works better than other open-source alternatives out there, we try to avoid bias though, and be fair and accurate here, by listing down advantages other tools might have over Camelot. (While also listing down steps with which Camelot makes up for them using one or more of the configuration parameters.)
We would like your help to keep this document up-to-date. If notice any inconsistency, please let us know by opening an issue.
Table of contents
The naming for parsing methods inside Camelot (i.e. Lattice and Stream) was inspired from Tabula. Lattice is used to parse tables that have demarcated lines between cells, while Stream is used to parse tables that have whitespaces between cells to simulate a table structure.
We took 10 PDFs of each type (lines, for Lattice and whitespaces between tables cells, for Stream) and passed them through Tabula's web interface and Camelot's command-line interface. The CSV outputs were pushed to this repo as is. We found that Camelot works better than Tabula in all Lattice cases. Tabula does better table detection for Stream cases, but it still fails to give good parsing output, which Camelot solves for with its configuration parameters.
Note: We have better table detection for Stream cases in the works. #102
We put a ✔️ in the "Table detected correctly?" column if the table was detected accurately and ❌ if it was not (providing an image of the detected table in both cases). The reasoning behind which output is better is provided in the "Comments" column.
n | Notes | Table detected correctly? | Extra configuration used? | Result | Which has better output? | Comments | ||||
---|---|---|---|---|---|---|---|---|---|---|
Tabula | Camelot | Tabula | Camelot | Tabula | Camelot | |||||
1. | agstat.pdf | Header text is vertical, columns span multiple cells. | ❌ image | ✔️ image | NA | No | csv | csv | Camelot | Tabula doesn't output all the header text. Camelot gets all the headers in the correct cells, albeit in reverse order in some cases. |
2. | background_lines_1.pdf | The lines are in background. | ❌ image | ✔️ image | NA | -back |
csv | csv | Both | |
3. | background_lines_2.pdf | The lines are in background. | ✔️ image | ✔️ image | NA |
-scale 40 -back |
csv | csv | Camelot | Tabula shifts some of the data points towards the left. Camelot gets the table as is. |
4. | column_span_1.pdf | Columns spans multiple cells. | ✔️ image | ✔️ image | NA | No | csv | csv | Camelot | Tabula moves some headers on the top-right to the left. Camelot gets them in the correct cells. |
5. | column_span_2.pdf | Columns spans multiple cells. | ✔️ image | ✔️ image | NA | -scale 40 |
csv | csv | Camelot | Tabula shifts some of the data points towards the left. Camelot gets the table as is. (For ex: The number 1728) |
6. | electoral_roll.pdf | Very unusual table. | ✔️ (almost) image | ✔️ image | NA |
-scale 40 -I 1 |
csv | csv | Camelot | Tabula doesn't give an output. Camelot is able to get all text out while preserving the table structure, which is usable by cleaning after some patter matching. |
7. | rotated.pdf | The table is rotated counter-clockwise. | ❌ image | ✔️ image | NA | No | csv | csv | Camelot | Tabula output is unusable, Camelot gets the table out as is. |
8. | row_span_1.pdf | Rows span multiple cells. | ✔️ image | ✔️ image | NA |
-scale 40 -block 99 -const -20 |
csv | csv | Camelot | Tabula shifts some of the data points towards the left. Camelot gets the table as is. Check out the totals near the bottom-right. |
9. | twotables_1.pdf | There are two tables on a single page. | ✔️ (almost) image | ✔️ image | NA | No | csv | Camelot | Tabula output is unusable, Camelot gets the tables out as they are. | |
10. | twotables_2.pdf | There are two tables on a single page. | ✔️ image | ✔️ image | No | Both |
n | Notes | Table detected correctly? | Extra configuration used? | Result | Which has better output? | Comments | ||||
---|---|---|---|---|---|---|---|---|---|---|
Tabula | Camelot | Tabula | Camelot | Tabula | Camelot | |||||
1. | 12s0324.pdf | There are two tables on a single page. | ✔️ | NA | NA | Both | ||||
2. | birdisland.pdf | PDF is encrypted. | ✔️ | NA | NA | csv | csv | Tabula | Camelot detects two tables, and even though the structure is correct, duplicate strings are found in the same cells. Bug filed. #103. | |
3. | budget.pdf | ✔️ | NA | NA | No | csv | csv | Camelot | Tabula merges the last two columns into one, Camelot gets them correctly. | |
4. | district_health.pdf | ✔️ | NA | NA | No | csv | csv | Camelot | Tabula merges all the columns. Camelot assigns the data points to the correct cells. | |
5. | health.pdf | ✔️ | NA | NA | No | csv | csv | Camelot | Same as above. | |
6. | m27.pdf | The text is very close. (difficult to differentiate between columns) | ✔️ | NA | NA |
-C 72,95,209,327,442,529,566,606,683 -split |
csv | csv | Camelot | Tabula merges some columns. Camelot uses its "-split" feature along with column separators to cut the text strings at those coordinates and put them in the correct cells. |
7. | mexican_towns.pdf | ✔️ | NA | NA | No | csv | csv | Both | ||
8. | missing_values.pdf | Two columns don't have any values. | ✔️ | NA | NA | No | csv | csv | Camelot | Tabula merges some columns, Camelot gets them correctly. |
9. | population_growth.pdf | ✔️ | NA | NA | No | csv | csv | Both | ||
10. | superscript.pdf | A number has another number in superscript. (Refer the 2nd column for row starting with Kerala) | ✔️ | NA | NA | -flag |
csv | csv | Camelot | Tabula merges the superscript with the number, which doesn't matter in this case due to the decimal point but can change the number by 10x without the point. Camelot uses a configuration parameter to delimit the superscripts with <s></s>tags, so that they can be handled during cleaning. |
5 PDFs of each type were used from the table above, for which Camelot required no extra configuration. Tables from the selected PDFs were parsed using this script (which uses pdfplumber) and Camelot's command-line-interface.
The reasoning behind which output is better is provided in the "Comments" column.
n | Notes | Result | Which has better output? | Comments | ||
---|---|---|---|---|---|---|
pdfplumber | Camelot | |||||
1. | agstat.pdf | Header text is vertical, columns span multiple cells. | csv | csv | Camelot | pdfplumber messes up header text. |
2. | column_span_1.pdf | Columns spans multiple cells. | csv | csv | Both | |
3. | rotated.pdf | The table is rotated counter-clockwise. | csv | csv | Camelot | pdfplumber output unusable. |
4. | twotables_1.pdf | There are two tables on a single page. | csv | Camelot | pdfplumber doesn't identify two tables and output is unusable. | |
5. | twotables_2.pdf | There are two tables on a single page. | csv | Camelot | pdfplumber doesn't identify two tables and output is unusable. | |
6. | budget.pdf | errored | csv | Camelot | ||
7. | district_health.pdf | csv | csv | Camelot | pdfplumber output unusable, merged columns. | |
8. | health.pdf | csv | csv | Camelot | pdfplumber output unusable, merged columns. | |
9. | mexican_towns.pdf | errored | csv | Camelot | ||
10. | missing_values.pdf | Two columns don't have any values. | csv | csv | Camelot | pdfplumber output unusable, merged columns. |
The open-source development for pdftables was stopped in September 2013, when it became a closed-source paid tool.
Again, 5 PDFs of each type were used from the table above, for which Camelot required no extra configuration. Tables from the selected PDFs were parsed using this script (which uses pdftables) and Camelot's command-line-interface.
Again, the reasoning behind which output is better is provided in the "Comments" column.
n | Notes | Result | Which has better output? | Comments | ||
---|---|---|---|---|---|---|
pdftables | Camelot | |||||
1. | agstat.pdf | Header text is vertical, columns span multiple cells. | csv | csv | Camelot | pdftables output unusable, merged columns. |
2. | column_span_1.pdf | Columns spans multiple cells. | csv | csv | Camelot | pdftables output unusable, merged columns. |
3. | rotated.pdf | The table is rotated counter-clockwise. | csv | csv | Camelot | pdftables output unusable. |
4. | twotables_1.pdf | There are two tables on a single page. | csv | Camelot | pdftables doesn't combine multi-line rows. | |
5. | twotables_2.pdf | There are two tables on a single page. | csv | Camelot | pdftables output unusable, merged columns. | |
6. | budget.pdf | csv | csv | Camelot | pdftables output unusable, merged columns. | |
7. | district_health.pdf | csv | csv | Camelot | pdftables output unusable, merged columns. | |
8. | health.pdf | csv | csv | Camelot | pdftables output unusable, merged columns. | |
9. | mexican_towns.pdf | csv | csv | Both | ||
10. | missing_values.pdf | Two columns don't have any values. | csv | csv | Camelot | pdftables output unusable, merged columns. |
5 PDFs of each type were used from the table above, for which Camelot required no extra configuration. Tables from the selected PDFs were parsed using this script (which uses pdf-table-extract) and Camelot's command-line-interface.
The reasoning behind which output is better is provided in the "Comments" column.
n | Notes | Result | Which has better output? | Comments | ||
---|---|---|---|---|---|---|
pdf-table-extract (pte) | Camelot | |||||
1. | agstat.pdf | Header text is vertical, columns span multiple cells. | csv | csv | Both | Camelot puts vertical headers in reverse order. Bug filed. [#105] |
2. | column_span_1.pdf | Columns spans multiple cells. | csv | csv | Camelot | pte gives extra columns. |
3. | rotated.pdf | The table is rotated counter-clockwise. | csv | csv | Camelot | pte doesn't account for table rotation. |
4. | twotables_1.pdf | There are two tables on a single page. | csv | Camelot | pte output unusable. | |
5. | twotables_2.pdf | There are two tables on a single page. | csv | Camelot | pte detects one table and merges first row with header. | |
6. | budget.pdf | csv | csv | Camelot | pte output unusable. | |
7. | district_health.pdf | csv | csv | Camelot | pte output unusable. | |
8. | health.pdf | csv | csv | Camelot | pte output unusable. | |
9. | mexican_towns.pdf | csv | csv | Camelot | pte output unusable. | |
10. | missing_values.pdf | Two columns don't have any values. | csv | csv | Camelot | pte output unusable. |