Skip to content

Commit

Permalink
Readme
Browse files Browse the repository at this point in the history
  • Loading branch information
jakep-allenai committed Oct 2, 2024
1 parent 68b9ee8 commit 6d8e638
Showing 1 changed file with 19 additions and 0 deletions.
19 changes: 19 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1 +1,20 @@
# pdelfin

Toolkit for truly understanding PDF documents in the wild.

<img src="https://github.com/user-attachments/assets/984a645c-096d-4b9a-9c5b-44063004cd8c" alt="image" width="300"/>

Things supported:
- A prompting strategy to get really good natural text parsing using ChatGPT 4o (silver_data)
- An eval toolkit for comparing different pipeline versions
- Basic filtering by language and SEO spam removal
- Finetuning code for Qwen2-VL (and soon other VLMs)

### Note: Font installation

You will probably need to install some fonts on your computer so that any pdfs you render come out looking nice.

```
sudo apt-get install ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts
```

0 comments on commit 6d8e638

Please sign in to comment.