generated from allenai/python-package-template
-
Notifications
You must be signed in to change notification settings - Fork 397
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
68b9ee8
commit 6d8e638
Showing
1 changed file
with
19 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,20 @@ | ||
# pdelfin | ||
|
||
Toolkit for truly understanding PDF documents in the wild. | ||
|
||
<img src="https://github.com/user-attachments/assets/984a645c-096d-4b9a-9c5b-44063004cd8c" alt="image" width="300"/> | ||
|
||
Things supported: | ||
- A prompting strategy to get really good natural text parsing using ChatGPT 4o (silver_data) | ||
- An eval toolkit for comparing different pipeline versions | ||
- Basic filtering by language and SEO spam removal | ||
- Finetuning code for Qwen2-VL (and soon other VLMs) | ||
|
||
### Note: Font installation | ||
|
||
You will probably need to install some fonts on your computer so that any pdfs you render come out looking nice. | ||
|
||
``` | ||
sudo apt-get install ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts | ||
``` |