Skip to content

Commit

Permalink
docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jakep-allenai committed Nov 18, 2024
1 parent f287f24 commit 9381bf8
Show file tree
Hide file tree
Showing 2 changed files with 66 additions and 1 deletion.
64 changes: 64 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,70 @@ You will also need to install the latest pypdf, which contains some fixes regard
pip install git+https://github.com/py-pdf/pypdf.git@9e0fce7b9810d3e09e2af66481ea3429c42e0d11
```

### Beaker Usage

If you want to linearize millions of PDFs efficiently using [beaker](https://www.beaker.org), follow these instructions.
This is the preferred method for best performance, and lets you get results quickly for iterating and debugging.

It also runs at 2,800+ tokens per second per H100 GPU.

For example:
```bash
python -m pdelfin.beakerpipeline s3://ai2-oe-data/[your username]/pdfworkspaces/[workspacename] --pdfs s3://ai2-oe-data/jakep/gnarly_pdfs/*.pdf --beaker
```

This will convert all the pdfs at `s3://ai2-oe-data/jakep/gnarly_pdfs/*.pdf` and output dolma formatted documents at `s3://ai2-oe-data/[your username]/pdfworkspaces/[workspacename]/results`

You can specify more GPUs with `--beaker_gpus [int]` to get through the work faster. You can also specify your workspace, and allowed beaker clusters to use.
With default settings, it should work fine on any available GPUs.


```bash
python -m pdelfin.beakerpipeline --help
usage: beakerpipeline.py [-h] [--pdfs PDFS] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP] [--workers WORKERS] [--stats]
[--model MODEL] [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE] [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM]
[--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN] [--beaker] [--beaker_workspace BEAKER_WORKSPACE] [--beaker_cluster BEAKER_CLUSTER] [--beaker_gpus BEAKER_GPUS]
[--beaker_priority BEAKER_PRIORITY]
workspace

Manager for running millions of PDFs through a batch inference pipeline

positional arguments:
workspace The S3 path where work will be done e.g., s3://bucket/prefix/

options:
-h, --help show this help message and exit
--pdfs PDFS Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths
--workspace_profile WORKSPACE_PROFILE
S3 configuration profile for accessing the workspace
--pdf_profile PDF_PROFILE
S3 configuration profile for accessing the raw pdf documents
--pages_per_group PAGES_PER_GROUP
Aiming for this many pdf pages per work item group
--workers WORKERS Number of workers to run at a time
--stats Instead of running any job, reports some statistics about the current workspace
--model MODEL List of paths where you can find the model to convert this pdf. You can specify several different paths here, and the script will try to use the one which is fastest
to access
--model_max_context MODEL_MAX_CONTEXT
Maximum context length that the model was fine tuned under
--model_chat_template MODEL_CHAT_TEMPLATE
Chat template to pass to sglang server
--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM
Dimension on longest side to use for rendering the pdf pages
--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN
Maximum amount of anchor text to use (characters)
--beaker Submit this job to beaker instead of running locally
--beaker_workspace BEAKER_WORKSPACE
Beaker workspace to submit to
--beaker_cluster BEAKER_CLUSTER
Beaker clusters you want to run on
--beaker_gpus BEAKER_GPUS
Number of gpu replicas to run
--beaker_priority BEAKER_PRIORITY
Beaker priority level for the job
```


### Batch Inference Usage

If you want run a fine tuned model in order to linearize millions of PDFs, you need to use the [birrpipeline.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/birrpipeline.py) script.
Expand Down
3 changes: 2 additions & 1 deletion pdelfin/beakerpipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -790,9 +790,10 @@ async def main():
asyncio.run(main())

# TODO
# - Fix loading of the model checkpoints, it's so flakey now, maybe use datasets
# - Add logging of failed pages and have the stats function read them
# - Fallback to different method if < 2% of pages are failed, make that configurable
# - Sglang commit a fix for the context length issue
# - pypdf fix for the 'v' error
# - Get a solid benchmark on the stream vs non stream approach
# - Fix loading of the model checkpoints, it's so flakey now, maybe use datasets

0 comments on commit 9381bf8

Please sign in to comment.