docs

allenai · Nov 18, 2024 · 9381bf8 · 9381bf8
1 parent f287f24
commit 9381bf8
Show file tree

Hide file tree

Showing 2 changed files with 66 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -33,6 +33,70 @@ You will also need to install the latest pypdf, which contains some fixes regard
 pip install git+https://github.com/py-pdf/pypdf.git@9e0fce7b9810d3e09e2af66481ea3429c42e0d11
 ```
 
+### Beaker Usage
+
+If you want to linearize millions of PDFs efficiently using [beaker](https://www.beaker.org), follow these instructions.
+This is the preferred method for best performance, and lets you get results quickly for iterating and debugging.
+
+It also runs at 2,800+ tokens per second per H100 GPU.
+
+For example:
+```bash
+python -m pdelfin.beakerpipeline s3://ai2-oe-data/[your username]/pdfworkspaces/[workspacename] --pdfs s3://ai2-oe-data/jakep/gnarly_pdfs/*.pdf --beaker
+```
+
+This will convert all the pdfs at `s3://ai2-oe-data/jakep/gnarly_pdfs/*.pdf` and output dolma formatted documents at `s3://ai2-oe-data/[your username]/pdfworkspaces/[workspacename]/results`
+
+You can specify more GPUs with `--beaker_gpus [int]` to get through the work faster. You can also specify your workspace, and allowed beaker clusters to use.
+With default settings, it should work fine on any available GPUs.
+
+
+```bash
+python -m pdelfin.beakerpipeline --help
+usage: beakerpipeline.py [-h] [--pdfs PDFS] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP] [--workers WORKERS] [--stats]
+                         [--model MODEL] [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE] [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM]
+                         [--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN] [--beaker] [--beaker_workspace BEAKER_WORKSPACE] [--beaker_cluster BEAKER_CLUSTER] [--beaker_gpus BEAKER_GPUS]
+                         [--beaker_priority BEAKER_PRIORITY]
+                         workspace
+
+Manager for running millions of PDFs through a batch inference pipeline
+
+positional arguments:
+  workspace             The S3 path where work will be done e.g., s3://bucket/prefix/
+
+options:
+  -h, --help            show this help message and exit
+  --pdfs PDFS           Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths
+  --workspace_profile WORKSPACE_PROFILE
+                        S3 configuration profile for accessing the workspace
+  --pdf_profile PDF_PROFILE
+                        S3 configuration profile for accessing the raw pdf documents
+  --pages_per_group PAGES_PER_GROUP
+                        Aiming for this many pdf pages per work item group
+  --workers WORKERS     Number of workers to run at a time
+  --stats               Instead of running any job, reports some statistics about the current workspace
+  --model MODEL         List of paths where you can find the model to convert this pdf. You can specify several different paths here, and the script will try to use the one which is fastest
+                        to access
+  --model_max_context MODEL_MAX_CONTEXT
+                        Maximum context length that the model was fine tuned under
+  --model_chat_template MODEL_CHAT_TEMPLATE
+                        Chat template to pass to sglang server
+  --target_longest_image_dim TARGET_LONGEST_IMAGE_DIM
+                        Dimension on longest side to use for rendering the pdf pages
+  --target_anchor_text_len TARGET_ANCHOR_TEXT_LEN
+                        Maximum amount of anchor text to use (characters)
+  --beaker              Submit this job to beaker instead of running locally
+  --beaker_workspace BEAKER_WORKSPACE
+                        Beaker workspace to submit to
+  --beaker_cluster BEAKER_CLUSTER
+                        Beaker clusters you want to run on
+  --beaker_gpus BEAKER_GPUS
+                        Number of gpu replicas to run
+  --beaker_priority BEAKER_PRIORITY
+                        Beaker priority level for the job
+```
+
+
 ### Batch Inference Usage
 
 If you want run a fine tuned model in order to linearize millions of PDFs, you need to use the [birrpipeline.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/birrpipeline.py) script.

diff --git a/pdelfin/beakerpipeline.py b/pdelfin/beakerpipeline.py
@@ -790,9 +790,10 @@ async def main():
     asyncio.run(main())
 
     # TODO
+    # - Fix loading of the model checkpoints, it's so flakey now, maybe use datasets
     # - Add logging of failed pages and have the stats function read them
     # - Fallback to different method if < 2% of pages are failed, make that configurable
     # - Sglang commit a fix for the context length issue
     # - pypdf fix for the 'v' error
     # - Get a solid benchmark on the stream vs non stream approach
-    # - Fix loading of the model checkpoints, it's so flakey now, maybe use datasets
+