Skip to content

Commit

Permalink
Massive refactor from pdelfin to olmocr
Browse files Browse the repository at this point in the history
  • Loading branch information
jakep-allenai committed Jan 27, 2025
1 parent 7261bfc commit b2894d0
Show file tree
Hide file tree
Showing 94 changed files with 184 additions and 184 deletions.
44 changes: 22 additions & 22 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ Thanks for considering contributing! Please read this document to learn the vari

### Did you find a bug?

First, do [a quick search](https://github.com/allenai/pdelfin/issues) to see whether your issue has already been reported.
First, do [a quick search](https://github.com/allenai/olmocrissues) to see whether your issue has already been reported.
If your issue has already been reported, please comment on the existing issue.

Otherwise, open [a new GitHub issue](https://github.com/allenai/pdelfin/issues). Be sure to include a clear title
Otherwise, open [a new GitHub issue](https://github.com/allenai/olmocrissues). Be sure to include a clear title
and description. The description should include as much relevant information as possible. The description should
explain how to reproduce the erroneous behavior as well as the behavior you expect to see. Ideally you would include a
code sample or an executable test case demonstrating the expected behavior.
Expand All @@ -21,7 +21,7 @@ We use GitHub issues to track feature requests. Before you create a feature requ
* Make sure you have a clear idea of the enhancement you would like. If you have a vague idea, consider discussing
it first on a GitHub issue.
* Check the documentation to make sure your feature does not already exist.
* Do [a quick search](https://github.com/allenai/pdelfin/issues) to see whether your feature has already been suggested.
* Do [a quick search](https://github.com/allenai/olmocrissues) to see whether your feature has already been suggested.

When creating your request, please:

Expand All @@ -41,31 +41,31 @@ When you're ready to contribute code to address an open issue, please follow the

Then clone your fork locally with

git clone https://github.com/USERNAME/pdelfin.git
git clone https://github.com/USERNAME/olmocrgit

or

git clone [email protected]:USERNAME/pdelfin.git
git clone [email protected]:USERNAME/olmocrgit

At this point the local clone of your fork only knows that it came from *your* repo, github.com/USERNAME/pdelfin.git, but doesn't know anything the *main* repo, [https://github.com/allenai/pdelfin.git](https://github.com/allenai/pdelfin). You can see this by running
At this point the local clone of your fork only knows that it came from *your* repo, github.com/USERNAME/olmocrgit, but doesn't know anything the *main* repo, [https://github.com/allenai/oolmocrit](https://github.com/allenai/ololmocrYou can see this by running

git remote -v

which will output something like this:

origin https://github.com/USERNAME/pdelfin.git (fetch)
origin https://github.com/USERNAME/pdelfin.git (push)
origin https://github.com/USERNAME/olmocrgit (fetch)
origin https://github.com/USERNAME/olmocrgit (push)

This means that your local clone can only track changes from your fork, but not from the main repo, and so you won't be able to keep your fork up-to-date with the main repo over time. Therefore you'll need to add another "remote" to your clone that points to [https://github.com/allenai/pdelfin.git](https://github.com/allenai/pdelfin). To do this, run the following:
This means that your local clone can only track changes from your fork, but not from the main repo, and so you won't be able to keep your fork up-to-date with the main repo over time. Therefore you'll need to add another "remote" to your clone that points to [https://github.com/allenai/olmocrgit](https://github.com/allenai/oolmocr To do this, run the following:

git remote add upstream https://github.com/allenai/pdelfin.git
git remote add upstream https://github.com/allenai/olmocrgit

Now if you do `git remote -v` again, you'll see

origin https://github.com/USERNAME/pdelfin.git (fetch)
origin https://github.com/USERNAME/pdelfin.git (push)
upstream https://github.com/allenai/pdelfin.git (fetch)
upstream https://github.com/allenai/pdelfin.git (push)
origin https://github.com/USERNAME/olmocrgit (fetch)
origin https://github.com/USERNAME/olmocrgit (push)
upstream https://github.com/allenai/olmocrgit (fetch)
upstream https://github.com/allenai/olmocrgit (push)

Finally, you'll need to create a Python 3 virtual environment suitable for working on this project. There a number of tools out there that making working with virtual environments easier.
The most direct way is with the [`venv` module](https://docs.python.org/3.7/library/venv.html) in the standard library, but if you're new to Python or you don't already have a recent Python 3 version installed on your machine,
Expand All @@ -77,8 +77,8 @@ When you're ready to contribute code to address an open issue, please follow the

Then you can create and activate a new Python environment by running:

conda create -n pdelfin python=3.9
conda activate pdelfin
conda create -n olmocrpython=3.9
conda activate olmocr

Once your virtual environment is activated, you can install your local clone in "editable mode" with

Expand All @@ -93,7 +93,7 @@ When you're ready to contribute code to address an open issue, please follow the

<details><summary>Expand details 👇</summary><br/>

Once you've added an "upstream" remote pointing to [https://github.com/allenai/python-package-temlate.git](https://github.com/allenai/pdelfin), keeping your fork up-to-date is easy:
Once you've added an "upstream" remote pointing to [https://github.com/allenai/python-package-temlate.git](https://github.com/allenai/olmocr, keeping your fork up-to-date is easy:

git checkout main # if not already on main
git pull --rebase upstream main
Expand All @@ -119,7 +119,7 @@ When you're ready to contribute code to address an open issue, please follow the

<details><summary>Expand details 👇</summary><br/>

Our continuous integration (CI) testing runs [a number of checks](https://github.com/allenai/pdelfin/actions) for each pull request on [GitHub Actions](https://github.com/features/actions). You can run most of these tests locally, which is something you should do *before* opening a PR to help speed up the review process and make it easier for us.
Our continuous integration (CI) testing runs [a number of checks](https://github.com/allenai/olmocractions) for each pull request on [GitHub Actions](https://github.com/features/actions). You can run most of these tests locally, which is something you should do *before* opening a PR to help speed up the review process and make it easier for us.

First, you should run [`isort`](https://github.com/PyCQA/isort) and [`black`](https://github.com/psf/black) to make sure you code is formatted consistently.
Many IDEs support code formatters as plugins, so you may be able to setup isort and black to run automatically everytime you save.
Expand All @@ -137,9 +137,9 @@ When you're ready to contribute code to address an open issue, please follow the

mypy .

We also strive to maintain high test coverage, so most contributions should include additions to [the unit tests](https://github.com/allenai/pdelfin/tree/main/tests). These tests are run with [`pytest`](https://docs.pytest.org/en/latest/), which you can use to locally run any test modules that you've added or changed.
We also strive to maintain high test coverage, so most contributions should include additions to [the unit tests](https://github.com/allenai/olmocrtree/main/tests). These tests are run with [`pytest`](https://docs.pytest.org/en/latest/), which you can use to locally run any test modules that you've added or changed.

For example, if you've fixed a bug in `pdelfin/a/b.py`, you can run the tests specific to that module with
For example, if you've fixed a bug in `olmocra/b.py`, you can run the tests specific to that module with

pytest -v tests/a/b_test.py

Expand All @@ -152,9 +152,9 @@ When you're ready to contribute code to address an open issue, please follow the

If the build fails, it's most likely due to small formatting issues. If the error message isn't clear, feel free to comment on this in your pull request.

And finally, please update the [CHANGELOG](https://github.com/allenai/pdelfin/blob/main/CHANGELOG.md) with notes on your contribution in the "Unreleased" section at the top.
And finally, please update the [CHANGELOG](https://github.com/allenai/olmocrblob/main/CHANGELOG.md) with notes on your contribution in the "Unreleased" section at the top.

After all of the above checks have passed, you can now open [a new GitHub pull request](https://github.com/allenai/pdelfin/pulls).
After all of the above checks have passed, you can now open [a new GitHub pull request](https://github.com/allenai/olmocrpulls).
Make sure you have a clear description of the problem and the solution, and include a link to relevant issues.

We look forward to reviewing your PR!
Expand Down
4 changes: 2 additions & 2 deletions .github/ISSUE_TEMPLATE/bug_report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ body:
- type: markdown
attributes:
value: >
#### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/allenai/pdelfin/issues?q=is%3Aissue+sort%3Acreated-desc+).
#### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/allenai/olmocr/issues?q=is%3Aissue+sort%3Acreated-desc+).
- type: textarea
attributes:
label: 🐛 Describe the bug
Expand All @@ -17,7 +17,7 @@ body:
```python
# All necessary imports at the beginning
import pdelfin
import olmocr
# A succinct reproducing example trimmed down to the essential parts:
assert False is True, "Oh no!"
Expand Down
4 changes: 2 additions & 2 deletions .github/ISSUE_TEMPLATE/documentation.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
name: 📚 Documentation
description: Report an issue related to https://pdelfin.readthedocs.io/latest
description: Report an issue related to https://olmocr.readthedocs.io/latest
labels: 'documentation'

body:
- type: textarea
attributes:
label: 📚 The doc issue
description: >
A clear and concise description of what content in https://pdelfin.readthedocs.io/latest is an issue.
A clear and concise description of what content in https://olmocr.readthedocs.io/latest is an issue.
validations:
required: true
- type: textarea
Expand Down
4 changes: 2 additions & 2 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ Changes proposed in this pull request:
## Before submitting

<!-- Please complete this checklist BEFORE submitting your PR to speed along the review process. -->
- [ ] I've read and followed all steps in the [Making a pull request](https://github.com/allenai/pdelfin/blob/main/.github/CONTRIBUTING.md#making-a-pull-request)
- [ ] I've read and followed all steps in the [Making a pull request](https://github.com/allenai/olmocr/blob/main/.github/CONTRIBUTING.md#making-a-pull-request)
section of the `CONTRIBUTING` docs.
- [ ] I've updated or added any relevant docstrings following the syntax described in the
[Writing docstrings](https://github.com/allenai/pdelfin/blob/main/.github/CONTRIBUTING.md#writing-docstrings) section of the `CONTRIBUTING` docs.
[Writing docstrings](https://github.com/allenai/olmocr/blob/main/.github/CONTRIBUTING.md#writing-docstrings) section of the `CONTRIBUTING` docs.
- [ ] If this PR fixes a bug, I've added a test that will fail without my fix.
- [ ] If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.
2 changes: 1 addition & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ jobs:
if: always()
run: |
. .venv/bin/activate
pip uninstall -y pdelfin
pip uninstall -y olmocr
release:
name: Release
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/pr_checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ on:
branches:
- main
paths:
- 'pdelfin/**'
- 'olmocr/**'

jobs:
changelog:
Expand Down
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@ Toolkit for training language models to work with PDF documents in the wild.


What is included:
- A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/data/buildsilver.py)
- An eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/eval/runeval.py)
- Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/filter/filter.py)
- Finetuning code for Qwen2-VL (and soon other VLMs) - [train.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/train/train.py)
- Processing millions of PDFs through a finetuned model using Sglang - [beakerpipeline.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/beakerpipeline.py)
- Viewing Dolma Docs created from PDFs - [dolmaviewer.py](https://github.com/allenai/pdelfin/blob/main/pdelfin/viewer/dolmaviewer.py)
- A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
- An eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py)
- Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py)
- Finetuning code for Qwen2-VL (and soon other VLMs) - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py)
- Processing millions of PDFs through a finetuned model using Sglang - [beakerpipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/beakerpipeline.py)
- Viewing Dolma Docs created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py)

### Installation

Expand All @@ -22,10 +22,10 @@ You will need to install poppler-utils and then also some fonts on your computer
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
```

Then, clone and install the pdelfin package
Then, clone and install the olmocr package
```bash
git clone https://github.com/allenai/pdelfin.git
cd pdelfin
git clone https://github.com/allenai/olmocr.git
cd olmocr
pip install -e .
```

Expand All @@ -43,7 +43,7 @@ It also runs at 2,800+ tokens per second per H100 GPU.

For example:
```bash
python -m pdelfin.beakerpipeline s3://ai2-oe-data/[your username]/pdfworkspaces/[workspacename] --pdfs s3://ai2-oe-data/jakep/gnarly_pdfs/*.pdf --beaker
python -m olmocr.beakerpipeline s3://ai2-oe-data/[your username]/pdfworkspaces/[workspacename] --pdfs s3://ai2-oe-data/jakep/gnarly_pdfs/*.pdf --beaker
```

This will convert all the pdfs at `s3://ai2-oe-data/jakep/gnarly_pdfs/*.pdf` and output dolma formatted documents at `s3://ai2-oe-data/[your username]/pdfworkspaces/[workspacename]/results`
Expand All @@ -53,7 +53,7 @@ With default settings, it should work fine on any available GPUs.


```bash
python -m pdelfin.beakerpipeline --help
python -m olmocr.beakerpipeline --help
usage: beakerpipeline.py [-h] [--pdfs PDFS] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP]
[--max_page_retries MAX_PAGE_RETRIES] [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--stats]
[--model MODEL] [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE]
Expand Down
2 changes: 1 addition & 1 deletion RELEASE_PROCESS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Steps

1. Update the version in `pdelfin/version.py`.
1. Update the version in `olmocr/version.py`.

3. Run the release script:

Expand Down
8 changes: 4 additions & 4 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@

sys.path.insert(0, os.path.abspath("../../"))

from pdelfin import VERSION, VERSION_SHORT # noqa: E402
from olmocr import VERSION, VERSION_SHORT # noqa: E402

# -- Project information -----------------------------------------------------

project = "pdelfin"
project = "olmocr"
copyright = f"{datetime.today().year}, Allen Institute for Artificial Intelligence"
author = "Allen Institute for Artificial Intelligence"
version = VERSION_SHORT
Expand Down Expand Up @@ -82,7 +82,7 @@
#
html_theme = "furo"

html_title = f"pdelfin v{VERSION}"
html_title = f"olmocr v{VERSION}"

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
Expand All @@ -97,7 +97,7 @@
"footer_icons": [
{
"name": "GitHub",
"url": "https://github.com/allenai/pdelfin",
"url": "https://github.com/allenai/olmocr",
"html": """
<svg stroke="currentColor" fill="currentColor" stroke-width="0" viewBox="0 0 16 16">
<path fill-rule="evenodd" d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0 0 16 8c0-4.42-3.58-8-8-8z"></path>
Expand Down
6 changes: 3 additions & 3 deletions docs/source/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# **pdelfin**
# **olmocr**

```{toctree}
:maxdepth: 2
Expand All @@ -15,8 +15,8 @@ overview
CHANGELOG
CONTRIBUTING
License <https://raw.githubusercontent.com/allenai/pdelfin/main/LICENSE>
GitHub Repository <https://github.com/allenai/pdelfin>
License <https://raw.githubusercontent.com/allenai/olmocr/main/LICENSE>
GitHub Repository <https://github.com/allenai/olmocr>
```

## Indices and tables
Expand Down
12 changes: 6 additions & 6 deletions docs/source/installation.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,23 @@
Installation
============

**pdelfin** supports Python >= 3.8.
**olmocr** supports Python >= 3.8.

## Installing with `pip`

**pdelfin** is available [on PyPI](https://pypi.org/project/pdelfin/). Just run
**olmocr** is available [on PyPI](https://pypi.org/project/olmocr/). Just run

```bash
pip install pdelfin
pip install olmocr
```

## Installing from source

To install **pdelfin** from source, first clone [the repository](https://github.com/allenai/pdelfin):
To install **olmocr** from source, first clone [the repository](https://github.com/allenai/olmocr):

```bash
git clone https://github.com/allenai/pdelfin.git
cd pdelfin
git clone https://github.com/allenai/olmocr.git
cd olmocr
```

Then run
Expand Down
File renamed without changes.
Loading

0 comments on commit b2894d0

Please sign in to comment.