Skip to content

Commit

Permalink
Viewer cleanup
Browse files Browse the repository at this point in the history
  • Loading branch information
jakep-allenai committed Jan 29, 2025
1 parent a243c89 commit 86267d8
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 7 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,8 @@ You can also bulk convert many PDFS with a glob pattern:
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf
```

#### Viewing Results

Once that finishes, output is stored as [Dolma](https://github.com/allenai/dolma)-style JSONL inside of the `./localworkspace/results` directory.

```bash
Expand Down
8 changes: 1 addition & 7 deletions olmocr/viewer/dolmaviewer.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,20 +11,14 @@
from concurrent.futures import ThreadPoolExecutor, as_completed
import markdown2

from olmocr.s3_utils import get_s3_bytes
from olmocr.s3_utils import get_s3_bytes, parse_s3_path
from olmocr.data.renderpdf import render_pdf_to_base64webp

def read_jsonl(path):
with smart_open.smart_open(path, 'r', encoding='utf-8') as f:
for line in f:
yield line.strip()

def parse_s3_path(path):
# s3://bucket_name/key_name
path = path[5:] # Remove 's3://'
bucket_name, key_name = path.split('/', 1)
return bucket_name, key_name

def generate_presigned_url(s3_client, bucket_name, key_name):
try:
response = s3_client.generate_presigned_url('get_object',
Expand Down

0 comments on commit 86267d8

Please sign in to comment.