v0.1.53
What's new
- Fixed git checks
Commits
08f7612 Bump version to v0.1.53 for release
58bdfa5 CI
25ec87b CI
c05e015 Hopefully CI runs now
15f9b8b Install poppler in CI
229da8c unused imports
32aa359 Formatting fix
0dcdbcc Update README.md
6583fb6 hfupload scripts
8297955 Making my parquets
51cfdbd Better converter
e369569 Update README.md
91eef27 Adding some gnarly 1 pager pdfs from kyle
87cb957 First pass at dataset builder script
6ed6f85 Generating parquets for hugging face
84c0c71 Merge branch 'main' of https://github.com/allenai/olmocr
7d67a59 Remove unused
6471f28 Random git ignores, remove unused code
c74d47a Pipeline fixes
04844b3 More beaker and docker fixes
9df86da Beaker fixes
cf6673c Pipeline fixes
7fbbb57 Remove mypy for now
d36e556 Hopefully fixes build
c69e0d6 More cleanup, removing dead adv anchor code
d4d711d Nicer glob handing for pipeline.py
84477b5 More formatting
e3d04ee Merge branch 'main' of https://github.com/allenai/olmocr into main
c37e545 running isort again
2c29533 Fixing most ruff errors
5690377 Ruff
fb40229 Isort and black update
cdb10a9 Python 3.11
dcaca8a Black formatting
4a1762d isort
0628d31 Some unit test cleanup
7d2403d More infos
8dd006d Merge branch 'main' of https://github.com/allenai/olmocr into main
04615d7 More logging on sglang server
0ccb99c readme
2e4ef95 Readme
2192505 Update README.md
9a1be7e Readme
496e162 Update README.md
b574766 Viewer and gitignore
86267d8 Viewer cleanup
a243c89 Update README.md
dbf6477 viewer fix
4c35105 More readme imporvements
f16acec Readme improvements
dee494a Local file stuff
7882944 Local pdf support
dbe5487 Support stats feature later
48447b6 Can use remote s3 files, and local workspace now
50f9a6a Name refactor
e0afb93 Better check for separate sglang installation step
00e3aac Inference test for qwen2 and 2.5, work queue fixes, build current still broken
4d0d924 Merge branch 'main' of https://github.com/allenai/olmocr
b28aad6 More test docs
96ae2dd Refactoring
c606267 Cleaning up some unused code
d8c13d0 Readmes and version updates
b2894d0 Massive refactor from pdelfin to olmocr
7261bfc Update README.md
cbfc803 Merge pull request #27 from allenai/molmo
aa59d38 Merge branch 'main' of https://github.com/allenai/pdelfin
eacd044 csv output
201fec3 Config update
72d2fa2 Reviewing molmo training
0311b44 Some small updates
6586744 Building some data summary tools
c74e3d1 ELO stuff
18f72b4 New ELO building stuff finished up I think
50464c1 build elo v1
3a28955 Added ELO scores
a8d9a55 Fixes for elo
00f2a67 More elo scoring stuff
834e91c runelo start
ef4167d Test set script
683be68 Better error handling on expand_s3_glob
5e633e0 Merge branch 'main' of https://github.com/allenai/pdelfin
0d1fc08 Small fixes
2190f61 Merge branch 'main' of https://github.com/allenai/pdelfin
e2bbd0e Adding some long context stats
0b72eda Move form check into exception handler, don't mark the work item as done if it had an exception on it
fa318da New version with s3 fix in it
84c53c2 Merge branch 'main' of https://github.com/allenai/pdelfin
e9c3c21 Skipping files which are not found
3e33ce1 Ignores
37cdb9e Merge branch 'main' of https://github.com/allenai/pdelfin
1eda300 Dolma viewer niceties
fe04db8 Better error handling
35502bc Limit the number of retries on the server process
b3ca86a More robust to errors when reading logs which had caused freezes
d4f3cff More reliable weka
6872105 Merge branch 'main' of https://github.com/allenai/pdelfin
c93fc36 Missing import
dd17185 More things to try
46fe4ac Trying fixes for live lock
41accfe Error out if you see a broken process pool, might need a better check for this
a95487e Adding check for possible sglang livelock
cff9799 Moving to official sglang release
f8dcdf6 Better catching of httpx errors and retrying them
d6a0013 Faster init by caching pdf filter
a91befc Fix for fallback stuff
8c858a9 New version
66fff4f Merge branch 'main' of https://github.com/allenai/pdelfin
212d391 More convservative filtering
cb800d6 Merge branch 'main' of https://github.com/allenai/pdelfin into main
7dd2046 New version
af8ce51 Merge branch 'main' of https://github.com/allenai/pdelfin into main
9112d81 No keep alive connection to try to resolve sglang livelock
53a5104 Merge branch 'main' of https://github.com/allenai/pdelfin into main
67d11ec TODOs and client fix
3153aea Merge branch 'main' of https://github.com/allenai/pdelfin into main
9b8d58b Better stats and metadata
273a8b0 Logging fallback pages
b0acfa8 Adding support for fallback pages
204a4a8 Better stats
3ef4609 Fixing args
27d2352 Claude recommends httpx instead of aiohttp, seeing if that will help with straggler timeouts
4469f4b Version patch
9e2e09b More fixes
8793fc7 Adding more retries, and it was able to process more complicated books
2f55a3d fix
d4d4736 more gcs
e48d4be Fix
8c3b575 Gcs support better
9381bf8 docs
f287f24 Fixing a few stats things
e499413 Better work queue
04429b2 Basic work queue from claude
995b1d1 Fixes, mocking out queue into separate file
fcabb8e Handling more error cases
96984fc Fix a reliability issue
0af29f1 Adding page rotation
e2303f2 Running on l40s, fixing queue
68543d4 Adding stats
b4ca563 Decent set of todos for monday
2f1664f Stop everything on a Nan
eac3b10 allow weka from augusta through vpn
370dbba new build
9ce243e no weka on augusta
eefb045 Single cluster fix
2e1d0b6 Fix
748b095 Fix
80ba562 Fixing timeout situation
65763de Don't retry accessdenied errors
2c52664 Cleaner exit
77c82fd New version with aiohttp fixes
ae1e4bc More realistic results
770da2b Docker
bfe4211 Debugging timeout errors and other things
fd17652 Trying to make it faster
278422b Fixing one max context issue
62de9fe weka fix
9a1e82f Logging
fe0574c Cleanup code, s3 retries
2c7686f I think I have error handling better now
8217e49 Page calc
4eab90f Fixing bugs
b67d8e7 Fixing work queue population
827b77e Working on task groups
a58efea better logging
a9cf2e0 Allow setting beaker priority
41c8d55 exponential backoff
4dcf9ed more fixes
06331d7 Fix timeout
8e16780 Beaker stuff
4c3bf70 Beaker fixes
3172a1c Shuffling
fe3c9a2 Creds and other things
a3b6962 fix
83bb1dc Dockerfile fixes
6c9c785 Using version strings
9610eac Secrets management
39256c1 Beaker running
867e2c9 Docker builds
a091412 Starting to play with docker too
bce85e6 pipeline
a085e8c Beaker test
910c2eb Downloads from s3 based on hash
6598e2d Control http session at the worker level
fbacdd0 Stuff
ae9b1c4 Better stats
9ce28c0 Measuring metrics better now
193e521 Semaphore timeout
102c0e4 new version of sglang, server restarts, semaphore timeouts
918e2f3 Pipeline stuff
691cc5a A few items
4f2f4fd Quicker results by limited workers via semaphore while still utilizing gpu
6154095 Logging and perf stuff
ade3580 FIxes
732300a Some errors dealt with
24a9d23 Trying to get reliablity up
fedda40 Small fixes
a9a94f2 Code to get stats
6b625b2 Bugfixes
9fb464c Refactoring to assemble docs
da1b23f Minor fixes
9ff107b Merge branch 'main' of https://github.com/allenai/pdelfin into main
299819e Reqs
9d51935 some cleanups
6590164 Starting to work
82ec249 Progress
37dc412 Working on script
e5fb7c0 Organization
ee72b36 Starting up server and workers async now
a39350e Reworking to be async
a103ce7 Some small things
b15bff6 Work queue coallescing
57186c7 Doing some more stuff
923231e exit handlers
051a7b4 Prepping work script
a65e12b Model download stuff
12a91ff Starting on a new approach
faf8659 Putting aside redis
3d6be3c Work queue sharing thing
75d4a0e Experimental beaker pipeline self organizing redis idea
a14febc sglang support for runeval
592cc50 More docs
03f5b25 Docs good now
d89ea6b docs
0362ce6 docs
b2b3f06 docs
46ccab3 More docs
93d7068 More docs
73bd961 Logger fix
3778228 More docs
ef2e4d6 Adding more docs
5ebc8cd Checkfix
9f010e6 Add check for poppler installation
be8fb28 Update README.md
426fda1 Removing some logs
500bd2d flash attn
d45b34f Trust remote code
cda0ad7 Config typo
cf3b377 train script
8f001bf Config updates
6a4a55f Hopefully working molmo HF trainer config
bede854 Startng to write molmo formatters
e65747e Some better logging
a0e0917 Merge branch 'main' of https://github.com/allenai/pdelfin into main
43aa4f2 Proper selection of LORA weights
bcb4794 Starting on molmo changes
232c445 Pipeline stability fixes hopefully and logging
ce2e4ba Applying rotation corrections
08d51b7 Adding some rotation retry contrl
7678f31 Fixing some reliability issues with the pipeline script
45269fa Switching to logging vs prints
a3e7654 Update all docs at once
062abff Adding some skip logic
8e6d0c6 swtichin to orjson, some better json error handling
48a3aff Reindexing
f13d0a5 List configs to list
ffe470b Fix
180dde0 dataprep sampling tests
64041bd Allow sampling different anchor text lens
6a22900 Allow for sampling anchor and other params
999f64d Adding empty anchor support
f8c5aac Some cleanup
a1a4798 Some crazy idea I had to simplify futures and memory limits
f6ac591 vllm benchmarker
4047258 Fixing one old bug to make update_static atomic
38dc5a2 Refactored to have a more efficient batchwriter, and also not allow too many running futures
d99096e Adding vllm profile script for reference
0a5c506 index
7c78676 Fix pipeline bug with indexing
31becaf S2orc dataset extractor
302eee3 Yay matches between birr and hf
f44dbd1 Small fixes
a482271 train more steps
c9ac48b Try to save at the last second only
9d35d3c Birr tokenization test
77f0b9f help text
7dbcbc1 Birr tests that don't do anything but help me understand the universe
492a3f6 Adding parameters for taget image and anchor text sizes
1c8602c Removing rotation invalid ones to see what happens
dd4f967 Filter refactor
3ecbeae Trying save to s3 but with threaded saver
5ba78ed Fix
89fcff2 Fixing saving bug again
7d4cff5 Nice test for picking proper page in birrpipelie
a4d7620 Choosing proper page
529d51d Put LR back, need to save larger checkpoints to weka to prevent timeouts
e141c91 Try lora run higher LR
2826bca Yay all unit tests pass cleanly now too
124aaf5 Hmm, cant repro failing anchor case
1c42a08 Fixes to prevent errors later in dataloading
f13bcad Adding check that pdfs are valid in the new anchor text generation format
5018d59 will try lower lr
5c36c22 Prepping for more training
063be21 New image
90cb80f Docker update
277723f Adding cache
87182ab Ensuring unique names
4884b82 Full dataset
51f1669 fix
d94713e Truncation handled in a custom collator
cbc667c Prepping to train
9d647b1 fix
446773d First part of new dataloader
202d81c Merge branch 'main' of https://github.com/allenai/pdelfin into main
e2552b2 Adding test case
7b16153 Code to do local inference on fine tuned models for testing
5a7377a Refactoring
4fd6066 gpt cleanup
a45f86e More cleanup
53fdb61 More pipeline code
10b7a58 fix
f477a68 dbmanager
2dccc4b Oops removing print
aea3f7f Fix for anchor generation on pdfs with no text elements
af03358 assemble
312847a Ok, finally working nicely to build the page index
312ee8d pipeline script
49b5b23 Working on new pipeline script
a8b50ae Preloading the datasets directly
85f2dc6 Fixes
2864f90 Dataloader fix with nicer tests
b7c80cd Fix up some tests but I don't see why this isn't working
3245990 Faster eval script
931f48c Allow eval script to support one more type of jsonls, runpipeline multiglobs, other fixes
c6bdf69 First stab at document assembly
847064f Taking notes, starting on document assembly
8e5809d runpipeline
a90feda bugfixes
c2909f3 run pipeline
954b19a Stuff
991b213 Refactoring, startng to write run_pipeline
4bf6e7a Refactoring
0c56dec Adding diff to tinyhost
400e921 Unifying some of the pdf rendering stuff
dc6440d Cleaning up anchor text to deal with abnormally long lines
b6b74b7 Rewriting prompts to eval with new model
7c19a9a fix
ad10add try lower lr
230c8a9 Trying new run that will rewrite the prompts as it goes
97291b3 Anchor is fixed to sample text elements better
c8a4d14 Adding image merging to pdf report/hint/anchor
57d9a21 Adding prompt length histogram to a script
adc702c FIxing wandb key
0859378 Lower lr
4b30dd8 Fixing eval script, working FSDP config
f5fd9ff Trying grad checkpoint
4fb7e9b Updated eval script
fb4e585 Trying out non-lora training
ec09408 Filtering based on cpu count
a90eb94 Fix dataloader bug
3d36545 loading fix for parquets again...
fdcd77e typo
7416b42 Adding support for parquet datasets which are precached
dc26541 Starting code to build parquets...
4557a5b Typo
e973de7 Typo
ebd40f9 Hopefully fixing dataloader for now
5d35461 Fix for unicode errors in big datasets for the future
44bcdc7 Hopefully can use weka for the train datasets now
d8e459c Weird issue with surrogate pairs in json
98020ca Allow loading files locally
13123dd Pinning datasets to work around weird issue
568dd48 Prepping for qwen2vl full training run
6065da2 Hopefully working better
a2ff849 checkpoint on new runner for openai batches
2da901d new better runopenaibatch script
35ec67c Hopefully finishing touches
db36608 Fix
f25cb6c Fixes
4630f7b Bugfixes
e87729a New send silver script for testing
6e1094e Support for more evals and output formats
974ddd3 I'm pretty sure we only need to save on rank0
8f1fa4f Running a mini config again with metric
046d4a4 Adding eval on start and seed params
2227605 Mini train config
4505a49 Pinning to normal transformers version now
78e3a94 Adding pluto ib
0ddaf90 Getting ready to launch a new training run
1686790 Checking filtering logic
b340ae5 A few notes, starting to test dataloader with new structured response format
8315162 Merge branch 'main' of https://github.com/allenai/pdelfin
6d8e638 Readme
68b9ee8 Small prompt fix
a5c2721 Need more token output due to structured outputs
d05832e Fixes and evals for structured outputs
802632c Building openai prompt with structured output
be00ccf Switching buildsilver to use new anchor code
0071cbd Appears as if the report method works really well, might need one last step to detect rotated pages
5703a59 Fix for voting on multiple docs in the same eval page
73fb81e Review page size option, fixing mkdirs in convertsilver script
276465a Adding flag to allow skipping filter
549e07b filtering out stupid ads
6ef8226 Can spit out anchor text for a gpt engine using pypdf, showing locations of images and text
e42cecf Adding anchor code based off of pypdf that visits each text block, hopefully so we can make it output good bboxes
09e8840 coherency based anchor text
28fe314 prepping anchor text generation code
7795f65 Fixing bug where we were not showing all the worst alignments
9d6e2fa Runeval is much improved now
8a66ece Script to rerun openai prompts on the same data
f99f6a6 Prompt utils
b6543a4 Qwen checkpoint fixer script
2c7323d Convert silver adjustments
80bb0cb Open ai to openai comparison now supported, new prompts
e179453 Fixing qwen checkpoint script
963e946 Convertsilver birr script can go in and out of S3 now
b856b45 Fixes to convertsilver to birr script
da1982a Refactoring prompts into their own new folder
d74f9a3 Send silver script tries to open file first, before sending an API requests
1216d9c retrieve silver script reports errors better
b4e9d6a Buildsilver script suppors reservoir sampling so it can sample 100M+ paths now efficiently
8ec9e35 dataprep issue
e53f782 Datasetdict fix
decfd7f Fixing the refiner input prompt to something simpler that doesn't depend on the training data. Fixing beaker job workspace and bumping priority to high.
22b765e Going back to non iterable dataset, so shuffling works better, applying a light filter
65a9c99 Hopefuly will train now
e864b9d weird dataloader stuff now
37f1005 typo
c00e40d More fixes
d098a87 Column name fix
84e9da6 Removing lambda due to pickling errors
61dd7bb Fix for map in iterable mode
49efa5c Typo
cf1aa01 Proper use of iterable_dataset
05fdb81 map and filter on iterable dataset
f14e910 bnb
7707bc0 trying cheaper optimizer to solve ooms
a0bec4e 7b scripto
385c1bf Lora config
24b30b2 Prepping for 7b training
5f9b234 Some prompt tweaks I thought of for next time
8ebe751 Merge branch 'main' of https://github.com/allenai/pdelfin
ed50254 Adding script to convert silver data that we send to openai into something we can run through mise/birr
4fb78c2 Fixing runeval to work with qwen2vl batch inferences
2579931 Merge branch 'main' of https://github.com/allenai/pdelfin
a50ffe2 Adding in eval scripts from oe-data-internal now all in one place
e64d4f7 More pip stuff
bf1239d Use mini dataset now for testing
596fc55 Enabling model eval
5a0bcb7 batch inference slowness
28bcf72 Hoping to get a quick batch inference pipeline rolling
45f691c Starting batch inference script to measure performance, train script using proper model from config now
b0777dc missing libaio
1bb222b Datasets version
7b76b66 extra index
5287ba5 Back to pip... sigh
357f2c6 More env stuff
491b738 env fix
1cf3cd8 Had to swtich to conda env override for gantry due to cu118 compat
cb0b97a Gantry requirements
1579397 Merge branch 'main' of https://github.com/allenai/pdelfin
0691e1a chmodding
79feb98 Merge branch 'main' of https://github.com/allenai/pdelfin into main
a3feca0 Setting up for a real train run
0812b0d Prepping for gantry
f78d021 Should be merging the LORA adapters back into the model for the final checkpoint
5967a52 Flash attention and mixed precision training, works quite a bit faster
a778225 Merge branch 'main' of https://github.com/allenai/pdelfin into main
45e5823 Much happier gpu utilization
dc71b28 No need to save tokenizer
5916239 typos
ea3af01 Loading dataset from config now
ab9458b Basic LORA trainer, doesn't seem to make any speed difference
3ed14a9 Prepping new training stuff
b915e7d Smaller config for now, fixing a few requirements
256d77c Hoping to get a basic hf Trainer to run
55035b0 Tries to run a forward pass but oOMS
4eddb1b Okay, reasonably happy with the dataprep pipeline
a47afe5 Adding test to make sure the traning and inference time tokenization stays identical, currenlty failing
fcb67eb Prepping data to be in a trainable format
dc86a99 Pyproject dependency cleanup
962fb7e merge
0cc2b5d Pyproject stuff
84e68f3 Basic forward generation pass with openai dataset and qwen2vl
7d2c447 Importing core training config stuff from dolma refine
bab32aa Formatting
f4d18cb Dataloader capabable of loading 38k rows reasonably fast
d22b311 Starting to write dataloader for visual lm data
fb4fc42 Fixing close file warning
af2126d 450tok/sec/core with smollm that appears to work well
2f71cb9 Using SmolLM, seems a lot better and is able to pass some tests
57e80aa Testing coherence with distilgpt2, but it doesn't work great
cb9b6ef Trying distilgpt2 instead of kenlm
01bc0b2 Moving a whole bunch of code over, still broken
a534a01 Moving pdf filter code over with tests
9662718 Running personalize script on template
7d71e2d Update README.md
68b2c0e Initial commit