Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steffen/cleanup #1

Open
wants to merge 386 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
386 commits
Select commit Hold shift + click to select a range
ec23196
Implement cleanup loop in validator and associated local model store …
Dec 27, 2023
8b59645
Add implementations for storing/retrieving data on chain and in Huggi…
Dec 27, 2023
5730065
Format all files for consistency. (#3)
Dec 27, 2023
beeb0c1
Refactor to use hotkeys not uids for miner identification. (#4)
Dec 27, 2023
f4a0ad3
Adds the Perf Monitor
Dec 28, 2023
3bb2289
Merge pull request #5 from RaoFoundation/perf-tracker
Dec 28, 2023
1b180c2
Merge branch 'dev' into miner_tracker
Dec 28, 2023
cac0feb
Improve model tracker comments and logging.
Dec 27, 2023
bcb801d
Delete .vscode/settings.json which is now in the ..gitignore.
Dec 29, 2023
b8a3193
Merge pull request #6 from RaoFoundation/miner_tracker
Dec 29, 2023
07eaedb
Merge pull request #7 from RaoFoundation/model_cleaner
Dec 30, 2023
fac0eb0
Add helper to get hash of directory.
Dec 30, 2023
71e789f
Add logic to redownload and get hash in upload_model.
Dec 30, 2023
9768cb7
Update to only store model for hash in a tmp folder.
Dec 30, 2023
762792d
Address PR feedback.
Dec 30, 2023
6d674ae
Merge pull request #8 from RaoFoundation/dirHash
Dec 30, 2023
3f90a28
Update the Miner
Dec 30, 2023
56e8137
Address feedback
Dec 30, 2023
b6e4e47
More PR feedback
Dec 30, 2023
2755cf9
Merge pull request #9 from RaoFoundation/miner-updates
Dec 30, 2023
f78cfd3
Update model tracker to track metadata.
Dec 30, 2023
c891add
Update validator eval loop to use new stores.
Dec 30, 2023
7f17118
Miner fixes
Dec 30, 2023
6d656fa
Merge pull request #11 from RaoFoundation/miner-fixes
Dec 30, 2023
d33d4fb
Use AutoModelForCausalLM.
Dec 30, 2023
8c1b7ac
Also update mining test to use same model type.
Dec 30, 2023
c8855a5
Merge pull request #12 from RaoFoundation/autoModelLM
Dec 30, 2023
29ae1a0
Pass netuid to the chain store
Dec 31, 2023
c94f391
Handle exceptions calculating miner losses.
Dec 31, 2023
b4d2325
Support loading a non hugging face saved model
Dec 31, 2023
d3d2f3d
Make a new wandb run for the validator if logging there.
Dec 31, 2023
38a11a5
Merge pull request #14 from RaoFoundation/miner-fixes2
Dec 31, 2023
44a3e73
Address PR fixes.
Dec 31, 2023
67574c9
Merge branch 'dev' into valEval
Dec 31, 2023
ced19a6
Add size check before downloading from hugging face.
Dec 30, 2023
cd0e717
Merge pull request #10 from RaoFoundation/valEval
Dec 31, 2023
4f773e8
Merge pull request #13 from RaoFoundation/checkRepoSize
Dec 31, 2023
7f66a39
Add checks in Model Updater for bad models.
Dec 31, 2023
b0fc67d
Merge pull request #15 from RaoFoundation/exceptOnBadModels
Dec 31, 2023
d6f2904
Improve test logging.
Dec 31, 2023
db56c27
Collected fixes.
Dec 31, 2023
de6edac
Exception handling improvements.
Dec 31, 2023
3004f06
Fix update loop sleep logic when revisiting recently.
Dec 31, 2023
928b61f
Uid state handling fixes.
Dec 31, 2023
a8e8a2f
Sleep in run step for readability.
Dec 31, 2023
3d22ac1
Align local and remote directory pathing.
Dec 31, 2023
81edb9d
Compute_losses on the pt_model not the Model.
Dec 31, 2023
695e738
Validator wandb run logging fixes.
Dec 31, 2023
359fa9c
Update comments on expected directory structure.
Dec 31, 2023
d5b7825
Merge pull request #16 from RaoFoundation/vali-fixes
Dec 31, 2023
adaf416
Add a new tool to upload a trained model
Dec 31, 2023
a3e073f
Merge pull request #17 from RaoFoundation/miner-push-only
Dec 31, 2023
88c8216
Clean-up
Dec 31, 2023
7d80098
Create a new validator wandb run every 100 run steps.
Dec 31, 2023
87d4f88
Merge pull request #18 from RaoFoundation/clean-up
Dec 31, 2023
8fdb630
Add auto-update script
Dec 31, 2023
28e5769
Fix directory hash after downloading models.
Dec 31, 2023
4b08bdf
Merge pull request #20 from RaoFoundation/auto-update
Dec 31, 2023
da72955
Merge pull request #21 from RaoFoundation/hash_location_fix
Dec 31, 2023
ee0b22e
Merge pull request #19 from RaoFoundation/new_wandb_runs
Dec 31, 2023
3b27b56
Remove unused import
Dec 31, 2023
c57f2ef
Merge pull request #22 from RaoFoundation/logs
Dec 31, 2023
9da0a5c
Split out miner/vali docs and update.
Dec 31, 2023
0278385
Improve Miner docs.
Jan 1, 2024
175a58c
Merge pull request #23 from RaoFoundation/docs
Jan 1, 2024
ae42a47
Update scoring temperature to 0.04.
Jan 3, 2024
ef67494
Merge pull request #24 from RaoFoundation/temp_update
surcyf123 Jan 3, 2024
0ddead7
Update validator score boosting of earlier models.
Jan 3, 2024
43d6a6a
Merge pull request #25 from RaoFoundation/epsilon_update
surcyf123 Jan 3, 2024
972950a
Merge pull request #26 from RaoFoundation/dev
Jan 3, 2024
93a1d98
Formatting fixes for miner docs
Jan 3, 2024
8e91a9f
Merge pull request #27 from RaoFoundation/doc-format
Jan 3, 2024
2aac764
Merge pull request #28 from RaoFoundation/dev
Jan 3, 2024
4c9f60f
Fix for pending uids to eval in next loop.
Jan 5, 2024
3d65475
Merge pull request #29 from RaoFoundation/updatedEvalCheck
Jan 5, 2024
5e2aaa9
Also update to a new uids file.
Jan 5, 2024
9958cd4
Merge pull request #30 from RaoFoundation/updatedEvalCheck
Jan 5, 2024
9cb69dc
Merge pull request #31 from RaoFoundation/dev
Jan 5, 2024
edeac8d
Realize symlinks on download from remote store.
Jan 8, 2024
0ac4c65
Update to improve error logging around failures to parse the metadata…
Jan 9, 2024
bf1dc9d
Model_id locality fix.
Jan 9, 2024
b3dde1a
Merge pull request #32 from RaoFoundation/log_improvements
Jan 9, 2024
173ad38
Merge pull request #33 from RaoFoundation/remove_symlink
Jan 9, 2024
88ee418
Merge pull request #34 from RaoFoundation/dev
Jan 9, 2024
a8abc8b
Add a notebook to check latest vali perf
Jan 13, 2024
62ca7e9
Clear all outputs
Jan 13, 2024
143f0cd
Merge pull request #35 from RaoFoundation/vali-perf
Jan 14, 2024
014531d
Increase max model size to 186M
Jan 15, 2024
960163e
Perform a full eval after vali upgrade
Jan 15, 2024
560d8e6
Make the clean loop delay larger
Jan 15, 2024
2eea7a5
Update the miner docs
Jan 15, 2024
8b46e81
Keep losses to math.inf when failing to evaluate model.
Jan 15, 2024
6647082
Merge pull request #38 from RaoFoundation/model_loss_none_fix
Jan 15, 2024
ddc0e58
Merge pull request #36 from RaoFoundation/vali-updates
Jan 15, 2024
d1d4b50
Include repo_id in error messages
Jan 15, 2024
6ed4577
Merge pull request #39 from RaoFoundation/improve-errors
Jan 15, 2024
25b91ed
Read back the metadata commit after writing
Jan 15, 2024
2300785
Merge pull request #40 from RaoFoundation/dev
Jan 15, 2024
ae65103
Merge pull request #41 from RaoFoundation/read-metadata
Jan 15, 2024
b1a0bdd
Update setup.py to point to new version location.
Jan 16, 2024
714dff7
Correct the docs
Jan 17, 2024
56b4a52
Merge pull request #37 from RaoFoundation/model-increase
Jan 17, 2024
27fa33b
Merge pull request #42 from RaoFoundation/setup_fix
Jan 17, 2024
edf58fb
Bump version
Jan 17, 2024
9c25951
Merge pull request #43 from RaoFoundation/bump-version
Jan 17, 2024
06eecdd
Merge pull request #44 from RaoFoundation/dev
Jan 17, 2024
c9ec6bc
Simplify the mining API
Jan 20, 2024
5d45fc7
Merge pull request #45 from RaoFoundation/api
Jan 20, 2024
ac31bb6
Run each eval in a subprocess to avoid a bad model being able to corr…
Feb 2, 2024
bdae9e6
Merge pull request #46 from RaoFoundation/debug
Feb 2, 2024
198e103
Remove model with inf loss
Feb 2, 2024
1f96e89
Fix dict .get()
Feb 2, 2024
45595cc
Merge pull request #47 from RaoFoundation/remove-bad-miners
Feb 2, 2024
65b29aa
Clean-up accidental test code
Feb 2, 2024
563dfdb
Merge pull request #48 from RaoFoundation/clean-up2
Feb 2, 2024
4402b91
Merge pull request #49 from RaoFoundation/dev
Feb 2, 2024
4d09328
Correctly call is_dir() method.
Feb 2, 2024
9a6695d
Add test for is_dir() behavior.
Feb 3, 2024
c563c26
Log but do not throw for expected model sync failures.
Feb 3, 2024
3ab91cd
Only keep hotkeys to be evaluated in storage.
Feb 3, 2024
c2c8f6a
Only allow at most 10 new models to be pending eval.
Feb 3, 2024
eb6b471
Merge pull request #50 from RaoFoundation/is_dir_fix
Feb 3, 2024
34e08e0
Merge pull request #51 from RaoFoundation/downgrade_model_size_log
Feb 3, 2024
7b1e494
Add lock around metagraph for sub threads and remove grace period on …
Feb 3, 2024
f877806
Merge pull request #52 from RaoFoundation/limit_stored_models
Feb 3, 2024
8dba6f3
Merge pull request #53 from RaoFoundation/limit_pending_models
Feb 3, 2024
69c2749
Only filter out uids with weights at 0 in addition to inf loss.
Feb 4, 2024
c496bf2
Merge pull request #54 from RaoFoundation/inf_and_weight_check
Feb 4, 2024
45d9bc1
Move state file to the model dir
Feb 4, 2024
bcb696e
Merge pull request #55 from RaoFoundation/perplexity
Feb 4, 2024
1f7345d
Revert "Only allow at most 10 new models to be pending eval."
Feb 4, 2024
172e4e3
Merge pull request #56 from RaoFoundation/revert-53-limit_pending_models
Feb 4, 2024
c8a9eba
Only allow at most 20 new models to be pending eval.
Feb 3, 2024
47a444c
PR Feedback.
Feb 4, 2024
c247220
Handle shutil.rmtree FIleNotFoundError.
Feb 4, 2024
a89a67f
Merge pull request #58 from RaoFoundation/shutil_exception
Feb 4, 2024
4c313ce
Merge pull request #57 from RaoFoundation/limit_pending_models
Feb 4, 2024
2d86ecd
Catch all exceptions from shutil rmtree.
Feb 4, 2024
613fe76
Merge pull request #59 from RaoFoundation/catch_all_rmtree
Feb 4, 2024
c952148
Reapply grace period of 300s.
Feb 4, 2024
56e1665
Catch exceptions in the clean-up loop.
Feb 4, 2024
47e166d
Add handling around computation of file timestamps if the file no lon…
Feb 4, 2024
f6206de
Merge pull request #60 from RaoFoundation/grace_reapply
Feb 4, 2024
7702da1
Merge pull request #61 from RaoFoundation/catch-cleanup
Feb 4, 2024
78864de
Update docs to point to the leaderboard
Feb 4, 2024
4321f85
Fix get_newest_datetime_under_path to get newest not oldest.
Feb 4, 2024
fbbd159
Merge pull request #63 from RaoFoundation/get_latest_under_path_fix
Feb 5, 2024
40f31f8
Standardize the loss function
Feb 5, 2024
5a4ebd0
Bump version
Feb 5, 2024
fb44be8
Merge pull request #66 from RaoFoundation/loss
Feb 5, 2024
71dd311
Merge pull request #65 from RaoFoundation/bump_version
Feb 5, 2024
1dffefc
Merge pull request #62 from RaoFoundation/update-docs
Feb 5, 2024
7e3b2c4
Merge pull request #67 from RaoFoundation/dev
Feb 5, 2024
430cb5a
Require models have max_position_embeddings=1024.
Feb 11, 2024
ccba669
Also reduce severity of logs when failing to download model.
Feb 11, 2024
3b2d967
Update spec version to 2.2.1 to ensure validators get new state.
Feb 11, 2024
b341ed6
Restrict model types.
Feb 11, 2024
cdb622d
Move list of allowed models to constants.
Feb 11, 2024
c8573f9
Merge pull request #69 from RaoFoundation/restrict_model_types
Feb 11, 2024
bd1f026
Merge pull request #70 from RaoFoundation/dev
Feb 11, 2024
a8da485
Update docs for allowed model types.
Feb 11, 2024
d06c77e
Merge pull request #71 from RaoFoundation/doc_update
Feb 11, 2024
e18fdd4
Add tool for running a benchmark
Feb 13, 2024
4cf9e0b
Remove test notebook
Feb 13, 2024
f94fc93
Merge pull request #72 from RaoFoundation/benchmarks
Feb 13, 2024
28a1afe
Allow larger models after a defined block
Feb 14, 2024
81c8b78
Increase max repo size
Feb 14, 2024
d8a7bdc
Add gpt2-large to benchmark
Feb 16, 2024
2234478
Merge pull request #73 from RaoFoundation/block-max
Feb 16, 2024
937afae
Merge pull request #74 from RaoFoundation/add-gpt2-large
Feb 16, 2024
7f1ec1e
Merge pull request #75 from RaoFoundation/dev
Feb 16, 2024
4e5cc6d
Update README.md
dougsillars Feb 16, 2024
7140510
Load model in the subprocess to avoid pickling
Feb 21, 2024
0f237c2
Fix missing method
Feb 21, 2024
725365f
Bump ttl to 150 seconds
Feb 21, 2024
3ab1102
Bump tranformers version
Feb 21, 2024
dd26bcc
Merge pull request #78 from RaoFoundation/bump-transformers
Feb 21, 2024
abb7496
Track total eval perf
Feb 21, 2024
6122352
Don't bump spec version
Feb 21, 2024
7c5fe35
Clean-up vali-perf notebook
Feb 21, 2024
4091575
Merge pull request #77 from RaoFoundation/qol
Feb 21, 2024
98a21b5
Revert "Merge pull request #77 from RaoFoundation/qol"
Feb 22, 2024
4fccab7
Merge pull request #80 from RaoFoundation/undo-77
Feb 22, 2024
c935923
Increase alpha. Log weight failures
Feb 22, 2024
c06992f
Merge pull request #81 from RaoFoundation/alpha
Feb 22, 2024
d2faaec
Merge pull request #79 from RaoFoundation/dev
Feb 22, 2024
5409309
Update model size on downloads based on block.
Mar 17, 2024
8c13811
Use optimizations at new block for inference.
Mar 18, 2024
c8cb2b8
Limit model types based on block.
Mar 18, 2024
e8206a7
Run inference with sequence length based on block.
Mar 18, 2024
9f8ae23
Doc updates.
Mar 18, 2024
3f5748c
Adjust temperature to prioritize top 1 model.
Mar 19, 2024
bd7501c
Adjust to only keep 10 best models + eval up to 15 new per loop.
Mar 19, 2024
90b870f
Check for updates to models with incentive first.
Mar 19, 2024
28916ff
Remove notebook and update cadence for check.
Mar 19, 2024
ea91667
Update to only 6 min, 14 max models by default.
Mar 19, 2024
c83a787
Fix docs + increase time for eval + adjust sample model parameters.
Mar 19, 2024
4957e80
Refactor to use ModelParameters + pass sequence length.
Mar 20, 2024
e520ff1
Rename to Model Criteria for clarity.
Mar 20, 2024
d8af206
Update docs to point to correct line for ModelCriteria.
Mar 20, 2024
2e9d6dd
Check generated outputs before calculating losses.
Mar 22, 2024
82e74a3
Send inputs to the same device as the model.
Mar 22, 2024
7eb4b4e
Refactor check out to a helper function.
Mar 22, 2024
1177610
Bump spec version to force reload of models.
Mar 22, 2024
6160d49
Pass tokenizer eos token id to remove warning message.
Mar 22, 2024
d80f965
Start iterator at 200 for fresh start.
Mar 22, 2024
b4d1207
Merge pull request #86 from RaoFoundation/disallow_attn
Mar 22, 2024
706f659
Merge pull request #87 from RaoFoundation/dev
Mar 22, 2024
99afe25
Update to use 6.9 params, 8192 seqeuence length, and block 2735661.
Mar 23, 2024
56a3713
Update to 24 pages and add clarify TFLOPs required.
Mar 23, 2024
0f26862
Update documentation on vali requirements and flash-attn requirements.
Mar 24, 2024
fadbe82
Merge branch 'dev' into next_milestone
Mar 24, 2024
7ae6d0c
Merge pull request #83 from RaoFoundation/next_milestone
Mar 24, 2024
5213654
Merge branch 'dev' into eval_loop_adjustments
Mar 24, 2024
3c1c44a
Merge pull request #76 from dougsillars/main
Mar 24, 2024
99e0588
Merge pull request #84 from RaoFoundation/eval_loop_adjustments
Mar 24, 2024
fd4681c
Add a new tokenizer for 7B
Mar 21, 2024
cd9819a
Bump to 6 minute timeouts and go back to random iterator start.
Mar 24, 2024
fe2a0c3
Update to 4k seq length + lower pages + adjust tokenizer.
Mar 24, 2024
fca0dd4
Pass pad token id to avoid instantiating new tokenizer every loss com…
Mar 24, 2024
732f904
Add Model Criteria for block 0 and improve logging.
Mar 24, 2024
4309982
Calculate average loss correctly in log_step.
Mar 25, 2024
e8bfe81
Move to GPT4 tokenizer instead of GPT3_5.
Mar 27, 2024
0771aaa
Push switchover block out by a week.
Mar 27, 2024
c0cf96c
Merge pull request #88 from RaoFoundation/update_tokenizer
Mar 28, 2024
18f0056
Merge pull request #89 from RaoFoundation/dev
Mar 28, 2024
d9fe3a1
Raise threshhold for unreasonable output and keep models with weights.
Mar 28, 2024
ae1fd35
Also prioritize keeping higher weights when filtering.
Mar 28, 2024
8b1e8bb
Adjust output lengths and check reptitiveness for all outputs.
Mar 29, 2024
bf5cc6e
Handle failures to load tracker state gracefully.
Mar 29, 2024
43b2428
Also test redownloading works as expected.
Mar 29, 2024
24f4b76
Merge pull request #91 from RaoFoundation/handle_corrupt_state
Mar 29, 2024
e72efee
Refactor model prioritization for clarity + correctness.
Mar 29, 2024
63271e1
Handle failures to load uids to eval state gracefully.
Mar 29, 2024
2b71a5d
Wipe tracker state in case of no uids to eval.
Mar 29, 2024
8a73df8
Also wipe the state in case of multiple bad restarts.
Mar 29, 2024
63bd73e
Merge pull request #90 from RaoFoundation/improve_model_check
Apr 1, 2024
d0716fd
Merge pull request #93 from RaoFoundation/eval_state
Apr 1, 2024
fee6b41
Retry evaluation for discarded models with incentive periodically.
Apr 1, 2024
9347268
Merge pull request #94 from RaoFoundation/retry_incentive
Apr 1, 2024
2c377cf
Merge pull request #95 from RaoFoundation/dev
Apr 1, 2024
9a6e0c0
Initialize uids_to_eval as set().
Apr 2, 2024
5a36e47
Fix docstring
steffencruz Apr 3, 2024
feb620c
Enable uploading a model with bfloat 16.
Apr 12, 2024
1e6e6ef
Add 7b models to the benchmark script
Apr 12, 2024
7b8e7b5
Default to upload with b16 for manual upload.
Apr 12, 2024
b1247e8
Merge pull request #96 from RaoFoundation/type_fix
Apr 12, 2024
6948e7a
Merge pull request #100 from RaoFoundation/benchmark-7b
Apr 12, 2024
57e5f82
Merge pull request #98 from RaoFoundation/upload_arg_opt
Apr 12, 2024
2477d4a
Merge branch 'dev' of github.com:RaoFoundation/pretraining into steff…
steffencruz Apr 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update to improve error logging around failures to parse the metadata…
… string.
  • Loading branch information
Sid committed Jan 9, 2024
commit 0ac4c653a84250357bd1e6b4fc2ba5f8fea7ab30
4 changes: 3 additions & 1 deletion model/model_updater.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,9 @@ async def sync_model(self, hotkey: str) -> bool:
metadata = await self._get_metadata(hotkey)

if not metadata:
bt.logging.trace(f"No metadata found on the chain for hotkey {hotkey}")
bt.logging.trace(
f"No valid metadata found on the chain for hotkey {hotkey}"
)
return False

# Check what model id the model tracker currently has for this hotkey.
Expand Down
12 changes: 11 additions & 1 deletion model/storage/chain/chain_model_metadata_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ async def retrieve_model_metadata(self, hotkey: str) -> Optional[ModelMetadata]:
partial = functools.partial(
bt.extrinsics.serving.get_metadata, self.subtensor, self.subnet_uid, hotkey
)

metadata = utils.run_in_subprocess(partial, 60)

if not metadata:
Expand All @@ -55,7 +56,16 @@ async def retrieve_model_metadata(self, hotkey: str) -> Optional[ModelMetadata]:
hex_data = commitment[list(commitment.keys())[0]][2:]

chain_str = bytes.fromhex(hex_data).decode()
model_id = ModelId.from_compressed_str(chain_str)

try:
model_id = ModelId.from_compressed_str(chain_str)
except:
# If the metadata format is not correct on the chain then we return None.
bt.logging.trace(
f"Failed to parse the metadata on the chain for hotkey {hotkey}."
)
return None

model_metadata = ModelMetadata(id=model_id, block=metadata["block"])

return model_metadata
Expand Down