Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steffen/cleanup #1

Open
wants to merge 386 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
386 commits
Select commit Hold shift + click to select a range
ec23196
Implement cleanup loop in validator and associated local model store …
Dec 27, 2023
8b59645
Add implementations for storing/retrieving data on chain and in Huggi…
Dec 27, 2023
5730065
Format all files for consistency. (#3)
Dec 27, 2023
beeb0c1
Refactor to use hotkeys not uids for miner identification. (#4)
Dec 27, 2023
f4a0ad3
Adds the Perf Monitor
Dec 28, 2023
3bb2289
Merge pull request #5 from RaoFoundation/perf-tracker
Dec 28, 2023
1b180c2
Merge branch 'dev' into miner_tracker
Dec 28, 2023
cac0feb
Improve model tracker comments and logging.
Dec 27, 2023
bcb801d
Delete .vscode/settings.json which is now in the ..gitignore.
Dec 29, 2023
b8a3193
Merge pull request #6 from RaoFoundation/miner_tracker
Dec 29, 2023
07eaedb
Merge pull request #7 from RaoFoundation/model_cleaner
Dec 30, 2023
fac0eb0
Add helper to get hash of directory.
Dec 30, 2023
71e789f
Add logic to redownload and get hash in upload_model.
Dec 30, 2023
9768cb7
Update to only store model for hash in a tmp folder.
Dec 30, 2023
762792d
Address PR feedback.
Dec 30, 2023
6d674ae
Merge pull request #8 from RaoFoundation/dirHash
Dec 30, 2023
3f90a28
Update the Miner
Dec 30, 2023
56e8137
Address feedback
Dec 30, 2023
b6e4e47
More PR feedback
Dec 30, 2023
2755cf9
Merge pull request #9 from RaoFoundation/miner-updates
Dec 30, 2023
f78cfd3
Update model tracker to track metadata.
Dec 30, 2023
c891add
Update validator eval loop to use new stores.
Dec 30, 2023
7f17118
Miner fixes
Dec 30, 2023
6d656fa
Merge pull request #11 from RaoFoundation/miner-fixes
Dec 30, 2023
d33d4fb
Use AutoModelForCausalLM.
Dec 30, 2023
8c1b7ac
Also update mining test to use same model type.
Dec 30, 2023
c8855a5
Merge pull request #12 from RaoFoundation/autoModelLM
Dec 30, 2023
29ae1a0
Pass netuid to the chain store
Dec 31, 2023
c94f391
Handle exceptions calculating miner losses.
Dec 31, 2023
b4d2325
Support loading a non hugging face saved model
Dec 31, 2023
d3d2f3d
Make a new wandb run for the validator if logging there.
Dec 31, 2023
38a11a5
Merge pull request #14 from RaoFoundation/miner-fixes2
Dec 31, 2023
44a3e73
Address PR fixes.
Dec 31, 2023
67574c9
Merge branch 'dev' into valEval
Dec 31, 2023
ced19a6
Add size check before downloading from hugging face.
Dec 30, 2023
cd0e717
Merge pull request #10 from RaoFoundation/valEval
Dec 31, 2023
4f773e8
Merge pull request #13 from RaoFoundation/checkRepoSize
Dec 31, 2023
7f66a39
Add checks in Model Updater for bad models.
Dec 31, 2023
b0fc67d
Merge pull request #15 from RaoFoundation/exceptOnBadModels
Dec 31, 2023
d6f2904
Improve test logging.
Dec 31, 2023
db56c27
Collected fixes.
Dec 31, 2023
de6edac
Exception handling improvements.
Dec 31, 2023
3004f06
Fix update loop sleep logic when revisiting recently.
Dec 31, 2023
928b61f
Uid state handling fixes.
Dec 31, 2023
a8e8a2f
Sleep in run step for readability.
Dec 31, 2023
3d22ac1
Align local and remote directory pathing.
Dec 31, 2023
81edb9d
Compute_losses on the pt_model not the Model.
Dec 31, 2023
695e738
Validator wandb run logging fixes.
Dec 31, 2023
359fa9c
Update comments on expected directory structure.
Dec 31, 2023
d5b7825
Merge pull request #16 from RaoFoundation/vali-fixes
Dec 31, 2023
adaf416
Add a new tool to upload a trained model
Dec 31, 2023
a3e073f
Merge pull request #17 from RaoFoundation/miner-push-only
Dec 31, 2023
88c8216
Clean-up
Dec 31, 2023
7d80098
Create a new validator wandb run every 100 run steps.
Dec 31, 2023
87d4f88
Merge pull request #18 from RaoFoundation/clean-up
Dec 31, 2023
8fdb630
Add auto-update script
Dec 31, 2023
28e5769
Fix directory hash after downloading models.
Dec 31, 2023
4b08bdf
Merge pull request #20 from RaoFoundation/auto-update
Dec 31, 2023
da72955
Merge pull request #21 from RaoFoundation/hash_location_fix
Dec 31, 2023
ee0b22e
Merge pull request #19 from RaoFoundation/new_wandb_runs
Dec 31, 2023
3b27b56
Remove unused import
Dec 31, 2023
c57f2ef
Merge pull request #22 from RaoFoundation/logs
Dec 31, 2023
9da0a5c
Split out miner/vali docs and update.
Dec 31, 2023
0278385
Improve Miner docs.
Jan 1, 2024
175a58c
Merge pull request #23 from RaoFoundation/docs
Jan 1, 2024
ae42a47
Update scoring temperature to 0.04.
Jan 3, 2024
ef67494
Merge pull request #24 from RaoFoundation/temp_update
surcyf123 Jan 3, 2024
0ddead7
Update validator score boosting of earlier models.
Jan 3, 2024
43d6a6a
Merge pull request #25 from RaoFoundation/epsilon_update
surcyf123 Jan 3, 2024
972950a
Merge pull request #26 from RaoFoundation/dev
Jan 3, 2024
93a1d98
Formatting fixes for miner docs
Jan 3, 2024
8e91a9f
Merge pull request #27 from RaoFoundation/doc-format
Jan 3, 2024
2aac764
Merge pull request #28 from RaoFoundation/dev
Jan 3, 2024
4c9f60f
Fix for pending uids to eval in next loop.
Jan 5, 2024
3d65475
Merge pull request #29 from RaoFoundation/updatedEvalCheck
Jan 5, 2024
5e2aaa9
Also update to a new uids file.
Jan 5, 2024
9958cd4
Merge pull request #30 from RaoFoundation/updatedEvalCheck
Jan 5, 2024
9cb69dc
Merge pull request #31 from RaoFoundation/dev
Jan 5, 2024
edeac8d
Realize symlinks on download from remote store.
Jan 8, 2024
0ac4c65
Update to improve error logging around failures to parse the metadata…
Jan 9, 2024
bf1dc9d
Model_id locality fix.
Jan 9, 2024
b3dde1a
Merge pull request #32 from RaoFoundation/log_improvements
Jan 9, 2024
173ad38
Merge pull request #33 from RaoFoundation/remove_symlink
Jan 9, 2024
88ee418
Merge pull request #34 from RaoFoundation/dev
Jan 9, 2024
a8abc8b
Add a notebook to check latest vali perf
Jan 13, 2024
62ca7e9
Clear all outputs
Jan 13, 2024
143f0cd
Merge pull request #35 from RaoFoundation/vali-perf
Jan 14, 2024
014531d
Increase max model size to 186M
Jan 15, 2024
960163e
Perform a full eval after vali upgrade
Jan 15, 2024
560d8e6
Make the clean loop delay larger
Jan 15, 2024
2eea7a5
Update the miner docs
Jan 15, 2024
8b46e81
Keep losses to math.inf when failing to evaluate model.
Jan 15, 2024
6647082
Merge pull request #38 from RaoFoundation/model_loss_none_fix
Jan 15, 2024
ddc0e58
Merge pull request #36 from RaoFoundation/vali-updates
Jan 15, 2024
d1d4b50
Include repo_id in error messages
Jan 15, 2024
6ed4577
Merge pull request #39 from RaoFoundation/improve-errors
Jan 15, 2024
25b91ed
Read back the metadata commit after writing
Jan 15, 2024
2300785
Merge pull request #40 from RaoFoundation/dev
Jan 15, 2024
ae65103
Merge pull request #41 from RaoFoundation/read-metadata
Jan 15, 2024
b1a0bdd
Update setup.py to point to new version location.
Jan 16, 2024
714dff7
Correct the docs
Jan 17, 2024
56b4a52
Merge pull request #37 from RaoFoundation/model-increase
Jan 17, 2024
27fa33b
Merge pull request #42 from RaoFoundation/setup_fix
Jan 17, 2024
edf58fb
Bump version
Jan 17, 2024
9c25951
Merge pull request #43 from RaoFoundation/bump-version
Jan 17, 2024
06eecdd
Merge pull request #44 from RaoFoundation/dev
Jan 17, 2024
c9ec6bc
Simplify the mining API
Jan 20, 2024
5d45fc7
Merge pull request #45 from RaoFoundation/api
Jan 20, 2024
ac31bb6
Run each eval in a subprocess to avoid a bad model being able to corr…
Feb 2, 2024
bdae9e6
Merge pull request #46 from RaoFoundation/debug
Feb 2, 2024
198e103
Remove model with inf loss
Feb 2, 2024
1f96e89
Fix dict .get()
Feb 2, 2024
45595cc
Merge pull request #47 from RaoFoundation/remove-bad-miners
Feb 2, 2024
65b29aa
Clean-up accidental test code
Feb 2, 2024
563dfdb
Merge pull request #48 from RaoFoundation/clean-up2
Feb 2, 2024
4402b91
Merge pull request #49 from RaoFoundation/dev
Feb 2, 2024
4d09328
Correctly call is_dir() method.
Feb 2, 2024
9a6695d
Add test for is_dir() behavior.
Feb 3, 2024
c563c26
Log but do not throw for expected model sync failures.
Feb 3, 2024
3ab91cd
Only keep hotkeys to be evaluated in storage.
Feb 3, 2024
c2c8f6a
Only allow at most 10 new models to be pending eval.
Feb 3, 2024
eb6b471
Merge pull request #50 from RaoFoundation/is_dir_fix
Feb 3, 2024
34e08e0
Merge pull request #51 from RaoFoundation/downgrade_model_size_log
Feb 3, 2024
7b1e494
Add lock around metagraph for sub threads and remove grace period on …
Feb 3, 2024
f877806
Merge pull request #52 from RaoFoundation/limit_stored_models
Feb 3, 2024
8dba6f3
Merge pull request #53 from RaoFoundation/limit_pending_models
Feb 3, 2024
69c2749
Only filter out uids with weights at 0 in addition to inf loss.
Feb 4, 2024
c496bf2
Merge pull request #54 from RaoFoundation/inf_and_weight_check
Feb 4, 2024
45d9bc1
Move state file to the model dir
Feb 4, 2024
bcb696e
Merge pull request #55 from RaoFoundation/perplexity
Feb 4, 2024
1f7345d
Revert "Only allow at most 10 new models to be pending eval."
Feb 4, 2024
172e4e3
Merge pull request #56 from RaoFoundation/revert-53-limit_pending_models
Feb 4, 2024
c8a9eba
Only allow at most 20 new models to be pending eval.
Feb 3, 2024
47a444c
PR Feedback.
Feb 4, 2024
c247220
Handle shutil.rmtree FIleNotFoundError.
Feb 4, 2024
a89a67f
Merge pull request #58 from RaoFoundation/shutil_exception
Feb 4, 2024
4c313ce
Merge pull request #57 from RaoFoundation/limit_pending_models
Feb 4, 2024
2d86ecd
Catch all exceptions from shutil rmtree.
Feb 4, 2024
613fe76
Merge pull request #59 from RaoFoundation/catch_all_rmtree
Feb 4, 2024
c952148
Reapply grace period of 300s.
Feb 4, 2024
56e1665
Catch exceptions in the clean-up loop.
Feb 4, 2024
47e166d
Add handling around computation of file timestamps if the file no lon…
Feb 4, 2024
f6206de
Merge pull request #60 from RaoFoundation/grace_reapply
Feb 4, 2024
7702da1
Merge pull request #61 from RaoFoundation/catch-cleanup
Feb 4, 2024
78864de
Update docs to point to the leaderboard
Feb 4, 2024
4321f85
Fix get_newest_datetime_under_path to get newest not oldest.
Feb 4, 2024
fbbd159
Merge pull request #63 from RaoFoundation/get_latest_under_path_fix
Feb 5, 2024
40f31f8
Standardize the loss function
Feb 5, 2024
5a4ebd0
Bump version
Feb 5, 2024
fb44be8
Merge pull request #66 from RaoFoundation/loss
Feb 5, 2024
71dd311
Merge pull request #65 from RaoFoundation/bump_version
Feb 5, 2024
1dffefc
Merge pull request #62 from RaoFoundation/update-docs
Feb 5, 2024
7e3b2c4
Merge pull request #67 from RaoFoundation/dev
Feb 5, 2024
430cb5a
Require models have max_position_embeddings=1024.
Feb 11, 2024
ccba669
Also reduce severity of logs when failing to download model.
Feb 11, 2024
3b2d967
Update spec version to 2.2.1 to ensure validators get new state.
Feb 11, 2024
b341ed6
Restrict model types.
Feb 11, 2024
cdb622d
Move list of allowed models to constants.
Feb 11, 2024
c8573f9
Merge pull request #69 from RaoFoundation/restrict_model_types
Feb 11, 2024
bd1f026
Merge pull request #70 from RaoFoundation/dev
Feb 11, 2024
a8da485
Update docs for allowed model types.
Feb 11, 2024
d06c77e
Merge pull request #71 from RaoFoundation/doc_update
Feb 11, 2024
e18fdd4
Add tool for running a benchmark
Feb 13, 2024
4cf9e0b
Remove test notebook
Feb 13, 2024
f94fc93
Merge pull request #72 from RaoFoundation/benchmarks
Feb 13, 2024
28a1afe
Allow larger models after a defined block
Feb 14, 2024
81c8b78
Increase max repo size
Feb 14, 2024
d8a7bdc
Add gpt2-large to benchmark
Feb 16, 2024
2234478
Merge pull request #73 from RaoFoundation/block-max
Feb 16, 2024
937afae
Merge pull request #74 from RaoFoundation/add-gpt2-large
Feb 16, 2024
7f1ec1e
Merge pull request #75 from RaoFoundation/dev
Feb 16, 2024
4e5cc6d
Update README.md
dougsillars Feb 16, 2024
7140510
Load model in the subprocess to avoid pickling
Feb 21, 2024
0f237c2
Fix missing method
Feb 21, 2024
725365f
Bump ttl to 150 seconds
Feb 21, 2024
3ab1102
Bump tranformers version
Feb 21, 2024
dd26bcc
Merge pull request #78 from RaoFoundation/bump-transformers
Feb 21, 2024
abb7496
Track total eval perf
Feb 21, 2024
6122352
Don't bump spec version
Feb 21, 2024
7c5fe35
Clean-up vali-perf notebook
Feb 21, 2024
4091575
Merge pull request #77 from RaoFoundation/qol
Feb 21, 2024
98a21b5
Revert "Merge pull request #77 from RaoFoundation/qol"
Feb 22, 2024
4fccab7
Merge pull request #80 from RaoFoundation/undo-77
Feb 22, 2024
c935923
Increase alpha. Log weight failures
Feb 22, 2024
c06992f
Merge pull request #81 from RaoFoundation/alpha
Feb 22, 2024
d2faaec
Merge pull request #79 from RaoFoundation/dev
Feb 22, 2024
5409309
Update model size on downloads based on block.
Mar 17, 2024
8c13811
Use optimizations at new block for inference.
Mar 18, 2024
c8cb2b8
Limit model types based on block.
Mar 18, 2024
e8206a7
Run inference with sequence length based on block.
Mar 18, 2024
9f8ae23
Doc updates.
Mar 18, 2024
3f5748c
Adjust temperature to prioritize top 1 model.
Mar 19, 2024
bd7501c
Adjust to only keep 10 best models + eval up to 15 new per loop.
Mar 19, 2024
90b870f
Check for updates to models with incentive first.
Mar 19, 2024
28916ff
Remove notebook and update cadence for check.
Mar 19, 2024
ea91667
Update to only 6 min, 14 max models by default.
Mar 19, 2024
c83a787
Fix docs + increase time for eval + adjust sample model parameters.
Mar 19, 2024
4957e80
Refactor to use ModelParameters + pass sequence length.
Mar 20, 2024
e520ff1
Rename to Model Criteria for clarity.
Mar 20, 2024
d8af206
Update docs to point to correct line for ModelCriteria.
Mar 20, 2024
2e9d6dd
Check generated outputs before calculating losses.
Mar 22, 2024
82e74a3
Send inputs to the same device as the model.
Mar 22, 2024
7eb4b4e
Refactor check out to a helper function.
Mar 22, 2024
1177610
Bump spec version to force reload of models.
Mar 22, 2024
6160d49
Pass tokenizer eos token id to remove warning message.
Mar 22, 2024
d80f965
Start iterator at 200 for fresh start.
Mar 22, 2024
b4d1207
Merge pull request #86 from RaoFoundation/disallow_attn
Mar 22, 2024
706f659
Merge pull request #87 from RaoFoundation/dev
Mar 22, 2024
99afe25
Update to use 6.9 params, 8192 seqeuence length, and block 2735661.
Mar 23, 2024
56a3713
Update to 24 pages and add clarify TFLOPs required.
Mar 23, 2024
0f26862
Update documentation on vali requirements and flash-attn requirements.
Mar 24, 2024
fadbe82
Merge branch 'dev' into next_milestone
Mar 24, 2024
7ae6d0c
Merge pull request #83 from RaoFoundation/next_milestone
Mar 24, 2024
5213654
Merge branch 'dev' into eval_loop_adjustments
Mar 24, 2024
3c1c44a
Merge pull request #76 from dougsillars/main
Mar 24, 2024
99e0588
Merge pull request #84 from RaoFoundation/eval_loop_adjustments
Mar 24, 2024
fd4681c
Add a new tokenizer for 7B
Mar 21, 2024
cd9819a
Bump to 6 minute timeouts and go back to random iterator start.
Mar 24, 2024
fe2a0c3
Update to 4k seq length + lower pages + adjust tokenizer.
Mar 24, 2024
fca0dd4
Pass pad token id to avoid instantiating new tokenizer every loss com…
Mar 24, 2024
732f904
Add Model Criteria for block 0 and improve logging.
Mar 24, 2024
4309982
Calculate average loss correctly in log_step.
Mar 25, 2024
e8bfe81
Move to GPT4 tokenizer instead of GPT3_5.
Mar 27, 2024
0771aaa
Push switchover block out by a week.
Mar 27, 2024
c0cf96c
Merge pull request #88 from RaoFoundation/update_tokenizer
Mar 28, 2024
18f0056
Merge pull request #89 from RaoFoundation/dev
Mar 28, 2024
d9fe3a1
Raise threshhold for unreasonable output and keep models with weights.
Mar 28, 2024
ae1fd35
Also prioritize keeping higher weights when filtering.
Mar 28, 2024
8b1e8bb
Adjust output lengths and check reptitiveness for all outputs.
Mar 29, 2024
bf5cc6e
Handle failures to load tracker state gracefully.
Mar 29, 2024
43b2428
Also test redownloading works as expected.
Mar 29, 2024
24f4b76
Merge pull request #91 from RaoFoundation/handle_corrupt_state
Mar 29, 2024
e72efee
Refactor model prioritization for clarity + correctness.
Mar 29, 2024
63271e1
Handle failures to load uids to eval state gracefully.
Mar 29, 2024
2b71a5d
Wipe tracker state in case of no uids to eval.
Mar 29, 2024
8a73df8
Also wipe the state in case of multiple bad restarts.
Mar 29, 2024
63bd73e
Merge pull request #90 from RaoFoundation/improve_model_check
Apr 1, 2024
d0716fd
Merge pull request #93 from RaoFoundation/eval_state
Apr 1, 2024
fee6b41
Retry evaluation for discarded models with incentive periodically.
Apr 1, 2024
9347268
Merge pull request #94 from RaoFoundation/retry_incentive
Apr 1, 2024
2c377cf
Merge pull request #95 from RaoFoundation/dev
Apr 1, 2024
9a6e0c0
Initialize uids_to_eval as set().
Apr 2, 2024
5a36e47
Fix docstring
steffencruz Apr 3, 2024
feb620c
Enable uploading a model with bfloat 16.
Apr 12, 2024
1e6e6ef
Add 7b models to the benchmark script
Apr 12, 2024
7b8e7b5
Default to upload with b16 for manual upload.
Apr 12, 2024
b1247e8
Merge pull request #96 from RaoFoundation/type_fix
Apr 12, 2024
6948e7a
Merge pull request #100 from RaoFoundation/benchmark-7b
Apr 12, 2024
57e5f82
Merge pull request #98 from RaoFoundation/upload_arg_opt
Apr 12, 2024
2477d4a
Merge branch 'dev' of github.com:RaoFoundation/pretraining into steff…
steffencruz Apr 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Align local and remote directory pathing.
  • Loading branch information
Sid committed Dec 31, 2023
commit 3d22ac1764343941db40b458bd8deba122a51d0a
4 changes: 2 additions & 2 deletions model/model_updater.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,8 @@ async def sync_model(self, hotkey: str) -> bool:
if metadata == tracker_model_metadata:
return False

# Get the local path based on the local store.
path = self.local_store.get_path(hotkey, metadata.id)
# Get the local path based on the local store to download to (top level hotkey path)
path = self.local_store.get_path(hotkey)

# Otherwise we need to download the new model based on the metadata.
model = await self.remote_store.download_model(metadata.id, path)
Expand Down
54 changes: 40 additions & 14 deletions model/storage/disk/disk_model_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,19 @@ class DiskModelStore(LocalModelStore):
def __init__(self, base_dir: str):
self.base_dir = base_dir

def get_path(self, hotkey: str, model_id: ModelId) -> str:
"""Returns the path to where this store would locate this model."""
return utils.get_local_model_dir(self.base_dir, hotkey, model_id)
def get_path(self, hotkey: str) -> str:
"""Returns the path to where this store would locate this hotkey."""
return utils.get_local_miner_dir(self.base_dir, hotkey)

def store_model(self, hotkey: str, model: Model) -> ModelId:
"""Stores a trained model locally."""

# Note that the revision argument here does not affect the directory path like with hugging face downloads.
model.pt_model.save_pretrained(
save_directory=utils.get_local_model_dir(self.base_dir, hotkey, model.id),
save_directory=utils.get_local_model_snapshot_dir(
self.base_dir, hotkey, model.id
),
revision=model.id.commit,
safe_serialization=True,
)

Expand All @@ -32,7 +36,7 @@ def retrieve_model(self, hotkey: str, model_id: ModelId) -> Model:
"""Retrieves a trained model locally."""

model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=utils.get_local_model_dir(
pretrained_model_name_or_path=utils.get_local_model_snapshot_dir(
self.base_dir, hotkey, model_id
),
revision=model_id.commit,
Expand All @@ -50,7 +54,7 @@ def delete_unreferenced_models(
valid_model_paths = set()
for hotkey, model_id in valid_models_by_hotkey.items():
valid_model_paths.add(
utils.get_local_model_dir(self.base_dir, hotkey, model_id)
utils.get_local_model_snapshot_dir(self.base_dir, hotkey, model_id)
)

# For each hotkey path on disk using listdir to go one level deep.
Expand All @@ -63,20 +67,42 @@ def delete_unreferenced_models(

# If it is not in valid_hotkeys and out of grace period remove it.
if hotkey not in valid_models_by_hotkey:
bt.logging.trace(
f"Removing directory for unreferenced hotkey: {hotkey} if out of grace."
deleted_hotkey = utils.remove_dir_out_of_grace(
hotkey_path, grace_period_seconds
)
utils.remove_dir_out_of_grace(hotkey_path, grace_period_seconds)
if deleted_hotkey:
bt.logging.trace(
f"Removed directory for unreferenced hotkey: {hotkey}."
)

else:
# Check all the model subfolder paths.
hotkey_dir = Path(hotkey_path)
model_subfolder_paths = [
str(d) for d in hotkey_dir.iterdir() if d.is_dir
]

# Check all the snapshot subfolder paths
for model_path in model_subfolder_paths:
if model_path not in valid_model_paths:
bt.logging.trace(
f"Removing directory for unreferenced model at: {model_path} if out of grace."
)
utils.remove_dir_out_of_grace(model_path, grace_period_seconds)
model_dir = Path(model_path)
snapshot_subfolder_paths = [
str(d) for d in model_dir.iterdir() if d.is_dir
]

# Check all the actual model snapshot paths
for snapshot_path in snapshot_subfolder_paths:
snapshot_dir = Path(snapshot_path)
commit_subfolder_paths = [
str(d) for d in snapshot_dir.iterdir() if d.is_dir
]

# Reached the end. Check all the actual commit subfolders.
for commit_path in commit_subfolder_paths:
if commit_path not in valid_model_paths:
deleted_model = utils.remove_dir_out_of_grace(
commit_path, grace_period_seconds
)
if deleted_model:
bt.logging.trace(
f"Removing directory for unreferenced model at: {commit_path}."
)
17 changes: 15 additions & 2 deletions model/storage/disk/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,19 @@ def get_local_miner_dir(base_dir: str, hotkey: str) -> str:
return os.path.join(get_local_miners_dir(base_dir), hotkey)


# Hugging face stores models under models--namespace--name/snapshots/commit when downloading.
def get_local_model_dir(base_dir: str, hotkey: str, model_id: ModelId) -> str:
return os.path.join(
get_local_miner_dir(base_dir, hotkey),
model_id.namespace + "_" + model_id.name + "_" + model_id.commit,
"models" + "--" + model_id.namespace + "--" + model_id.name,
)


def get_local_model_snapshot_dir(base_dir: str, hotkey: str, model_id: ModelId) -> str:
return os.path.join(
get_local_model_dir(base_dir, hotkey, model_id),
"snapshots",
model_id.commit,
)


Expand All @@ -39,12 +48,16 @@ def get_newest_datetime_under_path(path: str) -> datetime.datetime:
return datetime.datetime.fromtimestamp(newest_filetime)


def remove_dir_out_of_grace(path: str, grace_period_seconds: int):
def remove_dir_out_of_grace(path: str, grace_period_seconds: int) -> bool:
"""Removes a dir if the last modified time is out of grace period secs. Returns if it was deleted."""
last_modified = get_newest_datetime_under_path(path)
grace = datetime.timedelta(seconds=grace_period_seconds)

if last_modified < datetime.datetime.now() - grace:
shutil.rmtree(path=path, ignore_errors=True)
return True

return False


def get_hash_of_file(path: str) -> str:
Expand Down
6 changes: 3 additions & 3 deletions model/storage/hugging_face/hugging_face_model_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ async def test_roundtrip_model():
# Retrieve the model from hf.
retrieved_model = await hf_model_store.download_model(
model_id=model.id,
local_path=utils.get_local_model_dir("test-models", "hotkey0", model.id),
local_path=utils.get_local_miner_dir("test-models", "hotkey0"),
)

# Check that they match.
Expand All @@ -129,7 +129,7 @@ async def test_retrieve_model():
# Retrieve the model from hf (first run) or cache.
model = await hf_model_store.download_model(
model_id=model_id,
local_path=utils.get_local_model_dir("test-models", "hotkey0", model_id),
local_path=utils.get_local_miner_dir("test-models", "hotkey0"),
)

print(f"Finished retrieving the model with id: {model.id}")
Expand All @@ -149,7 +149,7 @@ async def test_retrieve_oversized_model():
try:
model = await hf_model_store.download_model(
model_id=model_id,
local_path=utils.get_local_model_dir("test-models", "hotkey0", model_id),
local_path=utils.get_local_miner_dir("test-models", "hotkey0"),
)
except ValueError as ve:
print(f"Caught expected exception for downloading too large of a model: {ve}")
Expand Down
2 changes: 1 addition & 1 deletion model/storage/local_model_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ def store_model(self, hotkey: str, model: Model) -> ModelId:
pass

@abc.abstractmethod
def get_path(self, hotkey: str, model_id: ModelId) -> str:
def get_path(self, hotkey: str) -> str:
"""Returns the path to the appropriate location based on implementation."""
pass

Expand Down
10 changes: 2 additions & 8 deletions tests/model/storage/disk/test_disk_model_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,9 @@ def tearDown(self):

def test_get_path(self):
hotkey = "hotkey0"
model_id = ModelId(
namespace="test_model",
name="test_name",
commit="test_commit",
hash="test_hash",
)

expected_path = utils.get_local_model_dir("test-models", hotkey, model_id)
actual_path = self.disk_store.get_path(hotkey, model_id)
expected_path = utils.get_local_miner_dir("test-models", hotkey)
actual_path = self.disk_store.get_path(hotkey)

self.assertEqual(expected_path, actual_path)

Expand Down
38 changes: 34 additions & 4 deletions tests/model/storage/disk/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,38 @@ def test_get_local_model_dir(self):
+ self.sep
+ hotkey
+ self.sep
+ "models--"
+ namespace
+ "_"
+ "--"
+ name
+ "_"
)
self.assertEqual(model_dir, expected_path)

def test_get_local_model_snapshot_dir(self):
hotkey = "test-hotkey"
namespace = "test-namespace"
name = "test-name"
commit = "test-commit"
model_id = ModelId(
namespace=namespace, name=name, hash="test-hash", commit=commit
)

model_dir = utils.get_local_model_snapshot_dir(self.base_dir, hotkey, model_id)

expected_path = (
self.base_dir
+ self.sep
+ "models"
+ self.sep
+ hotkey
+ self.sep
+ "models--"
+ namespace
+ "--"
+ name
+ self.sep
+ "snapshots"
+ self.sep
+ commit
)
self.assertEqual(model_dir, expected_path)
Expand Down Expand Up @@ -91,7 +119,8 @@ def test_remove_dir_out_of_grace(self):
time.sleep(1)

self.assertTrue(os.path.exists(self.base_dir))
utils.remove_dir_out_of_grace(self.base_dir, 0)
deleted = utils.remove_dir_out_of_grace(self.base_dir, 0)
self.assertTrue(deleted)
self.assertFalse(os.path.exists(self.base_dir))

def test_remove_dir_out_of_grace_in_grace(self):
Expand All @@ -104,7 +133,8 @@ def test_remove_dir_out_of_grace_in_grace(self):
file.close()

self.assertTrue(os.path.exists(self.base_dir))
utils.remove_dir_out_of_grace(self.base_dir, 60)
deleted = utils.remove_dir_out_of_grace(self.base_dir, 60)
self.assertFalse(deleted)
self.assertTrue(os.path.exists(self.base_dir))

def test_get_hash_of_file(self):
Expand Down
16 changes: 15 additions & 1 deletion tests/model/storage/fake_model_metadata_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,21 @@ def __init__(self):
self.metadata = dict()
self.store_errors = deque()

async def store_model_metadata(self, hotkey: str, model_metadata: ModelMetadata):
async def store_model_metadata(self, hotkey: str, model_id: ModelId):
"""Fake stores model metadata for a specific hotkey."""

# Return an injected error if we have one.
if len(self.store_errors) > 0:
raise self.store_errors.popleft()

model_metadata = ModelMetadata(id=model_id, block=self.current_block)
self.current_block += 1

self.metadata[hotkey] = model_metadata

async def store_model_metadata_exact(
self, hotkey: str, model_metadata: ModelMetadata
):
"""Fake stores model metadata for a specific hotkey."""

# Return an injected error if we have one.
Expand Down
8 changes: 7 additions & 1 deletion tests/model/storage/fake_remote_model_store.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from model.data import Model, ModelId
from model.storage.disk import utils
from model.storage.remote_model_store import RemoteModelStore


Expand All @@ -21,9 +22,14 @@ async def download_model(self, model_id: ModelId, local_path: str) -> Model:

model = self.remote_models[model_id]

# Parse out the hotkey and the base path from local_path to replicate hugging face logic.
split_string = local_path.split("/")

# Store it at the local_path
model.pt_model.save_pretrained(
save_directory=local_path,
save_directory=utils.get_local_model_snapshot_dir(
split_string[0], split_string[2], model_id
),
safe_serialization=True,
)

Expand Down
24 changes: 18 additions & 6 deletions tests/model/test_model_updater.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,9 @@ def test_get_metadata(self):
)
model_metadata = ModelMetadata(id=model_id, block=1)

asyncio.run(self.metadata_store.store_model_metadata(hotkey, model_metadata))
asyncio.run(
self.metadata_store.store_model_metadata_exact(hotkey, model_metadata)
)

metadata = asyncio.run(self.model_updater._get_metadata(hotkey))

Expand All @@ -56,7 +58,9 @@ def test_sync_model_bad_metadata(self):
model_metadata = ModelMetadata(id=model_id, block=1)

# Setup the metadata with a commit that doesn't exist in the remote store.
asyncio.run(self.metadata_store.store_model_metadata(hotkey, model_metadata))
asyncio.run(
self.metadata_store.store_model_metadata_exact(hotkey, model_metadata)
)

# FakeRemoteModelStore raises a KeyError but HuggingFace may raise other exceptions.
with self.assertRaises(Exception):
Expand All @@ -77,7 +81,9 @@ def test_sync_model_same_metadata(self):
model = Model(id=model_id, pt_model=pt_model)

# Setup the metadata, local, and model_tracker to match.
asyncio.run(self.metadata_store.store_model_metadata(hotkey, model_metadata))
asyncio.run(
self.metadata_store.store_model_metadata_exact(hotkey, model_metadata)
)
self.local_store.store_model(hotkey, model)

self.model_tracker.on_miner_model_updated(hotkey, model_metadata)
Expand Down Expand Up @@ -105,7 +111,9 @@ def test_sync_model_new_metadata(self):
model = Model(id=model_id, pt_model=pt_model)

# Setup the metadata and remote store but not local or the model_tracker.
asyncio.run(self.metadata_store.store_model_metadata(hotkey, model_metadata))
asyncio.run(
self.metadata_store.store_model_metadata_exact(hotkey, model_metadata)
)
asyncio.run(self.remote_store.upload_model(model))

self.assertIsNone(
Expand Down Expand Up @@ -148,7 +156,9 @@ def test_sync_model_bad_hash(self):
model = Model(id=model_id, pt_model=pt_model)

# Setup the metadata and remote store and but not local or the model tracker.
asyncio.run(self.metadata_store.store_model_metadata(hotkey, model_metadata))
asyncio.run(
self.metadata_store.store_model_metadata_exact(hotkey, model_metadata)
)
self.remote_store.inject_mismatched_model(model_id_chain, model)

# Assert we fail due to the hash mismatch between the model in remote store and the metadata on chain.
Expand Down Expand Up @@ -177,7 +187,9 @@ def test_sync_model_over_max_parameters(self):
model = Model(id=model_id, pt_model=pt_model)

# Setup the metadata and remote store but not local or the model_tracker.
asyncio.run(self.metadata_store.store_model_metadata(hotkey, model_metadata))
asyncio.run(
self.metadata_store.store_model_metadata_exact(hotkey, model_metadata)
)
asyncio.run(self.remote_store.upload_model(model))

# Assert we fail due to exceeding the maximum allowed parameter size.
Expand Down