Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retraining a brainy model (unet) on kwyk dataset using nobrainer API #277

Closed
hvgazula opened this issue Mar 4, 2024 · 4 comments
Closed
Assignees

Comments

@hvgazula
Copy link
Contributor

hvgazula commented Mar 4, 2024

Based on a user request #271

@hvgazula hvgazula self-assigned this Mar 4, 2024
@hvgazula
Copy link
Contributor Author

hvgazula commented Mar 4, 2024

  • setup kwyk data
  • unet model (nobrainer library model)
  • unet model (standard tf model)
  • environment setup
  • training on dgx-100 (volta) -- fails
  • training on a100 - success with batch size = 1

@satra
Copy link
Contributor

satra commented Mar 4, 2024

@hvgazula - the request was about retraining the brainy model not the kwyk model (you may want to edit the title). the kwyk model uses the bayesian neural network. i would love to see that trained as well, but that's a different task.

@hvgazula hvgazula changed the title Retraining kwyk using nobrainer Retraining a brainy model (unet) on kwyk dataset using nobrainer API Mar 5, 2024
@hvgazula
Copy link
Contributor Author

hvgazula commented Mar 5, 2024

Setting block_shape = None runs successfully on the test data (whose shape is (256, 256, 256). However, the process gets killed while running the same on the kwyk tfrecords despite setting the batch size to 1/2 (depending on the number of GPUs). This may have to do with how huge the tfrecord file is. To test, create a new tfrecord file with fewer shards and see if that works.

@hvgazula
Copy link
Contributor Author

hvgazula commented Mar 5, 2024

Note: The successful run was on 2 A100s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants