-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Ensure loading mp first #252
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Sayak, makes sense. Just FYI, I'm in the process of removing accelerate in #245 so that we can use our own distributed implementation directly in pure pytorch. With accelerate, it is becoming very hard for me dive into the code and make changes that allow PP/TP/CP to work, leading to several NCCL hangs. There is a lot of code in accelerate uses to glue together Deepspeed, FSDP, DDP, etc. so you can imagine how much of a problem debugging is (not saying that's a bad thing, but that we're limited until they provide some APIs to PP/TP/CP). Rather, having to use own backend, built with all parallelism in mind, and following fastest inference backends like xDiT and ParaAttention, is easier to understand and more easily tailorable to our use case. This is absolutely necessary if we want to be doing multi-node training for distillation, etc. in the near future imo.
We can do a stable release before the major codebase change coming. If you've tested this well for common cases, good to merge for the stable release
On second thought, maybe I could write a dispatch layer that allows using accelerate backend as well as the one I'm working on. It shouldn't be too hard hopefully, and this way we will maintain backwards compatibility with whatever existing support we have |
I echo that and will support you!
Yeah can confirm this works for the supported use cases currently.
Well, we know that we need a framework for doing parallelism and other shenanigans easily So, #245 is a no brainer here. Maybe easier to have the dispatcher in a separate PR so that your PR majorly focuses on non accelerate support? Just offering a perspective. |
Referencing: https://huggingface.co/docs/accelerate/main/en/concept_guides/deferring_execution
This PR ensures:
We download files first on the main process and then load the cached files afterward i.e., other processes.
Have run the tests, too. Absolutely okay if you don't want it.