[core] Ensure loading mp first #252

sayakpaul · 2025-01-29T09:44:19Z

Referencing: https://huggingface.co/docs/accelerate/main/en/concept_guides/deferring_execution

This PR ensures:

We download files first on the main process and then load the cached files afterward i.e., other processes.

Have run the tests, too. Absolutely okay if you don't want it.

a-r-r-o-w

Thanks Sayak, makes sense. Just FYI, I'm in the process of removing accelerate in #245 so that we can use our own distributed implementation directly in pure pytorch. With accelerate, it is becoming very hard for me dive into the code and make changes that allow PP/TP/CP to work, leading to several NCCL hangs. There is a lot of code in accelerate uses to glue together Deepspeed, FSDP, DDP, etc. so you can imagine how much of a problem debugging is (not saying that's a bad thing, but that we're limited until they provide some APIs to PP/TP/CP). Rather, having to use own backend, built with all parallelism in mind, and following fastest inference backends like xDiT and ParaAttention, is easier to understand and more easily tailorable to our use case. This is absolutely necessary if we want to be doing multi-node training for distillation, etc. in the near future imo.

We can do a stable release before the major codebase change coming. If you've tested this well for common cases, good to merge for the stable release

a-r-r-o-w · 2025-01-29T15:37:39Z

On second thought, maybe I could write a dispatch layer that allows using accelerate backend as well as the one I'm working on. It shouldn't be too hard hopefully, and this way we will maintain backwards compatibility with whatever existing support we have

sayakpaul · 2025-01-29T15:42:31Z

Thanks Sayak, makes sense. Just FYI, I'm in the process of removing accelerate in #245 so that we can use our own distributed implementation directly in pure pytorch. With accelerate, it is becoming very hard for me dive into the code and make changes that allow PP/TP/CP to work, leading to several NCCL hangs. There is a lot of code in accelerate uses to glue together Deepspeed, FSDP, DDP, etc. so you can imagine how much of a problem debugging is (not saying that's a bad thing, but that we're limited until they provide some APIs to PP/TP/CP). Rather, having to use own backend, built with all parallelism in mind, and following fastest inference backends like xDiT and ParaAttention, is easier to understand and more easily tailorable to our use case. This is absolutely necessary if we want to be doing multi-node training for distillation, etc. in the near future imo.

I echo that and will support you!

We can do a stable release before the major codebase change coming. If you've tested this well for common cases, good to merge for the stable release

Yeah can confirm this works for the supported use cases currently.

On second thought, maybe I could write a dispatch layer that allows using accelerate backend as well as the one I'm working on. It shouldn't be too hard hopefully, and this way we will maintain backwards compatibility with whatever existing support we have

Well, we know that we need a framework for doing parallelism and other shenanigans easily So, #245 is a no brainer here. Maybe easier to have the dispatcher in a separate PR so that your PR majorly focuses on non accelerate support? Just offering a perspective.

sayakpaul added 2 commits January 29, 2025 14:14

fix: model card info.

f1a131f

always ensure loading takes place on the main process first.

3feca30

sayakpaul requested a review from a-r-r-o-w January 29, 2025 09:45

sayakpaul added 2 commits January 29, 2025 15:17

revert changes

3de90c4

revert

55f0382

a-r-r-o-w approved these changes Jan 29, 2025

View reviewed changes

a-r-r-o-w merged commit 836ac78 into main Jan 29, 2025
1 check passed

a-r-r-o-w deleted the ensure-loading-mp-first branch January 29, 2025 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Ensure loading mp first #252

[core] Ensure loading mp first #252

sayakpaul commented Jan 29, 2025

a-r-r-o-w left a comment

a-r-r-o-w commented Jan 29, 2025

sayakpaul commented Jan 29, 2025

[core] Ensure loading mp first #252

[core] Ensure loading mp first #252

Conversation

sayakpaul commented Jan 29, 2025

a-r-r-o-w left a comment

Choose a reason for hiding this comment

a-r-r-o-w commented Jan 29, 2025

sayakpaul commented Jan 29, 2025