Small changes to reduce peak memory. #389

robieta · 2023-06-13T23:15:59Z

LLaMA 7B is very close to the limit for sharding across 4x40GB cards, and recently several library changes have put it over the limit and it now OOMs when run. (Notably, it turns out we were cheating and using a bit less memory than the correct implementation.) This PR introduces two small changes which prevent Shakespeare pretraining from OOMing and greatly improve its performance:

foreach=False Some PyTorch optimizers have the ability to group updates, which generally improves performance by reducing the number of kernel launches and consequently the amount of host latency. However this grouping has an increased peak memory footprint. Instead of allocating and freeing a size X Tensor k times, you allocate and free a k * X sized Tensor once.
limit_all_gathers=True Breaking up the optimizer step will bring the logical memory down below the OOM threshold, but if we look at memory statistics we still see reserved memory maxing out. The reason is FSDP is overzealous in launching all gathers in an attempt to overlap as much communication and compute as possible. The trouble is all of those concurrent in-flight requests take up memory; more specifically the CUDACachingAllocator caches per stream, so it's not straightforward for it to reclaim memory from the communication stream and use it in the compute stream. By restricting the number of in-flight gathers we get about a 3x performance improvement. (6-8 seconds / step -> 1.5-3 seconds / step)

rasbt · 2023-06-13T23:20:45Z

This is actually really nice @robieta ! Peak memory is one of the big issues in practice when finetuning LLaMA 65B / Falcon 40B as well.

Once this is merged, we should also apply this to lit-parrot!

pretrain/shakespeare.py

lantiga · 2023-06-21T14:34:12Z

@carmocca definitely let's add the same fixes to all scripts using optimization and FSDP

lantiga · 2023-06-21T14:34:22Z

Ok, added same tweaks to full finetuning and redpajama pretraining, which are the two cases were the optimizer state is chunkier. For adapter and LoRA I'm not expecting this to be a problem, @robieta please keep me honest here.

carmocca

I can confirm it about foreach from experiments in lit-gpt, haven't tested limit_all_gathers yet.

carmocca · 2023-06-21T17:58:57Z

Whoa I just ran those and I'm seeing a huge speedup (95% !) by setting limit_all_gathers for adapter too. So we should totally do it.

carmocca · 2023-06-21T18:14:44Z

The finetune scripts in this repo are using DeepSpeed, unlike in lit-gpt where I switched it for FSDP: Lightning-AI/litgpt#118. So we can merge this.

Reduce peak memory

580ae7c

robieta requested review from awaelchli, carmocca and lantiga as code owners June 13, 2023 23:15

carmocca reviewed Jun 14, 2023

View reviewed changes

pretrain/shakespeare.py Show resolved Hide resolved

pretrain/shakespeare.py Show resolved Hide resolved

awaelchli approved these changes Jun 16, 2023

View reviewed changes

lantiga approved these changes Jun 21, 2023

View reviewed changes

Add tweaks to full finetuning and redpajama

14acf63

carmocca approved these changes Jun 21, 2023

View reviewed changes

carmocca merged commit c3c43b6 into main Jun 21, 2023

carmocca deleted the robieta/llama_oom branch June 21, 2023 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small changes to reduce peak memory. #389

Small changes to reduce peak memory. #389

robieta commented Jun 13, 2023

rasbt commented Jun 13, 2023

lantiga commented Jun 21, 2023

lantiga commented Jun 21, 2023 •

edited

Loading

carmocca left a comment

carmocca commented Jun 21, 2023

carmocca commented Jun 21, 2023

Small changes to reduce peak memory. #389

Small changes to reduce peak memory. #389

Conversation

robieta commented Jun 13, 2023

rasbt commented Jun 13, 2023

lantiga commented Jun 21, 2023

lantiga commented Jun 21, 2023 • edited Loading

carmocca left a comment

Choose a reason for hiding this comment

carmocca commented Jun 21, 2023

carmocca commented Jun 21, 2023

lantiga commented Jun 21, 2023 •

edited

Loading