Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for SPHINX quantization & other memory issues #114

Open
linziyi96 opened this issue Nov 23, 2023 · 4 comments
Open

Tracking issue for SPHINX quantization & other memory issues #114

linziyi96 opened this issue Nov 23, 2023 · 4 comments

Comments

@linziyi96
Copy link
Contributor

We have currently received several requests (#112, #110, #97) to run the SPHINX inference on GPUs with smaller memory. We also believe that fitting it into the 24GB memory bar benefits a broad range of users who would like to run the model locally on commodity GPUs like 3090 or 4090.

With the latest update #113, we should see NF4 quantization running fine on SPHINX without errors (i.e., resolving #97). The memory usage is a bit less than 23GB, and it should now fit into a single 24GB GPU (3090, 4090 or A5000) even with ECC turned on

image

We are still doing a complete benchmark of this quantized model and will update the latest information under this issue. Meanwhile, any question is also welcomed :)

@quizD
Copy link

quizD commented Nov 27, 2023

When I used 4*16GB VRAM GPUS to run the scripts
SPHINX/README.md->Multi-GPU inference
or
torchrun --master_port=1112 --nproc_per_node=2 inference.py,
but also got
OutOfMemoryError wrong too. And watched 2 GPUS are used 100%.
Does now use multi GPUS script is correct? I tried another way but also failed.

File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in __init__ self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in <listcomp> self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 223, in __init__ self.self_attn = Blip2Attention(config) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 139, in __init__ self.qkv = nn.Linear(self.embed_dim, 3 * self.embed_dim, bias=False) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 96, in __init__ self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs)) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/utils/_device.py", line 62, in __torch_function__ return func(*args, **kwargs) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 15.57 GiB total capacity; 14.47 GiB already allocated; 90.12 MiB free; 14.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Looking forward to your reply!

@linziyi96 linziyi96 changed the title Tracking issue for SPHINX quantization Tracking issue for SPHINX quantization & other memory issues Nov 27, 2023
@linziyi96
Copy link
Contributor Author

linziyi96 commented Nov 27, 2023

When I used 4*16GB VRAM GPUS to run the scripts SPHINX/README.md->Multi-GPU inference or torchrun --master_port=1112 --nproc_per_node=2 inference.py, but also got OutOfMemoryError wrong too. And watched 2 GPUS are used 100%. Does now use multi GPUS script is correct? I tried another way but also failed.

File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in __init__ self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in <listcomp> self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 223, in __init__ self.self_attn = Blip2Attention(config) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 139, in __init__ self.qkv = nn.Linear(self.embed_dim, 3 * self.embed_dim, bias=False) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 96, in __init__ self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs)) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/utils/_device.py", line 62, in __torch_function__ return func(*args, **kwargs) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 15.57 GiB total capacity; 14.47 GiB already allocated; 90.12 MiB free; 14.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Looking forward to your reply!

With 16GB mem per GPU SPHINX will need to run on all 4 GPUs (26GB for LM params, 6GB for visual params, 4GB for kv-cache, 3GB for SAM, adding up to >32GB). We plan to add this support in the next few days (currently, we only support running on 1 or 2 GPUs).

@linziyi96
Copy link
Contributor Author

#116 fixes inference memory with image input.

We have now moved the inference development to small (24/16GB) GPUs to avoid such errors slipping by on the large training GPUs.

@linziyi96
Copy link
Contributor Author

linziyi96 commented Nov 27, 2023

Ongoing developments as of 27 Nov:

FP16 inference memory optimizations

  • Support re-sharding the model to larger tensor-parallel degrees (currently, only re-sharding to smaller degrees is supported) to support many small GPUs (e.g., 4*16GB) ([WIP] Further memory optimization of SPHINX series models #118)
  • Shard vision encoders to the tensor-parallel workers (currently, they are replicated among tensor-parallel works, which becomes inefficient for many small GPUs)
  • Better handling of SAM. Allow SAM to be sharded or disabled (at the cost of losing the segmentation functionality) to avoid uneven GPU memory usage.

If you have other feature requests about SPHINX inference, please feel free to reply under this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants