Tracking issue for SPHINX quantization & other memory issues #114

linziyi96 · 2023-11-23T17:41:30Z

We have currently received several requests (#112, #110, #97) to run the SPHINX inference on GPUs with smaller memory. We also believe that fitting it into the 24GB memory bar benefits a broad range of users who would like to run the model locally on commodity GPUs like 3090 or 4090.

With the latest update #113, we should see NF4 quantization running fine on SPHINX without errors (i.e., resolving #97). The memory usage is a bit less than 23GB, and it should now fit into a single 24GB GPU (3090, 4090 or A5000) even with ECC turned on

We are still doing a complete benchmark of this quantized model and will update the latest information under this issue. Meanwhile, any question is also welcomed :)

quizD · 2023-11-27T06:45:33Z

When I used 4*16GB VRAM GPUS to run the scripts
SPHINX/README.md->Multi-GPU inference
or
torchrun --master_port=1112 --nproc_per_node=2 inference.py,
but also got
OutOfMemoryError wrong too. And watched 2 GPUS are used 100%.
Does now use multi GPUS script is correct? I tried another way but also failed.

File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in __init__ self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in <listcomp> self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 223, in __init__ self.self_attn = Blip2Attention(config) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 139, in __init__ self.qkv = nn.Linear(self.embed_dim, 3 * self.embed_dim, bias=False) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 96, in __init__ self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs)) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/utils/_device.py", line 62, in __torch_function__ return func(*args, **kwargs) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 15.57 GiB total capacity; 14.47 GiB already allocated; 90.12 MiB free; 14.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Looking forward to your reply!

linziyi96 · 2023-11-27T16:02:11Z

When I used 4*16GB VRAM GPUS to run the scripts SPHINX/README.md->Multi-GPU inference or torchrun --master_port=1112 --nproc_per_node=2 inference.py, but also got OutOfMemoryError wrong too. And watched 2 GPUS are used 100%. Does now use multi GPUS script is correct? I tried another way but also failed.

File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in __init__ self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in <listcomp> self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 223, in __init__ self.self_attn = Blip2Attention(config) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 139, in __init__ self.qkv = nn.Linear(self.embed_dim, 3 * self.embed_dim, bias=False) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 96, in __init__ self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs)) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/utils/_device.py", line 62, in __torch_function__ return func(*args, **kwargs) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 15.57 GiB total capacity; 14.47 GiB already allocated; 90.12 MiB free; 14.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Looking forward to your reply!

With 16GB mem per GPU SPHINX will need to run on all 4 GPUs (26GB for LM params, 6GB for visual params, 4GB for kv-cache, 3GB for SAM, adding up to >32GB). We plan to add this support in the next few days (currently, we only support running on 1 or 2 GPUs).

linziyi96 · 2023-11-27T16:57:10Z

#116 fixes inference memory with image input.

We have now moved the inference development to small (24/16GB) GPUs to avoid such errors slipping by on the large training GPUs.

linziyi96 · 2023-11-27T17:16:48Z

Ongoing developments as of 27 Nov:

FP16 inference memory optimizations

Support re-sharding the model to larger tensor-parallel degrees (currently, only re-sharding to smaller degrees is supported) to support many small GPUs (e.g., 4*16GB) ([WIP] Further memory optimization of SPHINX series models #118)
Shard vision encoders to the tensor-parallel workers (currently, they are replicated among tensor-parallel works, which becomes inefficient for many small GPUs)
Better handling of SAM. Allow SAM to be sharded or disabled (at the cost of losing the segmentation functionality) to avoid uneven GPU memory usage.

If you have other feature requests about SPHINX inference, please feel free to reply under this issue.

This was referenced Nov 24, 2023

how many gpu memory need,run SPHINX ？ #112

Closed

OutOfMemoryError when running Sphinx with 2 A30 GPUs #110

Closed

NonDynamicallyQuantizableLinear object has no attribute 'weight' #97

Closed

linziyi96 changed the title ~~Tracking issue for SPHINX quantization~~ Tracking issue for SPHINX quantization & other memory issues Nov 27, 2023

ChrisLiu6 mentioned this issue Dec 6, 2023

Questions about SPHINX inference #121

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issue for SPHINX quantization & other memory issues #114

Tracking issue for SPHINX quantization & other memory issues #114

linziyi96 commented Nov 23, 2023

quizD commented Nov 27, 2023

linziyi96 commented Nov 27, 2023 •

edited

Loading

linziyi96 commented Nov 27, 2023

linziyi96 commented Nov 27, 2023 •

edited

Loading

Tracking issue for SPHINX quantization & other memory issues #114

Tracking issue for SPHINX quantization & other memory issues #114

Comments

linziyi96 commented Nov 23, 2023

quizD commented Nov 27, 2023

linziyi96 commented Nov 27, 2023 • edited Loading

linziyi96 commented Nov 27, 2023

linziyi96 commented Nov 27, 2023 • edited Loading

linziyi96 commented Nov 27, 2023 •

edited

Loading

linziyi96 commented Nov 27, 2023 •

edited

Loading