-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reusing the same pipeline (FluxPipeline) increase the inference duration #10705
Comments
Removed Lora related code and still the same issue 100%|█████████████████████████████████████████████████████████████| 40/40 [06:24<00:00, 9.61s/it] |
On 8GB GPU + 8GB system RAM I suspect this is overflowing to swap. Can you confirm what is the VRAM and RAM usage during generation? You can try precomputing prompt_embeds which should reduce RAM requirements for generation and hopefully avoid swap. import torch
from diffusers import (
BitsAndBytesConfig as DiffusersBitsAndBytesConfig,
FluxTransformer2DModel,
FluxPipeline,
)
from transformers import T5EncoderModel
bfl_repo = "black-forest-labs/FLUX.1-dev"
dtype = torch.bfloat16
quantization_config = DiffusersBitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)
text_encoder_2 = T5EncoderModel.from_pretrained(
bfl_repo,
subfolder="text_encoder_2",
quantization_config=quantization_config,
torch_dtype=dtype,
)
pipe = FluxPipeline.from_pretrained(
bfl_repo,
transformer=None,
vae=None,
text_encoder_2=text_encoder_2,
torch_dtype=dtype,
)
pipe.enable_model_cpu_offload()
prompt_embeds, pooled_prompt_embeds, _ = pipe.encode_prompt(
prompt="Photograph capturing a woman seated in a car, looking straight ahead. Her face is partially obscured, making her expression hard to read, adding an air of mystery. Natural light filters through the car window, casting subtle reflections and shadows on her face and the interior. The colors are muted yet realistic, with a slight grain that evokes a 1970s film quality. The scene feels intimate and contemplative, capturing a quiet, introspective moment, mj",
prompt_2=None,
)
del pipe
torch.cuda.empty_cache()
transformer_4bit = FluxTransformer2DModel.from_pretrained(
bfl_repo,
subfolder="transformer",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
)
pipe = FluxPipeline.from_pretrained(
bfl_repo,
transformer=transformer_4bit,
text_encoder=None,
text_encoder_2=None,
torch_dtype=dtype,
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()
image = pipe(
prompt_embeds=prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
width=1072,
height=1920,
max_sequence_length=512,
num_inference_steps=40,
guidance_scale=50,
generator=torch.Generator().manual_seed(1349562290),
).images[0]
image.save("out_majicbeauty5.png")
torch.cuda.empty_cache()
image = pipe(
prompt_embeds=prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
width=1072,
height=1920,
max_sequence_length=512,
num_inference_steps=50,
guidance_scale=40,
generator=torch.Generator().manual_seed(1349562290),
).images[0]
image.save("out_majicbeauty6.png") Lowering the resolution may also help as it will reduce intermediary tensor sizes. |
Hello @hlky I think I didn't explained the issue. What I have implemented is the pipe is initialized with models and then reused for multiple generations. This approach save the model loading times / quantization time...etc. If I use del pipe, it means the same pipe can't be reused which means longer generation time for each image because the model will be offloaded and loaded again for each generation. Let me add one example which works. |
@nitinmukesh On 8GB GPU + 8GB system RAM I suspect this is overflowing to swap. Can you confirm what is the VRAM and RAM usage during generation? The code example is a demonstration of a possible workaround. By precomputing the prompt embeds we can remove text encoders from the total pipeline requirements which may help avoid overflowing to swap. Lowering the resolution may also help as it will reduce intermediary tensor sizes.
You can reuse We are working on optimizations for both low VRAM and low system RAM users. For example, check out #10503 #10623. |
Even though I reuse my flux pipeline to avoid reloading over and over at some point you still run out of VRAM and get OOM errors. ` if pipe is None:
....
` |
HI @ukaprch, I ran your first code and it's even a miracle you got it running in a 8 VRAM and RAM machine. Without loras and if we don't take into account the quantization which needs a lot more VRAM and RAM than what you have, since you're using a higher resolution, the inference needs at least 9GB of VRAM and 15GB of RAM. As @hlky guessed you're swapping to disk, in linux you would get a OOM, but the reason you're getting those slow times it's because you're using the RAM and the disk to do inference which is really bad. Also just to clarify, you're using higher steps on the second run which will make the second run take longer. |
I would like to mention this issue is not related to slow inference speed or how much VRAM is required, but inference time should remain the same, if,
Here is the example code working fine. As I mentioned earlier this approach works with several other t2i, t2v, i2v, v2v pipelines without issue. The only issue is with FluxPipeline.
I generated 2 videos, and both took almost same time. (01:20 and 01:16)
Here is the VRAM usage for FluxPipeline
|
@nitinmukesh The other pipelines are not hitting the limit of your system, Flux is exceeding that limit which drastically affects performance. All pipelines share the same core code and follow the same design principles - with regards to reusing a pipeline there is no difference. The only limit is system resources which is confirmed by your screenshot, VRAM is full, RAM is full, NVME has high activity because it is offloading to disk. |
Hunyuan (39.0 GB) is much bigger than Flux (31.4 GB). I still think either there is memory leak or any other issue with FluxPipeline
This is same for all pipelines. I use 40 GB for cache, without which none of the pipeline works. So SSD is used for all pipelines considering I have only 8+8 setup. |
@nitinmukesh You are using a very small resolution and number of frames for Hunyuan, this reduces requirements. With Flux you are using a large resolution, this increases requirements. The cost of inference is weights + intermediary tensors, the size of which are affected resolution, in this case the size of Flux's transformer + intermediary tensors for the large resolution exceed your system's limit, this causes offloading to disk which drastically affects performance. Can you try lowering the resolution, or precomputing prompt embeds? |
With prompt_embed I get CUDA OOM even with lowest resolution, not sure why? The issue is however resolved after restarting the Laptop. Could this have to do with the model stored/offloaded in Virtual memory? Will do more testing. Thank you for your help.
|
Just to have this issue more clear for future searches, there's a number of issues happening here at the same time, I stated here before, that the VRAM you have isn't enough for Flux in that resolution. The first issue is that you're using windows with the default nvidia configuration which means that it never OOMs if you have enough RAM or swap disk space, it just makes the inference really slow. You were using cpu offload, which offloaded the models to cpu (in this case, the only important one is the T5 as it is really big) which filled your RAM during inference. At this step, it doesn't matter if you encoded the prompt before or not, you will need to delete the T5 from memory to be able to free the RAM (not VRAM). When doing the denoise, since the whole Flux model plus the resolution you were using, the model didn't fit in the VRAM which made window use the RAM but also you had the RAM full which then made it use the disk which made everything just really slow and bad. Probably will work for you if you use @hlky solution (deleting the text encoders after encoding the prompt) and if you use a lower resolution (for your VRAM you can just do 512px I think) and use the same args (same steps at least), you will get the same inference speed both times. You jsut need to make sure you don't go higher than the VRAM you have (7.5 GB in your case). We can't control what the drivers or windows do which is your problem here.
Probably, you're freeing the disk and the swap when you're restarting, but again, as I said, this is a windows issue and not diffusers or even a pytorch or a python one. In all the other OSes you will just get OOMs. |
Describe the bug
So I create the pipe and use it to generate multiple image with same settings. During first inference it take 8 min, next 30 min. VRAM usage remains the same.
Tested on 8 GB + 8 GB
P.S. I have used AuraFlow, Sana, Hunyuan, LTX, Cog, and several other pipeline but didn't encounter this issue with any of them.
Reproduction
Logs
System Info
Who can help?
@yiyixuxu @DN6
The text was updated successfully, but these errors were encountered: