-
-
Notifications
You must be signed in to change notification settings - Fork 173
Diffusion Models Overview
This wiki page provides detailed technical information about the diffusion models supported by OneTrainer, specifically focusing on SD1.5, SDXL, and Flux. The information is tailored for quite advanced users.
SD1.5 utilizes a UNet architecture with an encoder-decoder structure, based on a hierarchy of denoising autoencoders
The final training resolution was effectively 512x512
Tokenization and Max Tokens
- Uses CLIP tokenizer
- Max tokens per caption for OT: 75
LoRA Full Set of Blocks / Layer Keys A working example of a custom layer set for SD1.5 LoRA training is:
down_blocks.1.attentions.0,down_blocks.1.attentions.1,down_blocks.2.attentions.0,down_blocks.2.attentions.1,mid_block.attentions.0
The complete set of blocks for SD1.5 includes can be referenced here or here
VAE Compression
- Compression factor: 8x8 (8 times per dimension)
- VAE trained on 256px x 256px resolution
- Number of channels: 4
Paper: https://arxiv.org/pdf/2112.10752
SDXL uses an enhanced UNet architecture, significantly larger than SD1.5.
SDXL is trained at higher resolutions, effectively 1024x1024.
- Uses two CLIP text encoders (CLIP ViT-L & OpenCLIP ViT-bigG)
- Max tokens: 75 again in OneTrainer
- Compression factor: 8x8 (8 times per dimension)
- VAE trained on 256px x 256px resolution
- Uses the same VAE model as SD1.5, but trained with larger batch size and EMA enabled
Paper: https://arxiv.org/pdf/2307.01952
Placeholder. Cant find much info, nor papers.
Unknown, at least same or higher than SDXL
- Same as SDXL, 75 tokens max in OneTrainer, anything larger gets truncated.
- CLIP L/14 and T5-v1
Flux uses the following LoRA layers:
[
"down_blocks.0.attentions.0.transformer_blocks.0.attn1",
"down_blocks.0.attentions.0.transformer_blocks.0.attn2",
"down_blocks.0.attentions.1.transformer_blocks.0.attn1",
"down_blocks.0.attentions.1.transformer_blocks.0.attn2",
"down_blocks.1.attentions.0.transformer_blocks.0.attn1",
"down_blocks.1.attentions.0.transformer_blocks.0.attn2",
"down_blocks.1.attentions.1.transformer_blocks.0.attn1",
"down_blocks.1.attentions.1.transformer_blocks.0.attn2",
"down_blocks.2.attentions.0.transformer_blocks.0.attn1",
"down_blocks.2.attentions.0.transformer_blocks.0.attn2",
"down_blocks.2.attentions.1.transformer_blocks.0.attn1",
"down_blocks.2.attentions.1.transformer_blocks.0.attn2",
"up_blocks.1.attentions.0.transformer_blocks.0.attn1",
"up_blocks.1.attentions.0.transformer_blocks.0.attn2",
"up_blocks.1.attentions.1.transformer_blocks.0.attn1",
"up_blocks.1.attentions.1.transformer_blocks.0.attn2",
"up_blocks.1.attentions.2.transformer_blocks.0.attn1",
"up_blocks.1.attentions.2.transformer_blocks.0.attn2",
"up_blocks.2.attentions.0.transformer_blocks.0.attn1",
"up_blocks.2.attentions.0.transformer_blocks.0.attn2",
"up_blocks.2.attentions.1.transformer_blocks.0.attn1",
"up_blocks.2.attentions.1.transformer_blocks.0.attn2",
"up_blocks.2.attentions.2.transformer_blocks.0.attn1",
"up_blocks.2.attentions.2.transformer_blocks.0.attn2",
"up_blocks.3.attentions.0.transformer_blocks.0.attn1",
"up_blocks.3.attentions.0.transformer_blocks.0.attn2",
"up_blocks.3.attentions.1.transformer_blocks.0.attn1",
"up_blocks.3.attentions.1.transformer_blocks.0.attn2",
"up_blocks.3.attentions.2.transformer_blocks.0.attn1",
"up_blocks.3.attentions.2.transformer_blocks.0.attn2",
"mid_block.attentions.0.transformer_blocks.0.attn1",
"mid_block.attentions.0.transformer_blocks.0.attn2"
]
Full layers can be seen here