This guide provides a user-friendly breakdown of the command-line options available in SimpleTuner's train_sdxl.py
script. These options offer a high degree of customization, allowing you to train your model to suit your specific requirements.
- What: Choices: lora, full, deepfloyd, deepfloyd-lora, deepfloyd-stage2, deepfloyd-stage2-lora. Default: lora
- Why: Select whether a LoRA or full fine-tune are created. LoRA only supported for SDXL.
Note: DeepFloyd uses the train_sd2x.sh
/train_sd21.py
training script, sd2x-env.sh
configuration file. See DEEPFLOYD.md for more information.
- What: Enable Stable Diffusion 3 training quirks/overrides.
- Why: SD3 has three text encoders, it's pretty hefty and needs specific validation-time options considered. The equivalent option for this in the
sdxl-env.sh
environment file isSTABLE_DIFFUSION_3
.
Note: Stable Diffusion 3 uses the train_sdxl.sh
/train_sdxl.py
training script, sdxl-env.sh
configuration file.
- What: Enable PixArt Sigma training quirks/overrides.
- Why: PixArt is similar to SD3 and DeepFloyd in one way or another, and needs special treatment at validation, training, and inference time. Use this option to enable PixArt training support. PixArt does not support ControlNet, LoRA, or
--validation_using_datasets
Note: Like SDXL and SD3, PixArt Sigma also uses the train_sdxl.sh
/train_sdxl.py
training script, sdxl-env.sh
configuration file.
- What: Path to the pretrained model or its identifier from huggingface.co/models.
- Why: To specify the base model you'll start training from. Use
--revision
and--variant
to specify specific versions from a repository.
- What: Path to the pretrained T5 model or its identifier from huggingface.co/models.
- Why: When training PixArt, you might want to use a specific source for your T5 weights so that you can avoid downloading them multiple times when switching the base model you train from.
- What: The name of the Huggingface Hub model and local results directory.
- Why: This value is used as the directory name under the location specified as
--output_dir
. If--push_to_hub
is provided, this will become the name of the model on Huggingface Hub.
- What: If provided, your model will be uploaded to Huggingface Hub once training completes. Using
--push_checkpoints_to_hub
will additionally push every intermediary checkpoint.
- What: Enables training a custom mixture-of-experts model series. See Mixture-of-Experts for more information on these options.
- What: Path to your SimpleTuner dataset configuration.
- Why: Multiple datasets on different storage medium may be combined into a single training session.
- Example: See (multidatabackend.json.example)[/multidatabackend.json.example] for an example configuration, and this document for more information on configuring the data loader.
- What: When provided, will allow SimpleTuner to ignore differences between the cached config inside the dataset and the current values.
- Why: When SimplerTuner is run for the first time on a dataset, it will create a cache document containing information about everything in that dataset. This includes the dataset config, including its "crop" and "resolution" related configuration values. Changing these arbitrarily or by accident could result in your training jobs crashing randomly, so it's highly recommended to not use this parameter, and instead resolve the differences you'd like to apply in your dataset some other way.
- What: Configure the behaviour of the integrity scan check.
- Why: A dataset could have incorrect settings applied at multiple points of training, eg. if you accidentally delete the
.json
cache files from your dataset and switch the data backend config to use square images rather than aspect-crops. This will result in an inconsistent data cache, which can be corrected by settingscan_for_errors
totrue
in yourmultidatabackend.json
configuration file. When this scan runs, it relies on the setting of--vae_cache_scan_behaviour
to determine how to resolve the inconsistency:recreate
(the default) will remove the offending cache entry so that it can be recreated, andsync
will update the bucket metadata to reflect the reality of the real training sample. Recommended value:recreate
.
A lot of settings are instead set through the dataloader config, but these will apply globally.
- What: Input image resolution. Can be expressed as pixels, or megapixels.
- Why: All images in the dataset will have their smaller edge resized to this resolution for training. It is recommended use a value of 1.0 if also using
--resolution_type=area
. When using--resolution_type=pixel
and--resolution=1024px
, the images may become very large and use an excessive amount of VRAM. The recommended configuration is to combine--resolution_type=area
with--resolution=1
(or lower - .25 would be a 512px model with data bucketing).
- What: This tells SimpleTuner whether to use
area
size calculations orpixel
edge calculations. - Why: SimpleTuner's default
pixel
behaviour is to resize the image, keeping the aspect ratio. Setting the type toarea
instead uses the given megapixel value as the target size for the image, keeping the aspect ratio.
- What: Output image resolution, measured in pixels.
- Why: All images generated during validation will be this resolution. Useful if the model is being trained with a different resolution.
- What: Strategy for deriving image captions. Choices:
textfile
,filename
,parquet
,instanceprompt
- Why: Determines how captions are generated for training images.
textfile
will use the contents of a.txt
file with the same filename as the imagefilename
will apply some cleanup to the filename before using it as the caption.parquet
requires a parquet file to be present in the dataset, and will use thecaption
column as the caption unlessparquet_caption_column
is provided. All captions must be present unless aparquet_fallback_caption_column
is provided.instanceprompt
will use the value forinstance_prompt
in the dataset config as the prompt for every image in the dataset.
- What: When
--crop=true
is supplied, SimpleTuner will crop all (new) images in the training dataset. It will not re-process old images. - Why: Training on cropped images seems to result in better fine detail learning, especially on SDXL models.
- What: When
--crop=true
, the trainer may be instructed to crop in different ways. - Why: The
crop_style
option can be set tocenter
(orcentre
) for a classic centre-crop,corner
to elect for the lowest-right corner,face
to detect and centre upon the largest subject face, andrandom
for a random image slice. Default: random.
- What: When using
--crop=true
, the--crop_aspect
option may be supplied with a value ofsquare
orpreserve
. - Why: The default crop behaviour is to crop all images to a square aspect ratio, but when
--crop_aspect=preserve
is supplied, the trainer will crop images to a size matching their original aspect ratio. This may help to keep multi-resolution support, but it may also harm training quality. Your mileage may vary.
- What: Number of training epochs (the number of times that all images are seen). Setting this to 0 will allow
--max_train_steps
to take precedence. - Why: Determines the number of image repeats, which impacts the duration of the training process. More epochs tends to result in overfitting, but might be required to pick up the concepts you wish to train in. A reasonable value might be from 5 to 50.
- What: Number of training steps to exit training after. If set to 0, will allow
--num_train_epochs
to take priority. - Why: Useful for shortening the length of training.
- What: Batch size for the training data loader.
- Why: Affects the model's memory consumption, convergence quality, and training speed. The higher the batch size, the better the results will be, but a very high batch size might result in overfitting or destabilized training, as well as increasing the duration of the training session unnecessarily. Experimentation is warranted, but in general, you want to try to max out your video memory while not decreasing the training speed.
- What: Number of update steps to accumulate before performing a backward/update pass, essentially splitting the work over multiple batches to save memory at the cost of a higher training runtime.
- Why: Useful for handling larger models or datasets.
- What: Initial learning rate after potential warmup.
- Why: The learning rate behaves as a sort of "step size" for gradient updates - too high, and we overstep the solution. Too low, and we never reach the ideal solution. A minimal value for a
full
tune might be as low as1e-7
to a max of1e-6
while forlora
tuning a minimal value might be1e-5
with a maximal value as high as1e-3
. When a higher learning rate is used, it's advantageous to use an EMA network with a learning rate warmup - see--use_ema
,--lr_warmup_steps
, and--lr_scheduler
.
- What: How to scale the learning rate over time.
- Choices: constant, constant_with_warmup, cosine, cosine_with_restarts, polynomial (recommended), linear
- Why: Models benefit from continual learning rate adjustments to further explore the loss landscape. A cosine schedule is used as the default; this allows the training to smoothly transition between two extremes. If using a constant learning rate, it is common to select a too-high or too-low value, causing divergence (too high) or getting stuck in a local minima (too low). A polynomial schedule is best paired with a warmup, where it will gradually approach the
learning_rate
value before then slowing down and approaching--lr_end
by the end.
- What: Utilising min-SNR weighted loss factor.
- Why: Minimum SNR gamma weights the loss factor of a timestep by its position in the schedule. Overly noisy timesteps have their contributions reduced, and less-noisy timesteps have it increased. Value recommended by the original paper is 5 but you can use values as low as 1 or as high as 20, typically seen as the maximum value - beyond a value of 20, the math does not change things much. A value of 1 is the strongest.
- What: Train a model using a more gradual weighting on the loss landscape.
- Why: When training pixel diffusion models, they will simply degrade without using a specific loss weighting schedule. This is the case with DeepFloyd, where soft-min-snr-gamma was found to essentially be mandatory for good results. You may find success with latent diffusion model training, but in small experiments, it was found to potentially produce blurry results.
- What: Interval at which training state checkpoints are saved.
- Why: Useful for resuming training and for inference. Every n iterations, a partial checkpoint will be saved in the
.safetensors
format, via the Diffusers filesystem layout.
- What: Specifies if and from where to resume training.
- Why: Allows you to continue training from a saved state, either manually specified or the latest available. A checkpoint is composed of a
unet
and optionally, anema_unet
. Theunet
may be dropped into any Diffusers layout SDXL model, allowing it to be used as a normal model would.
- What: Directory for TensorBoard logs.
- Why: Allows you to monitor training progress and performance metrics.
- What: Specifies the platform for reporting results and logs.
- Why: Enables integration with platforms like TensorBoard, wandb, or comet_ml for monitoring.
This is a basic overview meant to help you get started. For a complete list of options and more detailed explanations, please refer to the full specification:
usage: train_sdxl.py [-h] [--snr_gamma SNR_GAMMA] [--use_soft_min_snr]
[--soft_min_snr_sigma_data SOFT_MIN_SNR_SIGMA_DATA]
[--model_type {full,lora,deepfloyd-full,deepfloyd-lora,deepfloyd-stage2,deepfloyd-stage2-lora}]
[--pixart_sigma] [--sd3] [--sd3_uses_diffusion]
[--weighting_scheme {sigma_sqrt,logit_normal,mode}]
[--logit_mean LOGIT_MEAN] [--logit_std LOGIT_STD]
[--mode_scale MODE_SCALE] [--lora_type {Standard}]
[--lora_init_type {default,gaussian,loftq}]
[--lora_rank LORA_RANK] [--lora_alpha LORA_ALPHA]
[--lora_dropout LORA_DROPOUT] [--controlnet]
[--controlnet_model_name_or_path]
--pretrained_model_name_or_path
PRETRAINED_MODEL_NAME_OR_PATH
[--pretrained_vae_model_name_or_path PRETRAINED_VAE_MODEL_NAME_OR_PATH]
[--pretrained_t5_model_name_or_path PRETRAINED_T5_MODEL_NAME_OR_PATH]
[--prediction_type {epsilon,v_prediction,sample}]
[--snr_weight SNR_WEIGHT]
[--training_scheduler_timestep_spacing {leading,linspace,trailing}]
[--inference_scheduler_timestep_spacing {leading,linspace,trailing}]
[--refiner_training] [--refiner_training_invert_schedule]
[--refiner_training_strength REFINER_TRAINING_STRENGTH]
[--timestep_bias_strategy {earlier,later,range,none}]
[--timestep_bias_multiplier TIMESTEP_BIAS_MULTIPLIER]
[--timestep_bias_begin TIMESTEP_BIAS_BEGIN]
[--timestep_bias_end TIMESTEP_BIAS_END]
[--timestep_bias_portion TIMESTEP_BIAS_PORTION]
[--disable_segmented_timestep_sampling]
[--rescale_betas_zero_snr]
[--vae_dtype {default,fp16,fp32,bf16}]
[--vae_batch_size VAE_BATCH_SIZE]
[--vae_cache_scan_behaviour {recreate,sync}]
[--vae_cache_preprocess]
[--aspect_bucket_disable_rebuild] [--keep_vae_loaded]
[--skip_file_discovery SKIP_FILE_DISCOVERY]
[--revision REVISION] [--variant VARIANT]
[--preserve_data_backend_cache] [--use_dora]
[--override_dataset_config]
[--cache_dir_text CACHE_DIR_TEXT]
[--cache_dir_vae CACHE_DIR_VAE] --data_backend_config
DATA_BACKEND_CONFIG [--write_batch_size WRITE_BATCH_SIZE]
[--enable_multiprocessing]
[--aspect_bucket_worker_count ASPECT_BUCKET_WORKER_COUNT]
[--cache_dir CACHE_DIR]
[--cache_clear_validation_prompts]
[--caption_strategy {filename,textfile,instance_prompt,parquet}]
[--parquet_caption_column PARQUET_CAPTION_COLUMN]
[--parquet_filename_column PARQUET_FILENAME_COLUMN]
[--instance_prompt INSTANCE_PROMPT]
[--output_dir OUTPUT_DIR] [--seed SEED]
[--seed_for_each_device SEED_FOR_EACH_DEVICE]
[--resolution RESOLUTION]
[--resolution_type {pixel,area}]
[--aspect_bucket_rounding {1,2,3,4,5,6,7,8,9}]
[--aspect_bucket_alignment {8,64}]
[--minimum_image_size MINIMUM_IMAGE_SIZE]
[--maximum_image_size MAXIMUM_IMAGE_SIZE]
[--target_downsample_size TARGET_DOWNSAMPLE_SIZE]
[--train_text_encoder]
[--tokenizer_max_length TOKENIZER_MAX_LENGTH]
[--train_batch_size TRAIN_BATCH_SIZE]
[--num_train_epochs NUM_TRAIN_EPOCHS]
[--max_train_steps MAX_TRAIN_STEPS]
[--checkpointing_steps CHECKPOINTING_STEPS]
[--checkpoints_total_limit CHECKPOINTS_TOTAL_LIMIT]
[--resume_from_checkpoint RESUME_FROM_CHECKPOINT]
[--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
[--gradient_checkpointing]
[--learning_rate LEARNING_RATE]
[--text_encoder_lr TEXT_ENCODER_LR] [--lr_scale]
[--lr_scheduler {linear,sine,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup}]
[--lr_warmup_steps LR_WARMUP_STEPS]
[--lr_num_cycles LR_NUM_CYCLES] [--lr_power LR_POWER]
[--use_ema] [--ema_decay EMA_DECAY]
[--non_ema_revision NON_EMA_REVISION]
[--offload_param_path OFFLOAD_PARAM_PATH]
[--use_8bit_adam] [--use_adafactor_optimizer]
[--adafactor_relative_step ADAFACTOR_RELATIVE_STEP]
[--use_prodigy_optimizer] [--prodigy_beta3 PRODIGY_BETA3]
[--prodigy_decouple PRODIGY_DECOUPLE]
[--prodigy_use_bias_correction PRODIGY_USE_BIAS_CORRECTION]
[--prodigy_safeguard_warmup PRODIGY_SAFEGUARD_WARMUP]
[--prodigy_learning_rate PRODIGY_LEARNING_RATE]
[--prodigy_weight_decay PRODIGY_WEIGHT_DECAY]
[--prodigy_epsilon PRODIGY_EPSILON]
[--use_dadapt_optimizer]
[--dadaptation_learning_rate DADAPTATION_LEARNING_RATE]
[--adam_beta1 ADAM_BETA1] [--adam_beta2 ADAM_BETA2]
[--adam_weight_decay ADAM_WEIGHT_DECAY]
[--adam_epsilon ADAM_EPSILON] [--adam_bfloat16]
[--max_grad_norm MAX_GRAD_NORM] [--push_to_hub]
[--push_checkpoints_to_hub] [--hub_model_id HUB_MODEL_ID]
[--logging_dir LOGGING_DIR]
[--validation_torch_compile VALIDATION_TORCH_COMPILE]
[--validation_torch_compile_mode {max-autotune,reduce-overhead,default}]
[--allow_tf32] [--validation_using_datasets]
[--webhook_config WEBHOOK_CONFIG] [--report_to REPORT_TO]
[--tracker_run_name TRACKER_RUN_NAME]
[--tracker_project_name TRACKER_PROJECT_NAME]
[--validation_prompt VALIDATION_PROMPT]
[--validation_prompt_library]
[--user_prompt_library USER_PROMPT_LIBRARY]
[--validation_negative_prompt VALIDATION_NEGATIVE_PROMPT]
[--num_validation_images NUM_VALIDATION_IMAGES]
[--validation_steps VALIDATION_STEPS]
[--num_eval_images NUM_EVAL_IMAGES]
[--eval_dataset_id EVAL_DATASET_ID]
[--validation_num_inference_steps VALIDATION_NUM_INFERENCE_STEPS]
[--validation_resolution VALIDATION_RESOLUTION]
[--validation_noise_scheduler {ddim,ddpm,euler,euler-a,unipc}]
[--validation_disable_unconditional] [--disable_compel]
[--enable_watermark] [--mixed_precision {bf16,no}]
[--local_rank LOCAL_RANK]
[--enable_xformers_memory_efficient_attention]
[--set_grads_to_none] [--noise_offset NOISE_OFFSET]
[--noise_offset_probability NOISE_OFFSET_PROBABILITY]
[--validation_guidance VALIDATION_GUIDANCE]
[--validation_guidance_rescale VALIDATION_GUIDANCE_RESCALE]
[--validation_randomize]
[--validation_seed VALIDATION_SEED]
[--fully_unload_text_encoder]
[--freeze_encoder_before FREEZE_ENCODER_BEFORE]
[--freeze_encoder_after FREEZE_ENCODER_AFTER]
[--freeze_encoder_strategy FREEZE_ENCODER_STRATEGY]
[--freeze_unet_strategy {none,bitfit}]
[--unet_attention_slice] [--print_filenames]
[--print_sampler_statistics]
[--metadata_update_interval METADATA_UPDATE_INTERVAL]
[--debug_aspect_buckets] [--debug_dataset_loader]
[--freeze_encoder FREEZE_ENCODER] [--save_text_encoder]
[--text_encoder_limit TEXT_ENCODER_LIMIT]
[--prepend_instance_prompt] [--only_instance_prompt]
[--data_aesthetic_score DATA_AESTHETIC_SCORE]
[--sdxl_refiner_uses_full_range]
[--caption_dropout_probability CAPTION_DROPOUT_PROBABILITY]
[--delete_unwanted_images] [--delete_problematic_images]
[--offset_noise] [--lr_end LR_END]
[--i_know_what_i_am_doing]
The following SimpleTuner command-line options are available:
options:
-h, --help show this help message and exit
--snr_gamma SNR_GAMMA
SNR weighting gamma to be used if rebalancing the
loss. Recommended value is 5.0. More details here:
https://arxiv.org/abs/2303.09556.
--use_soft_min_snr If set, will use the soft min SNR calculation method.
This method uses the sigma_data parameter. If not
provided, the method will raise an error.
--soft_min_snr_sigma_data SOFT_MIN_SNR_SIGMA_DATA
The standard deviation of the data used in the soft
min weighting method. This is required when using the
soft min SNR calculation method.
--model_type {full,lora,deepfloyd-full,deepfloyd-lora,deepfloyd-stage2,deepfloyd-stage2-lora}
The training type to use. 'full' will train the full
model, while 'lora' will train the LoRA model. LoRA is
a smaller model that can be used for faster training.
--pixart_sigma This must be set when training a PixArt Sigma model.
--sd3 This option must be provided when training a Stable
Diffusion 3 model.
--sd3_uses_diffusion The rectified flow objective of stable diffusion 3
seems to hold few advantages, yet is very difficult to
train with. If this option is supplied, a normal DDPM-
based diffusion schedule will be used to train,
instead of flow-matching. This will take a lot of data
and even more compute to resolve. If possible, use a
pretrained SD3 Diffusion model.
--weighting_scheme {sigma_sqrt,logit_normal,mode}
Stable Diffusion 3 used either uniform sampling of
timesteps with post-prediction loss weighting, or a
weighted timestep selection by mode or log-normal
distribution. The default for SD3 is logit_normal,
though upstream Diffusers training examples use
sigma_sqrt. The mode option is experimental, as it is
the most difficult to implement cleanly. In short
experiments, logit_normal produced the best results.
--logit_mean LOGIT_MEAN
As outlined in the Stable Diffusion 3 paper, using a
logit_mean of -0.5 produced the highest quality FID
results. The default here is 0.0.
--logit_std LOGIT_STD
Stable Diffusion 3-specific training parameters.
--mode_scale MODE_SCALE
Stable Diffusion 3-specific training parameters.
--lora_type {Standard}
When training using --model_type=lora, you may specify
a different type of LoRA to train here. Currently,
only 'Standard' type is supported. This option exists
for compatibility with Kohya configuration files.
--lora_init_type {default,gaussian,loftq}
The initialization type for the LoRA model. 'default'
will use Microsoft's initialization method, 'gaussian'
will use a Gaussian scaled distribution, and 'loftq'
will use LoftQ initialization. In short experiments,
'default' produced accurate results earlier in
training, 'gaussian' had slightly more creative
outputs, and LoftQ produces an entirely different
result with worse quality at first, taking potentially
longer to converge than the other methods.
--lora_rank LORA_RANK
The dimension of the LoRA update matrices.
--lora_alpha LORA_ALPHA
The alpha value for the LoRA model. This is the
learning rate for the LoRA update matrices.
--lora_dropout LORA_DROPOUT
LoRA dropout randomly ignores neurons during training.
This can help prevent overfitting.
--controlnet If set, ControlNet style training will be used, where
a conditioning input image is required alongside the
training data.
--controlnet_model_name_or_path
When provided alongside --controlnet, this will
specify ControlNet model weights to preload from the
hub.
--pretrained_model_name_or_path PRETRAINED_MODEL_NAME_OR_PATH
Path to pretrained model or model identifier from
huggingface.co/models.
--pretrained_vae_model_name_or_path PRETRAINED_VAE_MODEL_NAME_OR_PATH
Path to an improved VAE to stabilize training. For
more details check out:
https://github.com/huggingface/diffusers/pull/4038.
--pretrained_t5_model_name_or_path PRETRAINED_T5_MODEL_NAME_OR_PATH
T5-XXL is a huge model, and starting from many
different models will download a separate one each
time. This option allows you to specify a specific
location to retrieve T5-XXL v1.1 from, so that it only
downloads once..
--prediction_type {epsilon,v_prediction,sample}
The type of prediction to use for the u-net. Choose
between ['epsilon', 'v_prediction', 'sample']. For SD
2.1-v, this is v_prediction. For 2.1-base, it is
epsilon. SDXL is generally epsilon. SD 1.5 is epsilon.
--snr_weight SNR_WEIGHT
When training a model using
`--prediction_type=sample`, one can supply an SNR
weight value to augment the loss with. If a value of
0.5 is provided here, the loss is taken half from the
SNR and half from the MSE.
--training_scheduler_timestep_spacing {leading,linspace,trailing}
(SDXL Only) Spacing timesteps can fundamentally alter
the course of history. Er, I mean, your model weights.
For all training, including epsilon, it would seem
that 'trailing' is the right choice. SD 2.x always
uses 'trailing', but SDXL may do better in its default
state when using 'leading'.
--inference_scheduler_timestep_spacing {leading,linspace,trailing}
(SDXL Only) The Bytedance paper on zero terminal SNR
recommends inference using 'trailing'. SD 2.x always
uses 'trailing', but SDXL may do better in its default
state when using 'leading'.
--refiner_training When training or adapting a model into a mixture-of-
experts 2nd stage / refiner model, this option should
be set. This will slice the timestep schedule defined
by --refiner_training_strength proportion value
(default 0.2)
--refiner_training_invert_schedule
While the refiner training strength is applied to the
end of the schedule, this option will invert the
result for training a **base** model, eg. the first
model in a mixture-of-experts series. A
--refiner_training_strength of 0.35 will result in the
refiner learning timesteps 349-0. Setting
--refiner_training_invert_schedule then would result
in the base model learning timesteps 999-350.
--refiner_training_strength REFINER_TRAINING_STRENGTH
When training a refiner / 2nd stage mixture of experts
model, the refiner training strength indicates how
much of the *end* of the schedule it will be trained
on. A value of 0.2 means timesteps 199-0 will be the
focus of this model, and 0.3 would be 299-0 and so on.
The default value is 0.2, in line with the SDXL
refiner pretraining.
--timestep_bias_strategy {earlier,later,range,none}
The timestep bias strategy, which may help direct the
model toward learning low or frequency details.
Choices: ['earlier', 'later', 'none']. The default is
'none', which means no bias is applied, and training
proceeds normally. The value of 'later' will prefer to
generate samples for later timesteps.
--timestep_bias_multiplier TIMESTEP_BIAS_MULTIPLIER
The multiplier for the bias. Defaults to 1.0, which
means no bias is applied. A value of 2.0 will double
the weight of the bias, and a value of 0.5 will halve
it.
--timestep_bias_begin TIMESTEP_BIAS_BEGIN
When using `--timestep_bias_strategy=range`, the
beginning timestep to bias. Defaults to zero, which
equates to having no specific bias.
--timestep_bias_end TIMESTEP_BIAS_END
When using `--timestep_bias_strategy=range`, the final
timestep to bias. Defaults to 1000, which is the
number of timesteps that SDXL Base and SD 2.x were
trained on.
--timestep_bias_portion TIMESTEP_BIAS_PORTION
The portion of timesteps to bias. Defaults to 0.25,
which 25 percent of timesteps will be biased. A value
of 0.5 will bias one half of the timesteps. The value
provided for `--timestep_bias_strategy` determines
whether the biased portions are in the earlier or
later timesteps.
--disable_segmented_timestep_sampling
By default, the timestep schedule is divided into
roughly `train_batch_size` number of segments, and
then each of those are sampled from separately. This
improves the selection distribution, but may not be
desired in certain training scenarios, eg. when
limiting the timestep selection range.
--rescale_betas_zero_snr
If set, will rescale the betas to zero terminal SNR.
This is recommended for training with v_prediction.
For epsilon, this might help with fine details, but
will not result in contrast improvements.
--vae_dtype {default,fp16,fp32,bf16}
The dtype of the VAE model. Choose between ['default',
'fp16', 'fp32', 'bf16']. The default VAE dtype is
bfloat16, due to NaN issues in SDXL 1.0. Using fp16 is
not recommended.
--vae_batch_size VAE_BATCH_SIZE
When pre-caching latent vectors, this is the batch
size to use. Decreasing this may help with VRAM
issues, but if you are at that point of contention,
it's possible that your GPU has too little RAM.
Default: 4.
--vae_cache_scan_behaviour {recreate,sync}
When a mismatched latent vector is detected, a scan
will be initiated to locate inconsistencies and
resolve them. The default setting 'recreate' will
delete any inconsistent cache entries and rebuild it.
Alternatively, 'sync' will update the bucket
configuration so that the image is in a bucket that
matches its latent size. The recommended behaviour is
to use the default value and allow the cache to be
recreated.
--vae_cache_preprocess
By default, will encode images during training. For
some situations, pre-processing may be desired. To
revert to the old behaviour, supply
--vae_cache_preprocess=false.
--aspect_bucket_disable_rebuild
When using a randomised aspect bucket list, the VAE
and aspect cache are rebuilt on each epoch. With a
large and diverse enough dataset, rebuilding the
aspect list may take a long time, and this may be
undesirable. This option will not override
vae_cache_clear_each_epoch. If both options are
provided, only the VAE cache will be rebuilt.
--keep_vae_loaded If set, will keep the VAE loaded in memory. This can
reduce disk churn, but consumes VRAM during the
forward pass.
--skip_file_discovery SKIP_FILE_DISCOVERY
Comma-separated values of which stages to skip
discovery for. Skipping any stage will speed up
resumption, but will increase the risk of errors, as
missing images or incorrectly bucketed images may not
be caught. 'vae' will skip the VAE cache process,
'aspect' will not build any aspect buckets, and 'text'
will avoid text embed management. Valid options:
aspect, vae, text, metadata.
--revision REVISION Revision of pretrained model identifier from
huggingface.co/models. Trainable model components
should be at least bfloat16 precision.
--variant VARIANT Variant of pretrained model identifier from
huggingface.co/models. Trainable model components
should be at least bfloat16 precision.
--preserve_data_backend_cache
For very large cloud storage buckets that will never
change, enabling this option will prevent the trainer
from scanning it at startup, by preserving the cache
files that we generate. Be careful when using this,
as, switching datasets can result in the preserved
cache being used, which would be problematic.
Currently, cache is not stored in the dataset itself
but rather, locally. This may change in a future
release.
--use_dora If set, will use the DoRA-enhanced LoRA training. This
is an experimental feature, may slow down training,
and is not recommended for general use.
--override_dataset_config
When provided, the dataset's config will not be
checked against the live backend config. This is
useful if you want to simply update the behaviour of
an existing dataset, but the recommendation is to not
change the dataset configuration after caching has
begun, as most options cannot be changed without
unexpected behaviour later on. Additionally, it
prevents accidentally loading an SDXL configuration on
a SD 2.x model and vice versa.
--cache_dir_text CACHE_DIR_TEXT
This is the path to a local directory that will
contain your text embed cache.
--cache_dir_vae CACHE_DIR_VAE
This is the path to a local directory that will
contain your VAE outputs. Unlike the text embed cache,
your VAE latents will be stored in the AWS data
backend. Each backend can have its own value, but if
that is not provided, this will be the default value.
--data_backend_config DATA_BACKEND_CONFIG
The relative or fully-qualified path for your data
backend config. See multidatabackend.json.example for
an example.
--write_batch_size WRITE_BATCH_SIZE
When using certain storage backends, it is better to
batch smaller writes rather than continuous
dispatching. In SimpleTuner, write batching is
currently applied during VAE caching, when many small
objects are written. This mostly applies to S3, but
some shared server filesystems may benefit as well,
eg. Ceph. Default: 64.
--enable_multiprocessing
If set, will use processes instead of threads during
metadata caching operations. For some systems,
multiprocessing may be faster than threading, but will
consume a lot more memory. Use this option with
caution, and monitor your system's memory usage.
--aspect_bucket_worker_count ASPECT_BUCKET_WORKER_COUNT
The number of workers to use for aspect bucketing.
This is a CPU-bound task, so the number of workers
should be set to the number of CPU threads available.
If you use an I/O bound backend, an even higher value
may make sense. Default: 12.
--cache_dir CACHE_DIR
The directory where the downloaded models and datasets
will be stored.
--cache_clear_validation_prompts
When provided, any validation prompt entries in the
text embed cache will be recreated. This is useful if
you've modified any of the existing prompts, or,
disabled/enabled Compel, via `--disable_compel`
--caption_strategy {filename,textfile,instance_prompt,parquet}
The default captioning strategy, 'filename', will use
the filename as the caption, after stripping some
characters like underscores. The 'textfile' strategy
will use the contents of a text file with the same
name as the image. The 'parquet' strategy requires a
parquet file with the same name as the image,
containing a 'caption' column.
--parquet_caption_column PARQUET_CAPTION_COLUMN
When using caption_strategy=parquet, this option will
allow you to globally set the default caption field
across all datasets that do not have an override set.
--parquet_filename_column PARQUET_FILENAME_COLUMN
When using caption_strategy=parquet, this option will
allow you to globally set the default filename field
across all datasets that do not have an override set.
--instance_prompt INSTANCE_PROMPT
This is unused. Filenames will be the captions
instead.
--output_dir OUTPUT_DIR
The output directory where the model predictions and
checkpoints will be written.
--seed SEED A seed for reproducible training.
--seed_for_each_device SEED_FOR_EACH_DEVICE
By default, a unique seed will be used for each GPU.
This is done deterministically, so that each GPU will
receive the same seed across invocations. If
--seed_for_each_device=false is provided, then we will
use the same seed across all GPUs, which will almost
certainly result in the over-sampling of inputs on
larger datasets.
--resolution RESOLUTION
The resolution for input images, all the images in the
train/validation dataset will be resized to this
resolution. If using --resolution_type=area, this
float value represents megapixels.
--resolution_type {pixel,area}
Resizing images maintains aspect ratio. This defines
the resizing strategy. If 'pixel', the images will be
resized to the resolution by pixel edge. If 'area',
the images will be resized so the pixel area is this
many megapixels.
--aspect_bucket_rounding {1,2,3,4,5,6,7,8,9}
The number of decimal places to round the aspect ratio
to. This is used to create buckets for aspect ratios.
For higher precision, ensure the image sizes remain
compatible. Higher precision levels result in a
greater number of buckets, which may not be a
desirable outcome.
--aspect_bucket_alignment {8,64}
When training diffusion models, the image sizes
generally must align to a 64 pixel interval. This is
an exception when training models like DeepFloyd that
use a base resolution of 64 pixels, as aligning to 64
pixels would result in a 1:1 or 2:1 aspect ratio,
overly distorting images. For DeepFloyd, this value is
set to 8, but all other training defaults to 64. You
may experiment with this value, but it is not
recommended.
--minimum_image_size MINIMUM_IMAGE_SIZE
The minimum resolution for both sides of input images.
If --delete_unwanted_images is set, images smaller
than this will be DELETED. The default value is None,
which means no minimum resolution is enforced. If this
option is not provided, it is possible that images
will be destructively upsampled, harming model
performance.
--maximum_image_size MAXIMUM_IMAGE_SIZE
When cropping images that are excessively large, the
entire scene context may be lost, eg. the crop might
just end up being a portion of the background. To
avoid this, a maximum image size may be provided,
which will result in very-large images being
downsampled before cropping them. This value uses
--resolution_type to determine whether it is a pixel
edge or megapixel value.
--target_downsample_size TARGET_DOWNSAMPLE_SIZE
When using --maximum_image_size, very-large images
exceeding that value will be downsampled to this
target size before cropping. If --resolution_type=area
and --maximum_image_size=4.0,
--target_downsample_size=2.0 would result in a 4
megapixel image being resized to 2 megapixel before
cropping to 1 megapixel.
--train_text_encoder (SD 2.x only) Whether to train the text encoder. If
set, the text encoder should be float32 precision.
--tokenizer_max_length TOKENIZER_MAX_LENGTH
The maximum length of the tokenizer. If not set, will
default to the tokenizer's max length.
--train_batch_size TRAIN_BATCH_SIZE
Batch size (per device) for the training dataloader.
--num_train_epochs NUM_TRAIN_EPOCHS
--max_train_steps MAX_TRAIN_STEPS
Total number of training steps to perform. If
provided, overrides num_train_epochs.
--checkpointing_steps CHECKPOINTING_STEPS
Save a checkpoint of the training state every X
updates. Checkpoints can be used for resuming training
via `--resume_from_checkpoint`. In the case that the
checkpoint is better than the final trained model, the
checkpoint can also be used for inference.Using a
checkpoint for inference requires separate loading of
the original pipeline and the individual checkpointed
model components.See https://huggingface.co/docs/diffu
sers/main/en/training/dreambooth#performing-inference-
using-a-saved-checkpoint for step by stepinstructions.
--checkpoints_total_limit CHECKPOINTS_TOTAL_LIMIT
Max number of checkpoints to store.
--resume_from_checkpoint RESUME_FROM_CHECKPOINT
Whether training should be resumed from a previous
checkpoint. Use a path saved by
`--checkpointing_steps`, or `"latest"` to
automatically select the last available checkpoint.
--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
Number of updates steps to accumulate before
performing a backward/update pass.
--gradient_checkpointing
Whether or not to use gradient checkpointing to save
memory at the expense of slower backward pass.
--learning_rate LEARNING_RATE
Initial learning rate (after the potential warmup
period) to use. When using a cosine or sine schedule,
--learning_rate defines the maximum learning rate.
--text_encoder_lr TEXT_ENCODER_LR
Learning rate for the text encoder. If not provided,
the value of --learning_rate will be used.
--lr_scale Scale the learning rate by the number of GPUs,
gradient accumulation steps, and batch size.
--lr_scheduler {linear,sine,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup}
The scheduler type to use. Default: sine
--lr_warmup_steps LR_WARMUP_STEPS
Number of steps for the warmup in the lr scheduler.
--lr_num_cycles LR_NUM_CYCLES
Number of hard resets of the lr in
cosine_with_restarts scheduler.
--lr_power LR_POWER Power factor of the polynomial scheduler.
--use_ema Whether to use EMA (exponential moving average) model.
--ema_decay EMA_DECAY
The closer to 0.9999 this gets, the less updates will
occur over time. Setting it to a lower value, such as
0.990, will allow greater influence of later updates.
--non_ema_revision NON_EMA_REVISION
Revision of pretrained non-ema model identifier. Must
be a branch, tag or git identifier of the local or
remote repository specified with
--pretrained_model_name_or_path.
--offload_param_path OFFLOAD_PARAM_PATH
When using DeepSpeed ZeRo stage 2 or 3 with NVMe
offload, this may be specified to provide a path for
the offload.
--use_8bit_adam Whether or not to use 8-bit Adam from bitsandbytes.
--use_adafactor_optimizer
Whether or not to use the Adafactor optimizer.
--adafactor_relative_step ADAFACTOR_RELATIVE_STEP
When set, will use the experimental Adafactor mode for
relative step computations instead of the value set by
--learning_rate. This is an experimental feature, and
you are on your own for support.
--use_prodigy_optimizer
Whether or not to use the Prodigy optimizer.
--prodigy_beta3 PRODIGY_BETA3
coefficients for computing the Prodidy stepsize using
running averages. If set to None, uses the value of
square root of beta2. Ignored if optimizer is adamW
--prodigy_decouple PRODIGY_DECOUPLE
Use AdamW style decoupled weight decay
--prodigy_use_bias_correction PRODIGY_USE_BIAS_CORRECTION
Turn on Adam's bias correction. True by default.
Ignored if optimizer is adamW
--prodigy_safeguard_warmup PRODIGY_SAFEGUARD_WARMUP
Remove lr from the denominator of D estimate to avoid
issues during warm-up stage. True by default. Ignored
if optimizer is adamW
--prodigy_learning_rate PRODIGY_LEARNING_RATE
Though this is called the prodigy learning rate, it
corresponds to the d_coef parameter in the Prodigy
optimizer. This acts as a coefficient in the
expression for the estimate of d. Default for this
trainer is 0.5, but the Prodigy default is 1.0, which
ends up over-cooking models.
--prodigy_weight_decay PRODIGY_WEIGHT_DECAY
Weight decay to use. Prodigy default is 0, but
SimpleTuner uses 1e-2.
--prodigy_epsilon PRODIGY_EPSILON
Epsilon value for the Adam optimizer
--use_dadapt_optimizer
Whether or not to use the discriminator adaptation
optimizer.
--dadaptation_learning_rate DADAPTATION_LEARNING_RATE
Learning rate for the discriminator adaptation.
Default: 1.0
--adam_beta1 ADAM_BETA1
The beta1 parameter for the Adam and other optimizers.
--adam_beta2 ADAM_BETA2
The beta2 parameter for the Adam and other optimizers.
--adam_weight_decay ADAM_WEIGHT_DECAY
Weight decay to use.
--adam_epsilon ADAM_EPSILON
Epsilon value for the Adam optimizer
--adam_bfloat16 Whether or not to use stochastic bf16 in Adam.
Currently the only supported optimizer.
--max_grad_norm MAX_GRAD_NORM
Clipping the max gradient norm can help prevent
exploding gradients, but may also harm training by
introducing artifacts or making it hard to train
artifacts away.
--push_to_hub Whether or not to push the model to the Hub.
--push_checkpoints_to_hub
When set along with --push_to_hub, all intermediary
checkpoints will be pushed to the hub as if they were
a final checkpoint.
--hub_model_id HUB_MODEL_ID
The name of the repository to keep in sync with the
local `output_dir`.
--logging_dir LOGGING_DIR
[TensorBoard](https://www.tensorflow.org/tensorboard)
log directory. Will default to
*output_dir/runs/**CURRENT_DATETIME_HOSTNAME***.
--validation_torch_compile VALIDATION_TORCH_COMPILE
Supply `--validation_torch_compile=true` to enable the
use of torch.compile() on the validation pipeline. For
some setups, torch.compile() may error out. This is
dependent on PyTorch version, phase of the moon, but
if it works, you should leave it enabled for a great
speed-up.
--validation_torch_compile_mode {max-autotune,reduce-overhead,default}
PyTorch provides different modes for the Torch
Inductor when compiling graphs. max-autotune, the
default mode, provides the most benefit.
--allow_tf32 Whether or not to allow TF32 on Ampere GPUs. Can be
used to speed up training. For more information, see h
ttps://pytorch.org/docs/stable/notes/cuda.html#tensorf
loat-32-tf32-on-ampere-devices
--validation_using_datasets
When set, validation will use images sampled randomly
from each dataset for validation. Be mindful of
privacy issues when publishing training data to the
internet.
--webhook_config WEBHOOK_CONFIG
The path to the webhook configuration file. This file
should be a JSON file with the following format:
{"url": "https://your.webhook.url", "webhook_type":
"discord"}}
--report_to REPORT_TO
The integration to report the results and logs to.
Supported platforms are `"tensorboard"` (default),
`"wandb"` and `"comet_ml"`. Use `"all"` to report to
all integrations.
--tracker_run_name TRACKER_RUN_NAME
The name of the run to track with the tracker.
--tracker_project_name TRACKER_PROJECT_NAME
The name of the project for WandB or Tensorboard.
--validation_prompt VALIDATION_PROMPT
A prompt that is used during validation to verify that
the model is learning.
--validation_prompt_library
If this is provided, the SimpleTuner prompt library
will be used to generate multiple images.
--user_prompt_library USER_PROMPT_LIBRARY
This should be a path to the JSON file containing your
prompt library. See user_prompt_library.json.example.
--validation_negative_prompt VALIDATION_NEGATIVE_PROMPT
When validating images, a negative prompt may be used
to guide the model away from certain features. When
this value is set to --validation_negative_prompt='',
no negative guidance will be applied. Default: blurry,
cropped, ugly
--num_validation_images NUM_VALIDATION_IMAGES
Number of images that should be generated during
validation with `validation_prompt`.
--validation_steps VALIDATION_STEPS
Run validation every X steps. Validation consists of
running the prompt `args.validation_prompt` multiple
times: `args.num_validation_images` and logging the
images.
--num_eval_images NUM_EVAL_IMAGES
If possible, this many eval images will be selected
from each dataset. This is used when training super-
resolution models such as DeepFloyd Stage II, which
will upscale input images from the training set.
--eval_dataset_id EVAL_DATASET_ID
When provided, only this dataset's images will be used
as the eval set, to keep the training and eval images
split.
--validation_num_inference_steps VALIDATION_NUM_INFERENCE_STEPS
The default scheduler, DDIM, benefits from more steps.
UniPC can do well with just 10-15. For more speed
during validations, reduce this value. For better
quality, increase it. For model distilation, you will
likely want to keep this low.
--validation_resolution VALIDATION_RESOLUTION
Square resolution images will be output at this
resolution (256x256).
--validation_noise_scheduler {ddim,ddpm,euler,euler-a,unipc}
When validating the model at inference time, a
different scheduler may be chosen. UniPC can offer
better speed, and Euler A can put up with
instabilities a bit better. For zero-terminal SNR
models, DDIM is the best choice. Choices: ['ddim',
'ddpm', 'euler', 'euler-a', 'unipc'], Default: ddim
--validation_disable_unconditional
When set, the validation pipeline will not generate
unconditional samples. This is useful to speed up
validations with a single prompt on slower systems, or
if you are not interested in unconditional space
generations.
--disable_compel If provided, validation pipeline prompts will be
handled using the typical prompt encoding strategy.
Otherwise, the default behaviour is to use Compel for
prompt embed generation. Note that the training input
text embeds are not generated using Compel, and will
be truncated to 77 tokens.
--enable_watermark The SDXL 0.9 and 1.0 licenses both require a watermark
be used to identify any images created to be shared.
Since the images created during validation typically
are not shared, and we want the most accurate results,
this watermarker is disabled by default. If you are
sharing the validation images, it is up to you to
ensure that you are complying with the license,
whether that is through this watermarker, or another.
--mixed_precision {bf16,no}
SimpleTuner only supports bf16 training. Bf16 requires
PyTorch >= 1.10. on an Nvidia Ampere or later GPU, and
PyTorch 2.3 or newer for Apple Silicon. Default to the
value of accelerate config of the current system or
the flag passed with the `accelerate.launch` command.
Use this argument to override the accelerate config.
--local_rank LOCAL_RANK
For distributed training: local_rank
--enable_xformers_memory_efficient_attention
Whether or not to use xformers.
--set_grads_to_none Save more memory by using setting grads to None
instead of zero. Be aware, that this changes certain
behaviors, so disable this argument if it causes any
problems. More info: https://pytorch.org/docs/stable/g
enerated/torch.optim.Optimizer.zero_grad.html
--noise_offset NOISE_OFFSET
The scale of noise offset. Default: 0.1
--noise_offset_probability NOISE_OFFSET_PROBABILITY
When training with --offset_noise, the value of
--noise_offset will only be applied probabilistically.
The default behaviour is for offset noise (if enabled)
to be applied 25 percent of the time.
--validation_guidance VALIDATION_GUIDANCE
CFG value for validation images. Default: 7.5
--validation_guidance_rescale VALIDATION_GUIDANCE_RESCALE
CFG rescale value for validation images. Default: 0.0,
max 1.0
--validation_randomize
If supplied, validations will be random, ignoring any
seeds.
--validation_seed VALIDATION_SEED
If not supplied, the value for --seed will be used. If
neither those nor --validation_randomize are supplied,
a seed of zero is used.
--fully_unload_text_encoder
If set, will fully unload the text_encoder from memory
when not in use. This currently has the side effect of
crashing validations, but it is useful for initiating
VAE caching on GPUs that would otherwise be too small.
--freeze_encoder_before FREEZE_ENCODER_BEFORE
When using 'before' strategy, we will freeze layers
earlier than this.
--freeze_encoder_after FREEZE_ENCODER_AFTER
When using 'after' strategy, we will freeze layers
later than this.
--freeze_encoder_strategy FREEZE_ENCODER_STRATEGY
When freezing the text_encoder, we can use the
'before', 'between', or 'after' strategy. The
'between' strategy will freeze layers between those
two values, leaving the outer layers unfrozen. The
default strategy is to freeze all layers from 17 up.
This can be helpful when fine-tuning Stable Diffusion
2.1 on a new style.
--freeze_unet_strategy {none,bitfit}
When freezing the UNet, we can use the 'none' or
'bitfit' strategy. The 'bitfit' strategy will freeze
all weights, and leave bias thawed. The default
strategy is to leave the full u-net thawed. Freezing
the weights can improve convergence for finetuning.
--unet_attention_slice
If set, will use attention slicing for the SDXL UNet.
This is an experimental feature and is not recommended
for general use. SD 2.x makes use of attention slicing
on Apple MPS platform to avoid a NDArray size crash,
but SDXL does not seem to require attention slicing on
MPS. If memory constrained, try enabling it anyway.
--print_filenames If any image files are stopping the process eg. due to
corruption or truncation, this will help identify
which is at fault.
--print_sampler_statistics
If provided, will print statistics about the dataset
sampler. This is useful for debugging. The default
behaviour is to not print sampler statistics.
--metadata_update_interval METADATA_UPDATE_INTERVAL
When generating the aspect bucket indicies, we want to
save it every X seconds. The default is to save it
every 1 hour, such that progress is not lost on
clusters where runtime is limited to 6-hour increments
(e.g. the JUWELS Supercomputer). The minimum value is
60 seconds.
--debug_aspect_buckets
If set, will print excessive debugging for aspect
bucket operations.
--debug_dataset_loader
If set, will print excessive debugging for data loader
operations.
--freeze_encoder FREEZE_ENCODER
Whether or not to freeze the text_encoder. The default
is true.
--save_text_encoder If set, will save the text_encoder after training.
This is useful if you're using --push_to_hub so that
the final pipeline contains all necessary components
to run.
--text_encoder_limit TEXT_ENCODER_LIMIT
When training the text_encoder, we want to limit how
long it trains for to avoid catastrophic loss.
--prepend_instance_prompt
When determining the captions from the filename,
prepend the instance prompt as an enforced keyword.
--only_instance_prompt
Use the instance prompt instead of the caption from
filename.
--data_aesthetic_score DATA_AESTHETIC_SCORE
Since currently we do not calculate aesthetic scores
for data, we will statically set it to one value. This
is only used by the SDXL Refiner.
--sdxl_refiner_uses_full_range
If set, the SDXL Refiner will use the full range of
the model, rather than the design value of 20 percent.
This is useful for training models that will be used
for inference from end-to-end of the noise schedule.
You may use this for example, to turn the SDXL refiner
into a full text-to-image model.
--caption_dropout_probability CAPTION_DROPOUT_PROBABILITY
Caption dropout will randomly drop captions and, for
SDXL, size conditioning inputs based on this
probability. When set to a value of 0.1, it will drop
approximately 10 percent of the inputs. Maximum
recommended value is probably less than 0.5, or 50
percent of the inputs. Maximum technical value is 1.0.
The default is to use zero caption dropout, though for
better generalisation, a value of 0.1 is recommended.
--delete_unwanted_images
If set, will delete images that are not of a minimum
size to save on disk space for large training runs.
Default behaviour: Unset, remove images from bucket
only.
--delete_problematic_images
If set, any images that error out during load will be
removed from the underlying storage medium. This is
useful to prevent repeatedly attempting to cache bad
files on a cloud bucket.
--offset_noise Fine-tuning against a modified noise See:
https://www.crosslabs.org//blog/diffusion-with-offset-
noise for more information.
--lr_end LR_END A polynomial learning rate will end up at this value
after the specified number of warmup steps. A sine or
cosine wave will use this value as its lower bound for
the learning rate.
--i_know_what_i_am_doing
If you are using an optimizer other than AdamW, you
must set this flag to continue. This is a safety
feature to prevent accidental use of an unsupported
optimizer, as weights are stored in bfloat16.