Releases · bghira/SimpleTuner

24 Dec 20:56

bghira

v1.2.2

f56900d

v1.2.2 Latest

Latest

Features

Sana

Training Sana now supported, requires very little config changes.

Example to make multi-training environment:

mkdir config/environment_name where environment_name may be something like the model name or concept you were working on.

Example: mkdir config/flux

Move all of your current configurations into the new environment: mv config/*.json config/flux
Run configure.py to create new configs for Sana
mkdir config/sana
mv config/*.json config/sana

When launching you can now use:

ENV=sana ./train.sh
# or
ENV=flux ./train.sh

Note: You'll have to adjust the paths to multidatabackend.json and other config files inside the nested config.json files to point to their location, eg. config/flux/multidatabackend.json.

Gradient clipping by max value

When using --max_grad_norm, the previous behaviour was to scale the entire gradient vector such that the norm maxed out at a given value. The new behaviour is to clip individual values within the gradient to avoid outliers. This can be swapped back with --grad_clip_method=norm.

This was found to stabilise training for runs across a range of batch sizes, but noticeably enabled more learning to occur with fewer disasters.

Stable Diffusion 3.5 fixes

The eternal problem child SD3.5 has some training parameter fixes that make it worth reattempting training for.

The T5 text encoder previously was claimed by StabilityAI to use a sequence length of 256, but is now understood to have actually used a sequence length of 154. Updating this results in more likeness being trained into the model with less degradation:

Some checkpoints are available here and the EMA model weights here are noticeably better starting point for use with --init_lora - note, this is Lycoris adapter, not PEFT LoRA. You may have to adjust your configuration to use lora_type=lycoris and --init_lora=path/to/the/ema_model.safetensors

SD3.5 also now supports --gradient_checkpointing_interval which allows the use of more VRAM to speed up training by checkpointing fewer blocks.

DeepSpeed

Stage 3 offload has some experimental fixes which allow running the text and image encoders without sharding them.

Pull requests

support Sana training by @bghira in #1187
update sana toc link by @bghira in #1188
update sd3 seqlen to 154 max for t5 by @bghira in #1190
chore; log cleanup by @bghira in #1192
add --grad_clip_method to allow different forms of max_grad_norm clipping by @bghira in #1205
max_grad_norm value limit removal for sd3 by @bghira in #1207
local backend: use atomicwrites library to resolve rename errors and parallel overwrites by @bghira in #1206
apple: update quanto dependency to upstream repository by @bghira in #1208
swith clip method to "value" by default by @bghira in #1210
add vae in example by @MrTuanDao in #1212
sana: use bf16 weights and update class names to latest PR by @bghira in #1213
configurator should avoid asking about checkpointing intervals when the model family does not support it by @bghira in #1214
vaecache: sana should grab .latent object by @bghira in #1215
safety_check: Fix gradient checkpointing interval error message by @clayne in #1221
sana: add complex human instruction to user prompts by default (untested) by @bghira in #1216
flux: use rank 0 for h100 detection since that is the most realistic setup by @bghira in #1225
diffusers: bump to main branch instead of Sana branch by @bghira in #1226
torchao: bump version to 0.7.0 by @bghira in #1224
deepspeed from 0.15 to 0.16.1 by @bghira in #1227
accelerate: from v0.34 to v1.2 by @bghira in #1228
more dependency updates by @bghira in #1229
sd3: allow setting grad checkpointing interval by @bghira in #1230
merge by @bghira in #1232
remove sana complex human instruction from tensorboard args (#1234) by @bghira in #1235
merge by @bghira in #1242
deepspeed stage 3 needs validations disabled thoroughly by @bghira in #1243
merge by @bghira in #1244

New Contributors

@MrTuanDao made their first contribution in #1212
@clayne made their first contribution in #1221

Full Changelog: v1.2.1...v1.2.2

Contributors

clayne, bghira, and MrTuanDao

Assets 2

03 Dec 23:17

bghira

v1.2.1

07d9ea7

v1.2.1 - free lunch edition

Features

This release will speed up all validations without any config changes.

SageAttention (NVIDIA-only; must be installed manually for now)
- By default, only speeds up inference. SDXL more than Flux due to differences in their respective bottlenecks.
- Use --attention_mechanism=sageattention to enable this, and --sageattention_usage=training+inference to enable it for training as well as validations. This will probably make your model worse or collapse though.
Optimised --gradient_checkpointing implementation
- No longer applies during validations, so even without SageAttention we get a speedup (on a 4090+5800X3D) from 29 seconds for a Flux image to 15 seconds (SDXL goes from 15 seconds to 6 seconds)
Added --gradient_checkpointing_interval which you can use to speed up Flux training at the cost of some additional VRAM.
- Makes NF4 even more attractive for a 4090, where you can then use the SOAP optimiser in a meaningful way.
- See the options guide for more information.

What's Changed

Add SageAttention for substantial training speed-up by @bghira in #1182
SageAttention: make it inference-only by default by @bghira in #1183
gradient checkpointing speed-up by @bghira in #1184
add gradient checkpointing option to docs by @bghira in #1185
merge by @bghira in #1186

Full Changelog: v1.2...v1.2.1

Contributors

bghira

Assets 2

25 Nov 13:51

bghira

v1.2

2f8fc6e

v1.2 - EMA for LoRA/Lycoris training

Features

EMA is reworked. Previous training runs using EMA should not update to this release. Your checkpoints will not load the EMA weights correctly.
EMA now works fully for PEFT Standard LoRA and Lycoris adapters (tested LoKr only)
When EMA is enabled, side-by-side comparisons are now done by default (can be disabled with --ema_validation=ema_only or none)

Example; the starting model benchmark is on the left as before, the centre is the training Lycoris adapter, and the right side is the EMA weights. (SD3.5 Medium)

Bugfixes

Text encoders are now properly quantised if the parameter is given, they were in bf16 before
Updated doc reference link to caption filter example

What's Changed

quantise text encoders upon request correctly by @bghira in #1167
merge minor follow-up fixes by @bghira in #1168
(experimental) Allow EMA on LoRA/Lycoris networks by @bghira in #1170
Update caption_filter_list.txt.example reference by @emmanuel-ferdman in #1178
merge EMA LoRA/Lycoris support by @bghira in #1176

New Contributors

@emmanuel-ferdman made their first contribution in #1178

Full Changelog: v1.1.5...v1.2

Contributors

emmanuel-ferdman and bghira

Assets 2

16 Nov 23:45

bghira

v1.1.5

04a5a74

v1.1.5 - better validations for SD3.5M and Lycoris users

Features

Flow-matching models like SD3 and Flux can use uniform schedule sampling again, mirroring the v0.9.x release cycle from early August
More model card details for Hugging Face Hub
SD3.5 Medium: skip-layer guidance for validation outputs to more closely match usual workflow results
SD3.x: Allow configuring T5 and CLIP padding values (default to empty string)
Added --vae_enable_tiling for reducing VAE overhead on 2048px training for SD3.5 Medium on smaller GPUs
CLIP score tracking for validations by adding --evaluation_type=clip to your config
LyCORIS training can now have a specific strength set during validations using --validation_lycoris_strength to mirror the typical workflows found in ComfyUI etc. A recommended value is 1.0 (default) or 1.3. Using a value lower than 1.0 can help to avoid seeing a model "blow up" when you intend on using it at a lower weight later, anyway.

Bugfixes

Torch compile for validation fixed, now works (it did nothing before)
Torch compile disabled for LyCORIS models
Better SD3 quantisation performance via quanto by excluding layers from the quantisation
Flux: default shift value to 3 instead of 1
SD1.5 LoRA save fixed
Quanto typo for FP8 fixed
Multi-caption parquet backend crashing fixed
Concurrent text embed writes on multi-GPU system file locking issue fixed

Pull requests

experimental: remove some layers from quanto by @bghira in #1085
merge by @bghira in #1086
flux: modify the quanto default excluded layers to be different from sd3 by @bghira in #1087
sd3: allow configuring clip and t5 uncond values by @bghira in #1088
merge by @bghira in #1089
fix SD3 text embed creation; downgrade to pytorch 2.4.1 by @bghira in #1093
update docs and sd3 parameter defaults by @bghira in #1094
Small link update in TUTORIAL by @rootonchair in #1095
(#1097) resolve sd15 lora save error by @bghira in #1102
fix(typo): correct arg name in warning by @Jannchie in #1099
merge by @bghira in #1103
Add deduplication of captions by @mhirki in #1104
Throw an error if both --flux_schedule_auto_shift and --flux_schedule_shift are enabled. by @mhirki in #1106
Fix unit test failure after PR #1106 by @mhirki in #1107
Fix gO variable name by @samedii in #1108
disable caption deduplication as it prevents multigpu caching; add warning for sd3 using wrong VAE; cleanly terminate and restart batch text embed writing thread by @bghira in #1111
merge by @bghira in #1112
sd3: revert enforcement of sd35 flow_matching_loss values by @bghira in #1115
Updating Flux Quickstart Doc with Pre-Trained Model Info by @riffmaster-2001 in #1116
merge by @bghira in #1123
Fix missing docker dependencies by @Putzzmunta in #1126
Fix multi-caption parquets crashing in multiple locations (Closes #1092) by @AmericanPresidentJimmyCarter in #1109
sd3: add skip layer guidance by @bghira in #1125
sd3: model card detail expansion by @bghira in #1130
flux and sd3 could use uniform sampling instead of beta or sigmoid by @bghira in #1129
Fix random validation errors for good (and restore torch.compile for the validation pipeline at the same time) by @mhirki in #1131
merge by @bghira in #1132
revamp model card to work by default and provide quanto hints by @bghira in #1133
validation: disable compile for lycoris by @bghira in #1136
add --vae_enable_tiling to encode large res images with less vram used by @bghira in #1141
s3: when file does not exist, handle generic 404 error for headobject by @bghira in #1142
trainer: enable vae tiling when enabled by @bghira in #1143
validation: fix error when torch compile is disabled for lycoris by @bghira in #1144
add clip score tracking by @bghira in #1146
add documentation updates by @bghira in #1150
merge by @bghira in #1151
metadata: add more ddpm related schedule info to the model card by @bghira in #1152
local data backend should have file locking for writes and reads by @bghira in #1160
chore: ignore rmtree errors by @bghira in #1162
validation: allow setting a non-default strength for validation with lycoris by @bghira in #1161
add more info to model card, refine contents by @bghira in #1163
error out when cache dir path is not found by @bghira in #1164
merge by @bghira in #1165

New Contributors

@rootonchair made their first contribution in #1095
@Jannchie made their first contribution in #1099
@samedii made their first contribution in #1108
@Putzzmunta made their first contribution in #1126

Full Changelog: v1.1.4...v1.1.5

Contributors

samedii, mhirki, and 6 other contributors

Assets 2

0 Join discussion

22 Oct 15:07

bghira

v1.1.4

71bea97

v1.1.4

Support for SD 3.5 fine-tuning.

Stability AI has provided a tutorial on using SimpleTuner for this task here and the SD3 quickstart provided by SimpleTuner is available here

What's Changed

update to diffusers v0.31 for SD3.5 by @bghira in #1082
merge masked loss + reg image fixes by @bghira in #1080
update rocm, mps and nvidia to torch 2.5 by @bghira in #1081
merge by @bghira in #1083

Full Changelog: v1.1.3...v1.1.4

Contributors

bghira

Assets 2

0 Join discussion

18 Oct 17:31

bghira

v1.1.3

8bf644f

v1.1.3

Nested subdir datasets will now have caches also nested in subdirectories, which unfortunately requires most-likely regenerating these entries. Sorry - it was not feasible to keep the old structure working in parallel.
FlashAttention3 fixes for H100 nodes by downgrading default torch version to 2.4.1
Resume fixes for multi-gpu/multi-node state/epoch tracking
Other misc bugfixes

What's Changed

fix flux attn masked transformer modeling code by @bghira in #1055
merge by @bghira in #1056
fix rope function for FA3 by @bghira in #1057
merge by @bghira in #1058
lokr: resume by default training state if not found by @bghira in #1060
merge by @bghira in #1061
Restore init_lokr_norm functionality by @imit8ed in #1065
refactor how masks are retrieved by @bghira in #1066
nvidia dependency update for pytorch-triton / aiohappyeyeballs by @bghira in #1062
downgrade cuda to pt241 by default by @bghira in #1067
add nightly build for pt26 by @bghira in #1068
Add recropping script for image JSON metadata backends by @AmericanPresidentJimmyCarter in #1063
merge by @bghira in #1069
bugfix: restore sampler state on rank 0 correctly by @bghira in #1071
merge by @bghira in #1072
fix vae cache dir creation for subdirs by @bghira in #1076
fix for nested image subdirs w/ duplicated filenames across subdirs by @bghira in #1078

New Contributors

@imit8ed made their first contribution in #1065

Full Changelog: v1.1.2...v1.1.3

Contributors

imit8ed, bghira, and AmericanPresidentJimmyCarter

Assets 2

0 Join discussion

13 Oct 04:36

bghira

v1.1.2

dddaf4f

v1.1.2 - masked loss and strong prior preservation

New stuff

New is_regularisation_data option for datasets, works great
H100 or greater now has better torch compile support
SDXL ControlNet training is back, now with quantised base model (int8)
Multi-node training works now, with a guide to deploy it easily
Configure.py now can generate a very rudimentary user prompt library for you if you are in a hurry
Flux model cards now have more useful information about your Flux training setup
Masked loss training & a demo script in the toolkit dir for generating a folder of image masks

What's Changed

quanto: improve support for SDXL training by @bghira in #1027
Fix attention masking transformer for flux by @AmericanPresidentJimmyCarter in #1032
merge by @bghira in #1036
H100/H200/B200 FlashAttention3 for Flux + TorchAO improvements by @bghira in #1033
utf8 fix for emojis in dataset configs by @bghira in #1037
fix venv instructions and edge case for aspect crop bucket list by @bghira in #1038
merge by @bghira in #1039
multi-node training fixes for state tracker by @bghira in #1040
merge bugfixes by @bghira in #1041
configure.py can configure caption strategy by @bghira in #1042
regression by @bghira in #1043
fix multinode state resumption by @bghira in #1044
merge by @bghira in #1045
validations can crash when sending updates to wandb by @bghira in #1046
aws: do not give up on fatal errors during exists() by @bghira in #1047
merge by @bghira in #1048
add prompt expander based on 1B Llama model by @bghira in #1049
implement regularisation dataset parent-student loss for LyCORIS training by @bghira in #1050
metadata: add more flux model card details by @bghira in #1051
merge by @bghira in #1052
fix controlnet training for sdxl and introduce masked loss preconditioning by @bghira in #1053
merge by @bghira in #1054

Full Changelog: v1.1.1...v1.1.2

Contributors

bghira and AmericanPresidentJimmyCarter

Assets 2

05 Oct 00:37

bghira

v1.1.1

01de5d0

v1.1.1 - bring on the potato models

Trained with NF4 via PagedLion8Bit.

New custom timestep distribution for Flux via --flux_use_beta_schedule, --flux_beta_schedule_alpha, --flux_beta_schedule_beta (#1023)
The trendy AdEMAMix, its 8bit and paged counterparts are all now available as bnb-ademamix, bnb-ademamix-8bit, and bnb-ademamix8bit-paged`
All low-bit optimisers from Bits n Bytes are now included for NVIDIA and ROCm systems
NF4 training on NVIDIA systems down to 9090M total using Lion8Bit and 512px training at 1.5 sec/iter on a 4090

What's Changed

int8-quanto followup fixes (batch size > 1) by @bghira in #1016
merge by @bghira in #1018
update doc by @bghira in #1019
update docs by @bghira in #1025
Add the ability to use a Beta schedule to select Flux timesteps by @AmericanPresidentJimmyCarter in #1023
AdEMAMix, 8bit Adam/AdamW/Lion/Adagrad, Paged optimisers by @bghira in #1026
Bits n Bytes NF4 training by @bghira in #1028
merge by @bghira in #1029

Full Changelog: v1.1...v1.1.1

Contributors

bghira and AmericanPresidentJimmyCarter

Assets 2

0 Join discussion

01 Oct 20:51

bghira

v1.1

696760e

v1.1 - API-friendly edition

Features

Performance

Improved launch speed for large datasets (>1M samples)
Improved speed for quantising on CPU
Optional support for directly quantising on GPU near-instantly (--quantize_via)

Compatibility

SDXL, SD1.5 and SD2.x compatibility with LyCORIS training
Updated documentation to make multiGPU configuration a bit more obvious.
Improved support for torch.compile(), including automatically disabling it when eg. fp8-quanto is enabled
- Enable via accelerate config or config/config.env via TRAINER_DYNAMO_BACKEND=inductor
TorchAO for quantisation as an alternative to Optimum Quanto for int8 weight-only quantisation (int8-torchao)
f8uz-quanto, a compatibility level for AMD users to experiment with FP8 training dynamics
Support for multigpu PEFT LoRA training with Quanto enabled (not fp8-quanto)
- Previously, only LyCORIS would reliably work with quantised multigpu training sessions.
Ability to quantise models when full-finetuning, without warning or error. Previously, this configuration was blocked. Your mileage may vary, it's an experimental configuration.

Integrations

Images now get logged to tensorboard (thanks @anhi)
FastAPI endpoints for integrations (undocumented)
"raw" webhook type that sends a large number of HTTP requests containing events, useful for push notification type service

Optims

SOAP optimiser support
- uses fp32 gradients, nice and accurate but uses more memory than other optims, by default slows down every 10 steps as it preconditions
New 8bit and 4bit optimiser options from TorchAO (ao-adamw8bit, ao-adamw4bit etc)

Pull Requests

Fix flux cfg sampling bug by @AmericanPresidentJimmyCarter in #981
merge by @bghira in #982
FastAPI endpoints for managing trainer as a service by @bghira in #969
constant lr resume fix for optimi-stableadamw by @bghira in #984
clear data backends before configuring new ones by @bghira in #992
update to latest quanto main by @bghira in #994
log images in tensorboard by @anhi in #998
merge by @bghira in #999
torchao: add int8; quanto: add NF4; torch compile fixes + ability to compile optim by @bghira in #986
update flux quickstart by @bghira in #1000
compile optimiser by @bghira in #1001
optimizer compile step only by @bghira in #1002
remove optimiser compilation arg by @bghira in #1003
remove optim compiler from options by @bghira in #1004
remove optim compiler from options by @bghira in #1005
SOAP optimiser; int4 fixes for 4090 by @bghira in #1006
torchao: install 0.5.0 from pytorch source by @bghira in #1007
update safety check warning with guidance toward cache clear interval for OOM issues by @bghira in #1008
fix webhook contents for discord by @bghira in #1011
fp8-quanto fixes, unblocking of PEFT multigpu LoRA training for other precision levels by @bghira in #1013
quanto: activations sledgehammer by @bghira in #1014
1.1 merge window by @bghira in #1010

Full Changelog: v1.0.1...v1.1

Contributors

anhi, bghira, and AmericanPresidentJimmyCarter

Assets 2

0 Join discussion

14 Sep 18:45

bghira

v1.0.1

a5ca5a2

v1.0.1

This is a maintenance release with not many new features.

What's Changed

fix reference error to use_dora by @bghira in #929
fix merge error by @bghira in #930
fix use of --num_train_epochs by @bghira in #932
merge fixes by @bghira in #934
documentation updates, deepspeed config reference error fix by @bghira in #935
Fix caption_with_cogvlm.py for cogvlm2 + textfile strategy by @burgalon in #936
dependency updates, cogvlm fixes, peft/lycoris resume fix by @bghira in #939
feature: zero embed padding for t5 on request by @bghira in #941
merge by @bghira in #942
comet_ml validation images by @burgalon in #944
Allow users to init their LoKr with perturbed normal w2 by @AmericanPresidentJimmyCarter in #943
merge by @bghira in #948
fix typo in PR by @bghira in #949
update arg name for norm init by @bghira in #950
configure script should not set dropout by default by @bghira in #955
VAECache: improve startup speed for large sets by @bghira in #956
Update FLUX.md by @anae-git in #957
mild bugfixes by @bghira in #963
fix bucket worker not waiting for all queue worker to finish by @burgalon in #967
merge by @bghira in #968
fix DDP for PEFT LoRA & minor exit error by @bghira in #974

New Contributors

@anae-git made their first contribution in #957

Full Changelog: v1.0...v1.0.1

Contributors

burgalon, bghira, and 2 other contributors

Assets 2

0 Join discussion

Releases: bghira/SimpleTuner

v1.2.2

Features

Sana

Gradient clipping by max value

Stable Diffusion 3.5 fixes

DeepSpeed

Pull requests

New Contributors

Contributors

v1.2.1 - free lunch edition

Features

What's Changed

Contributors

v1.2 - EMA for LoRA/Lycoris training

Features

Bugfixes

What's Changed

New Contributors

Contributors

v1.1.5 - better validations for SD3.5M and Lycoris users

Features

Bugfixes

Pull requests

New Contributors

Contributors

v1.1.4

What's Changed

Contributors

v1.1.3

What's Changed

New Contributors

Contributors

v1.1.2 - masked loss and strong prior preservation

New stuff

What's Changed

Contributors

v1.1.1 - bring on the potato models

What's Changed

Contributors

v1.1 - API-friendly edition

Features

Performance

Compatibility

Integrations

Optims

Pull Requests

Contributors

v1.0.1

What's Changed

New Contributors

Contributors