Skip to content

Releases: bghira/SimpleTuner

v1.2.2

24 Dec 20:56
f56900d
Compare
Choose a tag to compare

Features

Sana

Training Sana now supported, requires very little config changes.

Example to make multi-training environment:

  1. mkdir config/environment_name where environment_name may be something like the model name or concept you were working on.
  • Example: mkdir config/flux
  1. Move all of your current configurations into the new environment: mv config/*.json config/flux
  2. Run configure.py to create new configs for Sana
  3. mkdir config/sana
  4. mv config/*.json config/sana

When launching you can now use:

ENV=sana ./train.sh
# or
ENV=flux ./train.sh

Note: You'll have to adjust the paths to multidatabackend.json and other config files inside the nested config.json files to point to their location, eg. config/flux/multidatabackend.json.

Gradient clipping by max value

When using --max_grad_norm, the previous behaviour was to scale the entire gradient vector such that the norm maxed out at a given value. The new behaviour is to clip individual values within the gradient to avoid outliers. This can be swapped back with --grad_clip_method=norm.

This was found to stabilise training for runs across a range of batch sizes, but noticeably enabled more learning to occur with fewer disasters.

Stable Diffusion 3.5 fixes

The eternal problem child SD3.5 has some training parameter fixes that make it worth reattempting training for.

The T5 text encoder previously was claimed by StabilityAI to use a sequence length of 256, but is now understood to have actually used a sequence length of 154. Updating this results in more likeness being trained into the model with less degradation:

image
image
image
image
image

Some checkpoints are available here and the EMA model weights here are noticeably better starting point for use with --init_lora - note, this is Lycoris adapter, not PEFT LoRA. You may have to adjust your configuration to use lora_type=lycoris and --init_lora=path/to/the/ema_model.safetensors

SD3.5 also now supports --gradient_checkpointing_interval which allows the use of more VRAM to speed up training by checkpointing fewer blocks.

DeepSpeed

Stage 3 offload has some experimental fixes which allow running the text and image encoders without sharding them.

Pull requests

  • support Sana training by @bghira in #1187
  • update sana toc link by @bghira in #1188
  • update sd3 seqlen to 154 max for t5 by @bghira in #1190
  • chore; log cleanup by @bghira in #1192
  • add --grad_clip_method to allow different forms of max_grad_norm clipping by @bghira in #1205
  • max_grad_norm value limit removal for sd3 by @bghira in #1207
  • local backend: use atomicwrites library to resolve rename errors and parallel overwrites by @bghira in #1206
  • apple: update quanto dependency to upstream repository by @bghira in #1208
  • swith clip method to "value" by default by @bghira in #1210
  • add vae in example by @MrTuanDao in #1212
  • sana: use bf16 weights and update class names to latest PR by @bghira in #1213
  • configurator should avoid asking about checkpointing intervals when the model family does not support it by @bghira in #1214
  • vaecache: sana should grab .latent object by @bghira in #1215
  • safety_check: Fix gradient checkpointing interval error message by @clayne in #1221
  • sana: add complex human instruction to user prompts by default (untested) by @bghira in #1216
  • flux: use rank 0 for h100 detection since that is the most realistic setup by @bghira in #1225
  • diffusers: bump to main branch instead of Sana branch by @bghira in #1226
  • torchao: bump version to 0.7.0 by @bghira in #1224
  • deepspeed from 0.15 to 0.16.1 by @bghira in #1227
  • accelerate: from v0.34 to v1.2 by @bghira in #1228
  • more dependency updates by @bghira in #1229
  • sd3: allow setting grad checkpointing interval by @bghira in #1230
  • merge by @bghira in #1232
  • remove sana complex human instruction from tensorboard args (#1234) by @bghira in #1235
  • merge by @bghira in #1242
  • deepspeed stage 3 needs validations disabled thoroughly by @bghira in #1243
  • merge by @bghira in #1244

New Contributors

Full Changelog: v1.2.1...v1.2.2

v1.2.1 - free lunch edition

03 Dec 23:17
07d9ea7
Compare
Choose a tag to compare

Features

This release will speed up all validations without any config changes.

  • SageAttention (NVIDIA-only; must be installed manually for now)
    • By default, only speeds up inference. SDXL more than Flux due to differences in their respective bottlenecks.
    • Use --attention_mechanism=sageattention to enable this, and --sageattention_usage=training+inference to enable it for training as well as validations. This will probably make your model worse or collapse though.
  • Optimised --gradient_checkpointing implementation
    • No longer applies during validations, so even without SageAttention we get a speedup (on a 4090+5800X3D) from 29 seconds for a Flux image to 15 seconds (SDXL goes from 15 seconds to 6 seconds)
  • Added --gradient_checkpointing_interval which you can use to speed up Flux training at the cost of some additional VRAM.
    • Makes NF4 even more attractive for a 4090, where you can then use the SOAP optimiser in a meaningful way.
    • See the options guide for more information.

What's Changed

Full Changelog: v1.2...v1.2.1

v1.2 - EMA for LoRA/Lycoris training

25 Nov 13:51
2f8fc6e
Compare
Choose a tag to compare

Features

  • EMA is reworked. Previous training runs using EMA should not update to this release. Your checkpoints will not load the EMA weights correctly.
  • EMA now works fully for PEFT Standard LoRA and Lycoris adapters (tested LoKr only)
  • When EMA is enabled, side-by-side comparisons are now done by default (can be disabled with --ema_validation=ema_only or none)

Example; the starting model benchmark is on the left as before, the centre is the training Lycoris adapter, and the right side is the EMA weights. (SD3.5 Medium)
image

Bugfixes

  • Text encoders are now properly quantised if the parameter is given, they were in bf16 before
  • Updated doc reference link to caption filter example

What's Changed

New Contributors

Full Changelog: v1.1.5...v1.2

v1.1.5 - better validations for SD3.5M and Lycoris users

16 Nov 23:45
04a5a74
Compare
Choose a tag to compare

Features

  • Flow-matching models like SD3 and Flux can use uniform schedule sampling again, mirroring the v0.9.x release cycle from early August
  • More model card details for Hugging Face Hub
  • SD3.5 Medium: skip-layer guidance for validation outputs to more closely match usual workflow results
  • SD3.x: Allow configuring T5 and CLIP padding values (default to empty string)
  • Added --vae_enable_tiling for reducing VAE overhead on 2048px training for SD3.5 Medium on smaller GPUs
  • CLIP score tracking for validations by adding --evaluation_type=clip to your config
  • LyCORIS training can now have a specific strength set during validations using --validation_lycoris_strength to mirror the typical workflows found in ComfyUI etc. A recommended value is 1.0 (default) or 1.3. Using a value lower than 1.0 can help to avoid seeing a model "blow up" when you intend on using it at a lower weight later, anyway.

Bugfixes

  • Torch compile for validation fixed, now works (it did nothing before)
  • Torch compile disabled for LyCORIS models
  • Better SD3 quantisation performance via quanto by excluding layers from the quantisation
  • Flux: default shift value to 3 instead of 1
  • SD1.5 LoRA save fixed
  • Quanto typo for FP8 fixed
  • Multi-caption parquet backend crashing fixed
  • Concurrent text embed writes on multi-GPU system file locking issue fixed

Pull requests

New Contributors

Full Changelog: v1.1.4...v1.1.5

v1.1.4

22 Oct 15:07
71bea97
Compare
Choose a tag to compare

Support for SD 3.5 fine-tuning.

Stability AI has provided a tutorial on using SimpleTuner for this task here and the SD3 quickstart provided by SimpleTuner is available here

What's Changed

Full Changelog: v1.1.3...v1.1.4

v1.1.3

18 Oct 17:31
8bf644f
Compare
Choose a tag to compare
  • Nested subdir datasets will now have caches also nested in subdirectories, which unfortunately requires most-likely regenerating these entries. Sorry - it was not feasible to keep the old structure working in parallel.
  • FlashAttention3 fixes for H100 nodes by downgrading default torch version to 2.4.1
  • Resume fixes for multi-gpu/multi-node state/epoch tracking
  • Other misc bugfixes

What's Changed

New Contributors

Full Changelog: v1.1.2...v1.1.3

v1.1.2 - masked loss and strong prior preservation

13 Oct 04:36
dddaf4f
Compare
Choose a tag to compare

New stuff

  • New is_regularisation_data option for datasets, works great
  • H100 or greater now has better torch compile support
  • SDXL ControlNet training is back, now with quantised base model (int8)
  • Multi-node training works now, with a guide to deploy it easily
  • Configure.py now can generate a very rudimentary user prompt library for you if you are in a hurry
  • Flux model cards now have more useful information about your Flux training setup
  • Masked loss training & a demo script in the toolkit dir for generating a folder of image masks

What's Changed

Full Changelog: v1.1.1...v1.1.2

v1.1.1 - bring on the potato models

05 Oct 00:37
01de5d0
Compare
Choose a tag to compare

image

Trained with NF4 via PagedLion8Bit.

  • New custom timestep distribution for Flux via --flux_use_beta_schedule, --flux_beta_schedule_alpha, --flux_beta_schedule_beta (#1023)
  • The trendy AdEMAMix, its 8bit and paged counterparts are all now available as bnb-ademamix, bnb-ademamix-8bit, and bnb-ademamix8bit-paged`
  • All low-bit optimisers from Bits n Bytes are now included for NVIDIA and ROCm systems
  • NF4 training on NVIDIA systems down to 9090M total using Lion8Bit and 512px training at 1.5 sec/iter on a 4090

What's Changed

Full Changelog: v1.1...v1.1.1

v1.1 - API-friendly edition

01 Oct 20:51
696760e
Compare
Choose a tag to compare

Features

image

Performance

  • Improved launch speed for large datasets (>1M samples)
  • Improved speed for quantising on CPU
  • Optional support for directly quantising on GPU near-instantly (--quantize_via)

Compatibility

  • SDXL, SD1.5 and SD2.x compatibility with LyCORIS training
  • Updated documentation to make multiGPU configuration a bit more obvious.
  • Improved support for torch.compile(), including automatically disabling it when eg. fp8-quanto is enabled
    • Enable via accelerate config or config/config.env via TRAINER_DYNAMO_BACKEND=inductor
  • TorchAO for quantisation as an alternative to Optimum Quanto for int8 weight-only quantisation (int8-torchao)
  • f8uz-quanto, a compatibility level for AMD users to experiment with FP8 training dynamics
  • Support for multigpu PEFT LoRA training with Quanto enabled (not fp8-quanto)
    • Previously, only LyCORIS would reliably work with quantised multigpu training sessions.
  • Ability to quantise models when full-finetuning, without warning or error. Previously, this configuration was blocked. Your mileage may vary, it's an experimental configuration.

Integrations

  • Images now get logged to tensorboard (thanks @anhi)
  • FastAPI endpoints for integrations (undocumented)
  • "raw" webhook type that sends a large number of HTTP requests containing events, useful for push notification type service

Optims

  • SOAP optimiser support
    • uses fp32 gradients, nice and accurate but uses more memory than other optims, by default slows down every 10 steps as it preconditions
  • New 8bit and 4bit optimiser options from TorchAO (ao-adamw8bit, ao-adamw4bit etc)

Pull Requests

Full Changelog: v1.0.1...v1.1

v1.0.1

14 Sep 18:45
a5ca5a2
Compare
Choose a tag to compare

This is a maintenance release with not many new features.

What's Changed

New Contributors

Full Changelog: v1.0...v1.0.1