# SimpleTuner **Repository Path**: fengfanzhou/SimpleTuner ## Basic Information - **Project Name**: SimpleTuner - **Description**: No description available - **Primary Language**: Unknown - **License**: AGPL-3.0 - **Default Branch**: bugfix/validation-metadata-nonetype - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-07-21 - **Last Updated**: 2025-07-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # SimpleTuner 💹 > ℹ️ No data is sent to any third parties except through opt-in flag `report_to`, `push_to_hub`, or webhooks which must be manually configured. **SimpleTuner** is geared towards simplicity, with a focus on making the code easily understood. This codebase serves as a shared academic exercise, and contributions are welcome. If you'd like to join our community, we can be found [on Discord](https://discord.com/invite/eq3cAMZtCC) via Terminus Research Group. If you have any questions, please feel free to reach out to us there. ## Table of Contents - [Design Philosophy](#design-philosophy) - [Tutorial](#tutorial) - [Features](#features) - [Flux](#flux1) - [Wan 2.1 Video](#wan-video) - [LTX Video](#ltx-video) - [PixArt Sigma](#pixart-sigma) - [NVLabs Sana](#nvlabs-sana) - [Stable Diffusion 2.0/2.1](#stable-diffusion-20--21) - [Stable Diffusion 3.0](#stable-diffusion-3) - [Kwai Kolors](#kwai-kolors) - [Lumina2](#lumina2) - [Hardware Requirements](#hardware-requirements) - [Flux](#flux1-dev-schnell) - [SDXL](#sdxl-1024px) - [Stable Diffusion (Legacy)](#stable-diffusion-2x-768px) - [Scripts](#scripts) - [Toolkit](#toolkit) - [Setup](#setup) - [Troubleshooting](#troubleshooting) ## Design Philosophy - **Simplicity**: Aiming to have good default settings for most use cases, so less tinkering is required. - **Versatility**: Designed to handle a wide range of image quantities - from small datasets to extensive collections. - **Cutting-Edge Features**: Only incorporates features that have proven efficacy, avoiding the addition of untested options. ## Tutorial Please fully explore this README before embarking on [the tutorial](/TUTORIAL.md), as it contains vital information that you might need to know first. For a quick start without reading the full documentation, you can use the [Quick Start](/documentation/QUICKSTART.md) guide. For memory-constrained systems, see the [DeepSpeed document](/documentation/DEEPSPEED.md) which explains how to use 🤗Accelerate to configure Microsoft's DeepSpeed for optimiser state offload. For multi-node distributed training, [this guide](/documentation/DISTRIBUTED.md) will help tweak the configurations from the INSTALL and Quickstart guides to be suitable for multi-node training, and optimising for image datasets numbering in the billions of samples. --- ## Features - Multi-GPU training - Image, video, and caption features (embeds) are cached to the hard drive in advance, so that training runs faster and with less memory consumption - Aspect bucketing: support for a variety of image/video sizes and aspect ratios, enabling widescreen and portrait training. - Refiner LoRA or full u-net training for SDXL - Most models are trainable on a 24G GPU, or even down to 16G at lower base resolutions. - LoRA/LyCORIS training for PixArt, SDXL, SD3, and SD 2.x that uses less than 16G VRAM - DeepSpeed integration allowing for [training SDXL's full u-net on 12G of VRAM](/documentation/DEEPSPEED.md), albeit very slowly. - Quantised NF4/INT8/FP8 LoRA training, using low-precision base model to reduce VRAM consumption. - Optional EMA (Exponential moving average) weight network to counteract model overfitting and improve training stability. - Train directly from an S3-compatible storage provider, eliminating the requirement for expensive local storage. (Tested with Cloudflare R2 and Wasabi S3) - For SDXL, SD 1.x/2.x, and Flux, full or LoRA based [ControlNet model training](/documentation/CONTROLNET.md) (not ControlLite) - Training [Mixture of Experts](/documentation/MIXTURE_OF_EXPERTS.md) for lightweight, high-quality diffusion models - [Masked loss training](/documentation/DREAMBOOTH.md#masked-loss) for superior convergence and reduced overfitting on any model - Strong [prior regularisation](/documentation/DATALOADER.md#is_regularisation_data) training support for LyCORIS models - Webhook support for updating eg. Discord channels with your training progress, validations, and errors - Integration with the [Hugging Face Hub](https://huggingface.co) for seamless model upload and nice automatically-generated model cards. - Use the [datasets library](/documentation/data_presets/preset_subjects200k.md) ([more info](/documentation/HUGGINGFACE_DATASETS.md)) to load compatible datasets directly from the hub ### HiDream Full training support for HiDream is included: - Custom ControlNet implementation for training via full-rank, LoRA or Lycoris - Memory-efficient training for NVIDIA GPUs (AMD support is planned) - Dev and Full both functioning and trainable. Fast is untested. - Optional MoEGate loss augmentation - Lycoris or full tuning via DeepSpeed ZeRO on a single GPU - Quantise the base model using `--base_model_precision` to `int8-quanto` or `fp8-quanto` for major memory savings - Quantise Llama LLM using `--text_encoder_4_precision` set to `int4-quanto` or `int8-quanto` to run on 24G cards. See [hardware requirements](#hidream) or the [quickstart guide](/documentation/quickstart/HIDREAM.md). ### Flux.1 Full training support for Flux.1 is included: - Double the training speed of Flux.1 with the new `--fuse_qkv_projections` option, taking advantage of Flash Attention 3 on Hopper systems - ControlNet training via full-rank, LoRA or Lycoris - Instruct fine-tuning for the Kontext \[dev] editing model implementation generously provided by [Runware](https://runware.ai). - Classifier-free guidance training - Leave it disabled and preserve the dev model's distillation qualities - Or, reintroduce CFG to the model and improve its creativity at the cost of inference speed and training time. - (optional) T5 attention masked training for superior fine details and generalisation capabilities - LoRA or full tuning via DeepSpeed ZeRO on a single GPU - Quantise the base model using `--base_model_precision` to `int8-quanto` or `fp8-torchao` for major memory savings See [hardware requirements](#flux1-dev-schnell) or the [quickstart guide](/documentation/quickstart/FLUX.md). ### Wan Video SimpleTuner has preliminary training integration for Wan 2.1 which has a 14B and 1.3B type, both of which work. - Text to Video training is supported. - Image to Video training is not yet supported. - Text encoder training is not supported. - VAE training is not supported. - LyCORIS, PEFT, and full tuning all work as expected - ControlNet training is not yet supported See the [Wan Video Quickstart](/documentation/quickstart/WAN.md) guide to start training. ### LTX Video SimpleTuner has preliminary training integration for LTX Video, efficiently training on less than 16G. - Text encoder training is not supported - VAE training is not supported - LyCORIS, PEFT, and full tuning all work as expected - ControlNet training is not yet supported See the [LTX Video Quickstart](/documentation/quickstart/LTXVIDEO.md) guide to start training. ### PixArt Sigma SimpleTuner has extensive training integration with PixArt Sigma - both the 600M & 900M models load without modification. - Text encoder training is not supported - LyCORIS and full tuning both work as expected - ControlNet training is supported for full and PEFT LoRA training - [Two-stage PixArt](https://huggingface.co/ptx0/pixart-900m-1024-ft-v0.7-stage1) training support (see: [MIXTURE_OF_EXPERTS](/documentation/MIXTURE_OF_EXPERTS.md)) See the [PixArt Quickstart](/documentation/quickstart/SIGMA.md) guide to start training. ### NVLabs Sana SimpleTuner has extensive training integration with NVLabs Sana. This is a lightweight, fun, and fast model that makes getting into model training highly accessible to a wider audience. - LyCORIS and full tuning both work as expected. - Text encoder training is not supported. - PEFT Standard LoRA is not supported. - ControlNet training is not yet supported See the [NVLabs Sana Quickstart](/documentation/quickstart/SANA.md) guide to start training. ### Stable Diffusion 3 - LoRA and full finetuning are supported as usual. - ControlNet training via full-rank, PEFT LoRA, or Lycoris - Certain features such as segmented timestep selection and Compel long prompt weighting are not yet supported. - Parameters have been optimised to get the best results, validated through from-scratch training of SD3 models See the [Stable Diffusion 3 Quickstart](/documentation/quickstart/SD3.md) to get going. ### Kwai Kolors An SDXL-based model with ChatGLM (General Language Model) 6B as its text encoder, **doubling** the hidden dimension size and substantially increasing the level of local detail included in the prompt embeds. Kolors support is almost as deep as SDXL, minus ControlNet training support. ### Lumina2 A 2B parameter flow-matching model that uses the 16ch Flux VAE. - LoRA, Lycoris, and full finetuning are supported - ControlNet training is not yet supported A [Lumina2 Quickstart](/documentation/quickstart/LUMINA2.md) is available with example configurations. ### Cosmos2 Predict (Image) A 2B / 14B parameter model that can do video as well as text-to-image. - Currently, only the text-to-image variant is supported. - Lycoris or full-rank tuning are supported, but PEFT LoRAs are currently not. - ControlNet training is not yet supported. A [Cosmos2 Predict Quickstart](/documentation/quickstart/COSMOS2IMAGE.md) is available with full example configuration and dataset. ### Legacy Stable Diffusion models RunwayML's SD 1.5 and StabilityAI's SD 2.x are both trainable under the `legacy` designation. --- ## Hardware Requirements ### NVIDIA Pretty much anything 3080 and up is a safe bet. YMMV. ### AMD LoRA and full-rank tuning are verified working on a 7900 XTX 24GB and MI300X. Lacking `xformers`, it will use more memory than Nvidia equivalent hardware. ### Apple LoRA and full-rank tuning are tested to work on an M3 Max with 128G memory, taking about **12G** of "Wired" memory and **4G** of system memory for SDXL. - You likely need a 24G or greater machine for machine learning with M-series hardware due to the lack of memory-efficient attention. - Subscribing to Pytorch issues for MPS is probably a good idea, as random bugs will make training stop working. ### HiDream [dev, full] - A100-80G (Full tune with DeepSpeed) - A100-40G (LoRA, LoKr) - 3090 24G (LoRA, LoKr) HiDream has not been tested on 16G cards, but with aggressive quantisation and pre-caching of embeds, you might make it work, though even 24G is pushing limits. ### Flux.1 [dev, schnell] - A100-80G (Full tune with DeepSpeed) - A100-40G (LoRA, LoKr) - 3090 24G (LoRA, LoKr) - 4060 Ti 16G, 4070 Ti 16G, 3080 16G (int8, LoRA, LoKr) - 4070 Super 12G, 3080 10G, 3060 12GB (nf4, LoRA, LoKr) Flux prefers being trained with multiple large GPUs but a single 16G card should be able to do it with quantisation of the transformer and text encoders. Kontext requires a bit beefier compute and memory allocation; a 4090 will go from ~3 to ~6 seconds per step when it is enabled. ### Auraflow - A100-80G (Full tune with DeepSpeed) - A100-40G (LoRA, LoKr) - 3090 24G (LoRA, LoKr) - 4060 Ti 16G, 4070 Ti 16G, 3080 16G (int8, LoRA, LoKr) - 4070 Super 12G, 3080 10G, 3060 12GB (nf4, LoRA, LoKr) ### SDXL, 1024px - A100-80G (EMA, large batches, LoRA @ insane batch sizes) - A6000-48G (EMA@768px, no EMA@1024px, LoRA @ high batch sizes) - A100-40G (EMA@1024px, EMA@768px, EMA@512px, LoRA @ high batch sizes) - 4090-24G (EMA@1024px, batch size 1-4, LoRA @ medium-high batch sizes) - 4080-12G (LoRA @ low-medium batch sizes) ### Stable Diffusion 2.x, 768px - 16G or better ## Toolkit For more information about the associated toolkit distributed with SimpleTuner, refer to [the toolkit documentation](/toolkit/README.md). ## Setup Detailed setup information is available in the [installation documentation](/INSTALL.md). ## Troubleshooting Enable debug logs for a more detailed insight by adding `export SIMPLETUNER_LOG_LEVEL=DEBUG` to your environment (`config/config.env`) file. For performance analysis of the training loop, setting `SIMPLETUNER_TRAINING_LOOP_LOG_LEVEL=DEBUG` will have timestamps that highlight any issues in your configuration. For a comprehensive list of options available, consult [this documentation](/OPTIONS.md).