SDXL 1.0 (Stable Diffusion XL 1.0) のLoRA学習 (DreamBooth fine-tuning via LoRA) がVRAM 16GBでできるんだって。やるしかないでしょ。

はじめに

今回の学習は「DreamBooth fine-tuning of the SDXL UNet via LoRA」として紹介されています。

いわゆる通常のLoRAとは異なるようです。

16GBで動かせるということはGoogle Colabで動かせるという事だと思います。

自分は宝の持ち腐れのRTX 4090をここぞとばかりに使いました。
touch-sp.hatenablog.com

環境

VRAM使用量削減のためbitsandbytesライブラリを使います。

Windowsではbitsandbytesが使えないのでWSL2を使いました。

Ubuntu 22.04 on WSL2
Python 3.10
CUDA 11.8

pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install git+https://github.com/huggingface/diffusers
pip install accelerate transformers ftfy tensorboard Jinja2 xformers==0.0.22 bitsandbytes scipy

サンプル画像のダウンロード

from huggingface_hub import snapshot_download

local_dir = "./dog"
snapshot_download(
    "diffusers/dog-example",
    local_dir=local_dir, repo_type="dataset",
    ignore_patterns=".gitattributes",
)

これで5枚の犬の画像がダウンロードされます。

accelerateの設定

$ accelerate config
------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
No distributed training
Do you want to run your training on CPU only (even if a GPU / Apple Silicon device is available)? [yes/NO]:NO
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all
------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
fp16

学習の実行

「stable-diffusion-xl-base-1.0」はあらかじめローカルにダウンロード済みです。

accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path="stable-diffusion-xl-base-1.0"  \
  --instance_data_dir="dog" \
  --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
  --output_dir="lora-trained-xl" \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of sks dog" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=600 \
  --checkpointing_steps=200 \
  --seed="0" \
  --enable_xformers_memory_efficient_attention \
  --use_8bit_adam

推論（結果）

from diffusers import DiffusionPipeline, AutoencoderKL
import torch

checkpoint = 200
#checkpoint = 400
#checkpoint = 600
lora_model_id = f"lora-trained-xl/checkpoint-{checkpoint}"

vae = AutoencoderKL.from_pretrained(
    "madebyollin/sdxl-vae-fp16-fix",
    torch_dtype=torch.float16)

pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-xl-base-1.0",
    vae=vae,
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True).to("cuda")

pipe.load_lora_weights(lora_model_id, file_name="pytorch_lora_weights.bin")

seed = 20000
generator = torch.manual_seed(seed)

image = pipe(
    "A picture of a sks dog in a bucket",
    num_inference_steps=25,
    generator=generator).images[0]

image.save(f"result_checkpoint{checkpoint}.png")

左から学習のステップ数 200→400→600です。

たった5枚の画像での学習ですが、少ないステップ数でも学習データの犬に近い結果が得られています。

注意

「--max_train_steps=600」で学習ステップを指定しています。

「--checkpointing_steps=200」で200ステップ毎に結果を保存するように指定しています。

なぜかcheckpointing_steps毎に結果を保存する時に、VRAM 16GBだとOOM (out of memory)が発生することがあります。

そのような時にはcheckpointing_stepsにmax_train_stepsよりも大きな数字を指定すれば問題ありません。

続き

touch-sp.hatenablog.com

ランキング参加中

プログラミング