DiffusersでDreamBoothを試してみる(Stable Diffusion v1.4のファインチューニング)

はじめに

DreamBoothという手法でStable Diffusion v1.4のファインチューニングを行います。

ローカルで動かしたい時にどういった設定で動くかいろいろさぐってみました。

環境構築

PC環境

使用したPCはこちらです。
RTX 3080 Laptop (VRAM 16GB)搭載モデルです。

bitsandbytesがWindowsで動作しないのでWSL2を使用しています。

Ubuntu 20.04 on WSL2 (Windows 11)
CUDA 11.6.2
Python 3.8.10

Python環境の構築

pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
pip install git+https://github.com/huggingface/diffusers.git
pip install git+https://github.com/huggingface/transformers.git
pip install accelerate==0.15.0 scipy==1.10.0 datasets==2.8.0 ftfy==6.1.1 tensorboard==2.11.2

Pythonスクリプトのダウンロード

こちらを使用させて頂きました。

設定

accelerate config
------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
No distributed training
Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:NO
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all
------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
bf16

RTX 3080環境ではFP16でもBF16でもメモリ使用量はそれほど変わりませんでした。

prior-preservation lossなし

no use_8bit_adam

no gradient_checkpointing, no set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 16.00 GiB total capacity; 14.08 GiB already allocated; 0 bytes free; 15.07 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

ダメです!

with gradient_checkpointing, no set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --gradient_checkpointing
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 16.00 GiB total capacity; 14.08 GiB already allocated; 0 bytes free; 15.07 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

ダメです!

no gradient_checkpointing, with set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --set_grads_to_none
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 16.00 GiB total capacity; 14.05 GiB already allocated; 0 bytes free; 14.90 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

ダメです!

with gradient_checkpointing, with set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --gradient_checkpointing \
  --set_grads_to_none
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 16.00 GiB total capacity; 14.05 GiB already allocated; 0 bytes free; 14.90 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

ダメです!

with use_8bit_adam

「bitsandbytes」をインストールしたうえで「--use_8bit_adam」を使います。

pip install bitsandbytes-cuda116==0.26.0.post2

no gradient_checkpointing, no set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --use_8bit_adam
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 16.00 GiB total capacity; 14.38 GiB already allocated; 0 bytes free; 14.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

ダメです。

with gradient_checkpointing, no set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --use_8bit_adam \
  --gradient_checkpointing

実行可能でした。

no gradient_checkpointing, with set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --use_8bit_adam \
  --set_grads_to_none

実行可能でした。



ここまでで「bitsandbytes」は必須ということがわかりました。

「--gradient_checkpointing」と「--set_grads_to_none」はVRAM使用量削減に効果あることもわかりました。

「--gradient_checkpointing」の方が「--set_grads_to_none」よりも効果高そうです。

prior-preservation lossあり

今までの調査結果を踏まえprior-preservation lossを使ってみたいと思います。

with use_8bit_adam

with gradient_checkpointing, no set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --class_data_dir="robo-class-images" \
  --class_prompt="a photo of robo" \
  --num_class_images=200 \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --use_8bit_adam \
  --gradient_checkpointing

実行可能でした。

no gradient_checkpointing, with set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --class_data_dir="robo-class-images" \
  --class_prompt="a photo of robo" \
  --num_class_images=200 \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --use_8bit_adam \
  --set_grads_to_none
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 16.00 GiB total capacity; 13.80 GiB already allocated; 0 bytes free; 14.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

ダメです!

with gradient_checkpointing, with set_grads_to_none

「--gradient_checkpointing」と「--set_grads_to_none」の両方を併用することも可能です。併用した時がこちらです。

text encoderのファインチューニング

最後にtext encoderのファインチューニングも試してみました。

with use_8bit_adam

with gradient_checkpointing, no set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo4" \
  --instance_prompt="a photo of sks robo" \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --class_data_dir="robo-class-images" \
  --class_prompt="a photo of robo" \
  --num_class_images=200 \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --use_8bit_adam \
  --gradient_checkpointing \
  --train_text_encoder
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 16.00 GiB total capacity; 12.90 GiB already allocated; 0 bytes free; 14.70 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

ダメです!

with gradient_checkpointing, with set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --class_data_dir="robo-class-images" \
  --class_prompt="a photo of robo" \
  --num_class_images=200 \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --use_8bit_adam \
  --set_grads_to_none \
  --gradient_checkpointing \
  --train_text_encoder

実行可能でした。

結果

「prior-preservation loss」あり、「text encoderのファインチューニング」ありの結果はこちらにのせています。
touch-sp.hatenablog.com

xFormersを使いたい

別記事を書きました。
touch-sp.hatenablog.com

xFormersが使えない問題

xformersを使うと学習が進まない問題が指摘されています。(←おそらく解決済み)
github.com
github.com

DeepSpeedを使いたい

別記事を書きました。
touch-sp.hatenablog.com

推論

from diffusers import StableDiffusionPipeline
import torch
import argparse

parser = argparse.ArgumentParser()
parser.add_argument(
    '--model',
    required=True,
    type=str,
    help='model id'
)
parser.add_argument(
    '--seed',
    type=int,
    default=200,
    help='seed'
)
opt = parser.parse_args()

model_id = opt.model
pipe = StableDiffusionPipeline.from_pretrained(
    model_id, 
    torch_dtype=torch.float16,
    safety_checker=None).to("cuda")

prompt = "A photo of sks robo on the beach"

seed = opt.seed

for i in range(4):
    new_seed = seed + i
    generator = torch.Generator(device="cuda").manual_seed(new_seed)
    image = pipe(
        prompt = prompt, 
        num_inference_steps = 50,
        generator = generator,
        num_images_per_prompt = 1).images[0]
    image.save(f'{model_id}_{new_seed}.png')

公式チュートリアル

huggingface.co
github.com
huggingface.co

関連記事

touch-sp.hatenablog.com
touch-sp.hatenablog.com