DiffusersでDreamBoothを試してみる（Stable Diffusion v1.4のファインチューニング）

はじめに
環境構築
- PC環境
- Python環境の構築
Pythonスクリプトのダウンロード
設定
prior-preservation lossなし
- no use_8bit_adam
- with use_8bit_adam
prior-preservation lossあり
- with use_8bit_adam
text encoderのファインチューニング
- with use_8bit_adam
  - with gradient_checkpointing, no set_grads_to_none
  - with gradient_checkpointing, with set_grads_to_none
結果
xFormersを使いたい
xFormersが使えない問題
DeepSpeedを使いたい
推論
公式チュートリアル
関連記事

はじめに

DreamBoothという手法でStable Diffusion v1.4のファインチューニングを行います。

ローカルで動かしたい時にどういった設定で動くかいろいろさぐってみました。

環境構築

PC環境

使用したPCはこちらです。
RTX 3080 Laptop (VRAM 16GB)搭載モデルです。

bitsandbytesがWindowsで動作しないのでWSL2を使用しています。

Ubuntu 20.04 on WSL2 (Windows 11)
CUDA 11.6.2
Python 3.8.10

Python環境の構築

pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
pip install git+https://github.com/huggingface/diffusers.git
pip install git+https://github.com/huggingface/transformers.git
pip install accelerate==0.15.0 scipy==1.10.0 datasets==2.8.0 ftfy==6.1.1 tensorboard==2.11.2

Pythonスクリプトのダウンロード

こちらを使用させて頂きました。

設定

accelerate config

------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
No distributed training
Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:NO
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all
------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
bf16

RTX 3080環境ではFP16でもBF16でもメモリ使用量はそれほど変わりませんでした。

prior-preservation lossなし

no use_8bit_adam

no gradient_checkpointing, no set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 16.00 GiB total capacity; 14.08 GiB already allocated; 0 bytes free; 15.07 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

ダメです！

with gradient_checkpointing, no set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --gradient_checkpointing

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 16.00 GiB total capacity; 14.08 GiB already allocated; 0 bytes free; 15.07 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

ダメです！

no gradient_checkpointing, with set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --set_grads_to_none

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 16.00 GiB total capacity; 14.05 GiB already allocated; 0 bytes free; 14.90 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

ダメです！

with gradient_checkpointing, with set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --gradient_checkpointing \
  --set_grads_to_none

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 16.00 GiB total capacity; 14.05 GiB already allocated; 0 bytes free; 14.90 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

ダメです！

with use_8bit_adam

「bitsandbytes」をインストールしたうえで「--use_8bit_adam」を使います。

pip install bitsandbytes-cuda116==0.26.0.post2

no gradient_checkpointing, no set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --use_8bit_adam

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 16.00 GiB total capacity; 14.38 GiB already allocated; 0 bytes free; 14.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

ダメです。

with gradient_checkpointing, no set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --use_8bit_adam \
  --gradient_checkpointing

実行可能でした。

no gradient_checkpointing, with set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --use_8bit_adam \
  --set_grads_to_none

実行可能でした。

ここまでで「bitsandbytes」は必須ということがわかりました。

「--gradient_checkpointing」と「--set_grads_to_none」はVRAM使用量削減に効果あることもわかりました。

「--gradient_checkpointing」の方が「--set_grads_to_none」よりも効果高そうです。

prior-preservation lossあり

今までの調査結果を踏まえprior-preservation lossを使ってみたいと思います。

with use_8bit_adam

with gradient_checkpointing, no set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --class_data_dir="robo-class-images" \
  --class_prompt="a photo of robo" \
  --num_class_images=200 \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --use_8bit_adam \
  --gradient_checkpointing

実行可能でした。

no gradient_checkpointing, with set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --class_data_dir="robo-class-images" \
  --class_prompt="a photo of robo" \
  --num_class_images=200 \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --use_8bit_adam \
  --set_grads_to_none

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 16.00 GiB total capacity; 13.80 GiB already allocated; 0 bytes free; 14.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

ダメです！

with gradient_checkpointing, with set_grads_to_none

「--gradient_checkpointing」と「--set_grads_to_none」の両方を併用することも可能です。併用した時がこちらです。

text encoderのファインチューニング

最後にtext encoderのファインチューニングも試してみました。

with use_8bit_adam

with gradient_checkpointing, no set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo4" \
  --instance_prompt="a photo of sks robo" \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --class_data_dir="robo-class-images" \
  --class_prompt="a photo of robo" \
  --num_class_images=200 \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --use_8bit_adam \
  --gradient_checkpointing \
  --train_text_encoder

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 16.00 GiB total capacity; 12.90 GiB already allocated; 0 bytes free; 14.70 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

ダメです！

with gradient_checkpointing, with set_grads_to_none

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-4" \
  --instance_data_dir="RoboData" \
  --output_dir="dreambooth_robo" \
  --instance_prompt="a photo of sks robo" \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --class_data_dir="robo-class-images" \
  --class_prompt="a photo of robo" \
  --num_class_images=200 \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --checkpointing_steps=100 \
  --max_train_steps=200 \
  --use_8bit_adam \
  --set_grads_to_none \
  --gradient_checkpointing \
  --train_text_encoder

実行可能でした。

結果

「prior-preservation loss」あり、「text encoderのファインチューニング」ありの結果はこちらにのせています。
touch-sp.hatenablog.com

xFormersを使いたい

別記事を書きました。
touch-sp.hatenablog.com

xFormersが使えない問題

~~xformersを使うと学習が進まない問題が指摘されています。~~（←おそらく解決済み）
github.com
github.com

DeepSpeedを使いたい

別記事を書きました。
touch-sp.hatenablog.com

推論

from diffusers import StableDiffusionPipeline
import torch
import argparse

parser = argparse.ArgumentParser()
parser.add_argument(
    '--model',
    required=True,
    type=str,
    help='model id'
)
parser.add_argument(
    '--seed',
    type=int,
    default=200,
    help='seed'
)
opt = parser.parse_args()

model_id = opt.model
pipe = StableDiffusionPipeline.from_pretrained(
    model_id, 
    torch_dtype=torch.float16,
    safety_checker=None).to("cuda")

prompt = "A photo of sks robo on the beach"

seed = opt.seed

for i in range(4):
    new_seed = seed + i
    generator = torch.Generator(device="cuda").manual_seed(new_seed)
    image = pipe(
        prompt = prompt, 
        num_inference_steps = 50,
        generator = generator,
        num_images_per_prompt = 1).images[0]
    image.save(f'{model_id}_{new_seed}.png')

公式チュートリアル

huggingface.co
github.com
huggingface.co

touch-sp.hatenablog.com
touch-sp.hatenablog.com

はじめに

環境構築

PC環境

Python環境の構築

Pythonスクリプトのダウンロード

設定

prior-preservation lossなし

no use_8bit_adam

no gradient_checkpointing, no set_grads_to_none

with gradient_checkpointing, no set_grads_to_none

no gradient_checkpointing, with set_grads_to_none

with gradient_checkpointing, with set_grads_to_none

with use_8bit_adam

no gradient_checkpointing, no set_grads_to_none

with gradient_checkpointing, no set_grads_to_none

no gradient_checkpointing, with set_grads_to_none

prior-preservation lossあり

with use_8bit_adam

with gradient_checkpointing, no set_grads_to_none

no gradient_checkpointing, with set_grads_to_none

with gradient_checkpointing, with set_grads_to_none

text encoderのファインチューニング

with use_8bit_adam

with gradient_checkpointing, no set_grads_to_none

with gradient_checkpointing, with set_grads_to_none

結果

xFormersを使いたい

xFormersが使えない問題

DeepSpeedを使いたい

推論

公式チュートリアル

関連記事