- はじめに
- 環境構築
- Pythonスクリプトのダウンロード
- 設定
- prior-preservation lossなし
- prior-preservation lossあり
- text encoderのファインチューニング
- 結果
- xFormersを使いたい
- xFormersが使えない問題
- DeepSpeedを使いたい
- 推論
- 公式チュートリアル
- 関連記事
はじめに
DreamBoothという手法でStable Diffusion v1.4のファインチューニングを行います。ローカルで動かしたい時にどういった設定で動くかいろいろさぐってみました。環境構築
PC環境
使用したPCはこちらです。RTX 3080 Laptop (VRAM 16GB)搭載モデルです。
bitsandbytesがWindowsで動作しないのでWSL2を使用しています。
Ubuntu 20.04 on WSL2 (Windows 11) CUDA 11.6.2 Python 3.8.10
Python環境の構築
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116 pip install git+https://github.com/huggingface/diffusers.git pip install git+https://github.com/huggingface/transformers.git pip install accelerate==0.15.0 scipy==1.10.0 datasets==2.8.0 ftfy==6.1.1 tensorboard==2.11.2
Pythonスクリプトのダウンロード
こちらを使用させて頂きました。設定
accelerate config
------------------------------------------------------------------------------------------------------------------------ In which compute environment are you running? This machine ------------------------------------------------------------------------------------------------------------------------ Which type of machine are you using? No distributed training Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:NO Do you wish to optimize your script with torch dynamo?[yes/NO]:NO Do you want to use DeepSpeed? [yes/NO]: NO What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all ------------------------------------------------------------------------------------------------------------------------ Do you wish to use FP16 or BF16 (mixed precision)? bf16
RTX 3080環境ではFP16でもBF16でもメモリ使用量はそれほど変わりませんでした。
prior-preservation lossなし
no use_8bit_adam
no gradient_checkpointing, no set_grads_to_none
accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path="stable-diffusion-v1-4" \ --instance_data_dir="RoboData" \ --output_dir="dreambooth_robo" \ --instance_prompt="a photo of sks robo" \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --checkpointing_steps=100 \ --max_train_steps=200
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 16.00 GiB total capacity; 14.08 GiB already allocated; 0 bytes free; 15.07 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
ダメです!
with gradient_checkpointing, no set_grads_to_none
accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path="stable-diffusion-v1-4" \ --instance_data_dir="RoboData" \ --output_dir="dreambooth_robo" \ --instance_prompt="a photo of sks robo" \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --checkpointing_steps=100 \ --max_train_steps=200 \ --gradient_checkpointing
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 16.00 GiB total capacity; 14.08 GiB already allocated; 0 bytes free; 15.07 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
ダメです!
no gradient_checkpointing, with set_grads_to_none
accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path="stable-diffusion-v1-4" \ --instance_data_dir="RoboData" \ --output_dir="dreambooth_robo" \ --instance_prompt="a photo of sks robo" \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --checkpointing_steps=100 \ --max_train_steps=200 \ --set_grads_to_none
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 16.00 GiB total capacity; 14.05 GiB already allocated; 0 bytes free; 14.90 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
ダメです!
with gradient_checkpointing, with set_grads_to_none
accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path="stable-diffusion-v1-4" \ --instance_data_dir="RoboData" \ --output_dir="dreambooth_robo" \ --instance_prompt="a photo of sks robo" \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --checkpointing_steps=100 \ --max_train_steps=200 \ --gradient_checkpointing \ --set_grads_to_none
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 16.00 GiB total capacity; 14.05 GiB already allocated; 0 bytes free; 14.90 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
ダメです!
with use_8bit_adam
「bitsandbytes」をインストールしたうえで「--use_8bit_adam」を使います。pip install bitsandbytes-cuda116==0.26.0.post2
no gradient_checkpointing, no set_grads_to_none
accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path="stable-diffusion-v1-4" \ --instance_data_dir="RoboData" \ --output_dir="dreambooth_robo" \ --instance_prompt="a photo of sks robo" \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --checkpointing_steps=100 \ --max_train_steps=200 \ --use_8bit_adam
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 16.00 GiB total capacity; 14.38 GiB already allocated; 0 bytes free; 14.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
ダメです。
with gradient_checkpointing, no set_grads_to_none
accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path="stable-diffusion-v1-4" \ --instance_data_dir="RoboData" \ --output_dir="dreambooth_robo" \ --instance_prompt="a photo of sks robo" \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --checkpointing_steps=100 \ --max_train_steps=200 \ --use_8bit_adam \ --gradient_checkpointing
実行可能でした。
no gradient_checkpointing, with set_grads_to_none
accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path="stable-diffusion-v1-4" \ --instance_data_dir="RoboData" \ --output_dir="dreambooth_robo" \ --instance_prompt="a photo of sks robo" \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --checkpointing_steps=100 \ --max_train_steps=200 \ --use_8bit_adam \ --set_grads_to_none
実行可能でした。
ここまでで「bitsandbytes」は必須ということがわかりました。
「--gradient_checkpointing」と「--set_grads_to_none」はVRAM使用量削減に効果あることもわかりました。
「--gradient_checkpointing」の方が「--set_grads_to_none」よりも効果高そうです。
prior-preservation lossあり
今までの調査結果を踏まえprior-preservation lossを使ってみたいと思います。with use_8bit_adam
with gradient_checkpointing, no set_grads_to_none
accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path="stable-diffusion-v1-4" \ --instance_data_dir="RoboData" \ --output_dir="dreambooth_robo" \ --instance_prompt="a photo of sks robo" \ --with_prior_preservation --prior_loss_weight=1.0 \ --class_data_dir="robo-class-images" \ --class_prompt="a photo of robo" \ --num_class_images=200 \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --checkpointing_steps=100 \ --max_train_steps=200 \ --use_8bit_adam \ --gradient_checkpointing
実行可能でした。
no gradient_checkpointing, with set_grads_to_none
accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path="stable-diffusion-v1-4" \ --instance_data_dir="RoboData" \ --output_dir="dreambooth_robo" \ --instance_prompt="a photo of sks robo" \ --with_prior_preservation --prior_loss_weight=1.0 \ --class_data_dir="robo-class-images" \ --class_prompt="a photo of robo" \ --num_class_images=200 \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --checkpointing_steps=100 \ --max_train_steps=200 \ --use_8bit_adam \ --set_grads_to_none
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 16.00 GiB total capacity; 13.80 GiB already allocated; 0 bytes free; 14.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
ダメです!
with gradient_checkpointing, with set_grads_to_none
「--gradient_checkpointing」と「--set_grads_to_none」の両方を併用することも可能です。併用した時がこちらです。
text encoderのファインチューニング
最後にtext encoderのファインチューニングも試してみました。with use_8bit_adam
with gradient_checkpointing, no set_grads_to_none
accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path="stable-diffusion-v1-4" \ --instance_data_dir="RoboData" \ --output_dir="dreambooth_robo4" \ --instance_prompt="a photo of sks robo" \ --with_prior_preservation --prior_loss_weight=1.0 \ --class_data_dir="robo-class-images" \ --class_prompt="a photo of robo" \ --num_class_images=200 \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --checkpointing_steps=100 \ --max_train_steps=200 \ --use_8bit_adam \ --gradient_checkpointing \ --train_text_encoder
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 16.00 GiB total capacity; 12.90 GiB already allocated; 0 bytes free; 14.70 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
ダメです!
with gradient_checkpointing, with set_grads_to_none
accelerate launch train_dreambooth.py \ --pretrained_model_name_or_path="stable-diffusion-v1-4" \ --instance_data_dir="RoboData" \ --output_dir="dreambooth_robo" \ --instance_prompt="a photo of sks robo" \ --with_prior_preservation --prior_loss_weight=1.0 \ --class_data_dir="robo-class-images" \ --class_prompt="a photo of robo" \ --num_class_images=200 \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --checkpointing_steps=100 \ --max_train_steps=200 \ --use_8bit_adam \ --set_grads_to_none \ --gradient_checkpointing \ --train_text_encoder
実行可能でした。
結果
「prior-preservation loss」あり、「text encoderのファインチューニング」ありの結果はこちらにのせています。touch-sp.hatenablog.com
xFormersを使いたい
別記事を書きました。touch-sp.hatenablog.com
xFormersが使えない問題
github.com
github.com
DeepSpeedを使いたい
別記事を書きました。touch-sp.hatenablog.com
推論
from diffusers import StableDiffusionPipeline import torch import argparse parser = argparse.ArgumentParser() parser.add_argument( '--model', required=True, type=str, help='model id' ) parser.add_argument( '--seed', type=int, default=200, help='seed' ) opt = parser.parse_args() model_id = opt.model pipe = StableDiffusionPipeline.from_pretrained( model_id, torch_dtype=torch.float16, safety_checker=None).to("cuda") prompt = "A photo of sks robo on the beach" seed = opt.seed for i in range(4): new_seed = seed + i generator = torch.Generator(device="cuda").manual_seed(new_seed) image = pipe( prompt = prompt, num_inference_steps = 50, generator = generator, num_images_per_prompt = 1).images[0] image.save(f'{model_id}_{new_seed}.png')
公式チュートリアル
huggingface.cogithub.com
huggingface.co
関連記事
touch-sp.hatenablog.comtouch-sp.hatenablog.com