はじめに
今回の学習は「DreamBooth fine-tuning of the SDXL UNet via LoRA」として紹介されています。いわゆる通常のLoRAとは異なるようです。16GBで動かせるということはGoogle Colabで動かせるという事だと思います。自分は宝の持ち腐れのRTX 4090をここぞとばかりに使いました。touch-sp.hatenablog.com
環境
VRAM使用量削減のためbitsandbytesライブラリを使います。Windowsではbitsandbytesが使えないのでWSL2を使いました。Ubuntu 22.04 on WSL2 Python 3.10 CUDA 11.8
pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 --index-url https://download.pytorch.org/whl/cu118 pip install git+https://github.com/huggingface/diffusers pip install accelerate transformers ftfy tensorboard Jinja2 xformers==0.0.22 bitsandbytes scipy
サンプル画像のダウンロード
from huggingface_hub import snapshot_download local_dir = "./dog" snapshot_download( "diffusers/dog-example", local_dir=local_dir, repo_type="dataset", ignore_patterns=".gitattributes", )
これで5枚の犬の画像がダウンロードされます。
accelerateの設定
$ accelerate config ------------------------------------------------------------------------------------------------------------------------ In which compute environment are you running? This machine ------------------------------------------------------------------------------------------------------------------------ Which type of machine are you using? No distributed training Do you want to run your training on CPU only (even if a GPU / Apple Silicon device is available)? [yes/NO]:NO Do you wish to optimize your script with torch dynamo?[yes/NO]:NO Do you want to use DeepSpeed? [yes/NO]: NO What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all ------------------------------------------------------------------------------------------------------------------------ Do you wish to use FP16 or BF16 (mixed precision)? fp16
学習の実行
「stable-diffusion-xl-base-1.0」はあらかじめローカルにダウンロード済みです。accelerate launch train_dreambooth_lora_sdxl.py \ --pretrained_model_name_or_path="stable-diffusion-xl-base-1.0" \ --instance_data_dir="dog" \ --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \ --output_dir="lora-trained-xl" \ --mixed_precision="fp16" \ --instance_prompt="a photo of sks dog" \ --resolution=1024 \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --gradient_checkpointing \ --learning_rate=1e-4 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --max_train_steps=600 \ --checkpointing_steps=200 \ --seed="0" \ --enable_xformers_memory_efficient_attention \ --use_8bit_adam
推論(結果)
from diffusers import DiffusionPipeline, AutoencoderKL import torch checkpoint = 200 #checkpoint = 400 #checkpoint = 600 lora_model_id = f"lora-trained-xl/checkpoint-{checkpoint}" vae = AutoencoderKL.from_pretrained( "madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16) pipe = DiffusionPipeline.from_pretrained( "stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("cuda") pipe.load_lora_weights(lora_model_id, file_name="pytorch_lora_weights.bin") seed = 20000 generator = torch.manual_seed(seed) image = pipe( "A picture of a sks dog in a bucket", num_inference_steps=25, generator=generator).images[0] image.save(f"result_checkpoint{checkpoint}.png")
左から学習のステップ数 200→400→600です。
たった5枚の画像での学習ですが、少ないステップ数でも学習データの犬に近い結果が得られています。