はじめに
あらたにDiffEdit(Zero-shot Diffusion-based Semantic Image Editing with Mask Guidance)というのがDiffusersから使えるようになっていたので試してみます。やったことはタイトル通り「写真に写る犬を猫に変換する」です。似たようなものは過去にいくつもありました。pix2pix-zero
touch-sp.hatenablog.comInstruct-Pix2Pix
touch-sp.hatenablog.com環境
Ubuntu 22.04 on WSL2 CUDA 11.8 Python 3.10
Python環境
pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 --extra-index-url https://download.pytorch.org/whl/cu118 pip install git+https://github.com/huggingface/diffusers.git pip install transformers accelerate
Pythonスクリプト
import torch from diffusers.utils import load_image from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionDiffEditPipeline sd_model_ckpt = "stabilityai/stable-diffusion-2-1" pipeline = StableDiffusionDiffEditPipeline.from_pretrained( sd_model_ckpt, torch_dtype=torch.float16, safety_checker=None, ) pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config) pipeline.enable_model_cpu_offload() pipeline.enable_vae_slicing() generator = torch.manual_seed(1000) img_url = "dog.png" raw_image = load_image(img_url).convert("RGB").resize((768, 768)) source_prompt = "a dog sitting on a park bench" target_prompt = "a cat sitting on a park bench" mask_image = pipeline.generate_mask( image=raw_image, source_prompt=source_prompt, target_prompt=target_prompt, generator=generator, ) inv_latents = pipeline.invert( prompt=source_prompt, image=raw_image, generator=generator ).latents image = pipeline( prompt=target_prompt, mask_image=mask_image, image_latents=inv_latents, generator=generator, negative_prompt=source_prompt, ).images[0] image.save("result_diffedit.png")