MasaCtrl が一部 SDXL 1.0 (Stable Diffusion XL 1.0) に対応しています。さっそく試してみました。

はじめに

MasaCtrlに関しては過去に記事を書いているのでよかったら見て下さい。
touch-sp.hatenablog.com
touch-sp.hatenablog.com
簡単に言うと一貫性のある画像（例えば同じキャラクターの姿勢違いなど）を生成することが可能なモデルです。

SDXL 1.0 (Stable Diffusion XL 1.0) に対応したということでさっそく試してみました。

環境

Windows 11
CUDA 11.7
Python 3.10

Python環境構築が1行で完結するようにrequirements.txtを作りました。

pip install -r https://raw.githubusercontent.com/dai-ichiro/myEnvironments/main/MasaCtrl/requirements_sdxl.txt

実行

リポジトリをクローンした後に「run_synthesis_sdxl.py」を実行するのですがデフォルトでfloat32を使うようになっています。

float16を使うように中身を書き換えました。

model = DiffusionPipeline.from_pretrained(
    model_path, 
    scheduler=scheduler,
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
    ).to(device)

# initialize the noise map
start_code = torch.randn([1, 4, 128, 128], dtype=torch.float16, device=device)

「safetensors」ファイルを使うならこのようになります。

model = StableDiffusionXLPipeline.from_single_file(
    model_path, 
    scheduler=scheduler,
    extract_ema=True,
    torch_dtype=torch.float16,
    variant="fp16",
    ).to(device)

結果

同一人物を立たせたり、座らせたりすることが目的です。

通常の画像生成AIはこういったことが苦手です。

プロンプトやseedは「run_synthesis_sdxl.py」内に直接書き込みます。

「OsorubeshiMerge v1.0」というモデルを使わせて頂きました。

prompt: 1boy, anime, casual, outdoors, sitting, plain white t-shirt, blue jeans, best quality
prompt: 1boy, anime, casual, outdoors, standing, plain white t-shirt, blue jeans, best quality
negative_prompt: worst quality, low quality, bad anatomy
seed: 40000

layerを44→49→54→59→64と変化させています。

59、64あたりの結果が良さそうです。

layer 44

layer 49

layer 54

layer 59

layer 64

Pythonスクリプト

import os
import torch

from diffusers import DDIMScheduler, DiffusionPipeline

from masactrl.masactrl_utils import AttentionBase
from masactrl.masactrl_utils import regiter_attention_editor_diffusers
from masactrl.masactrl import MutualSelfAttentionControl

from pytorch_lightning import seed_everything

torch.cuda.set_device(0)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_path = "model/OsorubeshiMerge_v1.0_ema"
model_name = os.path.basename(model_path)
scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", clip_sample=False, set_alpha_to_one=False)
model = DiffusionPipeline.from_pretrained(
    model_path, 
    scheduler=scheduler,
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
    ).to(device)

def consistent_synthesis():
    seed = 40000
    seed_everything(seed)

    out_dir_ori = "./results"
    os.makedirs(out_dir_ori, exist_ok=True)

    prompts = [
        "1boy, anime, casual, outdoors, sitting, plain white t-shirt, blue jeans, best quality",
        "1boy, anime, casual, outdoors, standing, plain white t-shirt, blue jeans, best quality"
        ]

    negative_prompts = [
        "worst quality, low quality, bad anatomy",
        "worst quality, low quality, bad anatomy"
        ]   

    STEP = 4
    LAYER_LIST = [44, 49, 54, 59, 64]  # run the synthesis with MasaCtrl at three different layer configs
    #LAYER_LIST = [49]  

    # initialize the noise map
    start_code = torch.randn([1, 4, 128, 128], dtype=torch.float16, device=device)
    # start_code = None
    start_code = start_code.expand(len(prompts), -1, -1, -1)

    # inference the synthesized image without MasaCtrl
    editor = AttentionBase()
    regiter_attention_editor_diffusers(model, editor)
    image_ori = model(
        prompts, 
        negative_prompt=negative_prompts,
        latents=start_code, 
        guidance_scale=7.5
        ).images

    for LAYER in LAYER_LIST:
        # hijack the attention module
        editor = MutualSelfAttentionControl(STEP, LAYER, model_type="SDXL")
        regiter_attention_editor_diffusers(model, editor)

        # inference the synthesized image
        image_masactrl = model(
            prompts, 
            negative_prompt=negative_prompts,
            latents=start_code,
            guidance_scale=7.5
            ).images
        
        out_dir = os.path.join(out_dir_ori, f"{model_name}_seed{seed}_layer{LAYER}")
        os.makedirs(out_dir, exist_ok=True)
        image_ori[0].save(os.path.join(out_dir, f"source_step{STEP}_layer{LAYER}.png"))
        image_masactrl[-1].save(os.path.join(out_dir, f"masactrl_step{STEP}_layer{LAYER}.png"))
        with open(os.path.join(out_dir, f"prompts.txt"), "w") as f:
            f.write(f"{model_name}\n")
            for i, p in enumerate(prompts):
                f.write(f"prompt{i+1}: {p}\n")
            f.write(f"negative_prompt: {negative_prompts[0]}\n")
            f.write(f"seed: {seed}\n")
        print("Syntheiszed images are saved in", out_dir)

if __name__ == "__main__":
    consistent_synthesis()

その他

追加の結果はGoogle Bloggerに載せています。
support-touchsp.blogspot.com

ランキング参加中

プログラミング