【Diffusers】AnimateDiff + Multi-ControlNet で動画作成してみる

はじめに

以前、単一のControlNetを使う場合の記事を書きました。
touch-sp.hatenablog.com
新たにMulti-ControlNetに挑戦しました。

はまりポイント

最大32フレームまでしか対応していないようです。

それ以上の動画を作ろうとするとエラーがでます。

RuntimeError: The size of tensor a (48) must match the size of tensor b (32) at non-singleton dimension 1

エラーの原因がずっとわからずにどれだけの時間を無駄にしたことか。

Pythonスクリプト

最終的にはこのようなスクリプトになりました。

from PIL import Image
import torch
from diffusers import DiffusionPipeline, AutoencoderKL, ControlNetModel, MotionAdapter, DPMSolverMultistepScheduler

adapter = MotionAdapter.from_pretrained("animatediff-motion-adapter-v1-5-2")

controlnets = [
        ControlNetModel.from_pretrained(
        "controlnet/control_v11p_sd15_openpose",
        torch_dtype=torch.float16
        ),
        ControlNetModel.from_pretrained(
        "controlnet/control_v11f1e_sd15_tile",
        torch_dtype=torch.float16
        )
    ]

vae = AutoencoderKL.from_single_file(
    "vae/vae-ft-mse-840000-ema-pruned.safetensors",
    torch_dtype=torch.float16
)

model_id = "model/mistoonAnimev20_ema"
pipe = DiffusionPipeline.from_pretrained(
    model_id,
    motion_adapter=adapter,
    controlnet=controlnets,
    vae=vae,
    custom_pipeline="pipeline_animatediff_controlnet",
    torch_dtype=torch.float16
).to("cuda")

pipe.scheduler = DPMSolverMultistepScheduler.from_pretrained(
    model_id,
    subfolder="scheduler", 
    beta_schedule="linear",
    clip_sample=False,
    timestep_spacing="linspace",
    steps_offset=1,
    algorithm_type="sde-dpmsolver++",
    use_karras_sigmas=True
)

pipe.load_textual_inversion("embeddings/easynegative.safetensors", token="easynegative")

pipe.enable_vae_slicing()

openpose_filename = "openpose.gif"
tile_filename = "tile.gif"

num_frames = 32
openpose_frames = []
gif_images = Image.open(openpose_filename)
for i in range(num_frames):
    gif_images.seek(i)
    image = gif_images.copy()
    openpose_frames.append(image)

tile_frames = []
gif_images = Image.open(tile_filename)
for i in range(num_frames):
    gif_images.seek(i)
    image = gif_images.copy()
    tile_frames .append(image)

controlimage = [openpose_frames, tile_frames]

prompt = "anime style, high quality, best quality, man, wearing sunglasses, dancing"
negative_prompt = "easynegative, worst quality, low quality"

seed=222
controlnet1_scale = 0.8
controlnet2_scale = 0.8
result = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_frames=num_frames,
    conditioning_frames=controlimage,
    num_inference_steps=25,
    generator=torch.manual_seed(seed),
    controlnet_conditioning_scale=[controlnet1_scale, controlnet2_scale],
    #clip_skip=1
).frames[0]

from diffusers.utils import export_to_gif
export_to_gif(result, f"result_seed{seed}_scale{controlnet1_scale}_{controlnet2_scale}.gif")

スクリプト中に出てくる「openpose.gif」は以前自分が書いた「preprocess.py」というスクリプトを使って元動画から作成したものです。

python preprocess.py python preprocess.py --video dance512.mp4 --type openpose --to_gif

touch-sp.hatenablog.com
スクリプト中に出てくる「tile.gif」は動画をGIFに変換しただけのものです。

結果

以前、animatediff-cli-prompt-travel を使ってVideo2Videoをやった時と同じ動画を使って、同じような結果を目指しました。
touch-sp.hatenablog.com
結果はGoogle Bloggerに載せておきます。
support-touchsp.blogspot.com
同様な質の結果が得られています。

ただし、animatediff-cli-prompt-travelはフレーム数に上限ありませんが、Diffusersの場合先にも述べたように32フレームが上限になっています。

32フレーム以上の動画を作成する場合には以下を参照して下さい。
touch-sp.hatenablog.com

ランキング参加中

プログラミング