【Diffusers】ConsisIDで動画生成を行ってみました(1)

はじめに

ConsisIDは人間のアイデンティティを保つことを目的として開発された動画生成モデルです。

IP-Adapter FaceIDの動画版みたいなものと勝手に認識しています。
touch-sp.hatenablog.com

使用したPC

プロセッサ	Intel(R) Core(TM) i7-14700K
実装 RAM	96.0 GB
GPU		RTX 4090 (VRAM 24GB)
Ubuntu 24.04 on WSL2
Python 3.12
CUDA 12.4

Python環境構築

onnxruntime-gpuだけバージョンを指定しました。

pip install torch==2.5.1+cu124 torchvision==0.20.1+cu124 xformers --index-url https://download.pytorch.org/whl/cu124
pip install git+https://github.com/huggingface/diffusers
pip install accelerate transformers sentencepiece
pip install onnxruntime-gpu==1.19.2 insightface
pip install consisid_eva_clip
pip install imageio-ffmpeg

apexはpipでインストールできなかったのでソースからビルドしました。

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

この記事の最後にライブラリのリストを書いておきます。

Pythonスクリプト

import torch
from diffusers import ConsisIDPipeline
from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
from diffusers.utils import export_to_video
from decorator import gpu_monitor, time_monitor, print_memory

# model was downloaded from https://huggingface.co/BestWishYsh/ConsisID-preview

@gpu_monitor(interval=0.5)
@time_monitor
def main():

    face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = prepare_face_models(
        "BestWishYsh/ConsisID-preview",
        device="cuda",
        dtype=torch.bfloat16
    )
    pipe = ConsisIDPipeline.from_pretrained(
        "BestWishYsh/ConsisID-preview",
        torch_dtype=torch.bfloat16
    )

    pipe.enable_model_cpu_offload()
    #pipe.enable_sequential_cpu_offload()
    #pipe.vae.enable_tiling()

    prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
    image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_input.png?download=true"

    id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(
        face_helper_1,
        face_clip_model,
        face_helper_2,
        eva_transform_mean,
        eva_transform_std,
        face_main_model,
        "cuda",
        torch.bfloat16,
        image,
        is_align_face=True,
    )
    video = pipe(
        image=image,
        prompt=prompt,
        num_inference_steps=50,
        guidance_scale=6.0,
        use_dynamic_cfg=False,
        id_vit_hidden=id_vit_hidden,
        id_cond=id_cond,
        kps_cond=face_kps,
        generator=torch.Generator("cuda").manual_seed(42),
    )
    export_to_video(video.frames[0], "output.mp4", fps=8)
    
    print_memory()    

if __name__ == "__main__":
    main()

結果

作成動画はGoogle Bloggerに載せておきます。
support-touchsp.blogspot.com
様々は方法で実行した結果を残しておきます。
「vae.enable_slicing()」はほとんどメモリ使用量削減に影響しませんでした。

方法1

pipe.enable_model_cpu_offload()
#pipe.enable_sequential_cpu_offload()
#pipe.vae.enable_tiling()
max_memory=16.57 GB
max_reserved=25.87 GB
time: 447.39 sec
GPU 0 - Used memory: 23.85/23.99 GB

方法2

pipe.enable_model_cpu_offload()
#pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
max_memory=16.04 GB
max_reserved=18.29 GB
time: 319.91 sec
GPU 0 - Used memory: 21.48/23.99 GB

方法3

#pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
#pipe.vae.enable_tiling()
max_memory=16.27 GB
max_reserved=23.21 GB
time: 637.43 sec
GPU 0 - Used memory: 23.95/23.99 GB

方法4

#pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
max_memory=5.29 GB
max_reserved=7.11 GB
time: 666.22 sec
GPU 0 - Used memory: 8.58/23.99 GB

続き

続きの記事はこちら。
touch-sp.hatenablog.com

ベンチマーク

ベンチマークはこちらのスクリプトを使用しました。
touch-sp.hatenablog.com

パッケージのバージョン

accelerate==1.3.0
albucore==0.0.23
albumentations==2.0.0
annotated-types==0.7.0
apex @ file:///mnt/wsl/PHYSICALDRIVE3p1/consisid/env/apex
asttokens==3.0.0
certifi==2024.12.14
charset-normalizer==3.4.1
coloredlogs==15.0.1
comm==0.2.2
consisid_eva_clip==1.0.2
contourpy==1.3.1
cycler==0.12.1
Cython==3.0.11
decorator==5.1.1
diffusers @ git+https://github.com/huggingface/diffusers@328e0d20a7b996f9bdb04180457eb08c1b42a76e
easydict==1.13
einops==0.8.0
executing==2.1.0
facexlib==0.3.0
filelock==3.13.1
filterpy==1.4.5
flatbuffers==24.12.23
fonttools==4.55.3
fsspec==2024.2.0
ftfy==6.3.1
huggingface-hub==0.27.1
humanfriendly==10.0
idna==3.10
imageio==2.37.0
imageio-ffmpeg==0.6.0
importlib_metadata==8.5.0
insightface==0.7.3
ipython==8.31.0
ipywidgets==8.1.5
jedi==0.19.2
Jinja2==3.1.3
joblib==1.4.2
jupyterlab_widgets==3.0.13
kiwisolver==1.4.8
lazy_loader==0.4
llvmlite==0.43.0
MarkupSafe==2.1.5
matplotlib==3.10.0
matplotlib-inline==0.1.7
mpmath==1.3.0
networkx==3.2.1
numba==0.60.0
numpy==2.0.2
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
onnx==1.17.0
onnxruntime-gpu==1.19.2
opencv-python==4.11.0.86
opencv-python-headless==4.11.0.86
packaging==24.2
parso==0.8.4
pexpect==4.9.0
pillow==10.2.0
prettytable==3.12.0
prompt_toolkit==3.0.49
protobuf==5.29.3
psutil==6.1.1
ptyprocess==0.7.0
pure_eval==0.2.3
pydantic==2.10.5
pydantic_core==2.27.2
pyfacer==0.0.5
Pygments==2.19.1
pyparsing==3.2.1
python-dateutil==2.9.0.post0
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
safetensors==0.5.2
scikit-image==0.25.0
scikit-learn==1.6.1
scipy==1.15.1
sentencepiece==0.2.0
setuptools==75.8.0
simsimd==6.2.1
six==1.17.0
stack-data==0.6.3
stringzilla==3.11.3
sympy==1.13.1
threadpoolctl==3.5.0
tifffile==2025.1.10
timm==1.0.14
tokenizers==0.21.0
torch==2.5.1+cu124
torchvision==0.20.1+cu124
tqdm==4.67.1
traitlets==5.14.3
transformers==4.48.0
triton==3.1.0
typing_extensions==4.12.2
urllib3==2.3.0
validators==0.34.0
wcwidth==0.2.13
wheel==0.45.1
widgetsnbextension==4.0.13
xformers==0.0.29.post1
zipp==3.21.0