【Diffusers】「Mochi 1 Preview」で動画作成（Text2Video）を行ってみる

はじめに

CogVideoXの記事を以前書きました。

Diffusersを使うとその時とほとんど同じスクリプトで「Mochi 1 Preview」も実行可能です。
touch-sp.hatenablog.com

使用したPC

OS		Windows 11
プロセッサ	Core(TM) i7-14700K
実装 RAM	96.0 GB
GPU		RTX 4090 (VRAM 24GB)

CUDA 12.4
Python 3.12

Python環境構築

pip install torch==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124
pip install diffusers[torch]
pip install transformers sentencepiece imageio imageio-ffmpeg
pip install torchao

diffusers==0.32.1
imageio==2.36.1
imageio-ffmpeg==0.6.0
sentencepiece==0.2.0
torch==2.5.1+cu124
torchao==0.8.0
transformers==4.48.0

Pythonスクリプト

import torch
from diffusers import MochiPipeline, AutoencoderKLMochi, MochiTransformer3DModel, TorchAoConfig
from diffusers.utils import export_to_video
from decorator import gpu_monitor, time_monitor
import gc

def flush():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

@gpu_monitor(interval=0.5)
@time_monitor
def main():
    pipe = MochiPipeline.from_pretrained(
        "genmo/mochi-1-preview",
        transformer=None,
        vae=None,
        torch_dtype=torch.bfloat16
    )
    pipe.enable_model_cpu_offload()

    prompt = "An aerial shot of a parade of elephants walking across the African savannah. The camera showcases the herd and the surrounding landscape."
    with torch.no_grad():
        prompt_embeds, prompt_attention_mask, negative_prompt_embeds, negative_prompt_attention_mask = (
            pipe.encode_prompt(prompt=prompt)
        )

    print("text_encoder:")
    print(f"torch.cuda.max_memory_allocated: {torch.cuda.max_memory_allocated()/ 1024**3:.2f} GB")

    del pipe
    flush()

    quantization_config = TorchAoConfig("int8wo")

    transformer = MochiTransformer3DModel.from_pretrained(
        "genmo/mochi-1-preview",
        variant="bf16",
        subfolder="transformer",
        quantization_config=quantization_config,
        torch_dtype=torch.bfloat16
    )

    vae = AutoencoderKLMochi.from_pretrained(
        "genmo/mochi-1-preview",
        variant="bf16",
        subfolder="vae",
        quantization_config=quantization_config,
        torch_dtype=torch.bfloat16
    )

    pipe = MochiPipeline.from_pretrained(
        "genmo/mochi-1-preview",
        transformer=transformer,
        vae=vae,
        text_encoder=None,
        tokenizer=None,
        torch_dtype=torch.bfloat16
    )
    pipe.enable_model_cpu_offload()
    pipe.enable_vae_tiling()

    frames = pipe(
        prompt_embeds=prompt_embeds,
        prompt_attention_mask=prompt_attention_mask,
        negative_prompt_embeds=negative_prompt_embeds,
        negative_prompt_attention_mask=negative_prompt_attention_mask,
        num_frames=85,
        num_inference_steps=64,
        generator=torch.Generator().manual_seed(42)
    ).frames[0]

    export_to_video(frames, "mochi.mp4", fps=30)

    print("transformer and vae:")
    print(f"torch.cuda.max_memory_allocated: {torch.cuda.max_memory_allocated()/ 1024**3:.2f} GB")

if __name__ == "__main__":
    main()

結果

このような二つのエラーが出ましたが、無視しても実行には影響ないようです。

import error: No module named 'triton'

Expected types for vae: ['AutoencoderKL'], got AutoencoderKLMochi.

text_encoder:
torch.cuda.max_memory_allocated: 8.94 GB

transformer and vae:
torch.cuda.max_memory_allocated: 13.99 GB

time: 916.85 sec
GPU 0 - Used memory: 19.37/23.99 GB

作成された動画はGoogle Bloggerに載せています。
support-touchsp.blogspot.com

その他

ベンチマークはこちらで記述したスクリプトで行いました。
touch-sp.hatenablog.com