【Diffusers】「CogVideoX」でText2Video

Python環境構築

 pip install torch==2.4.0+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install git+https://github.com/huggingface/diffusers
pip install git+https://github.com/huggingface/accelerate
pip install transformers sentencepiece opencv-python 

Pythonスクリプト

import torch
from diffusers import CogVideoXPipeline, CogVideoXDDIMScheduler
from diffusers.utils import export_to_video

prompt = (
    "A panda sits on a wooden stool in a serene bamboo forest."
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes."
    "Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays."
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
)
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16
)
pipe.scheduler = CogVideoXDDIMScheduler.from_config(
    pipe.scheduler.config,
    timestep_spacing="trailing"
)

pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

video = pipe(prompt=prompt, num_frames=48, guidance_scale=6, num_inference_steps=50, generator=torch.Generator().manual_seed(42)).frames[0]
export_to_video(video, "output_tiling.mp4", fps=8)

結果

結果はGoogle Bloggerに載せています。
support-touchsp.blogspot.com
VRAM使用量は8GB以下でした。



このエントリーをはてなブックマークに追加