はじめに
「to("cuda")」「enable_model_cpu_offload()」「enable_sequential_cpu_offload()」「enable_vae_slicing()」「enable_vae_tiling()」をいろいろ組み合わせてLumina Image 2.0を実行しました。使用したPC
プロセッサ Intel(R) Core(TM) i7-12700H 実装 RAM 32.0 GB GPU RTX 3080 Laptop (VRAM 16GB)
Python 3.12 CUDA 12.6
Python環境
accelerate==1.3.0 diffusers @ git+https://github.com/huggingface/diffusers@81440fd47493b9f9e817411ca0499d0bf06fde95 torch==2.6.0+cu126 transformers==4.48.3
結果
PC環境によって結果は変わってくるものと思います。あくまで参考程度にして下さい。ベンチマークが取れるスクリプトを記事の最後に載せておくので自身の環境で実際測定してみて下さい。生成が速い順
第1位
to("cada")
enable_vae_slicing()
enable_vae_tiling()
time: 76.81 sec
memory: 14.84 GB第2位
to("cada")
enable_vae_slicing()
time: 80.57 sec
memory: 14.84 GB第3位以降
to("cada")
enable_vae_tiling()
time: 80.63 sec
memory: 14.84 GBto("cada")
time: 81.02 sec
memory: 14.84 GBto("cada")
enable_model_cpu_offload()
enable_vae_slicing()
enable_vae_tiling()
time: 81.65 sec
memory: 10.36 GBto("cada")
enable_model_cpu_offload()
enable_vae_slicing()
time: 83.38 sec
memory: 10.38 GBto("cada")
enable_model_cpu_offload()
enable_vae_tiling()
time: 84.3 sec
memory: 10.38 GBto("cada")
enable_model_cpu_offload()
time: 84.81 sec
memory: 10.38 GBenable_model_cpu_offload() enable_vae_slicing() enable_vae_tiling() time: 85.24 sec memory: 5.95 GB
enable_model_cpu_offload() enable_vae_slicing() time: 87.81 sec memory: 5.95 GB
enable_model_cpu_offload() enable_vae_tiling() time: 88.41 sec memory: 5.95 GB
enable_model_cpu_offload() time: 88.79 sec memory: 5.95 GB
VRAM使用量が少ない順
第1位
enable_model_cpu_offload() enable_vae_slicing() enable_vae_tiling() memory: 5.95 GB time: 85.24 sec
enable_model_cpu_offload() enable_vae_slicing() memory: 5.95 GB time: 87.81 sec
enable_model_cpu_offload() enable_vae_tiling() memory: 5.95 GB time: 88.41 sec
enable_model_cpu_offload() memory: 5.95 GB time: 88.79 sec
第2位以降
to("cada")
enable_model_cpu_offload()
enable_vae_slicing()
enable_vae_tiling()
memory: 10.36 GB
time: 81.65 secto("cada")
enable_model_cpu_offload()
enable_vae_slicing()
memory: 10.38 GB
time: 83.38 secto("cada")
enable_model_cpu_offload()
enable_vae_tiling()
memory: 10.38 GB
time: 84.3 secto("cada")
enable_model_cpu_offload()
memory: 10.38 GB
time: 84.81 secto("cada")
enable_vae_slicing()
enable_vae_tiling()
memory: 14.84 GB
time: 76.81 secto("cada")
enable_vae_slicing()
memory: 14.84 GB
time: 80.57 secto("cada")
enable_vae_tiling()
memory: 14.84 GB
time: 80.63 secto("cada")
memory: 14.84 GB
time: 81.02 sec結果からわかること(考察)
「enable_sequential_cpu_offload()」を使うとそもそも画像生成できない。「enable_vae_slicing()」「enable_vae_tiling()」はVRAM使用量削減には貢献しないが使ったほうが生成速度が上がる。VRAM使用量削減には「enable_model_cpu_offload()」が有効。ただし「to("cuda")」と併用するとその効果が減弱する。Pythonスクリプト
ベンチマークが取れるスクリプトを書きました。時間がかかる場合にはnum_inference_stepsを50より小さくして下さい。import torch from diffusers import Lumina2Text2ImgPipeline from itertools import product import gc import time from typing import Tuple, TypedDict def reset_memory(): gc.collect() torch.cuda.empty_cache() torch.cuda.reset_accumulated_memory_stats() torch.cuda.reset_peak_memory_stats() class ResultDict(TypedDict): memeory: float time_required: float combination: str def main( i: int, combination: Tuple[bool, bool, bool, bool, bool] ) -> ResultDict: if sum(combination[:3]) == 0: return None pipe = Lumina2Text2ImgPipeline.from_pretrained( "Alpha-VLLM/Lumina-Image-2.0", torch_dtype=torch.bfloat16 ) try: combination_list = [] if combination[0]: pipe.to("cuda") combination_list.append("to(\"cada\")") if combination[1]: pipe.enable_model_cpu_offload() combination_list.append("enable_model_cpu_offload()") if combination[2]: pipe.enable_sequential_cpu_offload() combination_list.append("enable_sequential_cpu_offload()") if combination[3]: pipe.enable_vae_slicing() combination_list.append("enable_vae_slicing()") if combination[4]: pipe.enable_vae_tiling() combination_list.append("enable_vae_tiling()") prompt = ( "A serene photograph capturing the golden reflection of the sun on a vast expanse of water. " "The sun is positioned at the top center, casting a brilliant, shimmering trail of light across the rippling surface. " "The water is textured with gentle waves, creating a rhythmic pattern that leads the eye towards the horizon. " "The entire scene is bathed in warm, golden hues, enhancing the tranquil and meditative atmosphere. " "High contrast, natural lighting, golden hour, photorealistic, expansive composition, reflective surface, peaceful, visually harmonious." ) start_time = time.time() image = pipe( prompt, height=1024, width=1024, guidance_scale=4.0, num_inference_steps=50, cfg_trunc_ratio=0.25, cfg_normalization=True, generator=torch.Generator("cpu").manual_seed(0) ).images[0] image.save(f"{i}.jpg") end_time = time.time() result: ResultDict = { "memory": round(torch.cuda.max_memory_reserved() / 1024**3, 2), "time_required": round(end_time - start_time, 2), "combination": "\n".join(combination_list) } except Exception as e: print("\n".join(combination_list)) print(e) return None print("succeee!!") print("\n".join(combination_list)) print(f"saved image as {i}.jpg ") return result if __name__ == "__main__": combinations = list(product([True, False], repeat=5)) result_list = [] for i, combination in enumerate(combinations): reset_memory() result = main(i, combination) if result is not None: result_list.append(result) print("Sorted by time taken") time_sorted_list = sorted(result_list, key=lambda x: x["time_required"]) for time_sorted in time_sorted_list: print(time_sorted["combination"]) print(f"time: {time_sorted["time_required"]} sec") print(f"memory: {time_sorted["memory"]} GB") print() print("Sorted by memory used") memory_sorted_list = sorted(result_list, key=lambda x: x["memory"]) for memory_sorted in memory_sorted_list: print(memory_sorted["combination"]) print(f"memory: {memory_sorted["memory"]} GB") print(f"time: {memory_sorted["time_required"]} sec") print()