【Diffusers】【FLUX.1-dev】GGUFフォーマットを使用する

PC環境

Windows 11
RTX 3080 Laptop (VRAM 16GB)
CUDA 11.8
Python 3.12

Python環境構築

pip install torch==2.5.1+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install gguf
pip install git+https://github.com/huggingface/diffusers
pip install accelerate transformers protobuf sentencepiece

結果

3パターン試してみましたが画質の劣化はごくわずかです。

flux1-dev-Q8_0.gguf

text_encoder:
torch.cuda.max_memory_allocated: 9.33 GB

transformer:
torch.cuda.max_memory_allocated: 14.53 GB

time: 149.64 sec

GPU 0 - Used memory: 15.98/16.00 GB

flux1-dev-Q6_K.gguf

text_encoder:
torch.cuda.max_memory_allocated: 9.33 GB

transformer:
torch.cuda.max_memory_allocated: 11.79 GB

time: 168.83 sec

GPU 0 - Used memory: 14.55/16.00 GB

flux1-dev-Q4_K_S.gguf

text_encoder:
torch.cuda.max_memory_allocated: 9.33 GB

transformer:
torch.cuda.max_memory_allocated: 9.33 GB

time: 147.56 sec

GPU 0 - Used memory: 11.51/16.00 GB

Pythonスクリプト

import torch
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
from decorator import gpu_monitor, time_monitor
import gc

def flush():
    gc.collect()
    torch.cuda.empty_cache()

@gpu_monitor(interval=0.5)
@time_monitor
def main():
    # downloaded from https://huggingface.co/city96/FLUX.1-dev-gguf
    gguf_file = "flux1-dev-Q8_0.gguf"
    model_id = "black-forest-labs/Flux.1-Dev"

    pipeline = FluxPipeline.from_pretrained(
            model_id,
            transformer=None,
            vae=None,
            torch_dtype=torch.bfloat16
    ).to("cuda")

    prompt = "A cat holding a sign that says hello world"

    with torch.no_grad():
        prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
            prompt=prompt,
            prompt_2=None,
        )

    print("text_encoder:")
    print(f"torch.cuda.max_memory_allocated: {torch.cuda.max_memory_allocated()/ 1024**3:.2f} GB")

    del pipeline
    flush()

    transformer = FluxTransformer2DModel.from_single_file(
        gguf_file,
        quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
        torch_dtype=torch.bfloat16
    )
    pipeline = FluxPipeline.from_pretrained(
        model_id,
        transformer=transformer,
        text_encoder=None,
        text_encoder_2=None,
        tokenizer=None,
        tokenizer_2=None,
        torch_dtype=torch.bfloat16
    ).to("cuda")

    image = pipeline(
        prompt_embeds=prompt_embeds,
        pooled_prompt_embeds=pooled_prompt_embeds,
        generator=torch.manual_seed(0)
    ).images[0]

    save_file = gguf_file.replace(".gguf", ".jpg")
    image.save(save_file)

    print("transformer:")
    print(f"torch.cuda.max_memory_allocated: {torch.cuda.max_memory_allocated()/ 1024**3:.2f} GB")

if __name__ == "__main__":
    main()

その他

ベンチマークはこちらで記述したスクリプトで行いました。
touch-sp.hatenablog.com

ランキング参加中

プログラミング