軽量にもかかわらず強力と噂の「imp-v1-3b」で写真の説明をしてらう（残念ながら英語です）

用意した画像

ぱくたそからこちらの画像を使わせて頂きました。
www.pakutaso.com

結果

「Describe the person in the image」（画像の人物を説明して下さい）と投げかけた時の答えです。

The person in the image is a young woman wearing a blue kimono, which is a traditional Japanese garment. She is holding a white fan in her hand, possibly using it to cool herself or as a decorative accessory. The woman appears to be smiling, indicating that she is happy or content.

DeeLで翻訳したのがこちら。

画像の人物は、日本の伝統衣装である青い着物を着た若い女性。手には白い扇子を持っており、涼をとるためか、装飾品として使っているのだろう。女性は微笑んでいるように見え、幸せか満足していることを示している。

浴衣が着物になっている点はやや残念ですが、驚きの結果です。

うちわが扇子になっているのはDeepLの翻訳の問題だと思います。

Pythonスクリプト

大規模言語モデルってこんなに簡単に使えるんですね。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from diffusers.utils import load_image

user_query = "Describe the person in the image"

model = AutoModelForCausalLM.from_pretrained(
    # from https://huggingface.co/MILVLG/imp-v1-3b
    "imp-v1-3b", 
    torch_dtype=torch.float16, 
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "imp-v1-3b", 
    trust_remote_code=True
)
text = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n{user_query}. ASSISTANT:"
image = load_image("https://user0514.cdnw.net/shared/img/thumb/IMARIB20160807451515_TP_V4.jpg")

input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
image_tensor = model.image_preprocess(image).to(model.device)
output_ids = model.generate(
    input_ids,
    max_new_tokens=100,
    images=image_tensor,
    use_cache=True
)[0]
print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

PC環境

Windows 11
CUDA 11.8
Python 3.11

Python環境構築

CUDA 11.8を使った場合です。
Diffusersの「load_image」を使うためだけにDiffusersをインストールしています。

pip install torch==2.1.2+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install diffusers[torch]
pip install transformers einops

補足

日本語を使って同じようなことがしたければこちらを見て下さい。
touch-sp.hatenablog.com

ランキング参加中

プログラミング