リクルート社が発表した「japanese-clip-vit-b-32-roberta-base」を使って日本語で画像検索をする

前回に引き続きリクルート社が発表した「japanese-clip-vit-b-32-roberta-base」を使っていきます。
touch-sp.hatenablog.com
今回は日本語での画像検索に挑戦しました。

1年以上前にOpenAIのCLIPを使って同じことをした経験があります。
touch-sp.hatenablog.com

用意した画像

Kaggleの「Dogs vs Cats | Kaggle」からデータを使わせて頂きました。

テストデータだけを使ったのですが、それでも12500枚の画像があるのでそのうちの200枚だけにしています。

犬の画像と猫の画像が混ざっています。

「animal_images」というフォルダ内に写真を配置しました。

結果

200枚の画像に対して「芝生の上にいる犬」の写真を抽出して下さいと命令した時の答えです。

8枚の写真が抽出されました
animal_images/4.jpg
animal_images/18.jpg
animal_images/42.jpg
animal_images/65.jpg
animal_images/73.jpg
animal_images/95.jpg
animal_images/110.jpg
animal_images/167.jpg

抽出された画像がこちらです。

1枚微妙なのが混じっていますが、概ねうまく抽出できているようです。

ただし、抽出漏れは多数あります。

閾値を下げれば抽出漏れは防げますが、抽出して欲しくない画像も入ってきます。

Pythonスクリプト

Step 1

200枚の画像をいったんすべてベクトル化して、numpyのsavez_compressedを使って保存しました。

import torch
import numpy as np
from transformers import AutoModel, CLIPImageProcessor
from diffusers.utils import load_image
from pathlib import Path
from tqdm import tqdm

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "recruit-jp/japanese-clip-vit-b-32-roberta-base"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device)

image_processor = CLIPImageProcessor.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K")

image_paths = [Path("animal_images", f"{x}.jpg").as_posix() for x in range(1, 201)]

image_count = len(image_paths)
batch = 10
loop_count = (image_count - 1) // batch + 1

print(f"batch size: {batch}")
image_features_list = []
for i in tqdm(range(loop_count)):
    first_index = i * batch
    last_index = first_index + batch if (first_index + batch) < image_count else image_count
    pilimages_list = [load_image(x) for x in image_paths[first_index:last_index]]
    iamges = image_processor(pilimages_list, return_tensors="pt").pixel_values.to(device)
    with torch.inference_mode():
        image_features = model.get_image_features(iamges)
        image_features /= image_features.norm(dim=-1, keepdim=True)
        image_features_list.append(image_features.cpu().numpy())

image_features = np.concatenate(image_features_list, axis=0)

np.savez_compressed(
    'imageData',
    image_paths=np.array(image_paths),
    image_features =image_features
)

Step 2

保存した画像のベクトルを読み込んで抽出を行います。

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "recruit-jp/japanese-clip-vit-b-32-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device)

input_text="芝生の上にいる犬"
text = tokenizer(
    text=["[CLS]" + input_text],
    max_length=77,
    padding="max_length",
    truncation=True,
    add_special_tokens=False,
    return_tensors="pt"
).input_ids.to(device)

with torch.inference_mode():
    text_features = model.get_text_features(input_ids=text)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    text_features = text_features.cpu().numpy()

npz = np.load('imageData.npz')
image_paths = npz['image_paths']
image_features = npz['image_features']

similarity = image_features @ text_features.T 

result = list(image_paths[similarity.squeeze() > 0.40])

print(f"{len(result)}枚の写真が抽出されました")
print('\n'.join(result))

PC環境

Windows 11
Python 3.11

GPUがなくても実行可能です。

Python環境構築

CUDA 11.8を使った場合です。
Diffusersの「load_image」を使うためだけにDiffusersをインストールしています。

pip install torch==2.1.2+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install diffusers[torch]
pip install transformers protobuf sentencepiece

ランキング参加中

プログラミング