「Versatile-Diffusion」でImage Captioningをやってみたけど精度はいまいちだった

はじめに
環境
方法
結果

はじめに

「Stable Diffusion」などの画像生成AIが最近話題です。これらはテキスト（呪文）から画像を生成します。

「Versatile-Diffusion」はその逆（画像からテキストを生成）ができるらしいです。

さっそく「Versatile-Diffusion」でImage Captioningをやってみました。

github.com

環境

Windows 11
CUDA 11.6.2
Python 3.9.13

Pythonの環境構築は以下の2行です。

pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r https://raw.githubusercontent.com/SHI-Labs/Versatile-Diffusion/master/requirement.txt

方法

リポジトリのダウンロード

git clone https://github.com/SHI-Labs/Versatile-Diffusion.git
cd Versatile-Diffusion

フォルダの作成

「pretrained」フォルダと「log」フォルダを新規作成します。

学習済みパラメーターのダウンロード

こちらからパラメーターをダウンロードして「pretrained」フォルダに保存します。

実行

python inference.py --gpu 0 --app image-to-text --image assets/boy_and_girl.jpg --seed 0 --nsample 1 --fp16

結果

seedをいろいろ変えて実行してみました。

children helping their child with a kids
two girls and a boy standing
two young girls standing on a star
four girls playing a star behind a child
two girls and a boy standing on a sky

残念ながらいまいちです。