MMDetectionでImage Captioningができるようになっていました

環境

Ubuntu 22.04 on WSL2
Python 3.10
CUDA 11.8

pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install openmim==0.3.9
mim install mmcv==2.0.1
mim install mmdet[multimodal]==3.1.0

実行

リポジトリのクローン

git clone https://github.com/open-mmlab/mmdetection
cd mmdetection

モデルのダウンロード

wget https://download.openmmlab.com/mmdetection/v3.0/xdecoder/xdecoder_focalt_last_novg.pt

テスト画像のダウンロード

wget https://raw.githubusercontent.com/SHI-Labs/Versatile-Diffusion/master/assets/demo/reg_example/boy_and_girl.jpg

スクリプトの記述

import numpy as np
from argparse import ArgumentParser
from mmdet.apis.det_inferencer import DetInferencer, InputsType, PredType
from typing import Iterable, List, Optional, Tuple, Union

class ImageCaptionInferencer(DetInferencer):

    def visualize(self,
                  inputs: InputsType,
                  preds: PredType,
                  show: bool = False,
                  wait_time: int = 0,
                  draw_pred: bool = True,
                  pred_score_thr: float = 0.3,
                  **kwargs) -> Union[List[np.ndarray], None]:

        for pred in preds:
            print(pred.pred_caption)

def parse_args():
    parser = ArgumentParser()
    parser.add_argument('inputs', type=str, help='Input image file or folder path.')
    parser.add_argument('model', type=str, help='Config file name')
    parser.add_argument('--weights', type=str, help='Checkpoint file')
    parser.add_argument('--device', type=str, default='cuda:0', help='Device used for inference')

    call_args = vars(parser.parse_args())

    init_kws = ['model', 'weights', 'device']
    
    init_args = {}
    for init_kw in init_kws:
        init_args[init_kw] = call_args.pop(init_kw)

    init_args['palette'] = None

    return init_args, call_args

def main():
    init_args, call_args = parse_args()

    inferencer = ImageCaptionInferencer(**init_args)

    inferencer(**call_args)

if __name__ == '__main__':
    main()

上記スクリプトを「demo.py」という名前で保存します。

スクリプトの実行

python demo.py \
  boy_and_girl.jpg \
  projects/XDecoder/configs/xdecoder-tiny_zeroshot_caption_coco2014.py \
  --weights xdecoder_focalt_last_novg.pt

結果

children sitting on the ground and watching a starry sky

その他のImage Captioningの記事

touch-sp.hatenablog.com
touch-sp.hatenablog.com

ランキング参加中

プログラミング