japanese-clip-stair

日本語に特化したCLIPモデルです。STAIR Captionsデータセットで学習されています。

モデル概要

このモデルは、画像とテキストの類似度を計算するマルチモーダルモデルです。

画像エンコーダー: ResNet50
テキストエンコーダー: cl-tohoku/bert-base-japanese-v3
学習データ: STAIR Captions
埋め込み次元: 512

必要なライブラリ

pip install torch torchvision transformers pillow requests

使用方法

基本的な使用例

from transformers import AutoTokenizer, AutoModel
from PIL import Image
import torch
from torchvision import transforms
import requests
from io import BytesIO

# モデルとトークナイザーの読み込み
model = AutoModel.from_pretrained("AoiNoGeso/japanese-clip-stair", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("AoiNoGeso/japanese-clip-stair")

# 画像前処理関数
def preprocess_image(image, size=224):
    transform = transforms.Compose([
        transforms.Resize((size, size)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    if image.mode != 'RGB':
        image = image.convert('RGB')
    return transform(image).unsqueeze(0)

# 画像とテキストの準備
image_url = "https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg"
image = Image.open(BytesIO(requests.get(image_url).content))
pixel_values = preprocess_image(image)

texts = ["犬", "猫", "象", "鳥"]
text_inputs = tokenizer(texts, padding=True, return_tensors="pt")

# 推論実行
with torch.no_grad():
    outputs = model(
        pixel_values=pixel_values,
        input_ids=text_inputs.input_ids,
        attention_mask=text_inputs.attention_mask
    )
    
    # 確率計算
    probs = outputs['logits_per_image'].softmax(dim=-1)
    
    # 結果表示
    for i, (text, prob) in enumerate(zip(texts, probs[0])):
        print(f"{text}: {prob:.4f} ({prob*100:.2f}%)")

個別に特徴量を取得する場合

with torch.no_grad():
    # 画像特徴量のみ取得
    image_features = model.get_image_features(pixel_values)
    
    # テキスト特徴量のみ取得
    text_features = model.get_text_features(
        text_inputs.input_ids, 
        text_inputs.attention_mask
    )
    
    # 手動で類似度計算
    similarity = torch.matmul(image_features, text_features.T)
    probs = similarity.softmax(dim=-1)

モデルの性能

STAIR Captionsデータセットで学習されており、日本語の画像キャプションタスクに最適化されています。

制限事項

画像は224x224にリサイズされます
日本語テキストに最適化されています
PyTorchとtorchvisionが必要です

ライセンス

Apache 2.0

引用

@dataset{stair_captions,
  title={STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset},
  author={Yoshikawa, Yuya and Shigeto, Yutaro and Takeuchi, Akikazu},
  year={2017}
}

使用例

詳細な使用例は usage_example.py を参照してください。

トラブルシューティング

KeyError: 'japanese-clip'

もしこのエラーが発生した場合は、以下のコマンドでTransformersを最新版に更新してください：

pip install --upgrade transformers

それでも解決しない場合は、trust_remote_code=Trueパラメータを使用してください：

model = AutoModel.from_pretrained("AoiNoGeso/japanese-clip-stair", trust_remote_code=True)