Multimodal ViT-L14

📌 Model Introduction

This repository provides a multimodal model designed to automatically score mathematics homework by leveraging both visual (screenshots of problems and student responses) and textual information (problem descriptions and student answers). The model outputs embeddings suitable for retrieval tasks and also provides a regression-based scoring mechanism.

The model consists of two main components:

Component	Description
Base Encoder	ViT-L14 from OpenCLIP
Scorer (Regression Head)	A fully connected neural network (3 layers with ReLU activation and dropout)

A more complex textual prompt can be constructed using the following Python function:

def build_prompt(r: pd.Series)->str:
    return (f"1.{str(r.get('skill_name',''))[:50]} "
            f"2.{str(r.get('grade level',''))[:20]} "
            f"3.{str(r.get('problem_body',''))[:300]} "
            f"4.{str(r.get('student_response',''))[:200]}")

During inference, the model first uses CLIP to extract embeddings from both images and text. These embeddings can be employed for retrieval tasks by calculating cosine similarity with historical examples, inspired by the methodology described in:

Li, H., Xing, W., Zhu, W., Li, C., Lyu, B., Liu, Z., & Heffernan, N. (2025). Leveraging multi-modality and collaborative filtering for supporting automatic scoring in mathematics education. Proceedings of the 26th International Conference on Artificial Intelligence in Education.

The model also provides a simple scoring mechanism by passing image embeddings through the regression head, outputting scores. This fusion of retrieval and scoring enables a flexible inference pipeline. Future work can further utilize the scoring head as a supervised signal to optimize retrieval embeddings.

In summary:

Cosine similarity can be used for nearest neighbor retrieval or inverted index construction. For example, you can compare the grades of the most similar past student assignments to inform the grading of a current assignment.
Regression score provides an automated evaluation of student answers, suitable as a reference for teachers. Both methods can be combined as needed.

🗂️ Dataset and Metrics

Data source: Multimodal mathematics dataset, containing screenshots, problem texts, answers, and teacher scores (floating point between 0 and 1).
Training Objective: Minimizing Mean Squared Error (MSELoss).
Final Results: Achieved MSE = 0.0837 on the training set.
Fine-tuning: Efficient parameter tuning with LoRA and AdamW optimizer.

🛠️ Inference Example

import torch, open_clip
from PIL import Image

# ---------- 0. Paths ----------
root_dir      = "./"
clip_ckpt     = f"{root_dir}/full_clip.pt"
reg_ts        = f"{root_dir}/reg_head.ts"
pp_ckpt       = f"{root_dir}/preprocess.pt"

image_paths   = ["path_to_image1", "path_to_image2"]
text_prompts  = ["prompt_q+ans_1", "prompt_q+ans_2"]
device        = "cuda" if torch.cuda.is_available() else "cpu"

# ---------- 1. CLIP Backbone ----------
clip_name = "ViT-L-14"
clip = open_clip.create_model(clip_name, device=device, pretrained=None)
clip.load_state_dict(torch.load(clip_ckpt, map_location=device), strict=False)
clip.eval(); clip.requires_grad_(False)

# ---------- 2. Regression Head ----------
reg_head = torch.jit.load(reg_ts, map_location=device)
reg_head.eval()

# ---------- 3. Pre-processing ----------
preprocess = torch.load(pp_ckpt, weights_only=False)

# ---------- 4. Inference (batch size = 2) ----------
#   4-1. Build image & text batches
imgs = torch.stack(
    [preprocess(Image.open(p).convert("RGB")) for p in image_paths]
).to(device)                                                  # (2, 3, H, W)

toks = open_clip.tokenize(
    text_prompts, context_length=clip.context_length
).to(device)                                                  # (2, ctx_len)

with torch.no_grad():
    #----- Encode -----
    img_emb = torch.nn.functional.normalize(clip.encode_image(imgs), dim=1)  # (2, D)
    txt_emb = torch.nn.functional.normalize(clip.encode_text(toks), dim=1)   # (2, D)

    #----- Fuse image & text into one embedding per sample -----
    # Here we simply average the two L2-normalised vectors, then renormalise.
    fused_emb = torch.nn.functional.normalize(img_emb + txt_emb, dim=1)      # (2, D)

    #----- Similarity between the two fused samples -----
    # fused_emb[0] · fused_emb[1]  (equivalent to (fused_emb @ fused_emb.T)[0,1])
    pair_sim = (fused_emb[0] * fused_emb[1]).sum().item()

    #----- Regression scores (unchanged) -----
    scores = reg_head(img_emb).squeeze(-1)                                   # (2,)

# ---------- 5. Output ----------
print(f"Cosine similarity between sample-1 and sample-2: {pair_sim:.4f}\n")
print(f"[Sample 1] Regression score: {scores[0]:.4f}")
print(f"[Sample 2] Regression score: {scores[1]:.4f}")

🤝 Citation and License

Model provided is for research and educational use only.