Model Card for Model ID

RocqStar Ranker Embedder

A self‑attentive embedding model for premise / proof selection in Rocq ITP. RocqStar is fine‑tuned from CodeBERT so that distances in the embedding space correlate with proof similarity rather than with surface‑level statement overlap. It replaces BM25/Jaccard similarity in retrieval‑augmented generation (RAG) pipelines such as CoqPilot, leading to higher proof success rates on the IMM‑300 benchmark.

Model Details

Model Description

RocqStar is a 125 M‑parameter Transformer encoder (768‑dim hidden size) with multi‑head self‑attention and a learned self‑attentive pooling head. It is trained with an InfoNCE contrastive objective so that the cosine similarity of two statement embeddings approximates the similarity of their proofs, measured by a hybrid Levenshtein + Jaccard distance. The model takes tokenised Rocq (Gallina) theorem statements as input and outputs a 768‑d embedding.

Model type: Transformer encoder with self‑attentive pooling
Language(s): Rocq / Coq (Gallina) syntax (tokens)
License: Apache‑2.0
Fine‑tuned from: microsoft/codebert-base

Model Sources

Repository: https://github.com/JetBrains-Research/big-rocq
Paper: (coming soon)

How to Get Started

Go to https://github.com/JetBrains-Research/big-rocq. In subdirectory ranker-server you can find the server that you need to run to be able to use the model from the CoqPilot plugin. Also it may be used as a reference example of how to do inference with the model.

Training Details

Training Data

Corpus: 76 524 mined statements + proofs from 4 large Rocq projects.
Positive/negative mining: Proof distance < τ₊ (0.35) → positive; > τ₋ (0.6) → negative; 4 hard negatives per anchor.

Training Procedure

Objective: InfoNCE
Batch size: 32
Optimizer / LR: AdamW, lr = 4e‑6, linear warm‑up 10 %, 22k steps
Hardware: 1× NVIDIA H100 GPU, 160 GB RAM, 14 h wall‑clock

Evaluation

Testing Data, Factors & Metrics

Dataset: IMM‑300 (300 Rocq theorems) from the IMM project
Metrics: Downstream proof success rate of CoqPilot when given top‑7 retrieved premises; averaged over 12 generations.

Results

Model (back‑end)	Bucket	Baseline Jaccard	RocqStar
GPT‑4o	≤ 4	48 ± 5 %	51 ± 5 %
GPT‑4o	5–8	18 ± 4 %	25 ± 3 %
GPT‑4o	9–20	11 ± 4 %	11 ± 5 %
Claude 3.5	≤ 4	58 ± 5 %	61 ± 4 %
Claude 3.5	5–8	28 ± 5 %	36 ± 5 %
Claude 3.5	9–20	16 ± 5 %	21 ± 5 %

Summary

RocqStar delivers consistent gains, up to 28% relative improvement over Jaccard-index based retrieval, especially for medium‑length theorems where proof similarity diverges most from statement similarity.

JetBrains-Research
/

rocq-language-theorem-embeddings