Node-JEPA V2 — Heterogeneous Code Graph World Model

A 427K-parameter graph world model with 10 heterogeneous edge types (AST + CFG + DFG) that autonomously understands and modifies source code.

Hugging Face

What's New in V2

Feature V1 V2 Impact
Edge types 5 (child, next, calls, control, refers_to) 10 (+ data_flow, computed_from, returns_to, scope, guarded_by) +5 semantic edge types from GraphCodeBERT/FA-AST/GREPO
Edge handling Scalar edge type ID One-hot [E, 10] encoding + edge-type-aware attention GNN sees different weights per relationship type
Data flow ❌ None ✅ Variable def → use edges "Where does the value come from?" — GraphCodeBERT's key insight
Control flow Basic (if→body) + returns_to, guarded_by, scope Full CFG: which statements gate which, return paths
GNN attention Type-agnostic Edge-type attention bias + value modulation From GREPO: attention > GCN, removing any edge type hurts
[REG] node Only in GNNEncoder.forward() Also in DomainConditionalEncoder Fixes NaN on disconnected subgraph components

V2 Edge Type Distribution (87K-node code graph)

Edge Type Count % Source
ast_child 84,886 85.7% Tree-sitter AST parent→child
next_stmt 6,426 6.5% Sequential statement order
calls 3,113 3.1% Function call → definition
guarded_by 1,708 1.7% Statement → gating condition
control_flow 1,087 1.1% Control structure → body/branch
data_flow 553 0.6% Variable definition → use
computed_from 553 0.6% Reverse data flow
scope 323 0.3% Definition → enclosing scope
returns_to 283 0.3% Return statement → function
refers_to 137 0.1% Cross-file references

Verified Results

V2 Model (with heterogeneous edges)

  • Embedding rank: 31/32 (96.9%) — no collapse
  • Variance alive: 32/32 (100%)
  • Parameters: 427,553 (within 4M Pi 5 budget)
  • Edge types: 10 heterogeneous (one-hot encoded)
  • Transition MSE: 0.635 (synthetic edit pairs)

V1 Agent Loop (full integration verified)

  • Agent.observe() → 122-node graph, rank 31/32
  • ActionExecutor → add_import, add_docstring, edit_file with syntax validation
  • OnlineLearner → 5 episodes, prediction error ↓ 26%
  • Undo system → all edits reversed perfectly

Install

pip install torch torch-geometric sentence-transformers tree-sitter tree-sitter-python

Quickstart

from node_jepa import setup_agent

agent = setup_agent('/path/to/your/repo')
agent.observe()   # Parse code → heterogeneous graph (10 edge types)
agent.think()     # Detect surprises, plan actions
agent.act()       # Execute (dry-run by default)

V2 Training

python scripts/train_v2_scaled.py

This pre-trains the JEPA encoder on your code's heterogeneous graph, then trains the transition model on synthetic edit pairs.

How It Works

Source Code → Tree-sitter AST → Heterogeneous Graph (10 edge types)
                    ↓
         ┌─ ast_child (parent→child)
         ├─ data_flow (def→use) ← GraphCodeBERT's key insight
         ├─ control_flow (if→body)
         ├─ next_stmt (sequential)
         ├─ calls (call→definition)
         ├─ returns_to (return→function)
         ├─ guarded_by (stmt→condition)
         ├─ scope (def→enclosing)
         ├─ computed_from (use→def)
         └─ refers_to (cross-file)
                    ↓
         Edge-Type-Aware GAT Encoder
         (different attention per edge type)
                    ↓
         Latent Space (32D, rank 31/32)
                    ↓
    ┌───────────────┼───────────────┐
    ↓               ↓               ↓
Goal Embedding  Transition Model  Action Executor

Architecture

Layer Component Status
1+2 Perception — Tree-sitter → 10-type heterogeneous graph ✅ V2
3 World Model — JEPA encoder + edge-type-aware GAT + EMA ✅ V2
3+ Transition — Trained on edit pairs, AdaLN action conditioning ✅ V2
4 Agent Harness — Memory + surprise + planner + actions
4+ Executor — File edits (syntax-checked, undoable)
4+ Goals — 8 pre-defined goal embeddings (SentenceTransformer)
5 Learning — Experience buffer + imagination + EWC + online loop
6 Pi 5 Infra — 427K params, ONNX-ready

Pre-trained Checkpoints

Checkpoint Description Params
pretrained/v2_scaled.pt2 V2: heterogeneous edges, edge-type-aware GAT 427K
pretrained/git_transition_2000.pt2 V1: git history transition model 424K
pretrained/md17_transition_v2.pt2 V1: molecular domain 424K

Literature Grounding

V2 improvements are grounded in published results:

Paper Key Contribution How We Use It
GraphCodeBERT Data flow edges > AST alone data_flow + computed_from edge types
FA-AST AST + CFG + DFG → F1≈1.0 on clones All 10 heterogeneous edge types
GREPO GNNs beat GPT-4o on bug localization Edge-type-aware attention, GAT backbone
VISION Counterfactual vuln detection (97.8%) Architecture for future fine-tuning
GNN-Coder +20% zero-shot code search ASTGPool-style retrieval pipeline
CWM Code World Model (execution traces) Transition model paradigm
Graph-JEPA JEPA for graph-level representations Core self-supervised framework

Next Steps (Training to SOTA)

  1. Train on CommitPackFT — 4TB git commits as (before, message, after) triples → train transition model at scale
  2. Pre-train on The Stack v2 — 619 languages, millions of files → better foundation encoder
  3. Fine-tune for tasks — vulnerability detection (DiverseVul), bug localization (GREPO), code search (CodeSearchNet)
  4. Scale model — use full 4M param budget: hidden=256, embed=128, 4 layers (needs A100 for stable training)

Pi 5 Compliance

Constraint Target V2 Actual
Parameters ≤4M 427K (10.7%)
RAM ≤6GB ~200MB (3.3%)
ONNX Compatible

License

MIT

Citation

@software{node_jepa_v2,
  title = {Node-JEPA V2: Heterogeneous Code Graph World Model},
  author = {EPSAGR},
  year = {2026},
  url = {https://huggingface.co/EPSAGR/Node-JEPA}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for EPSAGR/Node-JEPA

Evaluation results

  • embedding-rank on Node-JEPA self-training (87K nodes, 99K edges, 10 edge types)
    self-reported
    31.000