Node-JEPA V2 — Heterogeneous Code Graph World Model
A 427K-parameter graph world model with 10 heterogeneous edge types (AST + CFG + DFG) that autonomously understands and modifies source code.

What's New in V2
| Feature |
V1 |
V2 |
Impact |
| Edge types |
5 (child, next, calls, control, refers_to) |
10 (+ data_flow, computed_from, returns_to, scope, guarded_by) |
+5 semantic edge types from GraphCodeBERT/FA-AST/GREPO |
| Edge handling |
Scalar edge type ID |
One-hot [E, 10] encoding + edge-type-aware attention |
GNN sees different weights per relationship type |
| Data flow |
❌ None |
✅ Variable def → use edges |
"Where does the value come from?" — GraphCodeBERT's key insight |
| Control flow |
Basic (if→body) |
+ returns_to, guarded_by, scope |
Full CFG: which statements gate which, return paths |
| GNN attention |
Type-agnostic |
Edge-type attention bias + value modulation |
From GREPO: attention > GCN, removing any edge type hurts |
| [REG] node |
Only in GNNEncoder.forward() |
Also in DomainConditionalEncoder |
Fixes NaN on disconnected subgraph components |
V2 Edge Type Distribution (87K-node code graph)
| Edge Type |
Count |
% |
Source |
ast_child |
84,886 |
85.7% |
Tree-sitter AST parent→child |
next_stmt |
6,426 |
6.5% |
Sequential statement order |
calls |
3,113 |
3.1% |
Function call → definition |
guarded_by |
1,708 |
1.7% |
Statement → gating condition |
control_flow |
1,087 |
1.1% |
Control structure → body/branch |
data_flow |
553 |
0.6% |
Variable definition → use |
computed_from |
553 |
0.6% |
Reverse data flow |
scope |
323 |
0.3% |
Definition → enclosing scope |
returns_to |
283 |
0.3% |
Return statement → function |
refers_to |
137 |
0.1% |
Cross-file references |
Verified Results
V2 Model (with heterogeneous edges)
- Embedding rank: 31/32 (96.9%) — no collapse
- Variance alive: 32/32 (100%)
- Parameters: 427,553 (within 4M Pi 5 budget)
- Edge types: 10 heterogeneous (one-hot encoded)
- Transition MSE: 0.635 (synthetic edit pairs)
V1 Agent Loop (full integration verified)
- Agent.observe() → 122-node graph, rank 31/32
- ActionExecutor → add_import, add_docstring, edit_file with syntax validation
- OnlineLearner → 5 episodes, prediction error ↓ 26%
- Undo system → all edits reversed perfectly
Install
pip install torch torch-geometric sentence-transformers tree-sitter tree-sitter-python
Quickstart
from node_jepa import setup_agent
agent = setup_agent('/path/to/your/repo')
agent.observe()
agent.think()
agent.act()
V2 Training
python scripts/train_v2_scaled.py
This pre-trains the JEPA encoder on your code's heterogeneous graph, then trains the transition model on synthetic edit pairs.
How It Works
Source Code → Tree-sitter AST → Heterogeneous Graph (10 edge types)
↓
┌─ ast_child (parent→child)
├─ data_flow (def→use) ← GraphCodeBERT's key insight
├─ control_flow (if→body)
├─ next_stmt (sequential)
├─ calls (call→definition)
├─ returns_to (return→function)
├─ guarded_by (stmt→condition)
├─ scope (def→enclosing)
├─ computed_from (use→def)
└─ refers_to (cross-file)
↓
Edge-Type-Aware GAT Encoder
(different attention per edge type)
↓
Latent Space (32D, rank 31/32)
↓
┌───────────────┼───────────────┐
↓ ↓ ↓
Goal Embedding Transition Model Action Executor
Architecture
| Layer |
Component |
Status |
| 1+2 |
Perception — Tree-sitter → 10-type heterogeneous graph |
✅ V2 |
| 3 |
World Model — JEPA encoder + edge-type-aware GAT + EMA |
✅ V2 |
| 3+ |
Transition — Trained on edit pairs, AdaLN action conditioning |
✅ V2 |
| 4 |
Agent Harness — Memory + surprise + planner + actions |
✅ |
| 4+ |
Executor — File edits (syntax-checked, undoable) |
✅ |
| 4+ |
Goals — 8 pre-defined goal embeddings (SentenceTransformer) |
✅ |
| 5 |
Learning — Experience buffer + imagination + EWC + online loop |
✅ |
| 6 |
Pi 5 Infra — 427K params, ONNX-ready |
✅ |
Pre-trained Checkpoints
| Checkpoint |
Description |
Params |
pretrained/v2_scaled.pt2 |
V2: heterogeneous edges, edge-type-aware GAT |
427K |
pretrained/git_transition_2000.pt2 |
V1: git history transition model |
424K |
pretrained/md17_transition_v2.pt2 |
V1: molecular domain |
424K |
Literature Grounding
V2 improvements are grounded in published results:
| Paper |
Key Contribution |
How We Use It |
| GraphCodeBERT |
Data flow edges > AST alone |
data_flow + computed_from edge types |
| FA-AST |
AST + CFG + DFG → F1≈1.0 on clones |
All 10 heterogeneous edge types |
| GREPO |
GNNs beat GPT-4o on bug localization |
Edge-type-aware attention, GAT backbone |
| VISION |
Counterfactual vuln detection (97.8%) |
Architecture for future fine-tuning |
| GNN-Coder |
+20% zero-shot code search |
ASTGPool-style retrieval pipeline |
| CWM |
Code World Model (execution traces) |
Transition model paradigm |
| Graph-JEPA |
JEPA for graph-level representations |
Core self-supervised framework |
Next Steps (Training to SOTA)
- Train on CommitPackFT — 4TB git commits as (before, message, after) triples → train transition model at scale
- Pre-train on The Stack v2 — 619 languages, millions of files → better foundation encoder
- Fine-tune for tasks — vulnerability detection (DiverseVul), bug localization (GREPO), code search (CodeSearchNet)
- Scale model — use full 4M param budget: hidden=256, embed=128, 4 layers (needs A100 for stable training)
Pi 5 Compliance
| Constraint |
Target |
V2 Actual |
| Parameters |
≤4M |
427K (10.7%) |
| RAM |
≤6GB |
~200MB (3.3%) |
| ONNX |
Compatible |
✅ |
License
MIT
Citation
@software{node_jepa_v2,
title = {Node-JEPA V2: Heterogeneous Code Graph World Model},
author = {EPSAGR},
year = {2026},
url = {https://huggingface.co/EPSAGR/Node-JEPA}
}