Node-JEPA V2 — Heterogeneous Code Graph World Model

A 427K-parameter graph world model with 10 heterogeneous edge types (AST + CFG + DFG) that autonomously understands and modifies source code.

What's New in V2

Feature	V1	V2	Impact
Edge types	5 (child, next, calls, control, refers_to)	10 (+ data_flow, computed_from, returns_to, scope, guarded_by)	+5 semantic edge types from GraphCodeBERT/FA-AST/GREPO
Edge handling	Scalar edge type ID	One-hot [E, 10] encoding + edge-type-aware attention	GNN sees different weights per relationship type
Data flow	❌ None	✅ Variable def → use edges	"Where does the value come from?" — GraphCodeBERT's key insight
Control flow	Basic (if→body)	+ returns_to, guarded_by, scope	Full CFG: which statements gate which, return paths
GNN attention	Type-agnostic	Edge-type attention bias + value modulation	From GREPO: attention > GCN, removing any edge type hurts
[REG] node	Only in GNNEncoder.forward()	Also in DomainConditionalEncoder	Fixes NaN on disconnected subgraph components

V2 Edge Type Distribution (87K-node code graph)

Edge Type	Count	%	Source
`ast_child`	84,886	85.7%	Tree-sitter AST parent→child
`next_stmt`	6,426	6.5%	Sequential statement order
`calls`	3,113	3.1%	Function call → definition
`guarded_by`	1,708	1.7%	Statement → gating condition
`control_flow`	1,087	1.1%	Control structure → body/branch
`data_flow`	553	0.6%	Variable definition → use
`computed_from`	553	0.6%	Reverse data flow
`scope`	323	0.3%	Definition → enclosing scope
`returns_to`	283	0.3%	Return statement → function
`refers_to`	137	0.1%	Cross-file references

Verified Results

V2 Model (with heterogeneous edges)

Embedding rank: 31/32 (96.9%) — no collapse
Variance alive: 32/32 (100%)
Parameters: 427,553 (within 4M Pi 5 budget)
Edge types: 10 heterogeneous (one-hot encoded)
Transition MSE: 0.635 (synthetic edit pairs)

V1 Agent Loop (full integration verified)

Agent.observe() → 122-node graph, rank 31/32
ActionExecutor → add_import, add_docstring, edit_file with syntax validation
OnlineLearner → 5 episodes, prediction error ↓ 26%
Undo system → all edits reversed perfectly

Install

pip install torch torch-geometric sentence-transformers tree-sitter tree-sitter-python

Quickstart

from node_jepa import setup_agent

agent = setup_agent('/path/to/your/repo')
agent.observe()   # Parse code → heterogeneous graph (10 edge types)
agent.think()     # Detect surprises, plan actions
agent.act()       # Execute (dry-run by default)

V2 Training

python scripts/train_v2_scaled.py

This pre-trains the JEPA encoder on your code's heterogeneous graph, then trains the transition model on synthetic edit pairs.

How It Works

Source Code → Tree-sitter AST → Heterogeneous Graph (10 edge types)
                    ↓
         ┌─ ast_child (parent→child)
         ├─ data_flow (def→use) ← GraphCodeBERT's key insight
         ├─ control_flow (if→body)
         ├─ next_stmt (sequential)
         ├─ calls (call→definition)
         ├─ returns_to (return→function)
         ├─ guarded_by (stmt→condition)
         ├─ scope (def→enclosing)
         ├─ computed_from (use→def)
         └─ refers_to (cross-file)
                    ↓
         Edge-Type-Aware GAT Encoder
         (different attention per edge type)
                    ↓
         Latent Space (32D, rank 31/32)
                    ↓
    ┌───────────────┼───────────────┐
    ↓               ↓               ↓
Goal Embedding  Transition Model  Action Executor

Architecture

Layer	Component	Status
1+2	Perception — Tree-sitter → 10-type heterogeneous graph	✅ V2
3	World Model — JEPA encoder + edge-type-aware GAT + EMA	✅ V2
3+	Transition — Trained on edit pairs, AdaLN action conditioning	✅ V2
4	Agent Harness — Memory + surprise + planner + actions	✅
4+	Executor — File edits (syntax-checked, undoable)	✅
4+	Goals — 8 pre-defined goal embeddings (SentenceTransformer)	✅
5	Learning — Experience buffer + imagination + EWC + online loop	✅
6	Pi 5 Infra — 427K params, ONNX-ready	✅

Pre-trained Checkpoints

Checkpoint	Description	Params
`pretrained/v2_scaled.pt2`	V2: heterogeneous edges, edge-type-aware GAT	427K
`pretrained/git_transition_2000.pt2`	V1: git history transition model	424K
`pretrained/md17_transition_v2.pt2`	V1: molecular domain	424K

Literature Grounding

V2 improvements are grounded in published results:

Paper	Key Contribution	How We Use It
GraphCodeBERT	Data flow edges > AST alone	`data_flow` + `computed_from` edge types
FA-AST	AST + CFG + DFG → F1≈1.0 on clones	All 10 heterogeneous edge types
GREPO	GNNs beat GPT-4o on bug localization	Edge-type-aware attention, GAT backbone
VISION	Counterfactual vuln detection (97.8%)	Architecture for future fine-tuning
GNN-Coder	+20% zero-shot code search	ASTGPool-style retrieval pipeline
CWM	Code World Model (execution traces)	Transition model paradigm
Graph-JEPA	JEPA for graph-level representations	Core self-supervised framework

Next Steps (Training to SOTA)

Train on CommitPackFT — 4TB git commits as (before, message, after) triples → train transition model at scale
Pre-train on The Stack v2 — 619 languages, millions of files → better foundation encoder
Fine-tune for tasks — vulnerability detection (DiverseVul), bug localization (GREPO), code search (CodeSearchNet)
Scale model — use full 4M param budget: hidden=256, embed=128, 4 layers (needs A100 for stable training)

Pi 5 Compliance

Constraint	Target	V2 Actual
Parameters	≤4M	427K (10.7%)
RAM	≤6GB	~200MB (3.3%)
ONNX	Compatible	✅

License

MIT

Citation

@software{node_jepa_v2,
  title = {Node-JEPA V2: Heterogeneous Code Graph World Model},
  author = {EPSAGR},
  year = {2026},
  url = {https://huggingface.co/EPSAGR/Node-JEPA}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for EPSAGR/Node-JEPA

GREPO: A Benchmark for Graph Neural Networks on Repository-Level Bug Localization

Paper • 2602.13921 • Published Feb 14 • 1

CWM: An Open-Weights LLM for Research on Code Generation with World Models

Paper • 2510.02387 • Published Sep 30, 2025 • 11

VISION: Robust and Interpretable Code Vulnerability Detection Leveraging Counterfactual Augmentation

Paper • 2508.18933 • Published Aug 26, 2025

GNN-Coder: Boosting Semantic Code Retrieval with Combined GNNs and Transformer

Paper • 2502.15202 • Published Feb 21, 2025

Graph-level Representation Learning with Joint-Embedding Predictive Architectures

Paper • 2309.16014 • Published Jan 18, 2025

Evaluation results

embedding-rank on Node-JEPA self-training (87K nodes, 99K edges, 10 edge types)
self-reported

31.000