Pablo Montalvo‑Leroux · ML Engineer @ Hugging Face
Static graphs require you to compile, wait, and cross fingers the bug reproduces.
Dynamic graphs mean you can drop pdb.set_trace()
anywhere and continue iterating.
torch.compile
gives the best of both worlds: write dynamically, ship something ahead‑of‑time optimised.
Research cadence is measured in hours; any friction kills momentum.
# modeling_bert.py — single source of truth 🗄️
class BertConfig(PretrainedConfig):
...
class BertSelfAttention(nn.Module):
...
class BertLayer(nn.Module):
...
class BertModel(PreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.embeddings = BertEmbeddings(config)
self.encoder = nn.ModuleList(
[BertLayer(config) for _ in range(config.num_hidden_layers)]
)
self.init_weights()
from_pretrained()
logic live together.Compose new blocks via subclassing and selective override.
class LlamaRotaryLoRA(LlamaAttention):
def __init__(...):
super().__init__(...)
self.q_proj = LoRA(self.q_proj) # swap in LoRA
self.apply_rotary() # keep RoPE
nn.Module
; dump logits layer‑by‑layer.tp_plan
JSON keeps model code pristine and declarative.{
"layer.*.self_attn.q_proj": "colwise",
"layer.*.self_attn.k_proj": "colwise",
"layer.*.self_attn.v_proj": "colwise",
"layer.*.self_attn.o_proj": "rowwise"
}
def translate_to_torch_parallel_style(style: str):
if style == "colwise":
return ColwiseParallel()
elif style == "rowwise":
return RowwiseParallel()
One JSON file loads a 17‑billion‑parameter Llama‑4 on 8 GPUs; tweak the plan, not the network.
Zero‑copy weight sharding shaves 15 % VRAM on A100 while cutting load time below 60 s for a 100‑B model.
class GlmMLP(Phi3MLP):
pass
class GlmAttention(LlamaAttention):
def __init__(self, config, layer_idx=None):
super().__init__(config, layer_idx)
self.o_proj = nn.Linear(
config.num_attention_heads * self.head_dim,
config.hidden_size,
bias=False,
)
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
# Slightly different RoPE
...
class GlmForCausalLM(LlamaForCausalLM):
pass
AST magic expands this 40‑line prototype into a full modelling file, ready for training.
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-8B")
model = AutoModelForConditionalGeneration.from_pretrained("Qwen/Qwen3-8B")
Same API across text, vision, and audio: learn once, apply everywhere.
Mitigations: Triton, compiled custom ops, compile‑time fallbacks, and callable kernels.
Kernel Hub lets any Python program download and hot‑load compiled CUDA/C++ kernels directly from the Hugging Face Hub at runtime.
PYTHONPATH
.🚀 Quick start (requires torch >= 2.5
):
pip install kernels
import torch
from kernels import get_kernel
# Download optimised kernels from the Hugging Face Hub
activation = get_kernel("kernels-community/activation")
x = torch.randn(10, 10, dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)
print(y)
Same Transformer code — now with a 3× faster GELU on A100s.
We tune radios without learning RF theory — ML frameworks should feel as frictionless.
transformers
have grown symbiotically for eight years—expect the spiral to continue.