PyTorch × Transformers Journey

Pythonicity, Autodiff & Modularity in Modern AI

Pablo Montalvo‑Leroux · ML Engineer @ Hugging Face

2016‑2018: Backprop & Birth Pangs

  • Hand‑crafted chain‑rule; frameworks such as Theano and CNTK appeared then vanished.
  • MLPs → RNNs → LSTMs — until BERT detonated the field in 2018.
  • Reproducibility was painful ✗ — until Transformers met PyTorch ✓.

Static vs Dynamic Graphs

Static graphs require you to compile, wait, and cross fingers the bug reproduces.

Dynamic graphs mean you can drop pdb.set_trace() anywhere and continue iterating.

torch.compile gives the best of both worlds: write dynamically, ship something ahead‑of‑time optimised.

Dynamic Graphs Enabled Contribution

  • Developers debug at line‑rate — no cold‑start recompiles.
  • Pull‑requests remained reproducible overnight, which accelerated trust.
  • Static‑graph alternatives stalled and the community consolidated around PyTorch.

Clone the Paper Tonight → Tweak Tomorrow

Research cadence is measured in hours; any friction kills momentum.

  • 2018: BERT fine‑tuning required printing tensors live rather than recompiling graphs.
  • Community PRs merged overnight — credibility snowballed for both PyTorch and Transformers.

“One Model · One File” — Why it Matters


# modeling_bert.py  — single source of truth 🗄️
class BertConfig(PretrainedConfig):
    ...

class BertSelfAttention(nn.Module):
    ...

class BertLayer(nn.Module):
    ...

class BertModel(PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.embeddings = BertEmbeddings(config)
        self.encoder = nn.ModuleList(
            [BertLayer(config) for _ in range(config.num_hidden_layers)]
        )
        self.init_weights()
        
  • All layers, forward pass, and from_pretrained() logic live together.
  • No cross‑file inheritance maze — copy to Colab, hack, and run.
  • Reviewers diff one file; merge time dropped from days to hours.

Transformers Grew with Python

  • The library prioritises hackability, which in turn accelerates adoption.
  • Python is slow by default, so we lean on compiled CUDA kernels and Triton for raw speed.
  • The new Kernel Hub means Transformers automatically uses a faster op the moment it is published — no application changes required.

Back to Python: Modular “Mary Shelley” Mode

Compose new blocks via subclassing and selective override.


class LlamaRotaryLoRA(LlamaAttention):
    def __init__(...):
        super().__init__(...)
        self.q_proj = LoRA(self.q_proj)  # swap in LoRA
        self.apply_rotary()              # keep RoPE
        

Logit Debugger: Trust but Verify

  • Attach a hook to every nn.Module; dump logits layer‑by‑layer.
  • Spot ε‑level drifts — LayerNorm precision, FP16 underflow, etc.
  • JSON traces are diffable in CI, so regressions stay caught.

DTensor & Tensor‑Parallel API

  • Logical tensor views unlock device‑mesh sharding.
  • The tp_plan JSON keeps model code pristine and declarative.
  • We regularly validate 100‑billion‑parameter checkpoints inside HF test infra.
Device mesh

Zero‑Config Parallelism

{
  "layer.*.self_attn.q_proj": "colwise",
  "layer.*.self_attn.k_proj": "colwise",
  "layer.*.self_attn.v_proj": "colwise",
  "layer.*.self_attn.o_proj": "rowwise"
}

def translate_to_torch_parallel_style(style: str):
    if style == "colwise":
        return ColwiseParallel()
    elif style == "rowwise":
        return RowwiseParallel()
        

One JSON file loads a 17‑billion‑parameter Llama‑4 on 8 GPUs; tweak the plan, not the network.

Load Faster & Stronger: Cache Allocator

Zero‑copy weight sharding shaves 15 % VRAM on A100 while cutting load time below 60 s for a 100‑B model.

Memory bars

Modular Transformers: GLM by Example


class GlmMLP(Phi3MLP):
    pass

class GlmAttention(LlamaAttention):
    def __init__(self, config, layer_idx=None):
        super().__init__(config, layer_idx)
        self.o_proj = nn.Linear(
            config.num_attention_heads * self.head_dim,
            config.hidden_size,
            bias=False,
        )

def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
    # Slightly different RoPE
    ...

class GlmForCausalLM(LlamaForCausalLM):
    pass
        

AST magic expands this 40‑line prototype into a full modelling file, ready for training.

Rise of Multimodality


processor = AutoProcessor.from_pretrained("Qwen/Qwen3-8B")
model = AutoModelForConditionalGeneration.from_pretrained("Qwen/Qwen3-8B")
        

Same API across text, vision, and audio: learn once, apply everywhere.

Why Python Wins

  • Low entry barrier attracts newcomers and domain specialists alike.
  • High‑level semantics concisely express low‑level intent.
  • The C++/Rust back‑end remains accessible for critical paths.

Where Python can bite 🐍

  • Interpreter overhead hurts microkernels (token‑by‑token decoding).
  • The GIL throttles concurrent host‑side work.
  • Fresh research code is easy to leave unoptimised.

Mitigations: Triton, compiled custom ops, compile‑time fallbacks, and callable kernels.

Kernel Hub: Optimised Ops from the Community

Kernel Hub lets any Python program download and hot‑load compiled CUDA/C++ kernels directly from the Hugging Face Hub at runtime.

  • Portable – kernels work from arbitrary paths outside PYTHONPATH.
  • Unique – load multiple versions of the same op side‑by‑side in one process.
  • Compatible – every kernel targets all recent PyTorch wheels (CUDA, ROCm, CPU) and C‑library ABIs.

🚀 Quick start (requires torch >= 2.5):

pip install kernels

import torch
from kernels import get_kernel

# Download optimised kernels from the Hugging Face Hub
activation = get_kernel("kernels-community/activation")

x = torch.randn(10, 10, dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)
print(y)
        

Same Transformer code — now with a 3× faster GELU on A100s.

API Design Lessons

  • Make easy things obvious, and hard things merely possible.
  • Keep the paper‑to‑repository delta minimal for new models.
  • Hide sharding mechanics; expose developer intent.

We tune radios without learning RF theory — ML frameworks should feel as frictionless.

Model Growth by Modality

Takeaways & The Future

  • PyTorch and transformers have grown symbiotically for eight years—expect the spiral to continue.
  • Pythonicity plus pragmatism keeps the barrier to innovation low.
  • Open‑source models are shipping faster, larger, and more multimodal than ever.

hf.co/transformers/contribute