PyTorch × Transformers Journey

Pythonicity, Autodiff & Modularity in Modern AI

Pablo Montalvo‑Leroux · ML Engineer @ Hugging Face

2016‑2018: Backprop & Birth Pangs

Hand‑crafted chain‑rule; frameworks such as Theano and CNTK appeared then vanished.
MLPs → RNNs → LSTMs — until BERT detonated the field in 2018.
Reproducibility was painful ✗ — until Transformers met PyTorch ✓.

Static vs Dynamic Graphs

Static graphs require you to compile, wait, and cross fingers the bug reproduces.

Dynamic graphs mean you can drop pdb.set_trace() anywhere and continue iterating.

torch.compile gives the best of both worlds: write dynamically, ship something ahead‑of‑time optimised.

Dynamic Graphs Enabled Contribution

Developers debug at line‑rate — no cold‑start recompiles.
Pull‑requests remained reproducible overnight, which accelerated trust.
Static‑graph alternatives stalled and the community consolidated around PyTorch.

Clone the Paper Tonight → Tweak Tomorrow

Research cadence is measured in hours; any friction kills momentum.

2018: BERT fine‑tuning required printing tensors live rather than recompiling graphs.
Community PRs merged overnight — credibility snowballed for both PyTorch and Transformers.

“One Model · One File” — Why it Matters


# modeling_bert.py  — single source of truth 🗄️
class BertConfig(PretrainedConfig):
    ...

class BertSelfAttention(nn.Module):
    ...

class BertLayer(nn.Module):
    ...

class BertModel(PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.embeddings = BertEmbeddings(config)
        self.encoder = nn.ModuleList(
            [BertLayer(config) for _ in range(config.num_hidden_layers)]
        )
        self.init_weights()

All layers, forward pass, and from_pretrained() logic live together.
No cross‑file inheritance maze — copy to Colab, hack, and run.
Reviewers diff one file; merge time dropped from days to hours.

Transformers Grew with Python

The library prioritises hackability, which in turn accelerates adoption.
Python is slow by default, so we lean on compiled CUDA kernels and Triton for raw speed.
The new Kernel Hub means Transformers automatically uses a faster op the moment it is published — no application changes required.

Back to Python: Modular “Mary Shelley” Mode

Compose new blocks via subclassing and selective override.


class LlamaRotaryLoRA(LlamaAttention):
    def __init__(...):
        super().__init__(...)
        self.q_proj = LoRA(self.q_proj)  # swap in LoRA
        self.apply_rotary()              # keep RoPE

Logit Debugger: Trust but Verify

Attach a hook to every nn.Module; dump logits layer‑by‑layer.
Spot ε‑level drifts — LayerNorm precision, FP16 underflow, etc.
JSON traces are diffable in CI, so regressions stay caught.

DTensor & Tensor‑Parallel API

Logical tensor views unlock device‑mesh sharding.
The tp_plan JSON keeps model code pristine and declarative.
We regularly validate 100‑billion‑parameter checkpoints inside HF test infra.

Zero‑Config Parallelism

{
  "layer.*.self_attn.q_proj": "colwise",
  "layer.*.self_attn.k_proj": "colwise",
  "layer.*.self_attn.v_proj": "colwise",
  "layer.*.self_attn.o_proj": "rowwise"
}


def translate_to_torch_parallel_style(style: str):
    if style == "colwise":
        return ColwiseParallel()
    elif style == "rowwise":
        return RowwiseParallel()

One JSON file loads a 17‑billion‑parameter Llama‑4 on 8 GPUs; tweak the plan, not the network.

Load Faster & Stronger: Cache Allocator

Zero‑copy weight sharding shaves 15 % VRAM on A100 while cutting load time below 60 s for a 100‑B model.

Modular Transformers: GLM by Example


class GlmMLP(Phi3MLP):
    pass

class GlmAttention(LlamaAttention):
    def __init__(self, config, layer_idx=None):
        super().__init__(config, layer_idx)
        self.o_proj = nn.Linear(
            config.num_attention_heads * self.head_dim,
            config.hidden_size,
            bias=False,
        )

def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
    # Slightly different RoPE
    ...

class GlmForCausalLM(LlamaForCausalLM):
    pass

AST magic expands this 40‑line prototype into a full modelling file, ready for training.

Rise of Multimodality


processor = AutoProcessor.from_pretrained("Qwen/Qwen3-8B")
model = AutoModelForConditionalGeneration.from_pretrained("Qwen/Qwen3-8B")

Same API across text, vision, and audio: learn once, apply everywhere.

Why Python Wins

Low entry barrier attracts newcomers and domain specialists alike.
High‑level semantics concisely express low‑level intent.
The C++/Rust back‑end remains accessible for critical paths.

Where Python can bite 🐍

Interpreter overhead hurts microkernels (token‑by‑token decoding).
The GIL throttles concurrent host‑side work.
Fresh research code is easy to leave unoptimised.

Mitigations: Triton, compiled custom ops, compile‑time fallbacks, and callable kernels.

Kernel Hub: Optimised Ops from the Community

Kernel Hub lets any Python program download and hot‑load compiled CUDA/C++ kernels directly from the Hugging Face Hub at runtime.

Portable – kernels work from arbitrary paths outside PYTHONPATH.
Unique – load multiple versions of the same op side‑by‑side in one process.
Compatible – every kernel targets all recent PyTorch wheels (CUDA, ROCm, CPU) and C‑library ABIs.

🚀 Quick start (requires torch >= 2.5):

pip install kernels


import torch
from kernels import get_kernel

# Download optimised kernels from the Hugging Face Hub
activation = get_kernel("kernels-community/activation")

x = torch.randn(10, 10, dtype=torch.float16, device="cuda")
y = torch.empty_like(x)
activation.gelu_fast(y, x)
print(y)

Same Transformer code — now with a 3× faster GELU on A100s.

API Design Lessons

Make easy things obvious, and hard things merely possible.
Keep the paper‑to‑repository delta minimal for new models.
Hide sharding mechanics; expose developer intent.

We tune radios without learning RF theory — ML frameworks should feel as frictionless.

Model Growth by Modality

Takeaways & The Future

PyTorch and transformers have grown symbiotically for eight years—expect the spiral to continue.
Pythonicity plus pragmatism keeps the barrier to innovation low.
Open‑source models are shipping faster, larger, and more multimodal than ever.

hf.co/transformers/contribute

PyTorch × Transformers Journey

Pythonicity, Autodiff & Modularity in Modern AI

2016‑2018: Backprop & Birth Pangs

Static vs Dynamic Graphs

Dynamic Graphs Enabled Contribution

Clone the Paper Tonight → Tweak Tomorrow

“One Model · One File” — Why it Matters

Transformers Grew with Python

Back to Python: Modular “Mary Shelley” Mode

Logit Debugger: Trust but Verify

DTensor & Tensor‑Parallel API