badaoui (ABDENNACER BADAOUI)

updated a model 6 days ago

badaoui/tiny-random-NemotronForCausalLM

Text Generation • 0.0B • Updated 6 days ago • 1.49k

published a model 6 days ago

badaoui/tiny-random-NemotronForCausalLM

Text Generation • 0.0B • Updated 6 days ago • 1.49k

upvoted an article 6 days ago

Article

Welcome GPT OSS, the new open-source model family from OpenAI!

By

and 11 others •

14 days ago

• 464

replied to their post 7 days ago

From the available metrics I currently have, I looked at how much quantizing only the mlp.down_proj layers impacts end-to-end perplexity compared to the original float model and to quantizing all linear layers.
I used asymmetric W4A16 quantization and evaluated on the same held-out dataset used for the sensitivity profiling.

Here’s a summary of the results:

Model	Original PPL	`mlp.down_proj` quantized PPL	All linear layers quantized PPL (except `lm_head`)
Gemma-2B	16.80	58.29	76.69
LLaMA-3.2-3B	13.73	15.11	17.32
Qwen-3-4B	20.10	22.73	29.70

Observations:

Gemma-2B: Quantizing only the mlp.down_proj layers already causes a large perplexity increase (from 16.80 -> 58.29), showing they’re a major bottleneck for precision loss.
LLaMA-3.2-3B: The impact is smaller (13.73 -> 15.11), but still noticeable.
Qwen-3-4B: The increase is modest (20.10 -> 22.73), and here the sensitivity is more evenly distributed — other linear layers are nearly as fragile as mlp.down_proj, so the relative contribution is less pronounced.

It would also be interesting to run evaluations of these models on downstream benchmarks to see how these perplexity changes translate into real-task performance differences.

replied to their post 7 days ago

For a detailed breakdown of the divergence-based method used to measure layer sensitivity, you can read my blog post here: https://huggingface.co/blog/badaoui/sensitivity-aware-mixed-precision-quantizer-v1

posted an update 7 days ago

Post

3022

Is there a "one-size-fits-all" recipe for quantizing Large Language Models? 🤔

As part of my ongoing work in mixed-precision quantization, I've been exploring this question by measuring layer-by-layer sensitivity. The goal is to see if we can find universal rules for which layers can be quantized aggressively without impacting performance.The results are fascinating and reveal two key insights:

1️⃣ Sensitivity profiles are like architectural "fingerprints." Models from the same family share strikingly similar sensitivity patterns. As you can see in the charts below for the Gemma and SmolLM families, the ranking and relative sensitivity of the layers remain remarkably consistent. This suggests that the underlying architecture is a primary driver of a model's quantization behavior.

2️⃣ A "universal" mixed-precision quantization strategy is challenging. While models within a family are similar, these "fingerprints" change dramatically when comparing different architectures like LLaMA, Qwen, and StableLM. This highlights the difficulty in creating a generalized mixed-precision configuration that works optimally across all model families.

However, there is one near-universal truth we uncovered: the mlp.down_proj layer consistently emerges as one of the most sensitive components across all models studied.
This finding strongly resonates with the work in "The Super Weight in Large Language Models" (by Mengxia Yu et al.). The paper identifies that functionally critical parameters, or "super weights," are concentrated in these down_proj layers. Our empirical results provide clear validation for this theory, showing these layers are highly intolerant to precision loss.

In short, while every architecture has a unique sensitivity profile, a fingerprint shaped not only by its core design but also by its specific training dataset and optimization approach, some components remain universally critical!
What are your thoughts?

4 replies

·

upvoted an article 8 days ago

Article

Sensitivity Aware Mixed Precision Quantization V1

By

and 1 other •

Jun 13

• 19

New activity in badaoui/tiny-random-seamless-m4t 19 days ago

Adding `safetensors` variant of this model

#1 opened 19 days ago by

SFconvertbot

updated a model 19 days ago

badaoui/tiny-random-seamless-m4t

Automatic Speech Recognition • 0.0B • Updated 19 days ago • 191

published a model 19 days ago

badaoui/tiny-random-seamless-m4t

Automatic Speech Recognition • 0.0B • Updated 19 days ago • 191

reacted to IlyasMoutawwakil's post with 🔥 19 days ago

Post

3314

🚀 Optimum: The Last v1 Release 🚀
Optimum v1.27 marks the final major release in the v1 series. As we close this chapter, we're laying the groundwork for a more modular and community-driven future:
- Optimum v2: A lightweight core package for porting Transformers, Diffusers, or Sentence-Transformers to specialized AI hardware/software/accelerators..
- Optimum‑ONNX: A dedicated package where the ONNX/ONNX Runtime ecosystem lives and evolves, faster-moving and decoupled from the Optimum core.

🎯 Why this matters:
- A clearer governance path for ONNX, fostering stronger community collaboration and improved developer experience..
- Enable innovation at a faster pace in a more modular, open-source environment.

💡 What this means:
- More transparency, broader participation, and faster development driven by the community and key actors in the ONNX ecosystem (PyTorch, Microsoft, Joshua Lochner 👀, ...)
- A cleaner, more maintainable core Optimum, focused on extending HF libraries to special AI hardware/software/accelerators tooling and used by our partners (Intel Corporation, Amazon Web Services (AWS), AMD, NVIDIA, FuriosaAI, ...)

🛠️ Major updates I worked on in this release:
✅ Added support for Transformers v4.53 and SmolLM3 in ONNX/ONNXRuntime.
✅ Solved batched inference/generation for all supported decoder model architectures (LLMs).

✨ Big shoutout to @echarlaix for leading the refactoring work that cleanly separated ONNX exporter logic and enabled the creation of Optimum‑ONNX.

📝 Release Notes: https://lnkd.in/gXtE_qji
📦 Optimum : https://lnkd.in/ecAezNT6
🎁 Optimum-ONNX: https://lnkd.in/gzjyAjSi
#Optimum #ONNX #OpenSource #HuggingFace #Transformers #Diffusers