ZeroGPU Explorers

community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

zero-gpu-explorers's activity

sarath-shekkizhar 
posted an update 7 days ago
view post
Post
575
Some interesting architectural choices made in Llama 4 models -- were these key to the 10M context? Possibly 🤔

🔍 Takeaways:
🧩 Interleaved Attention without position encoding
- LLaMA 4 removes explicit positional encoding in some attention layers to boost performance on longer contexts.
- The principles here could be similar to the residual connections to facilitate attention to early tokens without positional decay.

⚖️ Scaled Softmax to increase attention at inference time
- The max attention value (output of softmax) decreases as context size increases.
- Llama 4 incorporates a context-size dependent temperature in the softmax function to modify the slope of softmax, allowing the model to focus better on relevant tokens.
- Done only at inference time -- guessing it was more a choice after some observation on eval datasets.

What did you think of these choices?
AtAndDev 
posted an update 10 days ago
view post
Post
2878
Llama 4 is out...
·
lbourdois 
posted an update 27 days ago
view post
Post
2218
We introduce FAT5 (Flash Attention T5) ⚡

An implementation of T5 in PyTorch with UL2 objective optimized for GPGPU for both training and inference thanks to 13 different optimizations.
The main one is that we have designed a CUDA kernel to expand the Flash Attention by @tridao with RPE biases and supports other PE such as RoPE, ALiBi or FIRE.
The result kernel is 2 times faster than a SPDA implementation.
We also use Triton kernels to optimize certain parts of the architecture, such as the cross-entropy and RMSNorm layer.

The various kernels have been carefully built to be compatible with BF16 and torch.compile to go even faster and achieve efficient pretraining.

All other optimizations are described in a 📝 subsequent blog post available on @huggingface 🤗: CATIE-AQ/FAT5-report.

This methodology enabled us to efficiently pretrain as a proof of concept a FAT5 with 147M parameters in French in a reasonable time (1,461H for 419B tokens), with limited resources (1 A100 i.e. a computational budget of ~ €1,900) and a low carbon footprint (13.5kg eq CO2).

The model's weights are also available on Hugging Face: CATIE-AQ/FAT5-small.
Not very useful in practice, it's a PoC and not an instructed model (it's planned for later).

All the code is available on GitHub if you want to pretrain your own model in your own language or for a specific domain: https://github.com/catie-aq/flashT5

Ending by indicating that was a joint project with @BorisAlbar at hf.co/CATIE-AQ.
AtAndDev 
posted an update about 1 month ago
view post
Post
4204
There seems to multiple paid apps shared here that are based on models on hf, but some ppl sell their wrappers as "products" and promote them here. For a long time, hf was the best and only platform to do oss model stuff but with the recent AI website builders anyone can create a product (really crappy ones btw) and try to sell it with no contribution to oss stuff. Please dont do this, or try finetuning the models you use...
Sorry for filling yall feed with this bs but yk...
  • 6 replies
·
AtAndDev 
posted an update about 1 month ago
view post
Post
1599
Gemma 3 seems to be really good at human preference. Just waiting for ppl to see it.