mrm8488 (Manuel Romero)

reacted to lbourdois's post with 🔥❤️ 9 months ago

Post

3736

We introduce FAT5 (Flash Attention T5) ⚡

An implementation of T5 in PyTorch with UL2 objective optimized for GPGPU for both training and inference thanks to 13 different optimizations.
The main one is that we have designed a CUDA kernel to expand the Flash Attention by @tridao with RPE biases and supports other PE such as RoPE, ALiBi or FIRE.
The result kernel is 2 times faster than a SPDA implementation.
We also use Triton kernels to optimize certain parts of the architecture, such as the cross-entropy and RMSNorm layer.

The various kernels have been carefully built to be compatible with BF16 and torch.compile to go even faster and achieve efficient pretraining.

All other optimizations are described in a 📝 subsequent blog post available on @huggingface 🤗: CATIE-AQ/FAT5-report.

This methodology enabled us to efficiently pretrain as a proof of concept a FAT5 with 147M parameters in French in a reasonable time (1,461H for 419B tokens), with limited resources (1 A100 i.e. a computational budget of ~ €1,900) and a low carbon footprint (13.5kg eq CO2).

The model's weights are also available on Hugging Face: CATIE-AQ/FAT5-small.
Not very useful in practice, it's a PoC and not an instructed model (it's planned for later).

All the code is available on GitHub if you want to pretrain your own model in your own language or for a specific domain: https://github.com/catie-aq/flashT5 ⭐

Ending by indicating that was a joint project with @BorisAlbar at hf.co/CATIE-AQ.

reacted to fdaudens's post with 🔥 almost 2 years ago

Post

2386

IBM & NASA just released open-source AI model for weather & climate on Hugging Face.

Prithvi WxC offers insights beyond forecasting, tackling challenges from local weather to global climate. Potential apps: targeted forecasts, severe weather detection & more.

Prithvi-WxC

This is impressive. Check out this comparison of the Ida hurricane between ground truth and the AI model's prediction.

reacted to their post with ❤️ about 2 years ago

Post

9068

🚨Exciting news for the Multilingual Synthetic Data Community!🚨

I’ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Here’s what’s new!

🗞 The MAGPIE paper showcased that if you use the instruction-tuned version (Llama-3-8B-instruct) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B) on this dataset, you can improve even the it-tuned version

🤔 While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?

🎉 And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.

👩‍💻 To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using ollama models (initially phi and llama3) automatically and upload it to the Hugging Face Hub!
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)

🔍 Explore the datasets 📚 generated using our new script!

- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)

Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.

Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/

7 replies

·

replied to their post about 2 years ago

PRs are welcome! https://github.com/mrm8488/magpie-ollama-datagen

replied to their post about 2 years ago

Good point! I wil do!

replied to their post about 2 years ago

Thanks @osanseviero !!!

posted an update about 2 years ago

Post

9068

🚨Exciting news for the Multilingual Synthetic Data Community!🚨

I’ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Here’s what’s new!

🗞 The MAGPIE paper showcased that if you use the instruction-tuned version (Llama-3-8B-instruct) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B) on this dataset, you can improve even the it-tuned version

🤔 While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?

🎉 And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.

👩‍💻 To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using ollama models (initially phi and llama3) automatically and upload it to the Hugging Face Hub!
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)

🔍 Explore the datasets 📚 generated using our new script!

- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)

Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.

Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/

7 replies

·

replied to their post about 2 years ago

Sure

reacted to their post with 🚀 about 2 years ago

Post

9699

Working on a concept GPT-2 (small) that uses KANs instead of MLPs.
The ckpt and training code will be soon on the hub.

6 replies

·

posted an update about 2 years ago

Post

9699

Working on a concept GPT-2 (small) that uses KANs instead of MLPs.
The ckpt and training code will be soon on the hub.

6 replies

·

reacted to akhaliq's post with 👍🤯 over 2 years ago

Post

GaLore

Memory-Efficient LLM Training by Gradient Low-Rank Projection

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (2403.03507)

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

posted an update over 2 years ago

Post

Hello world! 🔥

reacted to osanseviero's post with 👍 over 2 years ago

Post

Diaries of Open Source. Part 2. Open Source is going brrrrr

🚀The European Space Agency releases MajorTOM, a dataset of earth observation covering half the earth. The dataset has 2.5 trillion pixels! Congrats @aliFrancis and @mikonvergence !
Dataset: Major-TOM/Core-S2L2A
Viewer: Major-TOM/MajorTOM-Core-Viewer

🍞Re-ranking models by MixedBreadAI, with very high quality, Apache 2 license, and easy to use!
Models: https://huggingface.co/models?other=reranker&sort=trending&search=mixedbread-ai
Blog: https://www.mixedbread.ai/blog/mxbai-rerank-v1

🧊StabilityAI and TripoAI release TripoSR, a super-fast MIT-licensed image-to-3D model!
Model: stabilityai/TripoSR
Demo: stabilityai/TripoSR

🤝Together AI and HazyResearch release Based
Models and datasets: hazyresearch/based-65d77fb76f9c813c8b94339c
GH repo: https://github.com/HazyResearch/based

🌊LaVague: an open-source pipeline to turn natural language into browser actions! It can run locally with HuggingFaceH4/zephyr-7b-gemma-v0.1
Read more about it at https://huggingface.co/posts/dhuynh95/717319217106504

🏆Berkeley Function-Calling Leaderboard
Read about it: https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html
Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html

🐬Sailor-Chat: chat models built on top of OpenOrca and @sarahooker CohereForAI Aya project. They can be used for South-East Asia languages such as Indonesian, Thai, Vietnamese, Malay and Lao!
Models: sail/sailor-language-models-65e19a749f978976f1959825
Demo: https://huggingface.co/spaces/sail/Sailor-7B-Chat

🤗Arabic-OpenHermes-2.5: OpenHermes dataset translated to Arabic 2A2I/Arabic-OpenHermes-2.5

See the previous part here https://huggingface.co/posts/osanseviero/622788932781684

3 replies

·

reacted to macadeliccc's post with 👍🤗 over 2 years ago

Post

Reducing perplexity in LLM's through layer selective rank reduction

Layer-Selective Rank Reduction (LASER) is a denoising method that improves reasoning by the strategic removal of higher-order components from weight matrices in the multi-layer perceptron (MLP) layers without the need for additional parameters or training data. This process leverages singular value decomposition to identify and eliminate these components. This simple, yet effective, method has shown to improve question-answering performance by up to 27.4 percentage points.

LaserRMT implements this through a process by calculating signal to noise ratio (SNR) for each layer and selectively reducing the rank of these layers.The SNR method meticulously computes the SNR by leveraging singular value decomposition (SVD) to separate the signal (higher-order components) from the noise (lower-order components) within the weight matrices of the model's layers. The SNR calculation is what determines which layers would benefit from rank reduction without compromising the models integrity.

If a layer is identified that could benefit from rank reduction, then the layer will enter an incremental process where the weight matrices are reduced and reconstructed by retaining only the singular values that surpass the threshold. In the case of laserRMT, the threshold is calculated by Marchenko-Pastur Law.

@staticmethod
    def marchenko_pastur_threshold(sigma, n, m):
        beta = n / m if n < m else m / n
        threshold = sigma * np.sqrt((1 + np.sqrt(beta))**2)
        return thr

The two primary benefits of applying this method are reducing computational overhead of large language models and simultaneously improving output quality.

Credit to @ehartford @fernandofernandes @DavidGF for laserRMT

Resources:
☄️ AutoLaser: https://colab.research.google.com/drive/11j0e-w6BfvqeFN1gUrpOqdW0vcKqfVqP?usp=sharing
laserRMT: https://github.com/cognitivecomputations/laserRMT
The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction (2312.13558)

8 replies

·

reacted to osanseviero's post with 🤗❤️ over 2 years ago

Post

I finished my model merging experiment day.🤗I would love your thoughts on this.

What did I do? I merged Mistral Instruct 0.1 and 0.2 models using different merging techniques:
- SLERP: linear interpolation (most popular method)
- MoE: replace some forward layers with MoE layers; using a random gate for now
- Frankenmerge: also known as passthrough, but that isn't very cool. It concatenates some specified layers ending in different numbers of params. In my case, I went from 7B to 9B.

Note: merging is not building an ensemble of models. You can read more about merging techniques at https://huggingface.co/blog/mlabonne/merge-models

Results
I built the 3 models using mergekit (running in an HF Space) - took less than an hour to do the three) osanseviero/mistral-instruct-merges-659ebf35ca0781acdb86bb0a

I'm doing a quick check with the OpenLLM Leaderboard.
🚨The OpenLLM Leaderboard is more suitable for pre-trained models than instruct models, but I still thought it would be interesting to look at the insights🚨

You can look at the attached image. Some interesting things
- All three models performed somewhere between 0.1 and 0.2 - congrats to the 140 people who got it right in https://twitter.com/osanseviero/status/1745071548866736171
- Frankenmerge terribly sucked with GSM8K. It seems that adding some Mistral 0.1 layers actually degraded the performance a lot - this is worse than even 0.1!
- Otherwise, frankenmerge was decent across HellaSwag, MMLU, and specially TruthfulQA
- MoE is using random gating, so I expected something right in between 0.1 and 0.2, which was the case

What do I do with this?
Not sure tbh! I think doing proper MT bench evals would be nice. I also think all of us should give a nice GH star to mergekit because it's awesome. I would love to have the time to do end-to-end ablation studies, but cool new things are coming up. Let me know if you have any thoughts in the results

8 replies

·

Manuel Romero PRO

AI & ML interests

Recent Activity

Organizations

Manuel Romero PRO

AI & ML interests

Recent Activity

Organizations

mrm8488's activity