Biomedical-TeMU (Biomedical TeMU)

🚨Exciting news for the Multilingual Synthetic Data Community!🚨

I’ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Here’s what’s new!

🗞 The MAGPIE paper showcased that if you use the instruction-tuned version (Llama-3-8B-instruct) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B) on this dataset, you can improve even the it-tuned version

🤔 While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?

🎉 And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.

👩‍💻 To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using ollama models (initially phi and llama3) automatically and upload it to the Hugging Face Hub!
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)

🔍 Explore the datasets 📚 generated using our new script!

- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)

Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.

Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/

7 replies

·

mariagrandury

authored a paper about 1 year ago

Spanish and LLM Benchmarks: is MMLU Lost in Translation?

Paper • 2406.17789 • Published May 28, 2024 • 2

mrm8488

posted an update about 1 year ago

Post

7667

Working on a concept GPT-2 (small) that uses KANs instead of MLPs.
The ckpt and training code will be soon on the hub.

6 replies

·

mrm8488

posted an update over 1 year ago

Post

Hello world! 🔥

mariagrandury

posted an update over 1 year ago

Post

✅ Ever wondered how to measure transparency in model development?

My last open-source contribution for 2023 is s Space that allows you to self-assess the transparency of your model based on the 100 indicators of the Foundation Model Transparency Index (FMTI).

The original study evaluated the developers of 10 top LLMs. Curious about how yours measures up? 👀

mariagrandury/fmti-transparency-self-assessment

Let's commit to a 2024 with greater transparency in the AI ecosystem! 🚀

6 replies

·

mariagrandury

posted an update over 1 year ago

Post

Holiday talk about AI taking over? Let's shift the narrative!

🌟 There is no reason to believe that just because AI systems are intelligent they will want to dominate us. Yann LeCun reminds us that AI systems won't have the same motivations as humans, we'll design them not to.

🌍 Instead of getting distracted by future existential risks, we must address AI’s more pressing risks — like emitting carbon, infringing copyrights and spreading bias. Sasha Luccioni urges us to create tools and legislation that promote transparency and diversity.

💡 Dive deeper into these perspectives:
- Yann's ( @ylecun ) WIRED interview (12'): https://www.wired.com/story/artificial-intelligence-meta-yann-lecun-interview/
- Sasha's ( @sasha ) TED Talk (10'): https://www.ted.com/talks/sasha_luccioni_ai_is_dangerous_but_not_for_the_reasons_you_think

P.S.: Love this new "Posts" feature, big thanks to 🤗 for letting me try it!

What are your go-to citations for AI risks? 👇