Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
m-ricย 
posted an update Oct 24
Post
1985
๐ŸŒŸ๐ŸŒŽ Cohere releases Aya 8B & 32B: SOTA multilingual models for 23 languages !

How did they manage to beat top contenders while also adding 23 languages?

๐Ÿ”„ ๐—ง๐—ฟ๐—ฎ๐—ถ๐—ป ๐—ผ๐—ป ๐˜€๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—ฑ๐—ฎ๐˜๐—ฎ:
โ€ข Synthetic data has been said to cause model-collapse after too much training
โ€ข Cohere has introduced "data arbitrage" to prevent this by strategically sampling from a pool of several teacher models instead of one single teacher
โ€ข First train a model pool for each different groups of languages, and employ an internal Reward Model named "Arbiter" to evaluate and select the optimal generation. Then only the best generation is kept as the final completion for each prompt
โžก๏ธ This process is particularly effective for multilingual setting, where no single teacher model performs in all languages : here "Multilingual Arbitrage" singlehandedly improves win rates of the 8B model vs Gemma-2-9B by 10 points!

๐Ÿงฉ ๐—จ๐˜€๐—ฒ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐—บ๐—ฒ๐—ฟ๐—ด๐—ถ๐—ป๐—ด: Rather than struggling to find the right mix of data in training a single model for multilingual use, just train language specific models then merge them!
โ€ข Maximize diversity between merged checkpoints by training each on different language families.
โ€ข Experimented fancy techniques (SLERP, TIES, DARE-TIES) but found out weighted averaging to be the most consistent!
โžก๏ธ Merging had 3x more gains at high 35B scale vs the 8B scale - consistent with literature findings that merging is more effective at scale

โšก๏ธ ๐—š๐—ฟ๐—ฒ๐—ฎ๐˜ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ: Automatic evaluations on Arena-Hard-Auto dataset:
โžก๏ธ Aya Expanse 8B beats models from its weight class such as Gemma 2 9B, Llama 3.1 8B, and the recent Ministral 8B, with win rates ranging from 60.4% to 70.6%
โžก๏ธ Aya Expanse 32B outperforms Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B (2x its size)
โ€ข โš ๏ธ But this performance eval comes from only one benchmark! Let's wait for Open LLM leaderboard evals;

๐Ÿ”’ CC by NC license

Blog post here: https://huggingface.co/blog/aya-expanse
In this post