We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute 🔥
How? By combining step-wise reward models with tree search algorithms :)
We show that smol models can match or exceed the performance of their much larger siblings when given enough "time to think"
We're open sourcing the full recipe and sharing a detailed blog post.
In our blog post we cover:
📈 Compute-optimal scaling: How we implemented DeepMind's recipe to boost the mathematical capabilities of open models at test-time.
🎄 Diverse Verifier Tree Search (DVTS): An unpublished extension we developed to the verifier-guided tree search technique. This simple yet effective method improves diversity and delivers better performance, particularly at large test-time compute budgets.
🧭 Search and Learn: A lightweight toolkit for implementing search strategies with LLMs and built for speed with vLLM
Yesterday, Mistral released their latest base model (via magnet link of course 😅) and the community quickly converted it to transformers format and pushed it to the Hub: mistral-community/Mixtral-8x22B-v0.1
Early evals of this model looked extremely strong, so we teamed up with Argilla and KAIST AI to cook up a Zephyr recipe with a few new alignment techniques that came out recently:
🧑🍳 Align the base model with Odds Ratio Preference Optimisation (ORPO). This novel algorithm developed by @JW17 and @nlee-208 and @j6mes and does not require an SFT step to achieve high performance and is thus much more computationally efficient than methods like DPO and PPO.
🦫 Use a brand new dataset of 7k high-quality, multi-turn preferences that has been developed by our friends at Argilla. To create this dataset, they took the excellent Capybara SFT dataset from @LDJnrLDJnr/Capybara and converted it into a preference dataset by augmenting the final turn with responses from new LLMs that were then ranked by GPT-4.
What we find especially neat about this approach is that training on 7k samples only takes ~1.3h on 4 H100 nodes, yet produces a model that is very strong on chat benchmarks like IFEval and BBH.
Misc models: 🦖T-Rex2, a very powerful object detection model for many applications https://github.com/IDEA-Research/T-Rex 👀 CT-RATE : A 3D dataset paired with text reports ibrahimhamamci/CT-RATE 🐙Octopus v2: a Gemma-based model trained for Android API - extremely fast, better than Llama+RAG, great results NexaAIDev/Octopus-v2
🌏Models and datasets around the world - Tess-70B, a MiQu-70B fine-tune with high-quality data migtissera/Tess-70B-v1.6 - UNI, a model trained on 100 million pathology images from 100k+ slides MahmoodLab/UNI - CONCH, a VLM trained on 1.17 million pathology image-text pairs MahmoodLab/CONCH
Can we align code generation models to be good at chat without compromising their base capabilities 🤔?
This was the question the H4 team asked itself when BigCode released StarCoder2 a bit over a week ago. We knew that code models like deepseek-ai/deepseek-coder-6.7b-instruct and m-a-p/OpenCodeInterpreter-DS-33B get impressive scores on code benchmarks like HumanEval, but they tend to score poorly on chat benchmarks like MT Bench and IFEval. We also knew that the Zephyr recipe we applied to Mistral 7B produced a strong chat model, so we wondered -- could be tweaked to produce a strong coding assistant?
It turns out the answer is yes and I'm happy to share StarChat2, a DPO fine-tune of StarCoder2 15B that scores highly on both HumanEval and MT Bench / IFEval 🌟!
The most interesting lesson for me was that you get better models by blending in more code/math data than chat during the SFT step - in terms of tokens, we found a ratio of 3:1 worked best.
Anyway, here's a demo of the model, along with all the code and datasets we used to train it:
5. SpeechBrain 1.0: a toolkit with hundreds of recipes and pretrained models for audio-related tasks, such as speech recognition, diarization, and enhancement. New major release! HF repos: https://huggingface.co/speechbrain Website: https://speechbrain.github.io/
The community has struggled to do a good preference-tune of Gemma, so the amazing @lewtun and @philschmid built an open-source recipe and trained a model to help people get started.
Some interesting details - Fine-tuned on DEITA and DPOed with Argilla DPO dataset - Very strong MT Bench results (7.81), better than Zephyr Beta (mistral based) and Gemma Instruct - Can run locally with tools such as llama.cpp on a Mac - Not so good AGIEval results compared to mistral-based tunes - All training code is open-sourced - Trained for 105 minutes on 8x H100 - No system message
Big kudos to the team! Super exciting to see a good fine-tune for Gemma
The paper shows an adversarial attack strategy in which a user sends malicious queries that can affect the output of other user queries from the same batch.
So if in the same batch we have - User A benign query - User B malicious query The response for A might be altered!😱
How is this possible? One approach is to fill the token buffers with adversarial data, hence forcing the gating to use the non-ideal experts or to entirely drop the bening tokens (in the case of finite limit size).
This assumes that the adversary can use the model as a black-box but can observe the logit outputs + ensure that the data is always grouped in the same batch.
How to mitigate this? - Randomize batch order (and even run twice if some queries are very sensitive) - Use a large capacity slack - Sample from gate weights instead of top-k (not great IMO, as that require more memory for inference)