RAG is evolving fast, keeping pace with cutting-edge AI trends. Today it becomes more agentic and smarter at navigating complex structures like hypergraphs.
For the last couple of weeks a large amount of studies on inference-time scaling has emerged. And it's so cool, because each new paper adds a trick to the toolbox, making LLMs more capable without needing to scale parameter count of the models.
So here are 13 new methods + 3 comprehensive studies on test-time scaling:
3. Z1: Efficient Test-time Scaling with Code (2504.00810) Proposes to train LLMs on code-based reasoning paths to make test-time scaling more efficient, limiting unnecessary tokens with a special dataset and a Shifted Thinking Window
AI inference refers to the process when AI models generate predictions, classifications, or decisions based on input data and pre-trained models. It encompasses a wide range of approaches with different computational methods and deployment.
Firstly, here are 5 inference types, based on how the model reasons:
1. Probabilistic inference -> https://arxiv.org/pdf/2502.05244 Uses probability theory to reason under uncertainty. The system maintains degrees of belief over hypotheses and updates them as evidence comes in.
3. Logical inference -> https://arxiv.org/abs/2009.03393 Uses formal logic to draw conclusions that are guaranteed true if the premises are. It supports theorem proving, logic programming, and tasks needing correctness, like software verification.
How Chain-of-Thought (CoT) prompting can unlock models' full potential across images, video, audio and more? Finding special multimodal CoT techniques is the answer.
Here are 9 methods of Multimodal Chain-of-Thought (MCoT). Most of them are open-source:
As we always use Transformers, it's helpful to understand RoPE—Rotary Position Embedding. Since token order matters, RoPE encodes it by rotating token embeddings based on their position, so the model knows how to interpret which token comes first, second, and so on.
Here are 8 types of RoPE that can be implemented in different cases:
4. Multimodal RoPE (MRoPE) -> Qwen2.5-VL Technical Report (2502.13923) Decomposes positional embedding into 3 components: temporal, height and width, so that positional features are aligned across modalities: text, images and videos.
8. XPos (Extrapolatable Position Embedding) -> https://huggingface.co/papers/2212.10 Introduces an exponential decay factor into the rotation matrix, improving stability on long sequences.
Attention mechanisms allow models to dynamically focus on specific parts of their input when performing tasks. In our recent article, we discussed Multi-Head Latent Attention (MLA) in detail and now it's time to summarize other existing types of attention.
Here is a list of 15 types of attention mechanisms used in AI models:
3. Self-attention -> Attention Is All You Need (1706.03762) Each element in the sequence "looks" at other elements and "decides" how much to borrow from each of them for its new representation.
5. Multi-Head Attention (MHA) -> Attention Is All You Need (1706.03762) Multiple attention “heads” are run in parallel. The model computes several attention distributions (heads), each with its own set of learned projections of queries, keys, and values.
Diffusion models are widely used for image and video generation but remain underexplored in text generation, where autoregressive models (ARMs) dominate. Unlike ARMs, which produce tokens sequentially, diffusion models iteratively refine noise through denoising steps, offering greater flexibility and speed. Recent advancements show a shift toward using diffusion models in place of, or alongside, ARMs. Researchers also combine strengths from both methods and integrate autoregressive concepts into diffusion.
Here are 5 new implementations of diffusion models:
1. Mercury family of diffusion LLMs (dLLMs) by Inception Labs -> https://www.inceptionlabs.ai/news It applies diffusion to text and code data, enabling sequence generation 10x faster than today's top LLMs. Now available Mercury Coder can run at over 1,000 tokens/sec on NVIDIA H100s.
3. LLaDA -> Large Language Diffusion Models (2502.09992) Shows diffusion models' potential in replacing ARMs. Trained with pre-training and SFT, LLaDA masks tokens, predicts them via a Transformer, and optimizes a likelihood bound. LLaDA matches key LLM skills, and surpasses GPT-4o in reversal poetry.
5. General Interpolating Discrete Diffusion (GIDD) -> Generalized Interpolating Discrete Diffusion (2503.04482) A flexible noising process with a novel diffusion ELBO enables combining masking and uniform noise, allowing diffusion models to correct mistakes, where ARMs struggle.
Chain-of-Thought (CoT) prompting enhances reasoning in AI models by breaking down complex problems into step-by-step logical sequences. It continues proving its effectiveness, especially in top-performing reasoning models. However, there are other similar methods, that expand CoT and can be used for different purposes. Here are 9 of them:
4. Chain-of-RAG ->https://huggingface.co/papers/2501.14342 Creates retrieval chains, instead of retrieving all info at once. It can dynamically adjust its search process and its parameters like step number
9. Chain(s)-of-Knowledge -> https://www.turingpost.com/p/cok Enhance LLMs by dynamically pulling in external knowledge to improve accuracy and reduce errors
Agents seem to be everywhere and this collection is for a deep dive into the theory and practice:
1. "Agents" Google's whitepaper by Julia Wiesinger, Patrick Marlow and Vladimir Vuskovic -> https://www.kaggle.com/whitepaper-agents Covers agents, their functions, tool use and how they differ from models
3. "AI Engineer Summit 2025: Agent Engineering" 8-hour video -> https://www.youtube.com/watch?v=D7BzTxVVMuw Experts' talks that share insights on the freshest Agent Engineering advancements, such as Google Deep Research, scaling tips and more
5. "Artificial Intelligence: Foundations of Computational Agents", 3rd Edition, book by David L. Poole and Alan K. Mackworth -> https://artint.info/3e/html/ArtInt3e.html Agents' architectures, how they learn, reason, plan and act with certainty and uncertainty
7. The Turing Post articles "AI Agents and Agentic Workflows" on Hugging Face -> @Kseniase We explore agentic workflows in detail and agents' building blocks, such as memory and knowledge
We've noticed a huge interest in test-time scaling (TTS), so we decided to explore this concept further. Test-time compute (TTC) refers to the amount of computational power used by an AI model when generating a response. Many researchers are now focused on scaling TTC, as it enables slow, deep "thinking" and step-by-step reasoning, which improves overall models' performance.
RAG techniques continuously evolve to enhance LLM response accuracy by retrieving relevant external data during generation. To keep up with current AI trends, new RAG types incorporate deep step-by-step reasoning, tree search, citations, multimodality and other effective techniques.
3. Chain-of-Retrieval Augmented Generation (CoRAG) -> Chain-of-Retrieval Augmented Generation (2501.14342) Retrieves information step-by-step and adjusts it, also deciding how much compute power to use at test time. If needed it reformulates queries.
With the phenomenon of DeepSeek-R1's top reasoning capabilities, we all saw the true power of RL. At its core, RL is a type of machine learning where a model/agent learns to make decisions by interacting with an environment to maximize a reward. RL learns through trial and error, receiving feedback in the form of rewards or penalties.
Here's a list of free sources that will help you dive into RL and how to use it:
2. Hugging Face Deep Reinforcement Learning Course -> https://huggingface.co/learn/deep-rl-course/unit0/introduction You'll learn how to train agents in unique environments, using best libraries, share your results, compete in challenges, and earn a certificate.
4. "Reinforcement Learning and Optimal Control" books, video lectures and course material by Dimitri P. Bertsekas from ASU -> https://web.mit.edu/dimitrib/www/RLbook.html Explores approximate Dynamic Programming (DP) and RL with key concepts and methods like rollout, tree search, and neural network training for RL and more.
8. Concepts: RLHF, RLAIF, RLEF, RLCF -> https://www.turingpost.com/p/rl-f Our flashcards easily explain what are these four RL approaches with different feedback
7 Open-source Methods to Improve Video Generation and Understanding
AI community is making great strides toward achieving the full potential of multimodality in video generation and understanding. Last week studies showed that working with videos is now one of the main focuses for improving AI models. Another highlight of the week is that open source, once again, proves its value. For those who were impressed by DeepSeek-R1, we’re with you!
Today, we’re combining these two key focuses and bringing you a list of open-source methods for better video generation and understanding:
Over the last few weeks, we have witnessed a surge in AI models' math reasoning capabilities. Top companies like Microsoft, NVIDIA, and Alibaba Qwen have already joined this race to make models "smarter" in mathematics. But why is this shift happening now?
Complex math calculations require advanced multi-step reasoning, making mathematics an ideal domain for demonstrating a model's strong "thinking" capabilities. Additionally, as AI continues to evolve and is applied in math-intensive fields such as machine learning and quantum computing (which is predicted to see significant growth in 2025), it must meet the demands of complex reasoning. Moreover, AI models can be integrated with external tools like symbolic solvers or computational engines to tackle large-scale math problems, which also needs high-quality math reasoning.
So here’s a list of 10 recent advancements in math reasoning of AI models:
Today, we spoke with Snowflake’s AI Research Team Leads, Yuxiong He and Samyam Rajbhandari (@samyam) (he is also one the researchers behind DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and
DeepSpeed-Inference (2401.08671) and other DeepSpeed papers) Collaborating with their co-authors to reduce inference costs for enterprise-specific tasks, they observed that inputs are often significantly larger than outputs. This is because it’s in the nature of enterprises to analyze enormous amounts of information trying to extract valuable insights, which are much shorter. To address this, they developed SwiftKV SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving
Model Transformation (2410.03960), an optimization that reduces LLM inference costs by up to 75% for Meta Llama LLMs, enhancing efficiency and performance in enterprise AI tasks.
Today they are open-sourcing SwiftKV (Snowflake/Llama-3.1-SwiftKV-8B-Instruct) and ArcticTrainging Platform. In our new episode "15 minutes with a Researcher" they explain how SwiftKV works, its applicability to other architectures, its limitations, and additional methods to further reduce computation costs in inference. Watch the full 15 min interview here (https://youtu.be/9x1k7eXe-6Q?si=4_HQOyi1CPHgvlrx)
Almost every AI researcher has studied or conducted a large number of AI research papers. So, it's quite logical that researchers are trying to create AI systems to help conduct research. Creating scientific research could be much easier and more varied if we use LLMs and AI assistants tailored for this purpose. Just imagine how interesting it would be to read high-quality research about AI made by an AI agent.
Today, we offer you to explore these 10 AI systems for scientific research: