1 99 3

Ksenia Se

Kseniase

https://www.turingpost.com/

AI & ML interests

None yet

Recent Activity

replied to their post 1 day ago

12 Powerful World Models World models are one of the most challenging areas in AI, pushing the boundaries of reasoning, perception, and planning. They're gen AI systems that help models and agents learn internal representations of real-world environments. Today, we invite you to take a look at 12 standout examples: 1. WorldVLA → https://huggingface.co/papers/2506.21539 This autoregressive world model integrates action prediction and visual world modeling in a single framework, allowing each to enhance the other. It introduces an attention masking strategy to reduce action prediction errors 2. SimuRA → https://arxiv.org/abs/2507.23773 A generalized world model that uses a language-based world model to simulate and plan actions before execution, enabling more general and flexible reasoning 3. PAN (Physical, Agentic, and Nested) world models → https://huggingface.co/papers/2507.05169 Has a hybrid architecture that combines discrete concept-based reasoning (via LLMs) with continuous perceptual simulation (via diffusion models), enabling rich multi-level, multimodal understanding and prediction 4. MineWorld by Microsoft Research → https://huggingface.co/papers/2504.08388 Enables real-time, interactive world modeling in Minecraft by combining visual and action tokenization within an autoregressive Transformer. It uses parallel decoding for fast scene generation (4–7 FPS) 5. WorldMem → https://huggingface.co/papers/2504.12369 Uses a memory bank with attention over time-stamped frames and states to maintain long-term and 3D spatial consistency in scene generation. So it reconstruct past scenes and simulate dynamic world changes across large temporal gaps Read further below ⬇️ If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe Plus explore this article for a comprehensive overview of the history and current evolution of world models: https://www.turingpost.com/p/topic-35-what-are-world-models

posted an update 1 day ago

replied to their post 9 days ago

9 new policy optimization techniques Reinforcement Learning (RL) won't stuck in the same old PPO loop - in the last two months alone, researchers have introduced a new wave of techniques, reshaping how we train and fine-tune LLMs, VLMs, and agents. Here are 9 fresh policy optimization techniques worth knowing: 1. GSPO: Group Sequence Policy Optimization → https://huggingface.co/papers/2507.18071 Shifts from token-level to sequence-level optimization, clipping, and rewarding to capture the full picture and increase stability compared to GRPO. GSPO-token variation also allows token-level fine-tuning. 2. LAPO: Length-Adaptive Policy Optimization → https://huggingface.co/papers/2507.15758 A two-stage RL framework that trains models to adaptively control reasoning length by learning typical solution lengths for shorter and more efficient reasoning. 3. HBPO: Hierarchical Budget Policy Optimization → https://huggingface.co/papers/2507.15844 This one trains model to adapt reasoning depth based on problem complexity. It divides training samples into subgroups with different token budgets, using budget-aware rewards to align reasoning effort with task difficulty. 4. SOPHIA: Semi-off-policy reinforcement learning → https://huggingface.co/papers/2507.16814 Combines on-policy visual understanding from the Vision Language Models (VLMs) with off-policy reasoning from an LM, assigning outcome-based rewards and propagating visual rewards backward through the reasoning steps. 5. RePO: Replay-Enhanced Policy Optimization → https://huggingface.co/papers/2506.09340 Introduces a replay buffer into on-policy RL for LLMs, retrieving diverse off-policy samples for each prompt to broaden the training data per prompt Read further below ⬇️ If you like it, also subscribe to the Turing Post: https://www.turingpost.com/subscribe

View all activity

Organizations

replied to their post 1 day ago

iVideoGPT → https://huggingface.co/papers/2405.15223
Unifies visual observations, actions, and rewards into a single token sequence, enabling scalable, interactive world modeling of high-dimensional environments
MaskGWM → https://huggingface.co/papers/2502.11663
It's used for autonomous driving. It improves long-horizon and multi-view prediction by combining video generation with MAE-style feature-level context learning. Its innovations include: scalable Diffusion Transformers, diffusion-aware mask tokens, and spatial-temporal masking.
World-model-augmented (WMA) web agent → https://huggingface.co/papers/2410.13232
This mix of a world model and LLM-based web agents enables agents to simulate future outcomes in natural language and avoid mistakes in long-horizon tasks. The world model's transition-focused abstraction allows for efficient policy improvement
Navigation World Models from Meta →
https://huggingface.co/papers/2412.03572
Allows agents to simulate and evaluate navigation trajectories before acting. Powered by a large Conditional Diffusion Transformer, NWM adapts to dynamic constraints and generalizes to unfamiliar environments with a single image
Сosmos World Foundation Models by NVIDIA →
https://huggingface.co/papers/2501.03575
Include 3 model families: 1) Cosmos-Predict1 simulates how the visual world evolves over time, learning physical world dynamics from video clips; 2) Cosmos-Transfer1 allows to guide world generation using multiple spatial control signals: segmentation, depth, edge maps, blurred visual inputs, etc.; 3) Cosmos-Reason1 reasons about what is happening, what will happen next, and what actions are feasible.
DreamerV3, Google DeepMind → https://arxiv.org/abs/2301.04104
A single, general-purpose world model-based RL algorithm. It demonstrates robust, farsighted planning in complex environments without human data or reward shaping, and excels in tasks like collecting diamonds in Minecraft from scratch.
Genie 2, Google DeepMind →
https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/
Generates diverse training environments for embodied agents. From a single image prompt, it creates playable virtual worlds controllable via keyboard and mouse usable by both humans and AI systems.

posted an update 1 day ago

Post

2539

12 Powerful World Models

World models are one of the most challenging areas in AI, pushing the boundaries of reasoning, perception, and planning. They're gen AI systems that help models and agents learn internal representations of real-world environments.

Today, we invite you to take a look at 12 standout examples:

1. WorldVLA → WorldVLA: Towards Autoregressive Action World Model (2506.21539)
This autoregressive world model integrates action prediction and visual world modeling in a single framework, allowing each to enhance the other. It introduces an attention masking strategy to reduce action prediction errors

2. SimuRA → https://arxiv.org/abs/2507.23773
A generalized world model that uses a language-based world model to simulate and plan actions before execution, enabling more general and flexible reasoning

3. PAN (Physical, Agentic, and Nested) world models → Critiques of World Models (2507.05169)
Has a hybrid architecture that combines discrete concept-based reasoning (via LLMs) with continuous perceptual simulation (via diffusion models), enabling rich multi-level, multimodal understanding and prediction

4. MineWorld by Microsoft Research → MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft (2504.08388)
Enables real-time, interactive world modeling in Minecraft by combining visual and action tokenization within an autoregressive Transformer. It uses parallel decoding for fast scene generation (4–7 FPS)

5. WorldMem → WORLDMEM: Long-term Consistent World Simulation with Memory (2504.12369)
Uses a memory bank with attention over time-stamped frames and states to maintain long-term and 3D spatial consistency in scene generation. So it reconstruct past scenes and simulate dynamic world changes across large temporal gaps

Read further below ⬇️

If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe

Plus explore this article for a comprehensive overview of the history and current evolution of world models: https://www.turingpost.com/p/topic-35-what-are-world-models

1 reply

replied to their post 9 days ago

CISPO: Clipped Importance Sampling Policy Optimization →
https://huggingface.co/papers/2506.13585
This RL algorithm from the MiniMax-M1 project clips importance-sampling weights instead of per-token updates. This lets all tokens (even rare but crucial ones) contribute to learning, avoiding the token-level clipping. CISPO also avoids KL penalties and uses group relative advantage like GRPO.
PAPO: Perception-Aware Policy Optimization → https://huggingface.co/papers/2507.06448
Enhances RL in vision-language tasks by adding a KL-based perception loss to the GRPO objective for better visual alignment during training. It boosts accuracy by 4–8% and reduces perception errors by ~30%.
OPO: On-Policy RL with Optimal Baseline → https://huggingface.co/papers/2505.23585
A simplified RL algorithm from Microsoft that enforces strict on-policy training by using freshly sampled outputs from the current policy for every update, minimizing off-policy drift. It minimizes gradient variance, avoiding auxiliary models and regularization.
EXPO: Expressive Policy Optimization → https://huggingface.co/papers/2507.07986
Trains complex policies by pairing a large base model with a lightweight edit policy that suggests better actions, selecting the best of both without backpropagating through the base.

posted an update 9 days ago

Post

4857

9 new policy optimization techniques

Reinforcement Learning (RL) won't stuck in the same old PPO loop - in the last two months alone, researchers have introduced a new wave of techniques, reshaping how we train and fine-tune LLMs, VLMs, and agents.

Here are 9 fresh policy optimization techniques worth knowing:

1. GSPO: Group Sequence Policy Optimization → Group Sequence Policy Optimization (2507.18071)
Shifts from token-level to sequence-level optimization, clipping, and rewarding to capture the full picture and increase stability compared to GRPO. GSPO-token variation also allows token-level fine-tuning.

2. LAPO: Length-Adaptive Policy Optimization → LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization (2507.15758)
A two-stage RL framework that trains models to adaptively control reasoning length by learning typical solution lengths for shorter and more efficient reasoning.

3. HBPO: Hierarchical Budget Policy Optimization → Hierarchical Budget Policy Optimization for Adaptive Reasoning (2507.15844)
This one trains model to adapt reasoning depth based on problem complexity. It divides training samples into subgroups with different token budgets, using budget-aware rewards to align reasoning effort with task difficulty.

4. SOPHIA: Semi-off-policy reinforcement learning → Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning (2507.16814)
Combines on-policy visual understanding from the Vision Language Models (VLMs) with off-policy reasoning from an LM, assigning outcome-based rewards and propagating visual rewards backward through the reasoning steps.

5. RePO: Replay-Enhanced Policy Optimization → RePO: Replay-Enhanced Policy Optimization (2506.09340)
Introduces a replay buffer into on-policy RL for LLMs, retrieving diverse off-policy samples for each prompt to broaden the training data per prompt

Read further below ⬇️
If you like it, also subscribe to the Turing Post: https://www.turingpost.com/subscribe

1 reply

posted an update 16 days ago

Post

6146

6 Essential Reads on core AI/ML topics:

Time to look at some free useful resources that can help you upgrade your knowledge of AI and machine learning!
Today we offer you these 6 must-read surveys that can be your perfect guides to the major fields and techniques:

1. Foundations of Large Language Models by Tong Xiao and Jingbo Zhu → https://arxiv.org/abs/2501.09223
Many recommend this 270-page book as a good resource to focus on fundamental concepts, such as pre-training, generative models, prompting, alignment, and inference

2. Large Language Models Post-Training: Surveying Techniques from Alignment to Reasoning -> A Survey on Post-training of Large Language Models (2503.06072)
Read this to master policy optimization (RLHF, DPO, GRPO), supervised and parameter-efficient fine-tuning, reasoning, integration, and adaptation techniques

3. Agentic Large Language Models, a survey by Leiden University → https://arxiv.org/abs/2503.23037
Surveys agentic LLMs across reasoning, tools, and multi-agent collaboration, highlighting their synergy. It also explores their promise, risks and applications in medicine, finance, science.

4. A Survey of Context Engineering for Large Language Models → A Survey of Context Engineering for Large Language Models (2507.13334)
Defines Context Engineering as systematic info design for LLMs beyond prompting, covering retrieval, processing, management, and architectures like RAG and multi-agent systems

5. A Survey of Generative Categories and Techniques in Multimodal Large Language Models → https://arxiv.org/abs/2506.10016
Covers multimodal models, exploring six generative modalities, key techniques (SSL, RLHF, CoT), architectural trends, and challenges

6. Large Language models for Time Series Analysis: Techniques, Applications, and Challenges → https://arxiv.org/abs/2506.11040
Explains how LLMs transform time series analysis by enhancing pattern recognition and long-term dependency handling + shows how to build them

Also, subscribe to the Turing Post: https://www.turingpost.com/subscribe

1 reply

replied to their post 23 days ago

FreeLoRA → https://huggingface.co/papers/2507.01792
Enables training-free image generation with multiple subjects by fine-tuning each LoRA module on one subject. During inference, subject-aware activation applies modules only to their target tokens, ensuring clean, interference-free fusion.
LoRA-Augmented Generation (LAG) → https://huggingface.co/papers/2507.05346
Uses large collections of task-specific LoRA adapters without needing extra training or data. It selects and applies the most relevant adapters at each layer and token, exceling in knowledge-intensive tasks.
ARD-LoRA (Adaptive Rank Dynamic LoRA) → https://huggingface.co/papers/2506.18267
Adjusts the rank of LoRA adapters dynamically across transformer layers and heads by learning per-head scaling factors through a meta-objective. It balances performance, efficiency, using fewer parameters and reducing memory use.
WaRA → https://huggingface.co/papers/2506.24092
Designed for vision tasks, it uses wavelet transforms and decomposes weight updates into multiple resolutions, capturing both coarse and detailed patterns.
BayesLoRA → https://huggingface.co/papers/2506.22809
Adds uncertainty estimation to LoRA adapters using MC-Dropout, helping models gauge confidence in unfamiliar situations. It detects variance outside fine-tuned distributions, supporting more cautious and adaptive behavior of models.
Dual LoRA Learning (DLoRAL) → https://huggingface.co/papers/2506.15591
Trains two LoRA branches: C-LoRA captures temporal coherence from degraded input, while D-LoRA improves visual detail. It's used for video super-resolution that enhances both spatial detail and temporal consistency.
Safe Pruning LoRA (SPLoRA) → https://huggingface.co/papers/2506.18931
Improves the safety of LoRA-tuned LMs by selectively removing LoRA layers that reduce alignment, using a new E-DIEM metric to detect safety-related shifts without relying on data labels.
PLoP (Precise LoRA Placement) → https://huggingface.co/papers/2506.20629
A lightweight method that automatically selects optimal LoRA adapter placement during fine-tuning based on the model and task

posted an update 23 days ago

Post

5092

13 New types of LoRA

LoRA (Low-Rank Adaptation) is a popular lightweight method for fine-tuning AI models. It doesn't update the full model, it adds small trainable components, low-rank matrices, while keeping the original weights frozen. Only these adapters are trained.

Recently, many interesting new LoRA variations came out, so it’s a great time to take a look at these 13 clever approaches:

1. T-LoRA → T-LoRA: Single Image Diffusion Model Customization Without Overfitting (2507.05964)
A timestep-dependent LoRA method for adapting diffusion models with a single image. It dynamically adjusts updates and uses orthogonal initialization to reduce overlap, achieving better fidelity–alignment balance than standard LoRA

2. SingLoRA → SingLoRA: Low Rank Adaptation Using a Single Matrix (2507.05566)
Simplifies LoRA by using only one small matrix instead of usual two, and multiplying it by its own transpose (like A × Aᵀ). It uses half the parameters of LoRA and avoids scale mismatch between different matrices

3. LiON-LoRA → LiON-LoRA: Rethinking LoRA Fusion to Unify Controllable Spatial and Temporal Generation for Video Diffusion (2507.05678)
Improves control and precision in video diffusion models when training data is limited. It builds on LoRA, adding 3 key principles: linear scalability, orthogonality, and norm consistency. A controllable token and modified self-attention enables smooth adjustment of motion

4. LoRA-Mixer → LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing (2507.00029)
Combines LoRA and mixture-of-experts (MoE) to adapt LLMs for multiple tasks. It dynamically routes task-specific LoRA experts into linear projections of attention modules, supporting both joint training and frozen expert reuse

5. QR-LoRA → QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation (2507.04599)
Separates content and style when combining multiple LoRA adapters. It implements QR decomposition to structure parameter updates, where the orthogonal Q matrix reduces interference between features, and the R matrix captures specific transformations

Read further in the comments 👇

If you like it, also subscribe to the Turing Post: https://www.turingpost.com/subscribe

1 reply

replied to their post 30 days ago

AllVoiceLab MCP Server -> https://github.com/allvoicelab/AllVoiceLab-MCP
Enables AI agents to access advanced text-to-speech, voice conversion, and video translation APIs, powering use cases like global content localization, AI audiobooks, and voice-driven media production.
MCP Email Server -> https://github.com/Shy2593666979/mcp-server-email
For email functionality: write and send emails with multiple recipients, add and search files within specified directories.
Google Admin MCP Server -> https://github.com/securityfortech/google-admin-mcp
Manage Google Workspace users through the Admin Directory API (list, create, get info about users, etc.)
Android MCP Server -> https://github.com/minhalvp/android-mcp-server
Provides programmatic control over Android devices through ADB (Android Debug Bridge).
DeepView MCP -> https://github.com/ai-1st/deepview-mcp
Enables IDEs (Cursor, Windsurf, etc.) to analyze large codebases using Gemini's extensive context window.
Calculator MCP Server -> https://github.com/githejie/mcp-server-calculator
May sound easy, but it's essential for precise numerical calculations within LLMs
MCP Aggregator -> https://github.com/nazar256/combine-mcp
Combines multiple MCP servers into a single interface for more convenient use

posted an update 30 days ago

Post

6454

13 Outstanding MCP Servers

MCP is redefining how AI assistants connect to the world of data and tools, so no wonder MCP servers are in high demand now. That’s why we’ve curated 13 cool MCP servers to upgrade your workflow:

1. Hugging Face Official MCP Server -> https://github.com/evalstate/hf-mcp-server
Provides an access and interaction with Hugging Face models, datasets, and Gradio Spaces for dynamic tool integration and configuration across environments.

2. Browser MCP -> https://browsermcp.io/
An MCP server +Chrome extension. It allows to automate your browser with AI apps like VS Code, Claude, Cursor, and Windsurf.

3. Bright Data MCP -> https://github.com/brightdata/brightdata-mcp
This one is for working with data in real-time: searching the web, navigating websites, taking action and retrieving data.

4. JSON MCP -> https://github.com/VadimNastoyashchy/json-mcp
Interact with JSON files: split, merge, find specific data, and validate content within them.

5. Octagon Deep Research MCP -> https://github.com/OctagonAI/octagon-deep-research-mcp
Allows for deep research via AI agents, integrating seamlessly with MCP clients like Claude Desktop and Cursor for powerful, unlimited research capabilities.

6. VLM Run MCP Server -> https://docs.vlm.run/mcp/introduction
Provides an agent the ability to see, understand and process visual content.

Read further in the comments 👇

P.S.:
Our most read explanation of MCP on Hugging Face https://huggingface.co/blog/Kseniase/mcp

Our first list of 13 awesome MCP servers: https://huggingface.co/posts/Kseniase/204958200717570

If you like it, also subscribe to the Turing Post: https://www.turingpost.com/subscribe

1 reply

replied to their post about 1 month ago

DeepResearcher -> https://github.com/GAIR-NLP/DeepResearcher
An RL framework for training deep research agents end-to-end in real-world environments with web search, exhibiting emergent behaviour like planning, multi-source validation, self-reflection, and honest defining when the agent doesn't know the answer
Search-R1 -> https://github.com/PeterGriffinJin/Search-R1
Features interleaved search access and an open-source RL training pipeline supporting various algorithms (PPO, GRPO, etc.), LLMs (LLaMA3, Qwen2.5, etc.), and search engines (online, local, retrievers)
ReCall -> https://github.com/Agent-RL/ReCall
Trains LLMs to reason with tools via RL, no supervised tool-use data needed. It enables agentic use of tools like OpenAI o3 and supports synthetic data generation across diverse environments and multi-step tasks
OWL -> https://github.com/camel-ai/owl
A framework built on CAMEL-AI framework enabling dynamic multi-agent collaboration for task automation across diverse domains

Here's an awesome study exploring the entire roadmap of Deep Research assistants. Don't forget to check it out -> https://huggingface.co/papers/2506.18096

posted an update about 1 month ago

Post

3563

10 Open-source Deep Research assistants

Deep Research agents are quickly becoming our daily co-workers — built for complex investigations, not just chat. With modular architecture, advanced tool use and real web access, they go far beyond typical AI. While big-name agents get the spotlight, we want to highlight some powerful recent open-source alternatives:

1. DeerFlow -> https://github.com/bytedance/deer-flow
A modular multi-agent system combining LMs and tools for automated research and code analysis. It links a coordinator, planner, team of specialized agent, and reporter, and converts reports to speech via Text-to-Speech (TTS)

2. Alita -> https://github.com/CharlesQ9/Alita
Uses a single problem-solving module for scalable reasoning through simplicity. It self-evolves by generating and reusing Model Context Protocols (MCPs) from open-source tools to build external capabilities for diverse tasks

3. WebThinker -> https://github.com/RUC-NLPIR/WebThinker
Lets reasoning models autonomously search the web and navigate pages. Deep Web Explorer allows interaction with links and follow-up searches. Through a Think-Search-and-Draft process models generate and refine reports in real time. RL training with preference pairs improves the workflow

4. SimpleDeepSearcher -> https://github.com/RUCAIBox/SimpleDeepSearcher
A lightweight framework showing that supervised fine-tuning is a real alternative to complex RL, using simulated web interactions and multi-criteria curation to generate high-quality training data

5. AgenticSeek -> https://github.com/Fosowl/agenticSeek
A private, on-device assistant that picks the best agent expert for browsing, coding, or planning—no cloud needed. Includes voice input via speech-to-text

6. Suna -> https://github.com/kortix-ai/suna
Offers web browsing, file and doc handling, CLI execution, site deployment, and API/service integration—all in one assistant

Subscribe to the Turing Post:https://www.turingpost.com/subscribe
Read further ⬇️

2 replies

upvoted 2 articles about 1 month ago

Article

Accidentally Building an AI Reasoning Research Ecosystem (Or: Can AI Stop Thinking?)

•

Jun 26

• 3

Article

What Coding Agent Wins?

and 1 other •

Jun 26

• 7

published an article about 1 month ago

Article

What Coding Agent Wins?

and 1 other •

Jun 26

• 7

replied to their post about 1 month ago

Constraint-Based Decoding -> https://huggingface.co/papers/2502.05111
Guide generation using hard constraints, like context-free grammar (CFG) rules. This keeps outputs aligned with task goals, especially in structured prediction or planning. Can be combined with symbolic solvers or logic-checking agents
Exploration Prompts (Explore-then-Pick) -> https://huggingface.co/papers/2506.09014
Generate multiple diverse responses via sampling, then use a learned Sample Set Aggregator (SSA), trained with reinforcement learning, to pick the best answer. Similar to “draft → verify” strategies, but the final selection is done via a trained model, not heuristics.
Prompt Perturbation Sampling for Inference -> https://huggingface.co/papers/2502.11027
From a pool of diverse model responses sampled with prompt perturbation, distill only the most elegant, logically consistent outputs to improve metrics like Pass@10. This is a post‑generation inference technique.
Prompt Ordering via Embedding Clustering -> https://openreview.net/pdf?id=1Iu2Yte5N6
Uncovers that few-shot prompt permutations form clusters in the model’s embedding space — especially by first demonstration — and uses this to design a cluster-based ordering method for generating strong in-context example sequences.
Controlled Prompting Variations -> https://huggingface.co/papers/2504.02111
Controlled “bad” prompts (like irrelevant info, misleading framing) expose fragilities in model reasoning. So use light adversarial prompting in evaluations to find breaking points. Plus remove irrelevant info to reduce confusion and improve focus; standardize format to minimize inconsistency and hallucination; and implement explicitly prompt reasoning to boost accuracy and transparency

posted an update about 1 month ago

Post

5406

10 Techniques for Boosting LLM Reasoning in 2025

Everyone’s chasing top reasoning, but sometimes it's still the bottleneck for many real-world tasks. This week, let's spotlight some powerful techniques that have shown promise in helping LLMs achieve more consistent logic, planning, and depth:

1. Retrieval-Augmented CoT Chaining (RAG+CoT) -> CoT-RAG: Integrating Chain of Thought and Retrieval-Augmented Generation to Enhance Reasoning in Large Language Models (2504.13534)
Combines Chain-of-Thought prompting with retrieval augmentation at intermediate steps. Relevant documents are fetched after each reasoning subgoal, updating context dynamically. Great for open-domain QA, math, logic and multi-hop fact-checking

2. Tool-use by example injection -> Self-Training Large Language Models for Tool-Use Without Demonstrations (2502.05867)
Injects few-shot tool interaction examples during training to implicitly teach calling patterns. Helps in plug-and-play tool use without training new architectures

3. Visual Scratchpads, or multimodal reasoning support -> Imagine while Reasoning in Space: Multimodal Visualization-of-Thought (2501.07542)
Using structured visual inputs or sketchable intermediate steps (diagrams, grids, trees) boosts performance in tasks like planning, geometry, and multi-agent simulation. In real practice thanks to this GPT-4o, Claude, and Gemini show marked improvement

4. System 1 vs System 2 Prompt switching -> Adaptive Deep Reasoning: Triggering Deep Thinking When Needed (2505.20101)
Changing a fast, intuitive response prompt with a slow, deliberate reasoning mode is among the most popular AI trends. E.g., models tend to respond more reliably when explicitly instructed to “think like a researcher.” This can also reduce hallucinations in open-ended generation and debate tasks

5. Adversarial Self-Chat Fine-Tuning -> Self-playing Adversarial Language Game Enhances LLM Reasoning (2404.10642)
Generate debates between model variants or model vs human, then fine-tune on the winner’s response. It helps models learn to better defend their reasoning. Used in Claude’s Constitutional AI and SPPO-style tuning

Read further below👇

Also, subscribe to the Turing Post: https://www.turingpost.com/subscribe

2 replies

reacted to their post with 👍 about 2 months ago

Post

3537

11 Types of JEPA

Since Meta released the newest V-JEPA 2 this week, we thought it's a good time to revisit a few other interesting JEPA variants. JEPA, or Joint Embedding Predictive Architecture, a self-supervised learning framework that predicts the latent representation of a missing part of the input.

Here are 11 JEPA types that you should know about:

1. V-JEPA 2 -> V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning (2506.09985)
Trained on 1M+ hours of internet videos and a little bit of robot interaction data, V-JEPA 2 can watch, understand, answer questions, and help robots plan and act in physical world

2. Time-Series-JEPA (TS-JEPA) -> Time-Series JEPA for Predictive Remote Control under Capacity-Limited Networks (2406.04853)
It's a time-series predictive model that learns compact, meaningful representations. A self-supervised semantic actor then uses them to generate control commands without raw data

3. Denoising JEPA (D-JEPA) -> Denoising with a Joint-Embedding Predictive Architecture (2410.03755)
Combines JEPA with diffusion techniques. By treating JEPA as masked image modeling and next-token prediction, D-JEPA generates data auto-regressively, incorporating diffusion and flow-matching losses

4. CNN-JEPA -> CNN-JEPA: Self-Supervised Pretraining Convolutional Neural Networks Using Joint Embedding Predictive Architecture (2408.07514)
This SSL approach applies JEPA idea to CNNs using a sparse encoder, depthwise separable convolutions, and improved masking. On ImageNet-100, CNN-JEPA outperforms I-JEPA with 73.3% accuracy

5. Stem-JEPA -> Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation (2408.02514)
Identifies instrument stems by mapping mixes and stems into a shared space using an encoder and predictor. It captures timbre, harmony, and rhythm for tasks like stem retrieval, alignment, and genre or key estimation

6. DMT-JEPA (Discriminative Masked Targets JEPA) -> DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture (2405.17995)
Improves discriminative power by generating masked targets from semantically similar neighboring patches and uses lightweight cross-attention for aggregation

Read further below👇

Also, subscribe to the Turing Post -> https://www.turingpost.com/subscribe

1 reply

replied to their post about 2 months ago

seq-JEPA -> https://huggingface.co/papers/2505.03176
A world modeling framework that learns invariant and equivariant representations from view sequences and transformations, using a transformer to predict future states. Excels in sequence-based tasks
AD-L-JEPA -> https://huggingface.co/papers/2501.04969
Learns spatial world models via Bird’s Eye View (BEV) embeddings without explicit generation or manual pair creation, simplifying training and boosting representation quality. Excels in LiDAR 3D object detection and transfer learning
SAR-JEPA -> https://huggingface.co/papers/2311.15153
Predicts multi-scale Synthetic Aperture Radar (SAR) gradient features from locally masked patches. SAR-JEPA handles small targets and speckle noise and integrates domain-specific features to improve SSL signals
HEP-JEPA -> https://huggingface.co/papers/2502.03933
A transformer-based foundation model for high-energy collider tasks. Using the JetClass dataset of 100M jets, it predicts embeddings of unseen jet constituents from partial context
ECG-JEPA -> https://huggingface.co/papers/2410.13867
JEPA for self-supervised ECG representation learning designed to excel at ECG-based heart arrhythmia diagnosis

Check out more types of JEPA here -> https://huggingface.co/posts/Kseniase/646284586461230

posted an update about 2 months ago

Post

3537

1 reply

upvoted a paper about 2 months ago

A Tale of Tails: Model Collapse as a Change of Scaling Laws

Paper • 2402.07043 • Published Feb 10, 2024 • 16

Ksenia Se

AI & ML interests

Recent Activity

Organizations

Kseniase's activity

Accidentally Building an AI Reasoning Research Ecosystem (Or: Can AI Stop Thinking?)

What Coding Agent Wins?

What Coding Agent Wins?