12 5 35

Nikita Kezins

entfane

AI & ML interests

LLM post-training, adversarial training, safety, knowledge transfer

Recent Activity

liked a dataset 4 days ago

unalignment/toxic-dpo-v0.2

liked a dataset 4 days ago

prometheus-eval/Feedback-Collection

liked a dataset 4 days ago

prometheus-eval/Preference-Collection

View all activity

Organizations

liked 3 datasets 4 days ago

New activity in suriya7/t5-base-text-to-sql 6 days ago

french to sql model

#2 opened 28 days ago by

Dieuveil

upvoted an article 6 days ago

Article

Training and Finetuning Reranker Models with Sentence Transformers v4

•

Mar 26

• 151

liked 2 models 7 days ago

Qwen/Qwen3-Embedding-0.6B

Feature Extraction • 0.6B • Updated Jun 20 • 2.78M • • 436

Skywork/Skywork-Reward-V2-Qwen3-0.6B

Text Classification • 0.6B • Updated 29 days ago • 193k • 7

New activity in Qwen/Qwen3-Reranker-0.6B 7 days ago

reranker0.6b and embedding0.6b are the same model weights？

#6 opened about 2 months ago by

chaochaoli

liked a dataset 7 days ago

argilla/ultrafeedback-binarized-preferences-cleaned

Viewer • Updated Dec 11, 2023 • 60.9k • 5.84k • 146

replied to AdinaY's post 7 days ago

Thank you very much!

reacted to Kseniase's post with 👍 7 days ago

Post

4857

9 new policy optimization techniques

Reinforcement Learning (RL) won't stuck in the same old PPO loop - in the last two months alone, researchers have introduced a new wave of techniques, reshaping how we train and fine-tune LLMs, VLMs, and agents.

Here are 9 fresh policy optimization techniques worth knowing:

1. GSPO: Group Sequence Policy Optimization → Group Sequence Policy Optimization (2507.18071)
Shifts from token-level to sequence-level optimization, clipping, and rewarding to capture the full picture and increase stability compared to GRPO. GSPO-token variation also allows token-level fine-tuning.

2. LAPO: Length-Adaptive Policy Optimization → LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization (2507.15758)
A two-stage RL framework that trains models to adaptively control reasoning length by learning typical solution lengths for shorter and more efficient reasoning.

3. HBPO: Hierarchical Budget Policy Optimization → Hierarchical Budget Policy Optimization for Adaptive Reasoning (2507.15844)
This one trains model to adapt reasoning depth based on problem complexity. It divides training samples into subgroups with different token budgets, using budget-aware rewards to align reasoning effort with task difficulty.

4. SOPHIA: Semi-off-policy reinforcement learning → Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning (2507.16814)
Combines on-policy visual understanding from the Vision Language Models (VLMs) with off-policy reasoning from an LM, assigning outcome-based rewards and propagating visual rewards backward through the reasoning steps.

5. RePO: Replay-Enhanced Policy Optimization → RePO: Replay-Enhanced Policy Optimization (2506.09340)
Introduces a replay buffer into on-policy RL for LLMs, retrieving diverse off-policy samples for each prompt to broaden the training data per prompt

Read further below ⬇️
If you like it, also subscribe to the Turing Post: https://www.turingpost.com/subscribe