P.M.SALMAN KHAN's picture

4 9 66

P.M.SALMAN KHAN

salmankhanpm

·

https://salmankhanpm.co

AI & ML interests

NLP - LLM - AI SAFETY

Recent Activity

reacted to Kseniase's post with 🔥 about 15 hours ago

10 Latest Preference Optimization Techniques Models need feedback on what makes outputs “good” or “bad.” Policy optimization (PO) turns preferences and rewards into actual training signals. This field is evolving quickly, moving far beyond classics like PPO and GRPO. So here is our overview of 10 newest PO methods: 1. Pref-GRPO → https://huggingface.co/papers/2508.20751 Stabilizes text-to-image reinforcement learning (RL) with pairwise preference rewards and a unified UNIGENBENCH benchmark 2. PVPO (Policy with Value Preference Optimization) → https://huggingface.co/papers/2508.21104 This critic-free RL method uses a pre-trained model as a reference anchor to reduce bias and guide learning, selecting high-value examples through data pre-sampling 3. DCPO (Dynamic Clipping Policy Optimization) → https://huggingface.co/papers/2509.02333 Uses dynamic clipping, which adjusts probability limits per token for better token exploration, and smooth reward standardization to balance rewards over training steps and prevent wasted updates 4. ARPO (Agentic Reinforced Policy Optimization) → https://huggingface.co/papers/2507.19849 Optimizes multi-turn LLM agents that use external tools. It uses an entropy-based adaptive rollout to explore post-tool use and an advantage attribution method to better assign credit across steps, leading to more efficient tool use with fewer resources 5. GRPO-RoC (Group Relative Policy Optimization with Resampling-on-Correct) → https://huggingface.co/papers/2508.20722 Oversamples rollouts, then resamples them to keep diverse mistakes and only the highest-quality correct answers. It reduces noises and ends up with stronger reasoning in a code environment Read further below ⬇️ If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe

reacted to Kseniase's post with 🤗 about 15 hours ago

10 Latest Preference Optimization Techniques Models need feedback on what makes outputs “good” or “bad.” Policy optimization (PO) turns preferences and rewards into actual training signals. This field is evolving quickly, moving far beyond classics like PPO and GRPO. So here is our overview of 10 newest PO methods: 1. Pref-GRPO → https://huggingface.co/papers/2508.20751 Stabilizes text-to-image reinforcement learning (RL) with pairwise preference rewards and a unified UNIGENBENCH benchmark 2. PVPO (Policy with Value Preference Optimization) → https://huggingface.co/papers/2508.21104 This critic-free RL method uses a pre-trained model as a reference anchor to reduce bias and guide learning, selecting high-value examples through data pre-sampling 3. DCPO (Dynamic Clipping Policy Optimization) → https://huggingface.co/papers/2509.02333 Uses dynamic clipping, which adjusts probability limits per token for better token exploration, and smooth reward standardization to balance rewards over training steps and prevent wasted updates 4. ARPO (Agentic Reinforced Policy Optimization) → https://huggingface.co/papers/2507.19849 Optimizes multi-turn LLM agents that use external tools. It uses an entropy-based adaptive rollout to explore post-tool use and an advantage attribution method to better assign credit across steps, leading to more efficient tool use with fewer resources 5. GRPO-RoC (Group Relative Policy Optimization with Resampling-on-Correct) → https://huggingface.co/papers/2508.20722 Oversamples rollouts, then resamples them to keep diverse mistakes and only the highest-quality correct answers. It reduces noises and ends up with stronger reasoning in a code environment Read further below ⬇️ If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe

reacted to Kseniase's post with 👍 about 15 hours ago

10 Latest Preference Optimization Techniques Models need feedback on what makes outputs “good” or “bad.” Policy optimization (PO) turns preferences and rewards into actual training signals. This field is evolving quickly, moving far beyond classics like PPO and GRPO. So here is our overview of 10 newest PO methods: 1. Pref-GRPO → https://huggingface.co/papers/2508.20751 Stabilizes text-to-image reinforcement learning (RL) with pairwise preference rewards and a unified UNIGENBENCH benchmark 2. PVPO (Policy with Value Preference Optimization) → https://huggingface.co/papers/2508.21104 This critic-free RL method uses a pre-trained model as a reference anchor to reduce bias and guide learning, selecting high-value examples through data pre-sampling 3. DCPO (Dynamic Clipping Policy Optimization) → https://huggingface.co/papers/2509.02333 Uses dynamic clipping, which adjusts probability limits per token for better token exploration, and smooth reward standardization to balance rewards over training steps and prevent wasted updates 4. ARPO (Agentic Reinforced Policy Optimization) → https://huggingface.co/papers/2507.19849 Optimizes multi-turn LLM agents that use external tools. It uses an entropy-based adaptive rollout to explore post-tool use and an advantage attribution method to better assign credit across steps, leading to more efficient tool use with fewer resources 5. GRPO-RoC (Group Relative Policy Optimization with Resampling-on-Correct) → https://huggingface.co/papers/2508.20722 Oversamples rollouts, then resamples them to keep diverse mistakes and only the highest-quality correct answers. It reduces noises and ends up with stronger reasoning in a code environment Read further below ⬇️ If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe

View all activity

Organizations

salmankhanpm 's models 45

salmankhanpm/whisper-te-2e-u

Automatic Speech Recognition • Updated Jul 19 • 10

salmankhanpm/whisper-te-lora-2e

salmankhanpm/whisper-te-16-2e

Automatic Speech Recognition • 0.0B • Updated Jul 19 • 13

salmankhanpm/whisper-te-2e

salmankhanpm/w-lora_model-de-l

salmankhanpm/w-lora_model-de-16

Automatic Speech Recognition • 0.0B • Updated Jul 18 • 11

salmankhanpm/w-lora_model-de

salmankhanpm/gemma-3-4b-it-Q4_K_M-GGUF

Image-Text-to-Text • 4B • Updated Jul 16 • 4

salmankhanpm/gemma-3n-E4B-it-Q4_K_M-GGUF

Image-Text-to-Text • 7B • Updated Jul 7 • 37

salmankhanpm/sarvam-1-2b-Instruct

Text Generation • 3B • Updated Jun 26 • 15

salmankhanpm/sarvam-1-instruct-dpo-test1

Text Generation • 3B • Updated Jun 12 • 10

salmankhanpm/g3-un

Updated Jun 3 • 1

salmankhanpm/sarvam-m-mlx-4Bit

Text Generation • 4B • Updated May 25 • 16

salmankhanpm/sarvam-1-mlx-4Bit

Text Generation • 0.4B • Updated May 25 • 6

salmankhanpm/gemma-3-4b-it-ft

Image-Text-to-Text • 4B • Updated Apr 25 • 5

salmankhanpm/lora_gemma-3-4b-bt

0.0B • Updated Apr 25

salmankhanpm/lora_gemma-3-4b-inference-test-finetune

Text Generation • Updated Apr 14 • 8

salmankhanpm/lora_gemma-3-4b-inference-test-v1

salmankhanpm/lora_gemma-3-4b-inference-test

salmankhanpm/gemma-3-12b-finetune-google-repo

Text Generation • Updated Apr 8 • 6

salmankhanpm/gemma-3-12b-it-google-repo

salmankhanpm/gemma-3-12b-finetune-google-repo-Q3_K_M-GGUF

12B • Updated Apr 5 • 2

salmankhanpm/gemma-3-12b-finetune-google-repo-Q4_K_M-GGUF

12B • Updated Apr 5 • 3

salmankhanpm/gemma-3-12b-finetune-google-repo-Q8_0-GGUF

12B • Updated Apr 5

salmankhanpm/gemma-3-4b-it-google-repo

salmankhanpm/gemma-3-4b-finetune-google-repo-Q8_0-GGUF

4B • Updated Apr 5 • 2

salmankhanpm/gemma-3-4b-finetune-google-repo-Q4_K_M-GGUF

4B • Updated Apr 5

salmankhanpm/gemma-3-4b-finetune-google-repo

Text Generation • Updated Apr 5 • 7

salmankhanpm/gemma-3-finetune-Q4_K_M-GGUF

4B • Updated Apr 4

salmankhanpm/gemma-3-finetune

Text Generation • Updated Apr 4 • 7