Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
33.9
TFLOPS
4
9
66
P.M.SALMAN KHAN
salmankhanpm
Follow
jk12p's profile picture
21world's profile picture
Gargaz's profile picture
3 followers
·
91 following
https://salmankhanpm.co
salmankhanpm154
SALMANKHANPM
salmankhanpm786
AI & ML interests
NLP - LLM - AI SAFETY
Recent Activity
reacted
to
Kseniase
's
post
with 🔥
about 15 hours ago
10 Latest Preference Optimization Techniques Models need feedback on what makes outputs “good” or “bad.” Policy optimization (PO) turns preferences and rewards into actual training signals. This field is evolving quickly, moving far beyond classics like PPO and GRPO. So here is our overview of 10 newest PO methods: 1. Pref-GRPO → https://huggingface.co/papers/2508.20751 Stabilizes text-to-image reinforcement learning (RL) with pairwise preference rewards and a unified UNIGENBENCH benchmark 2. PVPO (Policy with Value Preference Optimization) → https://huggingface.co/papers/2508.21104 This critic-free RL method uses a pre-trained model as a reference anchor to reduce bias and guide learning, selecting high-value examples through data pre-sampling 3. DCPO (Dynamic Clipping Policy Optimization) → https://huggingface.co/papers/2509.02333 Uses dynamic clipping, which adjusts probability limits per token for better token exploration, and smooth reward standardization to balance rewards over training steps and prevent wasted updates 4. ARPO (Agentic Reinforced Policy Optimization) → https://huggingface.co/papers/2507.19849 Optimizes multi-turn LLM agents that use external tools. It uses an entropy-based adaptive rollout to explore post-tool use and an advantage attribution method to better assign credit across steps, leading to more efficient tool use with fewer resources 5. GRPO-RoC (Group Relative Policy Optimization with Resampling-on-Correct) → https://huggingface.co/papers/2508.20722 Oversamples rollouts, then resamples them to keep diverse mistakes and only the highest-quality correct answers. It reduces noises and ends up with stronger reasoning in a code environment Read further below ⬇️ If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe
reacted
to
Kseniase
's
post
with 🤗
about 15 hours ago
10 Latest Preference Optimization Techniques Models need feedback on what makes outputs “good” or “bad.” Policy optimization (PO) turns preferences and rewards into actual training signals. This field is evolving quickly, moving far beyond classics like PPO and GRPO. So here is our overview of 10 newest PO methods: 1. Pref-GRPO → https://huggingface.co/papers/2508.20751 Stabilizes text-to-image reinforcement learning (RL) with pairwise preference rewards and a unified UNIGENBENCH benchmark 2. PVPO (Policy with Value Preference Optimization) → https://huggingface.co/papers/2508.21104 This critic-free RL method uses a pre-trained model as a reference anchor to reduce bias and guide learning, selecting high-value examples through data pre-sampling 3. DCPO (Dynamic Clipping Policy Optimization) → https://huggingface.co/papers/2509.02333 Uses dynamic clipping, which adjusts probability limits per token for better token exploration, and smooth reward standardization to balance rewards over training steps and prevent wasted updates 4. ARPO (Agentic Reinforced Policy Optimization) → https://huggingface.co/papers/2507.19849 Optimizes multi-turn LLM agents that use external tools. It uses an entropy-based adaptive rollout to explore post-tool use and an advantage attribution method to better assign credit across steps, leading to more efficient tool use with fewer resources 5. GRPO-RoC (Group Relative Policy Optimization with Resampling-on-Correct) → https://huggingface.co/papers/2508.20722 Oversamples rollouts, then resamples them to keep diverse mistakes and only the highest-quality correct answers. It reduces noises and ends up with stronger reasoning in a code environment Read further below ⬇️ If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe
reacted
to
Kseniase
's
post
with 👍
about 15 hours ago
10 Latest Preference Optimization Techniques Models need feedback on what makes outputs “good” or “bad.” Policy optimization (PO) turns preferences and rewards into actual training signals. This field is evolving quickly, moving far beyond classics like PPO and GRPO. So here is our overview of 10 newest PO methods: 1. Pref-GRPO → https://huggingface.co/papers/2508.20751 Stabilizes text-to-image reinforcement learning (RL) with pairwise preference rewards and a unified UNIGENBENCH benchmark 2. PVPO (Policy with Value Preference Optimization) → https://huggingface.co/papers/2508.21104 This critic-free RL method uses a pre-trained model as a reference anchor to reduce bias and guide learning, selecting high-value examples through data pre-sampling 3. DCPO (Dynamic Clipping Policy Optimization) → https://huggingface.co/papers/2509.02333 Uses dynamic clipping, which adjusts probability limits per token for better token exploration, and smooth reward standardization to balance rewards over training steps and prevent wasted updates 4. ARPO (Agentic Reinforced Policy Optimization) → https://huggingface.co/papers/2507.19849 Optimizes multi-turn LLM agents that use external tools. It uses an entropy-based adaptive rollout to explore post-tool use and an advantage attribution method to better assign credit across steps, leading to more efficient tool use with fewer resources 5. GRPO-RoC (Group Relative Policy Optimization with Resampling-on-Correct) → https://huggingface.co/papers/2508.20722 Oversamples rollouts, then resamples them to keep diverse mistakes and only the highest-quality correct answers. It reduces noises and ends up with stronger reasoning in a code environment Read further below ⬇️ If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe
View all activity
Organizations
salmankhanpm
's models
45
Sort: Recently updated
salmankhanpm/whisper-te-2e-u
Automatic Speech Recognition
•
Updated
Jul 19
•
10
salmankhanpm/whisper-te-lora-2e
Updated
Jul 19
salmankhanpm/whisper-te-16-2e
Automatic Speech Recognition
•
0.0B
•
Updated
Jul 19
•
13
salmankhanpm/whisper-te-2e
Updated
Jul 19
salmankhanpm/w-lora_model-de-l
Updated
Jul 18
salmankhanpm/w-lora_model-de-16
Automatic Speech Recognition
•
0.0B
•
Updated
Jul 18
•
11
salmankhanpm/w-lora_model-de
Updated
Jul 18
salmankhanpm/gemma-3-4b-it-Q4_K_M-GGUF
Image-Text-to-Text
•
4B
•
Updated
Jul 16
•
4
salmankhanpm/gemma-3n-E4B-it-Q4_K_M-GGUF
Image-Text-to-Text
•
7B
•
Updated
Jul 7
•
37
salmankhanpm/sarvam-1-2b-Instruct
Text Generation
•
3B
•
Updated
Jun 26
•
15
salmankhanpm/sarvam-1-instruct-dpo-test1
Text Generation
•
3B
•
Updated
Jun 12
•
10
salmankhanpm/g3-un
Updated
Jun 3
•
1
salmankhanpm/sarvam-m-mlx-4Bit
Text Generation
•
4B
•
Updated
May 25
•
16
salmankhanpm/sarvam-1-mlx-4Bit
Text Generation
•
0.4B
•
Updated
May 25
•
6
salmankhanpm/gemma-3-4b-it-ft
Image-Text-to-Text
•
4B
•
Updated
Apr 25
•
5
salmankhanpm/lora_gemma-3-4b-bt
0.0B
•
Updated
Apr 25
salmankhanpm/lora_gemma-3-4b-inference-test-finetune
Text Generation
•
Updated
Apr 14
•
8
salmankhanpm/lora_gemma-3-4b-inference-test-v1
Updated
Apr 14
salmankhanpm/lora_gemma-3-4b-inference-test
Updated
Apr 11
salmankhanpm/gemma-3-12b-finetune-google-repo
Text Generation
•
Updated
Apr 8
•
6
salmankhanpm/gemma-3-12b-it-google-repo
Updated
Apr 5
salmankhanpm/gemma-3-12b-finetune-google-repo-Q3_K_M-GGUF
12B
•
Updated
Apr 5
•
2
salmankhanpm/gemma-3-12b-finetune-google-repo-Q4_K_M-GGUF
12B
•
Updated
Apr 5
•
3
salmankhanpm/gemma-3-12b-finetune-google-repo-Q8_0-GGUF
12B
•
Updated
Apr 5
salmankhanpm/gemma-3-4b-it-google-repo
Updated
Apr 5
salmankhanpm/gemma-3-4b-finetune-google-repo-Q8_0-GGUF
4B
•
Updated
Apr 5
•
2
salmankhanpm/gemma-3-4b-finetune-google-repo-Q4_K_M-GGUF
4B
•
Updated
Apr 5
salmankhanpm/gemma-3-4b-finetune-google-repo
Text Generation
•
Updated
Apr 5
•
7
salmankhanpm/gemma-3-finetune-Q4_K_M-GGUF
4B
•
Updated
Apr 4
salmankhanpm/gemma-3-finetune
Text Generation
•
Updated
Apr 4
•
7
Previous
1
2
Next