2 18

Patrick

patf82

AI & ML interests

None yet

Recent Activity

liked a model about 2 months ago

deepseek-ai/DeepSeek-R1-0528

updated a model 2 months ago

patf82/Absolute_Zero_Reasoner-Coder-14b-5.0bpw-exl3

published a model 2 months ago

patf82/Absolute_Zero_Reasoner-Coder-14b-5.0bpw-exl3

View all activity

Organizations

None yet

liked a model about 2 months ago

deepseek-ai/DeepSeek-R1-0528

Text Generation • 685B • Updated May 29 • 312k • • 2.27k

updated a model 2 months ago

patf82/Absolute_Zero_Reasoner-Coder-14b-5.0bpw-exl3

5B • Updated May 14 • 2 • 4

published a model 2 months ago

patf82/Absolute_Zero_Reasoner-Coder-14b-5.0bpw-exl3

5B • Updated May 14 • 2 • 4

liked a model 2 months ago

turboderp/Qwen3-30B-A3B-exl3

Updated May 17 • 43 • 7

liked a model 7 months ago

Sao10K/32B-Qwen2.5-Kunou-v1

Text Generation • 33B • Updated Dec 30, 2024 • 37 • • 35

liked 3 models 12 months ago

reacted to m-ric's post with ❤️ about 1 year ago

Post

3240

𝐍𝐞𝐰 𝐝𝐞𝐜𝐨𝐝𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞 𝐢𝐧 𝐭𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫𝐬 𝐬𝐢𝐠𝐧𝐢𝐟𝐢𝐜𝐚𝐧𝐭𝐥𝐲 𝐫𝐞𝐝𝐮𝐜𝐞𝐬 𝐡𝐚𝐥𝐥𝐮𝐜𝐢𝐧𝐚𝐭𝐢𝐨𝐧𝐬 👏

DoLa decoding, which made a conference paper at ICLR '24, has just been merged in Transformers by @joaogante and Yung-Sung Chuang.
This new decoding method is simple yet extremely impressive!

Reminder: Decoder LLMs (the GPT kind of LLM, the most common one) generate their outputs one token at a time: at each step, given a current text, they compute a logit for each token in their vocabulary that should represent the probability of this token coming next.

Then they either pick the highest logit token (greedy decoding) or sample one with a probability defined by the logits (sampling).

The authors of DoLa wanted to improve that simple method.

They knew this established fact that transformer LMs encode low-level info (like base syntax) in early layers and more high-level info like knowledge in the later layers.

💡 This gave them their key idea: During decoding, rather than picking the token with the highest logit, 𝘄𝗵𝘆 𝗻𝗼𝘁 𝗽𝗶𝗰𝗸 𝘁𝗵𝗲 𝘁𝗼𝗸𝗲𝗻 𝘄𝗶𝘁𝗵 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝗶𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗶𝗻𝗰𝗿𝗲𝗮𝘀𝗲 𝗶𝗻 𝗹𝗼𝗴𝗶𝘁 𝗮𝗰𝗿𝗼𝘀𝘀 𝗹𝗮𝘆𝗲𝗿𝘀?

This gives impressive results:
🚀 𝟱% - 𝟮𝟬% 𝗯𝗮𝘀𝗲 𝗽𝗼𝗶𝗻𝘁𝘀 𝗶𝗻𝗰𝗿𝗲𝗮𝘀𝗲 𝗮𝗰𝗿𝗼𝘀𝘀 𝘁𝗵𝗲 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸𝘀
🚀 For instance on TruthfulQA / Open-ended, across all model sizes the increase in truthfulness is 14 base points, which is 𝗮𝗿𝗼𝘂𝗻𝗱 𝟰𝟬% 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁 𝗰𝗼𝗺𝗽𝗮𝗿𝗲𝗱 𝘁𝗼 𝘀𝘁𝗮𝗻𝗱𝗮𝗿𝗱 𝗱𝗲𝗰𝗼𝗱𝗶𝗻𝗴!

🤔 Wouldn't decoding take longer because of this added contrasting step? 👉 𝗧𝗵𝗲 𝗿𝘂𝗻𝘁𝗶𝗺𝗲 𝗶𝗻𝗰𝗿𝗲𝗮𝘀𝗲 𝗶𝘀 𝗻𝗲𝗴𝗹𝗶𝗴𝗶𝗯𝗹𝗲, 𝟭 𝘁𝗼 𝟴% 𝗼𝗻𝗹𝘆.

Paper added to my collection 👉 m-ric/optimization-mechanics-661d543a5fc6ca1dc84284a0