Ksenia Se

Kseniase

AI & ML interests

None yet

Recent Activity

replied to their post 3 days ago
9 Multimodal Chain-of-Thought methods How Chain-of-Thought (CoT) prompting can unlock models' full potential across images, video, audio and more? Finding special multimodal CoT techniques is the answer. Here are 9 methods of Multimodal Chain-of-Thought (MCoT). Most of them are open-source: 1. KAM-CoT -> https://huggingface.co/papers/2401.12863 This lightweight framework combines CoT prompting with knowledge graphs (KGs) and achieves 93.87% accuracy 2. Multimodal Visualization-of-Thought (MVoT) -> https://huggingface.co/papers/2501.07542 Lets models generate visual reasoning traces, using a token discrepancy loss to improve visual quality 3. Compositional CoT (CCoT) -> https://huggingface.co/papers/2311.17076 Uses scene graph (SG) representations generated by the LMM itself to improve performance on compositional and general multimodal benchmarks 4. URSA -> https://huggingface.co/papers/2501.04686 Brings System 2-style thinking to multimodal math reasoning, using a 3-module CoT data synthesis process with CoT distillation, trajectory-format rewriting and format unification 5. MM-Verify -> https://huggingface.co/papers/2502.13383 Introduces a verification mechanism with MM-Verifier and MM-Reasoner that implements synthesized high-quality CoT data for multimodal reasoning 6. Duty-Distinct CoT (DDCoT) -> https://huggingface.co/papers/2310.16436 Divides the reasoning responsibilities between LMs and visual models, integrating the visual recognition capabilities into the joint reasoning process 7. Multimodal-CoT from Amazon Web Services -> https://huggingface.co/papers/2302.00923 A two-stage framework separates rationale generation from answer prediction, allowing the model to reason more effectively using multimodal inputs 8. Graph-of-Thought (GoT) -> https://huggingface.co/papers/2305.16582 This two-stage framework models reasoning as a graph of interconnected ideas, improving performance on text-only and multimodal tasks More in the comments👇
posted an update 3 days ago
9 Multimodal Chain-of-Thought methods How Chain-of-Thought (CoT) prompting can unlock models' full potential across images, video, audio and more? Finding special multimodal CoT techniques is the answer. Here are 9 methods of Multimodal Chain-of-Thought (MCoT). Most of them are open-source: 1. KAM-CoT -> https://huggingface.co/papers/2401.12863 This lightweight framework combines CoT prompting with knowledge graphs (KGs) and achieves 93.87% accuracy 2. Multimodal Visualization-of-Thought (MVoT) -> https://huggingface.co/papers/2501.07542 Lets models generate visual reasoning traces, using a token discrepancy loss to improve visual quality 3. Compositional CoT (CCoT) -> https://huggingface.co/papers/2311.17076 Uses scene graph (SG) representations generated by the LMM itself to improve performance on compositional and general multimodal benchmarks 4. URSA -> https://huggingface.co/papers/2501.04686 Brings System 2-style thinking to multimodal math reasoning, using a 3-module CoT data synthesis process with CoT distillation, trajectory-format rewriting and format unification 5. MM-Verify -> https://huggingface.co/papers/2502.13383 Introduces a verification mechanism with MM-Verifier and MM-Reasoner that implements synthesized high-quality CoT data for multimodal reasoning 6. Duty-Distinct CoT (DDCoT) -> https://huggingface.co/papers/2310.16436 Divides the reasoning responsibilities between LMs and visual models, integrating the visual recognition capabilities into the joint reasoning process 7. Multimodal-CoT from Amazon Web Services -> https://huggingface.co/papers/2302.00923 A two-stage framework separates rationale generation from answer prediction, allowing the model to reason more effectively using multimodal inputs 8. Graph-of-Thought (GoT) -> https://huggingface.co/papers/2305.16582 This two-stage framework models reasoning as a graph of interconnected ideas, improving performance on text-only and multimodal tasks More in the comments👇
View all activity

Organizations

Turing Post's profile picture Journalists on Hugging Face's profile picture Social Post Explorers's profile picture Hugging Face Discord Community's profile picture Sandbox's profile picture

Kseniase's activity

replied to their post 3 days ago
posted an update 3 days ago
view post
Post
1638
9 Multimodal Chain-of-Thought methods

How Chain-of-Thought (CoT) prompting can unlock models' full potential across images, video, audio and more? Finding special multimodal CoT techniques is the answer.

Here are 9 methods of Multimodal Chain-of-Thought (MCoT). Most of them are open-source:

1. KAM-CoT -> KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning (2401.12863)
This lightweight framework combines CoT prompting with knowledge graphs (KGs) and achieves 93.87% accuracy

2. Multimodal Visualization-of-Thought (MVoT) -> Imagine while Reasoning in Space: Multimodal Visualization-of-Thought (2501.07542)
Lets models generate visual reasoning traces, using a token discrepancy loss to improve visual quality

3. Compositional CoT (CCoT) -> Compositional Chain-of-Thought Prompting for Large Multimodal Models (2311.17076)
Uses scene graph (SG) representations generated by the LMM itself to improve performance on compositional and general multimodal benchmarks

4. URSA -> URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics (2501.04686)
Brings System 2-style thinking to multimodal math reasoning, using a 3-module CoT data synthesis process with CoT distillation, trajectory-format rewriting and format unification

5. MM-Verify -> MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification (2502.13383)
Introduces a verification mechanism with MM-Verifier and MM-Reasoner that implements synthesized high-quality CoT data for multimodal reasoning

6. Duty-Distinct CoT (DDCoT) -> DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models (2310.16436)
Divides the reasoning responsibilities between LMs and visual models, integrating the visual recognition capabilities into the joint reasoning process

7. Multimodal-CoT from Amazon Web Services -> Multimodal Chain-of-Thought Reasoning in Language Models (2302.00923)
A two-stage framework separates rationale generation from answer prediction, allowing the model to reason more effectively using multimodal inputs

8. Graph-of-Thought (GoT) -> Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Large Language Models (2305.16582)
This two-stage framework models reasoning as a graph of interconnected ideas, improving performance on text-only and multimodal tasks

More in the comments👇
  • 1 reply
·
reacted to their post with 🚀❤️👀 8 days ago
view post
Post
4964
8 types of RoPE

As we always use Transformers, it's helpful to understand RoPE—Rotary Position Embedding. Since token order matters, RoPE encodes it by rotating token embeddings based on their position, so the model knows how to interpret which token comes first, second, and so on.

Here are 8 types of RoPE that can be implemented in different cases:

1. Original RoPE -> RoFormer: Enhanced Transformer with Rotary Position Embedding (2104.09864)
Encodes token positions by rotating token embeddings in the complex plane via a position-based rotation matrix, thereby providing the self-attention mechanism with relative positional info.

2. LongRoPE -> LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens (2402.13753)
Extends the context window of pre-trained LLMs to 2048k tokens, leveraging non-uniformities in positional interpolation with an efficient search.

3. LongRoPE2 -> LongRoPE2: Near-Lossless LLM Context Window Scaling (2502.20082)
Extends the effective context window of pre-trained LLMs to the target! length, rescaling RoPE guided by “needle-driven” perplexity.

4. Multimodal RoPE (MRoPE) -> Qwen2.5-VL Technical Report (2502.13923)
Decomposes positional embedding into 3 components: temporal, height and width, so that positional features are aligned across modalities: text, images and videos.

5. Directional RoPE (DRoPE) -> DRoPE: Directional Rotary Position Embedding for Efficient Agent Interaction Modeling (2503.15029)
Adds an identity scalar, improving how angles are handled without extra complexity. It helps balance accuracy, speed, and memory usage.

6. VideoRoPE -> VideoRoPE: What Makes for Good Video Rotary Position Embedding? (2502.05173)
Adapts RoPE for video, featuring 3D structure, low-frequency temporal allocation, diagonal layout, and adjustable spacing.

7. VRoPE -> VRoPE: Rotary Position Embedding for Video Large Language Models (2502.11664)
An another RoPE for video, which restructures positional indices and balances encoding for uniform spatial focus.

8. XPos (Extrapolatable Position Embedding) -> https://huggingface.co/papers/2212.10
Introduces an exponential decay factor into the rotation matrix​, improving stability on long sequences.
  • 1 reply
·
replied to their post 10 days ago
posted an update 10 days ago
view post
Post
4964
8 types of RoPE

As we always use Transformers, it's helpful to understand RoPE—Rotary Position Embedding. Since token order matters, RoPE encodes it by rotating token embeddings based on their position, so the model knows how to interpret which token comes first, second, and so on.

Here are 8 types of RoPE that can be implemented in different cases:

1. Original RoPE -> RoFormer: Enhanced Transformer with Rotary Position Embedding (2104.09864)
Encodes token positions by rotating token embeddings in the complex plane via a position-based rotation matrix, thereby providing the self-attention mechanism with relative positional info.

2. LongRoPE -> LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens (2402.13753)
Extends the context window of pre-trained LLMs to 2048k tokens, leveraging non-uniformities in positional interpolation with an efficient search.

3. LongRoPE2 -> LongRoPE2: Near-Lossless LLM Context Window Scaling (2502.20082)
Extends the effective context window of pre-trained LLMs to the target! length, rescaling RoPE guided by “needle-driven” perplexity.

4. Multimodal RoPE (MRoPE) -> Qwen2.5-VL Technical Report (2502.13923)
Decomposes positional embedding into 3 components: temporal, height and width, so that positional features are aligned across modalities: text, images and videos.

5. Directional RoPE (DRoPE) -> DRoPE: Directional Rotary Position Embedding for Efficient Agent Interaction Modeling (2503.15029)
Adds an identity scalar, improving how angles are handled without extra complexity. It helps balance accuracy, speed, and memory usage.

6. VideoRoPE -> VideoRoPE: What Makes for Good Video Rotary Position Embedding? (2502.05173)
Adapts RoPE for video, featuring 3D structure, low-frequency temporal allocation, diagonal layout, and adjustable spacing.

7. VRoPE -> VRoPE: Rotary Position Embedding for Video Large Language Models (2502.11664)
An another RoPE for video, which restructures positional indices and balances encoding for uniform spatial focus.

8. XPos (Extrapolatable Position Embedding) -> https://huggingface.co/papers/2212.10
Introduces an exponential decay factor into the rotation matrix​, improving stability on long sequences.
  • 1 reply
·
upvoted an article 13 days ago
view article
Article

What is Qwen-Agent framework? Inside the Qwen family

By Kseniase and 1 other
8
published an article 13 days ago
view article
Article

What is Qwen-Agent framework? Inside the Qwen family

By Kseniase and 1 other
8
upvoted an article 15 days ago
view article
Article

🌁#92: Fight for Developers and the Year of Orchestration

By Kseniase
5
published an article 15 days ago
view article
Article

🌁#92: Fight for Developers and the Year of Orchestration

By Kseniase
5
upvoted an article 16 days ago
view article
Article

🦸🏻#14: What Is MCP, and Why Is Everyone – Suddenly!– Talking About It?

By Kseniase
112