mkurman (Mariusz Kurman)

posted an update 10 days ago

Post

227

🚀 Big news! NeuroBLAST, the outstanding new architecture, has officially arrived on HF! After three intense months of training my 1.9 billion SLM on my trusty RTX 3090 Ti, I’m happy to announce the results. While it’s not perfect just yet, I’ve dedicated countless hours to optimizing costs while crafting clever layer connections that mimic the brain's centers. Plus, I’ve introduced a new memory-like layer that’s sure to turn heads! I can’t wait to dive deep into this journey in my upcoming blog post. Stay tuned for the full scoop! 🔥

meditsolutions/NeuroBLAST-1.9B-Instruct-Early-Preview

reacted to Kseniase's post with 👀 4 months ago

Post

5382

8 types of RoPE

As we always use Transformers, it's helpful to understand RoPE—Rotary Position Embedding. Since token order matters, RoPE encodes it by rotating token embeddings based on their position, so the model knows how to interpret which token comes first, second, and so on.

Here are 8 types of RoPE that can be implemented in different cases:

1. Original RoPE -> RoFormer: Enhanced Transformer with Rotary Position Embedding (2104.09864)
Encodes token positions by rotating token embeddings in the complex plane via a position-based rotation matrix, thereby providing the self-attention mechanism with relative positional info.

2. LongRoPE -> LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens (2402.13753)
Extends the context window of pre-trained LLMs to 2048k tokens, leveraging non-uniformities in positional interpolation with an efficient search.

3. LongRoPE2 -> LongRoPE2: Near-Lossless LLM Context Window Scaling (2502.20082)
Extends the effective context window of pre-trained LLMs to the target! length, rescaling RoPE guided by “needle-driven” perplexity.

4. Multimodal RoPE (MRoPE) -> Qwen2.5-VL Technical Report (2502.13923)
Decomposes positional embedding into 3 components: temporal, height and width, so that positional features are aligned across modalities: text, images and videos.

5. Directional RoPE (DRoPE) -> DRoPE: Directional Rotary Position Embedding for Efficient Agent Interaction Modeling (2503.15029)
Adds an identity scalar, improving how angles are handled without extra complexity. It helps balance accuracy, speed, and memory usage.

6. VideoRoPE -> VideoRoPE: What Makes for Good Video Rotary Position Embedding? (2502.05173)
Adapts RoPE for video, featuring 3D structure, low-frequency temporal allocation, diagonal layout, and adjustable spacing.

7. VRoPE -> VRoPE: Rotary Position Embedding for Video Large Language Models (2502.11664)
An another RoPE for video, which restructures positional indices and balances encoding for uniform spatial focus.

8. XPos (Extrapolatable Position Embedding) -> https://huggingface.co/papers/2212.10
Introduces an exponential decay factor into the rotation matrix, improving stability on long sequences.

1 reply

·

reacted to Kseniase's post with 🔥 5 months ago

Post

7977

15 types of attention mechanisms

Attention mechanisms allow models to dynamically focus on specific parts of their input when performing tasks. In our recent article, we discussed Multi-Head Latent Attention (MLA) in detail and now it's time to summarize other existing types of attention.

Here is a list of 15 types of attention mechanisms used in AI models:

1. Soft attention (Deterministic attention) -> Neural Machine Translation by Jointly Learning to Align and Translate (1409.0473)
Assigns a continuous weight distribution over all parts of the input. It produces a weighted sum of the input using attention weights that sum to 1.

2. Hard attention (Stochastic attention) -> Effective Approaches to Attention-based Neural Machine Translation (1508.04025)
Makes a discrete selection of some part of the input to focus on at each step, rather than attending to everything.

3. Self-attention -> Attention Is All You Need (1706.03762)
Each element in the sequence "looks" at other elements and "decides" how much to borrow from each of them for its new representation.

4. Cross-Attention (Encoder-Decoder attention) -> Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation (2104.08771)
The queries come from one sequence and the keys/values come from another sequence. It allows a model to combine information from two different sources.

5. Multi-Head Attention (MHA) -> Attention Is All You Need (1706.03762)
Multiple attention “heads” are run in parallel. The model computes several attention distributions (heads), each with its own set of learned projections of queries, keys, and values.

6. Multi-Head Latent Attention (MLA) -> DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (2405.04434)
Extends MHA by incorporating a latent space where attention heads can dynamically learn different latent factors or representations.

7. Memory-Based attention -> End-To-End Memory Networks (1503.08895)
Involves an external memory and uses attention to read from and write to this memory.

See other types in the comments 👇

1 reply

·

reacted to BrigitteTousi's post with ❤️🔥🚀 5 months ago

Post

3437

LeRobot goes to driving school! 🚗🚗🚗

Hugging Face just announced a new collab with Yaak to bring the largest open-source self-driving dataset to LeRobot!

Major kudos to HF's @cadene , as well as @sandhawalia , @Shnissen and the Yaak team!

Check out the blog post here: https://huggingface.co/blog/lerobot-goes-to-driving-school

1 reply

·

posted an update 5 months ago

Post

672

I feel like it's going to take me forever

meditsolutions/medit-one-140M-9B-tokens-checkpoint

reacted to albertvillanova's post with 👍 5 months ago

Post

4144

🚀 New smolagents update: Safer Local Python Execution! 🦾🐍

With the latest release, we've added security checks to the local Python interpreter: every evaluation is now analyzed for dangerous builtins, modules, and functions. 🔒

Here's why this matters & what you need to know! 🧵👇

1️⃣ Why is local execution risky? ⚠️
AI agents that run arbitrary Python code can unintentionally (or maliciously) access system files, run unsafe commands, or exfiltrate data.

2️⃣ New Safety Layer in smolagents 🛡️
We now inspect every return value during execution:
✅ Allowed: Safe built-in types (e.g., numbers, strings, lists)
⛔ Blocked: Dangerous functions/modules (e.g., os.system, subprocess, exec, shutil)

3️⃣ Immediate Benefits 💡
- Prevent agents from accessing unsafe builtins
- Block unauthorized file or network access
- Reduce accidental security vulnerabilities

4️⃣ Security Disclaimer ⚠️
🚨 Despite these improvements, local Python execution is NEVER 100% safe. 🚨
If you need true isolation, use a remote sandboxed executor like Docker or E2B.

5️⃣ The Best Practice: Use Sandboxed Execution 🔐
For production-grade AI agents, we strongly recommend running code in a Docker or E2B sandbox to ensure complete isolation.

6️⃣ Upgrade Now & Stay Safe! 🚀
Check out the latest smolagents release and start building safer AI agents today.

🔗 https://github.com/huggingface/smolagents

What security measures do you take when running AI-generated code? Let’s discuss! 👇

#AI #smolagents #Python #Security

2 replies

·

posted an update 5 months ago

Post

935

Just released NVAMP Loss!

✔️ modification of the cross-entropy loss function designed specifically for training LLMs.
✔️ twist on the standard cross-entropy loss by emphasizing the importance of outlier prediction errors and dynamically normalizing token-level variance.
✔️ more stable and efficient training, leading to models that generalize better.

Check it out, give it a spin, and let me know what you think!

Licensed under the Apache 2.0 license and ready to use. Happy training! 🔥🤖

https://github.com/mkurman/nvamp-loss

posted an update 5 months ago

Post

2410

MedIT One 140M Fifth checkpoint after 9B tokens
meditsolutions/medit-one-140M-9B-tokens-checkpoint

posted an update 5 months ago

Post

436

Test-time compute (TTC) scaling’s dope. Here’s my spin: Adaptive train-time compute scaling.

https://open.substack.com/pub/mkurman/p/adaptive-train-time-compute-scaling?r=7bzqr

What’s your take? Hit me!

posted an update 5 months ago

Post

571

I have uploaded the third pre-training checkpoint after 6 billion tokens to demonstrate that the MedIT One architecture is trainable.

Give it some noise plz! Love u all :D

meditsolutions/medit-one-140M-6B-tokens-checkpoint

reacted to Jaward's post with ❤️ 5 months ago

Post

4995

made a few improvements on custom grpo trainer:
- added sequence similarity reward (seems to work)
- improved vllm support (5x inference speed)
- adjusted reward scores (this helped with format/accuracy)
- can now push to hf hub (already pushed mine lol: Jaward/smollm2_360m_grpo_gsm8k_reasoner)

Code: https://github.com/Jaykef/ai-algorithms/blob/main/smollm2_360M_135M_grpo_gsm8k.ipynb

posted an update 5 months ago

Post

3711

Introducing a new architecture, MedIT One – a single-token transformer with LSTM-like recurrence.

It is extremely fast in training and inference, but we lack funding for large-scale training. Enjoy 🍓

https://github.com/MedITSolutionsKurman/medit-one

reacted to JingzeShi's post with 🚀 5 months ago

Post

3017

🤗Welcome to the Doge Edge Device Small language Model.

SmallDoge/Doge-160M-Instruct

reacted to CultriX's post with 👍❤️ 6 months ago

Post

2688

Final upgrade to the Multi-Agent Task Completion Space: CultriX/MultiAgent-CodeTask .

It now includes :
- a live stream of the progress being made on the task (see included video),
- The following components:
1. Automatic prompt optimization
2. An orchestrator deciding which agent to call dynamically including feedback from a human (human-in-the-loop)
3. A coding agent to complete the task
4. A code reviewing agent to iteratively provide feedback to improve the code generated by the coding agent until the code meets the required criteria after which it is approved.
5. A testing agent that tests the approved code or provides information on how to test it.
6. A documentation agent that provides documentation and a help message for the approved and tested code.

posted an update 6 months ago

Post

2048

I've been working on something cool: a GRPO with an LLM evaluator that can also perform SFT on the feedback data - if you want. Check it out 😊

Any 🌟are more than welcome 🤗

https://github.com/mkurman/grpo-llm-evaluator

posted an update 6 months ago

Post

1595

Blurred-Thoughts Supervised-Finetuning 🙈

After hours of working with GitHub Copilot to organize the code, I'm keen to announce the release of Blurred Thoughts Supervised-Finetuning (BT-SFT), a new method for fine-tuning LLMs to produce more diverse and creative responses.

BT-SFT introduces:
✅ Smart tokenization method randomly masks tokens within <think> ... </think> tags, promoting the model to generate diverse responses that align better with its probability distribution instead of memorizing the thought process from distilled data.
✅ Reward function that ensures responses are well-structured.

Explore and contribute to the project available in my GitHub repository:
https://github.com/mkurman/blurred-thoughts-SFT

Keep me updated on your experiments with BT-SFT! 🐐

reacted to nicolay-r's post with 🔥 6 months ago

Post

1626

📢 The LLaMA-3.1-8B distilled 8B version of the R1 DeepSeek AI is available besides the one based on Qwen

📙 Notebook for using it in reasoning over series of data 🧠 :
https://github.com/nicolay-r/nlp-thirdgate/blob/master/tutorials/llm_deep_seek_7b_distill_llama3.ipynb

Loading using the pipeline API of the transformers library:
https://github.com/nicolay-r/nlp-thirdgate/blob/master/llm/transformers_llama.py
🟡 GPU Usage: 12.3 GB (FP16/FP32 mode) which is suitable for T4. (a 1.5 GB less than Qwen-distilled version)
🐌 Perfomance: T4 instance: ~0.19 tokens/sec (FP32 mode) and (FP16 mode) ~0.22-0.30 tokens/sec. Is it should be that slow? 🤔
Model name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
⭐ Framework: https://github.com/nicolay-r/bulk-chain
🌌 Notebooks and models hub: https://github.com/nicolay-r/nlp-thirdgate

Mariusz Kurman PRO

AI & ML interests

Recent Activity

Organizations

Mariusz Kurman PRO

AI & ML interests

Recent Activity

Organizations

mkurman's activity