5 6 4

Vatolin Alexey

vatolinalex

AI & ML interests

None yet

Recent Activity

reacted to nyuuzyou's post with 👍 12 days ago

🇷🇺 Russian Forum Messages Dataset - https://huggingface.co/datasets/nyuuzyou/ruforum Collection of approximately 58 million Russian forum messages featuring: - Complete message content from Russian online forums spanning 2010-2025 - Comprehensive metadata including unique message IDs and timestamps - Full text content preserving original user discussions and interactions - Monolingual dataset focused exclusively on Russian language content This dataset offers a unique textual archive of Russian online conversations suitable for text generation, sentiment analysis, and language modeling research. Released to the public domain under CC0 1.0 license.

updated a dataset 12 days ago

vatolinalex/ru_sci_bench_translation_search

published a dataset 12 days ago

vatolinalex/ru_sci_bench_translation_search

View all activity

Organizations

vatolinalex's activity

reacted to nyuuzyou's post with 👍 12 days ago

Post

5557

🇷🇺 Russian Forum Messages Dataset - nyuuzyou/ruforum

Collection of approximately 58 million Russian forum messages featuring:

- Complete message content from Russian online forums spanning 2010-2025
- Comprehensive metadata including unique message IDs and timestamps
- Full text content preserving original user discussions and interactions
- Monolingual dataset focused exclusively on Russian language content

This dataset offers a unique textual archive of Russian online conversations suitable for text generation, sentiment analysis, and language modeling research. Released to the public domain under CC0 1.0 license.

updated a dataset 12 days ago

vatolinalex/ru_sci_bench_translation_search

Viewer • Updated 12 days ago • 19.9k • 36

published a dataset 12 days ago

vatolinalex/ru_sci_bench_translation_search

Viewer • Updated 12 days ago • 19.9k • 36

upvoted a paper about 2 months ago

Training Sparse Mixture Of Experts Text Embedding Models

Paper • 2502.07972 • Published Feb 11 • 6

liked a model about 2 months ago

EuroBERT/EuroBERT-210m

Fill-Mask • Updated 9 days ago • 12.3k • 66

reacted to tomaarsen's post with ❤️ about 2 months ago

Post

6676

An assembly of 18 European companies, labs, and universities have banded together to launch 🇪🇺 EuroBERT! It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc.

🇪🇺 15 Languages: English, French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic, Turkish, Hindi
3️⃣ 3 model sizes: 210M, 610M, and 2.1B parameters - very very useful sizes in my opinion
➡️ Sequence length of 8192 tokens! Nice to see these higher sequence lengths for encoders becoming more common.
⚙️ Architecture based on Llama, but with bi-directional (non-causal) attention to turn it into an encoder. Flash Attention 2 is supported.
🔥 A new Pareto frontier (stronger *and* smaller) for multilingual encoder models
📊 Evaluated against mDeBERTa, mGTE, XLM-RoBERTa for Retrieval, Classification, and Regression (after finetuning for each task separately): EuroBERT punches way above its weight.
📝 Detailed paper with all details, incl. data: FineWeb for English and CulturaX for multilingual data, The Stack v2 and Proof-Pile-2 for code.

Check out the release blogpost here: https://huggingface.co/blog/EuroBERT/release
* EuroBERT/EuroBERT-210m
* EuroBERT/EuroBERT-610m
* EuroBERT/EuroBERT-2.1B

The next step is for researchers to build upon the 3 EuroBERT base models and publish strong retrieval, zero-shot classification, etc. models for all to use. I'm very much looking forward to it!