its5Q (its5Q)

reacted to nyuuzyou's post with 👍 3 months ago

Post

756

🎰 Casino Benchmark: Dataset + Space
nyuuzyou/casino-benchmark
nyuuzyou/casino-benchmark

14 models faced 1,400 simulations of heads-up Blackjack and European Roulette. Shared seeds locked identical cards and spins for each.

Key Stats:

- 14 models benchmarked
- 59,483 rows
- 35 MB compressed Parquet
- 35,000 scored decisions
- Full prompts, JSON responses, reasoning traces, latency
- Bankroll tracking from $1,000 start per run

Live leaderboard tracks bets, hits, stands, and risk management.
Gemini 3 Flash leads at +$3,396. Claude 4.5 Haiku at -$7,788.
Traces in the dataset. Leaderboard in the space.

reacted to nyuuzyou's post with 👍 about 1 year ago

Post

3759

🖼️ SVGRepo Icons Dataset - nyuuzyou/svgrepo

Collection of 217,510 Scalable Vector Graphics (SVG) icons featuring:

- Sourced from SVGRepo.com across diverse categories & styles
- Includes metadata: title, tags, source collection, and specific license
- Contains minified SVG markup for direct use or processing
- Organized into splits based on individual icon license (e.g., MIT, CC0, Apache)

reacted to nyuuzyou's post with 👍 about 1 year ago

Post

3634

🦅 SmolLM2-Eagle Collection - nyuuzyou/smollm2-eagle-680263bf97f0c7e6bbe4936b

Collection of fine-tuned bilingual language models featuring:
- Models in three parameter sizes: 135M, 360M, and 1.7B based on HuggingFaceTB's SmolLM2 models
- Both standard and GGUF formats for flexible deployment in llama.cpp and Ollama
- Fine-tuned on nyuuzyou/EagleSFT dataset (536,231 Russian-English QA pairs derived from 739k+ real user queries)
- Experimental Russian language capabilities while maintaining English performance
- Limited Russian capabilities due to SFT-only approach without Russian pre-training
- Environmental impact: ~19.75 kg CO2eq

This collection provides compact models for research on bilingual language capabilities, resource-constrained environments, and educational applications. Not recommended for production use due to experimental nature and inherent limitations. Available under Apache 2.0 license.

1 reply

·

reacted to nyuuzyou's post with 👍 about 1 year ago

Post

2984

🦅 EagleSFT Dataset - nyuuzyou/EagleSFT

Collection of 536,231 question-answer pairs featuring:

- Human-posed questions and machine-generated responses for SFT
- Bilingual content in Russian and English with linked IDs
- Derived from 739k+ real user queries, primarily educational topics
- Includes unique IDs and machine-generated category labels

This dataset provides a resource for supervised fine-tuning (SFT) of large language models, cross-lingual research, and understanding model responses to diverse user prompts. Released to the public domain under CC0 1.0 license.

posted an update over 1 year ago

Post

3490

Am I missing something, or there is still no way to filter by model size while searching for models? It has been a requested feature since 2022, but I haven't seen any updates since! With the amount of different models coming out, I think the size filter would be a great extension of the search functionality, especially when looking for smaller models, which are a lot less prevalent.

1 reply

·

posted an update over 1 year ago

Post

1874

Continuing my streak by releasing the Wikireading dataset: a large collection of scraped non-fiction books predominantly in Russian language.
its5Q/wikireading

Here's the highlights:
- ~7B tokens, or ~28B characters, making it a great candidate for use in pretraining
- Contains non-fiction works from many knowledge domains
- Includes both the original HTML and extracted text of book chapters

reacted to clem's post with 🔥 over 1 year ago

Post

4154

Just crossed 200,000 free public AI datasets shared by the community on Hugging Face! Text, image, video, audio, time-series & many more... Thanks everyone!

http://hf.co/datasets

posted an update over 1 year ago

Post

1196

Made public a dataset of scraped teletype articles.

Here's the overview:
- 3.3 million articles, predominantly in Russian and English
- Includes original HTML, extracted text and metadata
- All articles were run through language identification
- Includes all public articles up until April 2024

its5Q/teletype

its5Q

AI & ML interests

Organizations

its5Q

AI & ML interests

Organizations

its5Q's activity