Spaces:
Running
Seeking Guidance from SmolLM Team: A Blueprint for "Crane-01," a Sovereign African Language Model
Hello Hugging Face community, and especially to the brilliant minds behind the SmolLM project,
My name is Kato Steven Mubiru, and I lead the Crane Models team in Uganda, a grassroots initiative focused on building sovereign AI for Africa.
We have been incredibly inspired by the SmolLM paper. Your work proves that with meticulous data curation and efficient architectures, it's possible to create small, powerful models that are perfect for resource-constrained environments like ours. This is the exact philosophy we want to bring to our next major project: Crane-01.
Our Vision for Crane-01:
Our goal is to move beyond fine-tuning existing models and to pre-train Crane-01, a new family of models (starting around 1.5B parameters) from scratch. The core objective is to design an architecture and a pre-training corpus that are fundamentally grounded in the unique linguistic structures and cultural contexts of Uganda's 40+ languages. We don't just want a model that speaks Luganda; we want a model that thinks in a way that reflects our unique understanding of the world.
We have already laid the groundwork by creating the Ugandan Cultural Context Benchmark (UCCB), a comprehensive suite to measure an AI's cultural relevance and safety. Now, we are embarking on the pre-training journey.
Our Ask for Guidance:
As we begin to architect Crane-01, we would be immensely grateful for any high-level advice or guidance from the SmolLM team, whose expertise in this domain is unparalleled. Specifically, we are grappling with questions like:
Data Curation for Pre-training: Based on your experience with SmolLM-Corpus, what were your biggest lessons in balancing synthetic vs. web data? What strategies were most effective for ensuring data quality over sheer quantity for low-resource contexts?
Architecture for Small Models: Your paper mentions that "architecture matters more than scale for small models." For a model that needs to capture unique linguistic features, are there specific architectural choices (beyond deep-and-thin) that you found particularly impactful?
Training Stability: What were the biggest challenges in ensuring stable training for a small model on a diverse, custom dataset? Are there any key hyperparameters or training dynamics we should be particularly mindful of?
We are not asking for code or hands-on help, but for the invaluable "lessons learned" from your pioneering work. Any pointers or best practices you could share would be a massive accelerator for our mission to build a truly sovereign and culturally-intelligent AI for Africa.
Thank you for your incredible contributions to open science.
Best regards,
Kato Steven Mubiru
Lead, Crane Models Team