Open Source AI Model

Welcome to our open-source AI initiative! We are a community-driven project committed to transparency at every step—from datasets to model architectures. Join us in building the future of AI, where the power of collective knowledge shapes the next generation of language models.

Project Overview

Our mission is to develop high-quality open-source datasets and build a foundational language model that can be fine-tuned on consumer hardware. By leveraging crowd-sourced computing, we aim to create a base Large Language Model (LLM) that’s truly "by the community, for the community." All model checkpoints, as well as the final pre-trained model, will be made publicly available on Hugging Face for everyone to access and use. Our goal is to have 3 main sizes one for the current gen phone, one for the current gen cards, and one in between.

Core Objectives

High-Quality Datasets: Focus on creating well-curated, high-quality datasets to pre-train the LLM.
Efficient Model Architecture: Develop a compact base model with a large tokenizer that can be fine-tuned quickly on consumer-grade hardware.
Crowd-Sourced Compute: Utilize crowd-sourced compute resources to train the model, making the process accessible to the global community.

Ways to Contribute

There are multiple ways you can contribute to the project:

1. Source Code Contributions

Help us improve the project by submitting bug fixes, enhancements, or new features to our GitHub repository.

2. Dataset Discovery

Assist in finding high-quality datasets that can be used to train and refine our models.

3. Dataset Optimization

Help us clean and optimize datasets for training, ensuring the highest data quality and relevance.

4. GPU Contributions

If you have a GPU with at least 15GB VRAM, you can participate in crowd-sourced training. This is similar to mining pools, where contributors help train the model using their computational power.

5. CPU Contributions

Help with tokenizing datasets and converting them into .npy files to support the training pipeline.

6. Writer / Content Creator

Assist with creating documentation, blog posts, and social media content to raise awareness and share knowledge about the project.

Why Choose Us?

Completely Open Source: We believe in full transparency. From the datasets to the model itself, everything is open-source and freely accessible.
Community-Driven: The project evolves with the contributions of developers and AI enthusiasts from all over the world. We rely on the collective strength of the community to build something truly impactful.

Funding

At this stage, the project is entirely community-funded, relying on contributions in the form of time, computational power, and knowledge. Every bit helps!

Tokenizer

Tokenizer Type: Based on the Gemma 3 SentencePiece Tokenizer
Vocabulary Size: 262K
Tokenizer Features:
- Split digits for better handling of numerical data.
- Preserve whitespace for accurate text segmentation.
- Byte fallback for handling rare words or tokens.

Model Details

Our model is based on the Gemma Architecture, with the following characteristics:

Context Length: 8K (planned upgrade to 32K)
Attention Mechanism: Replaced softcapping with QK norm
Attention Layers: 5 sliding + 1 global attention layer
RoPE Scaling: set to 8, which allows for better handling of positional encodings
Mixed-precision training for reduced memory consumption and faster training
AdamW Optimizer for stable training
512 sliding window attention, with plans to enhance it to 1024
Chat template enforces a BOS (Beginning of Sequence) token to maintain consistency in responses
Vision Encoder:
- Uses the Pan & Scan algorithm to extract relevant features
- Operates at a fixed resolution of 896 × 896
- Supports windowing during inference for variable input sizes

Dataset Guidelines

We strive to use datasets that are high-quality, trustworthy, and ethically sourced. Below are the guidelines for dataset selection and generation:

Books: Must be educational and provide verifiable knowledge.
Code: Should consist of working code that follows best practices and is useful for training.
Math: Only datasets with provable, verifiable mathematical concepts will be used.
Knowledge-based: All information must be accurate and supported by reliable sources.
Multilingual: Must feature high-quality, human-generated translations (no machine-generated translations allowed).
News: Should come from multiple sources and provide a balanced, neutral perspective.

All datasets used will be open-sourced on Hugging Face, and the code used to prepare them will be publicly available on GitHub.

Proof of Concept (POC) Stage

In the POC stage, we are training on a small, diverse selection of high-quality datasets. This will serve as a foundation for future model development. The model size at this stage is set at 4 billion parameters, with plans to scale up in later stages.

For the POC stage, we are using the working name Locksley, inspired by the legendary Robin Hood—because we are building upon the knowledge and work of giants in the field, and because we needed a name to start. This name is provisional, and we are open to suggestions for something better!

Infrenace Code

HF_TOKEN = "hf_token_here"
import os
from transformers import AutoTokenizer, AutoModelForCausalLM
our_repo = "AIGym/base-v1"
print("Testing the saved model and tokenizer...")
loaded_tokenizer = AutoTokenizer.from_pretrained(our_repo, token=HF_TOKEN)
loaded_model = AutoModelForCausalLM.from_pretrained(our_repo, token=HF_TOKEN)
test_text = "Hello, my name is"
inputs = loaded_tokenizer(test_text, return_tensors="pt")
outputs = loaded_model.generate(**inputs, max_length=50)
generated_text = loaded_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Test input: {test_text}")
print(f"Generated output: {generated_text}")
print("Test complete!")

Join Us in Shaping the Future of Truly Open-Source AI! 🚀

We are excited to have you be a part of this initiative. Whether you're contributing computational power, improving code, sharing datasets, or spreading the word, every contribution counts.

Let’s build a more open, more accessible, and more transparent AI future—together!