Extended Gemma 3 1B IT (Simple Initialization)

An extended version of Google's Gemma 3 1B Instruction-Tuned model with expanded vocabulary and statistically initialized embeddings for multilingual support.

Model Details

Base Model: google/gemma-3-1b-it
Model Type: Causal Language Model with Extended Vocabulary
Initialization Method: Statistical initialization using mean and covariance of existing embeddings
Extended Vocabulary: Additional tokens for multilingual support
Model Name: pavan-naik/gemma-3-1b-it-exp

Description

This model extends the original Gemma 3 1B IT model with:

Extended tokenizer vocabulary for additional language support
Statistical embedding initialization where new tokens are initialized from a multivariate normal distribution based on existing embeddings' mean and covariance
Preserved model architecture and instruction-tuning capabilities

⚠️ Important Note

This model is NOT pretrained after token extension and initialization. This is a base model with extended tokens and initialized embeddings only. The new language tokens require additional pretraining/fine-tuning to achieve optimal performance. This model serves as a starting point for multilingual adaptation rather than a ready-to-use multilingual model.

Initialization Method

The new embeddings are initialized from a multivariate normal distribution that has the old embeddings' mean and covariance. This approach provides a statistically principled way to initialize new token embeddings based on the existing vocabulary's embedding distribution.

As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html

To disable this statistical initialization, use mean_resizing=False when calling resize_token_embeddings().

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("pavan-naik/gemma-3-1b-it-exp")
tokenizer = AutoTokenizer.from_pretrained("pavan-naik/gemma-3-1b-it-exp")

# Use like any other Gemma model
inputs = tokenizer("Your multilingual text here", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)

Technical Details

Initialization Strategy: New tokens initialized using multivariate normal distribution based on existing embeddings' statistics
Preserved Components: Original model weights, architecture, and instruction-following capabilities
Extended Components: Input embeddings and output projection layer (LM head)
Statistical Basis: Mean and covariance computed from original vocabulary embeddings

Intended Use

This model serves as a starting point for multilingual model development. It is designed for:

Further pretraining on multilingual corpora
Fine-tuning for specific multilingual tasks
Research into vocabulary expansion and embedding initialization

This model requires additional training before production use. The extended tokens have only been initialized but not trained on actual multilingual data.

Limitations

Requires additional training: New language tokens are only initialized, not trained on multilingual data
Not production-ready: This is a base model for further development, not a finished multilingual model
Performance: Extended tokens will have limited performance without additional pretraining/fine-tuning
Statistical initialization: While principled, may not capture semantic relationships for specific language tokens
Random initialization: New embeddings are sampled from a distribution, so behavior is somewhat unpredictable until trained

pavan-naik
/

gemma-3-1b-it-exp