Text Generation
Transformers
Safetensors
Pashto
bloom
pashto
zamai-bloom
text-generation-inference

ZamAI Bloom Pashto - Base Model

This model card is for the base Pashto language model, trained from bigscience/bloom-560m on a general Pashto corpus. This model is intended to be a foundational model for Pashto, which can be further fine-tuned for specific tasks.

Model Description

This model is a version of bigscience/bloom-560m that has been continually trained on a Pashto text corpus. The goal is to create a language model proficient in understanding and generating general Pashto text.

Original Base Model: bigscience/bloom-560m Current Model (this one): tasal9/pashto-bloom-base (or your specific Hub ID)

Intended Uses & Limitations

Intended Uses

This model is intended for:

  • Serving as a base for fine-tuning on specific Pashto NLP tasks.
  • Generating general Pashto text.
  • Research in Pashto NLP.
  • Educational purposes for Pashto language learning.

Limitations and Bias

  • The model's performance is dependent on the quality and diversity of the training data. It may generate text that reflects biases present in the data.
  • It might produce factually incorrect or nonsensical text, especially for complex topics or out-of-domain prompts.
  • The model may not be suitable for critical applications without further evaluation and mitigation of potential harms.
  • Performance on specific Pashto dialects might vary depending on their representation in the training data.
  • As a base model, its performance on specialized tasks without fine-tuning will be limited.

How to use

You can use this model with the Hugging Face transformers library for text generation or as a starting point for fine-tuning.

First, install the library:

pip install transformers torch

Then, you can use the model in Python:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "tasal9/pashto-bloom-base" # Replace with your Hub ID
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "په پښتو ژبه کې یو شعر ولیکئ د پسرلې په اړه" # Example prompt: "Write a poem in Pashto about spring"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate text
# Adjust generation parameters as needed (max_length, num_beams, do_sample, top_k, top_p, etc.)
outputs = model.generate(**inputs, max_length=100, num_beams=5, early_stopping=True)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)

Training Data

This model was trained on a general Pashto corpus.

  • Source: Describe the source of your Pashto text (e.g., ps(1).txt from a specific collection, web scraped data, etc.).
  • Size: [e.g., Number of documents, lines, tokens, GBs after processing ps(1).txt]
  • Preprocessing: Texts were tokenized using the AutoTokenizer for bigscience/bloom-560m. Sequences were filtered to a minimum length. [Mention any other significant cleaning or filtering steps from prepare_base_dataset.py].
  • Dataset Script: scripts/prepare_base_dataset.py was used to process the raw text into a Hugging Face dataset.
  • Processed Dataset Location (example): datasets/base_ps_prepared/

Training Procedure

Preprocessing

The raw Pashto texts were processed using scripts/prepare_base_dataset.py. This involved tokenization with the bigscience/bloom-560m tokenizer and filtering of short sequences.

Training

The model was trained (continued pre-training) using the Hugging Face transformers library with PyTorch, starting from bigscience/bloom-560m.

  • Training script: scripts/train_base_model.py
  • Hyperparameters (Update with your actual training parameters):
    • Learning rate: [e.g., 5e-5 or 2e-5]
    • Batch size (per device): [e.g., 4, 8]
    • Number of epochs: [e.g., 1, 3]
    • Optimizer: AdamW
    • Weight decay: 0.01
    • Warmup steps: [e.g., 500]
    • Gradient accumulation steps: [e.g., 1]
    • Seed: 42 (or your chosen seed)
  • Infrastructure:
    • Hardware: [e.g., 1x NVIDIA A100 40GB, Google Colab T4/V100]
    • Training time: [e.g., X hours]

This model represents a fresh training run starting from the original bigscience/bloom-560m weights, not from previous incorrect checkpoints.

Evaluation Results

Evaluation for a base model typically involves measuring perplexity on a held-out Pashto test set.

  • Test set: [Describe your test set, e.g., the test split from datasets/base_ps_prepared/test]
  • Metrics: Perplexity
  • Results:
    • Perplexity: [To be filled after training and evaluation]

Qualitative observations on text generation quality can also be included.

Model Card Contact

Author: Yaqoob Tasal Username: tasal9 Organization: ZamAI GitHub: https://github.com/tasal9

Citation

If you use this model, please consider citing:

@misc{zamai_bloom_pashto_base_2025,
  author    = {Yaqoob Tasal},
  title     = {ZamAI Bloom Pashto - Base Language Model},
  year      = {2025},
  publisher = {Hugging Face},
  journal   = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/tasal9/pashto-bloom-base}} # Update with your Hub ID
}

And the original Bloom model:

@article{scao2022bloom,
  title={BLOOM: A 176B-Parameter Open-Access Multilingual Language Model},
  author={Scao, Teven Le and Fan, Angela and Akiki, Christopher and Baran, Efrat and Ben Cheikh, Rim and Coavoux, Maxime and Davison, Thomas and de Vargas, Niklas Deckers and Delangue, C{\\'e}line and Demeusy, Thibault and others},
  journal={arXiv preprint arXiv:2211.05100},
  year={2022}
}

Remember to replace placeholders like dataset details, hyperparameters, and evaluation results with your actual project details once the new training is complete. Save this as README.md (or ModelCard.md) in your model repository on the Hugging Face Hub.

Downloads last month
21
Safetensors
Model size
559M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support