ZamAI Bloom Pashto - Base Model
This model card is for the base Pashto language model, trained from bigscience/bloom-560m
on a general Pashto corpus. This model is intended to be a foundational model for Pashto, which can be further fine-tuned for specific tasks.
Model Description
This model is a version of bigscience/bloom-560m that has been continually trained on a Pashto text corpus. The goal is to create a language model proficient in understanding and generating general Pashto text.
Original Base Model: bigscience/bloom-560m
Current Model (this one): tasal9/pashto-bloom-base
(or your specific Hub ID)
Intended Uses & Limitations
Intended Uses
This model is intended for:
- Serving as a base for fine-tuning on specific Pashto NLP tasks.
- Generating general Pashto text.
- Research in Pashto NLP.
- Educational purposes for Pashto language learning.
Limitations and Bias
- The model's performance is dependent on the quality and diversity of the training data. It may generate text that reflects biases present in the data.
- It might produce factually incorrect or nonsensical text, especially for complex topics or out-of-domain prompts.
- The model may not be suitable for critical applications without further evaluation and mitigation of potential harms.
- Performance on specific Pashto dialects might vary depending on their representation in the training data.
- As a base model, its performance on specialized tasks without fine-tuning will be limited.
How to use
You can use this model with the Hugging Face transformers
library for text generation or as a starting point for fine-tuning.
First, install the library:
pip install transformers torch
Then, you can use the model in Python:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "tasal9/pashto-bloom-base" # Replace with your Hub ID
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "په پښتو ژبه کې یو شعر ولیکئ د پسرلې په اړه" # Example prompt: "Write a poem in Pashto about spring"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate text
# Adjust generation parameters as needed (max_length, num_beams, do_sample, top_k, top_p, etc.)
outputs = model.generate(**inputs, max_length=100, num_beams=5, early_stopping=True)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Training Data
This model was trained on a general Pashto corpus.
- Source: Describe the source of your Pashto text (e.g.,
ps(1).txt
from a specific collection, web scraped data, etc.). - Size: [e.g., Number of documents, lines, tokens, GBs after processing
ps(1).txt
] - Preprocessing: Texts were tokenized using the
AutoTokenizer
forbigscience/bloom-560m
. Sequences were filtered to a minimum length. [Mention any other significant cleaning or filtering steps fromprepare_base_dataset.py
]. - Dataset Script:
scripts/prepare_base_dataset.py
was used to process the raw text into a Hugging Face dataset. - Processed Dataset Location (example):
datasets/base_ps_prepared/
Training Procedure
Preprocessing
The raw Pashto texts were processed using scripts/prepare_base_dataset.py
. This involved tokenization with the bigscience/bloom-560m
tokenizer and filtering of short sequences.
Training
The model was trained (continued pre-training) using the Hugging Face transformers
library with PyTorch, starting from bigscience/bloom-560m
.
- Training script:
scripts/train_base_model.py
- Hyperparameters (Update with your actual training parameters):
- Learning rate: [e.g., 5e-5 or 2e-5]
- Batch size (per device): [e.g., 4, 8]
- Number of epochs: [e.g., 1, 3]
- Optimizer: AdamW
- Weight decay: 0.01
- Warmup steps: [e.g., 500]
- Gradient accumulation steps: [e.g., 1]
- Seed: 42 (or your chosen seed)
- Infrastructure:
- Hardware: [e.g., 1x NVIDIA A100 40GB, Google Colab T4/V100]
- Training time: [e.g., X hours]
This model represents a fresh training run starting from the original bigscience/bloom-560m
weights, not from previous incorrect checkpoints.
Evaluation Results
Evaluation for a base model typically involves measuring perplexity on a held-out Pashto test set.
- Test set: [Describe your test set, e.g., the test split from
datasets/base_ps_prepared/test
] - Metrics: Perplexity
- Results:
- Perplexity: [To be filled after training and evaluation]
Qualitative observations on text generation quality can also be included.
Model Card Contact
Author: Yaqoob Tasal Username: tasal9 Organization: ZamAI GitHub: https://github.com/tasal9
Citation
If you use this model, please consider citing:
@misc{zamai_bloom_pashto_base_2025,
author = {Yaqoob Tasal},
title = {ZamAI Bloom Pashto - Base Language Model},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/tasal9/pashto-bloom-base}} # Update with your Hub ID
}
And the original Bloom model:
@article{scao2022bloom,
title={BLOOM: A 176B-Parameter Open-Access Multilingual Language Model},
author={Scao, Teven Le and Fan, Angela and Akiki, Christopher and Baran, Efrat and Ben Cheikh, Rim and Coavoux, Maxime and Davison, Thomas and de Vargas, Niklas Deckers and Delangue, C{\\'e}line and Demeusy, Thibault and others},
journal={arXiv preprint arXiv:2211.05100},
year={2022}
}
Remember to replace placeholders like dataset details, hyperparameters, and evaluation results with your actual project details once the new training is complete. Save this as README.md
(or ModelCard.md
) in your model repository on the Hugging Face Hub.
- Downloads last month
- 21