Text Generation
Transformers
Safetensors
Persian
conversational
gaokerena-v1.0 / README.md
mehrdadgh's picture
Update README.md
070a9bd verified
metadata
library_name: transformers
datasets:
  - gaokerena/medical_corpus
  - gaokerena/MF3QA
language:
  - fa
base_model:
  - CohereForAI/aya-expanse-8b
pipeline_tag: text-generation
co2_eq_emissions:
  emissions: 2660
  source: >-
    Quantifying the Carbon Emissions of Machine Learning.
    https://arxiv.org/abs/1910.09700
  training_type: fine-tuning
  hardware_used: 1 A100 PCIe 40/80GB GPU
  geographical_location: asia-east1

Gaokerena

Gaokerena is a Persian-language medical assistant fine-tuned to provide accurate and reliable responses to medical queries. Built upon Aya-Expanse-8B, a multilingual model developed by Cohere For AI, it is specifically tailored to address questions in Persian, offering users a helpful resource for general medical information. Gaokerena is designed to assist users by delivering clear, concise, and relevant medical insights, making it a useful tool for understanding medical topics and concepts.

visit our github repository for further information

Model Description

Model Sources

Intended Use:

Gaokerena is designed to:

  • Provide health-promoting information in Persian.
  • Assist with general medical queries, offering reliable and understandable explanations.
  • Support healthcare professionals and medical students by simplifying complex medical concepts into accessible language.

Risks and Limitations

While Gaokerena aims to provide accurate information, it is not a substitute for professional medical advice. The model may have limitations in:

  • Handling medical emergencies.
  • Addressing highly specialized or rare medical conditions.
  • Offering region-specific guidance, as the training data does not include localized Persian medical practices.

How to Get Started with the Model

Since the model has been built upon Aya, you can use this model in a single or multi-modal configuration.

Single modal inference

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft.peft_model import PeftModel

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16

model = AutoModelForCausalLM.from_pretrained(
    "CohereForAI/aya-expanse-8b",
    torch_dtype=dtype,
    device_map=device
)
tokenizer = AutoTokenizer.from_pretrained("CohereForAI/aya-expanse-8b")

model = PeftModel.from_pretrained(model = model,model_id = "gaokerena/gaokerena-v1.0")
model = model.merge_and_unload()

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
pipe_output = pipe([{"role": "user", "content": "چگونه استرس می‌تواند باعث ایجاد آفت دهان شود؟"}],
                       max_new_tokens=1024,
                       eos_token_id=[tokenizer.eos_token_id],
                       do_sample=False,
)

output = pipe_output[0]["generated_text"][-1]["content"]
print(output)
Zahra,

Multi modal inference

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
from peft.peft_model import PeftModel

model_id = "CohereForAI/aya-vision-8b"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.float16
)
model = PeftModel.from_pretrained(model=model,model_id="gaokerena/test3")
model = model.merge_and_unload()

messages = [
    {"role": "user",
     "content": [
       {"type": "image", "url": "./chest-pic.jpeg"},
        {"type": "text", "text": "در مورد این تصویر توضیح بده"},
    ]},
    ]

inputs = processor.apply_chat_template(
    messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
).to(model.device)

gen_tokens = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=True,
    temperature=0.3,
)
print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Training Details

The Gaokerena model was pretrained on 60,000 medical articles from the MedicalCorpus dataset, collected from various Persian medical web services. Additionally, it was instruction-tuned on a dataset of 20,000 question-answer pairs from the MF3QA dataset.

Environmental Impact

  • Hardware Type: A100 PCIe 40/80G
  • Hours used: 19
  • Cloud Provider: Google Cloud Platform
  • Compute Region: asia-east1
  • Carbon Emitted: 2.66 KG CO2 eq.

Bibtex

if you found our model useful feel free to give us a cite!

@misc{Gaokerena-v1.0,
  title={Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model},
  author={Ghassabi, Mehrdad and Rostami, Pedram and Baradaran Kashani, Hamidreza and Poursina, Amirhossein and Kazemi, Zahra and Tavakoli, Milad},
  year={2025}
  eprint={2505.16000},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}