Model Card for Llama-electronic-radiology-TR

Model Details

Model Summary

This model is a domain-adapted version of Llama-3.2-1B, fine-tuned via continued pretraining on Turkish-language electronic radiology PhD theses. The model was trained in an autoregressive (causal language modeling) setup using the hazal/electronic-radiology-phd-thesis-trR dataset. Unlike instruction-tuned models, this version focuses on improving the model’s fluency, vocabulary, and semantic consistency in highly technical medical and radiological contexts. It is intended for downstream applications such as domain-specific generation, summarization, and potential fine-tuning for clinical tasks in Turkish.

Model Description

Language(s) (NLP): Turkish
License: Llama 3.2
Finetuned from model: Llama-3.2-1B

Uses

Direct Use

The primary intended uses include:

Domain-specific generation: Generating fluent, semantically rich Turkish text in radiological contexts, e.g., imaging protocols, research summaries, or academic abstracts.
Medical document summarization: Summarizing long Turkish-language radiological texts, such as reports or thesis chapters.
Language modeling for downstream tasks: Serving as a base model for fine-tuning into instruction-tuned clinical models or QA systems in radiology.
Research applications: Assisting in the development of Turkish-language models for clinical NLP, especially in low-resource and domain-specific contexts.

This model is not instruction-tuned and does not perform well in prompt-based Q&A or dialogue setups without additional supervised fine-tuning.

Bias, Risks, and Limitations

🔬 Domain Bias

The model has been trained exclusively on Turkish PhD-level academic texts in radiology. As such, its knowledge and language patterns are narrowly focused on:

Formal, academic Turkish
Medical terminology in radiology and imaging
Structured dissertation-like content

It may underperform or produce awkward completions when applied to:

Conversational Turkish
Non-medical or non-radiological topics
Informal writing styles or dialectal Turkish

❌ Medical Safety

This model should not be used for clinical decision-making, diagnosis, or treatment recommendations. Despite being trained on medical content, it lacks factual grounding, context awareness, and real-time clinical judgment. Any outputs generated by this model must be verified by licensed medical professionals.

🧠 Memorization Risk

Continued pretraining on a small or repetitive corpus can lead to memorization of phrases, potentially exposing:

Patient case formats
Study identifiers
Sections of dissertations

Although this dataset appears anonymized and academic, you should not use the model for data anonymization, patient privacy protection, or regulatory compliance tasks.

🧪 Limitations

The model does not have grounding in real-world imaging data or structured knowledge bases.
Outputs may hallucinate plausible-sounding but incorrect medical facts.
Limited to Turkish; does not generalize to multilingual or English medical contexts.
Repetition or looping in generation may still occur in long sequences if decoding is not configured properly (repetition_penalty, eos_token_id, etc.).

How to Get Started with the Model

Use the code below to get started with the model.

from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from peft import PeftModel

login(token="")  


tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-1B",)
base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/Llama-3.2-1B",
    device_map={"":0}, token=""
)

model = PeftModel.from_pretrained(base_model,"Rustamshry/Llama-electronic-radiology-TR")


input_text = "Bulgular: Gruplar arası yaş ve cinsiyet dağılımı açısından istatiksel olarakanlamlı farklılık saptanmadı."

inputs = tokenizer(input_text, return_tensors="pt", add_special_tokens=True).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,            
    temperature=1.0,             
    top_p=0.95,
    repetition_penalty = 1.2, 
    eos_token_id=tokenizer.eos_token_id 
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

Hours used: 10 hours

Dataset: hazal/electronic-radiology-phd-thesis-trR

This dataset contains Turkish-language PhD theses focused on electronic and diagnostic radiology. It was curated for the purpose of training language models in the Turkish medical domain. The documents are academic in tone, rich in domain-specific vocabulary, and structured into medical sections (e.g., materials & methods, results, discussion).

Language: Turkish
Domain: Electronic Radiology
Type: Academic dissertations
Preprocessing: The dataset was tokenized and truncated to a maximum sequence length suitable for LLM training. No instruction-style formatting was applied.

Dataset link: hazal/electronic-radiology-phd-thesis-trR

Framework versions

PEFT 0.14.0

Rustamshry
/

Llama-electronic-radiology-TR