Model Card for Llama-electronic-radiology-TR

Model Details

Model Summary

This model is a domain-adapted version of Llama-3.2-1B, fine-tuned via continued pretraining on Turkish-language electronic radiology PhD theses. The model was trained in an autoregressive (causal language modeling) setup using the hazal/electronic-radiology-phd-thesis-trR dataset. Unlike instruction-tuned models, this version focuses on improving the model’s fluency, vocabulary, and semantic consistency in highly technical medical and radiological contexts. It is intended for downstream applications such as domain-specific generation, summarization, and potential fine-tuning for clinical tasks in Turkish.

Model Description

  • Language(s) (NLP): Turkish
  • License: Llama 3.2
  • Finetuned from model: Llama-3.2-1B

Uses

Direct Use

The primary intended uses include:

  • Domain-specific generation: Generating fluent, semantically rich Turkish text in radiological contexts, e.g., imaging protocols, research summaries, or academic abstracts.
  • Medical document summarization: Summarizing long Turkish-language radiological texts, such as reports or thesis chapters.
  • Language modeling for downstream tasks: Serving as a base model for fine-tuning into instruction-tuned clinical models or QA systems in radiology.
  • Research applications: Assisting in the development of Turkish-language models for clinical NLP, especially in low-resource and domain-specific contexts.

This model is not instruction-tuned and does not perform well in prompt-based Q&A or dialogue setups without additional supervised fine-tuning.

Bias, Risks, and Limitations

🔬 Domain Bias

The model has been trained exclusively on Turkish PhD-level academic texts in radiology. As such, its knowledge and language patterns are narrowly focused on:

  • Formal, academic Turkish
  • Medical terminology in radiology and imaging
  • Structured dissertation-like content

It may underperform or produce awkward completions when applied to:

  • Conversational Turkish
  • Non-medical or non-radiological topics
  • Informal writing styles or dialectal Turkish

❌ Medical Safety

This model should not be used for clinical decision-making, diagnosis, or treatment recommendations. Despite being trained on medical content, it lacks factual grounding, context awareness, and real-time clinical judgment. Any outputs generated by this model must be verified by licensed medical professionals.

🧠 Memorization Risk

Continued pretraining on a small or repetitive corpus can lead to memorization of phrases, potentially exposing:

  • Patient case formats
  • Study identifiers
  • Sections of dissertations

Although this dataset appears anonymized and academic, you should not use the model for data anonymization, patient privacy protection, or regulatory compliance tasks.

🧪 Limitations

  • The model does not have grounding in real-world imaging data or structured knowledge bases.
  • Outputs may hallucinate plausible-sounding but incorrect medical facts.
  • Limited to Turkish; does not generalize to multilingual or English medical contexts.
  • Repetition or looping in generation may still occur in long sequences if decoding is not configured properly (repetition_penalty, eos_token_id, etc.).

How to Get Started with the Model

Use the code below to get started with the model.

from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from peft import PeftModel

login(token="")  


tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-1B",)
base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/Llama-3.2-1B",
    device_map={"":0}, token=""
)

model = PeftModel.from_pretrained(base_model,"Rustamshry/Llama-electronic-radiology-TR")


input_text = "Bulgular: Gruplar arası yaş ve cinsiyet dağılımı açısından istatiksel olarakanlamlı farklılık saptanmadı."

inputs = tokenizer(input_text, return_tensors="pt", add_special_tokens=True).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,            
    temperature=1.0,             
    top_p=0.95,
    repetition_penalty = 1.2, 
    eos_token_id=tokenizer.eos_token_id 
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

  • Hours used: 10 hours

Dataset: hazal/electronic-radiology-phd-thesis-trR

This dataset contains Turkish-language PhD theses focused on electronic and diagnostic radiology. It was curated for the purpose of training language models in the Turkish medical domain. The documents are academic in tone, rich in domain-specific vocabulary, and structured into medical sections (e.g., materials & methods, results, discussion).

  • Language: Turkish
  • Domain: Electronic Radiology
  • Type: Academic dissertations
  • Preprocessing: The dataset was tokenized and truncated to a maximum sequence length suitable for LLM training. No instruction-style formatting was applied.

Dataset link: hazal/electronic-radiology-phd-thesis-trR

Framework versions

  • PEFT 0.14.0
Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Rustamshry/Llama-electronic-radiology-TR

Adapter
(272)
this model

Dataset used to train Rustamshry/Llama-electronic-radiology-TR