Model Card for tsilva/clinical-field-mapper-causal_lm

This model is a fine-tuned version of distilbert/distilgpt2 on the tsilva/clinical-field-mappings dataset. Its purpose is to normalize healthcare database column names to a standardized set of target column names.

Task

This is a causal language model designed to map free-text field names to standardized schema terms.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tsilva/clinical-field-mapper-causal_lm") model = AutoModelForCausalLM.from_pretrained("tsilva/clinical-field-mapper-causal_lm")

def predict(input_text): inputs = tokenizer(input_text + "|", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True))

predict('cardi@')

Evaluation Results

  • train accuracy: 98.24%
  • validation accuracy: 89.84%
  • test accuracy: 89.35%

Training Details

  • Seed: 42
  • Epochs scheduled: 50
  • Epochs completed: 14
  • Early stopping triggered: Yes
  • Final training loss: 1.3344
  • Final evaluation loss: 1.1981
  • Optimizer: adamw_bnb_8bit
  • Learning rate: 0.0005
  • Batch size: 512
  • Precision: fp16
  • DeepSpeed enabled: True
  • Gradient accumulation steps: 1

License

Specify your license here (e.g., Apache 2.0, MIT, etc.)

Limitations and Bias

  • Model was trained on a specific clinical mapping dataset.
  • Performance may vary on out-of-distribution column names.
  • Ensure you validate model outputs in production environments.
Downloads last month
19
Safetensors
Model size
121M params
Tensor type
FP16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results

  • train Accuracy on tsilva/clinical-field-mappings
    self-reported
    0.982
  • validation Accuracy on tsilva/clinical-field-mappings
    self-reported
    0.898
  • test Accuracy on tsilva/clinical-field-mappings
    self-reported
    0.893