GMY-60m
Introducing GMY-60m, a model capable of handling a diverse number of Linear B translation, transliteration, and correction tasks.
1. Model description
This is an instruct model, meaning it is capable of multiple tasks. It is intended primarily for translation + transliteration, but it can also be used for reverse translation as well.
Translation Instructions:
Note that the phrasing is awkward (e.g. "Linear B cuneiform") in prompts to keep it consistent with related models that can be integrated.
- "Translate Linear B cuneiform to English" + Linear B signs → English
- "Translate complex Linear B transliteration to English" + complex transliteration → English
- "Translate Linear B simple transliteration to English" + simple transliteration → English
- "Translate Linear B grouped transliteration to English" + transliteration with special symbols → English
- "Translate English to Linear B cuneiform" + English → Linear B signs
- "Translate English to simple Linear B transliteration" + English → Linear B simple transliteration with no special symbols
- "Translate English to grouped Linear B transliteration" + English → Linear B transliteration grouped into words with special symbols
Transliteration Instructions:
- "Transliterate Linear B cuneiform to simple Latin Characters" + Linear B signs → transliteration with no special symbols
- "Transliterate Linear B cuneiform to grouped Latin characters" + Linear B signs → transliteration with special symbols/subscripts
- "Group Linear B transliteration into likely words" + simple transliteration → transliteration with special symbols/subscripts
Mising Sign Insructions:
- 'Identify the missing signs: ' + string of Linear B, transliterations
Base model
This is a finetuned version of google's t5-small.
2. Usage (code snippet)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_path = "thalesian/GMY-60m"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, local_files_only=True)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
# 1) Prepare your cuneiform input
prompt = "Translate Linear B cuneiform to English: "
input_text = "𐀪 𐀍 𐀜 𐂆 𐂝 * * 𐀜 𐀃 𐂆 𐄒 𐀃 𐂝 𐄈 *"
# 2) Tokenize & get model outputs
inputs = tokenizer(prompt + input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=64)
# 3) Decode prediction
prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Reference:", "ewe wool * * in the month of lapatos o ewe o wool *")
print("Prediction:", prediction)
3. Training and evaluation data
Data was used from the linearb.xyz corpus project. More information on the training data, as well as the test and validation splits, can be found on both the GitHub and published methodology.
Training procedure
It was trained in 5 tranches with different datasets and collators:
- a pretraining and training dataset of Linear B and transliterated data (5,537 texts)
And 3 different collation methods:
- pretraining collation which introduces an asterisk to represent missing signs
- missing sign translations, which randomly introduces an asterisk to represent missing signs
- translation error, which randomly introduces the wrong sign into input data to simulate transliteration or glyph error
Final stage training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 128
- eval_batch_size: 128
- seed: 42
- distributed_type: multi-GPU
- num_devices: 2
- total_train_batch_size: 256
- total_eval_batch_size: 256
- optimizer: Use OptimizerNames.ADAMW_APEX_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 5000
- num_epochs: 200
Framework versions
- Transformers 4.50.3
- PyTorch 2.6.0+cu126
- Datasets 3.3.0
- Tokenizers 0.21.1
4. Metrics
From Language | From Script | To Language | To Script | BLEU |
---|---|---|---|---|
Mycenean Greek | Linear B | English | Latin | 55.55 |
Mycenaean Greek | Transliteration | English | Latin | 54.63 |
Mycenaean Greek | Linear B | Mycenaean Greek | Transliteration | 82.52 |
English | Latin | Mycenaean Greek | Transliteration | 29.20 |
English | Latin | Mycenaean Greek | Linear B | 30.79 |
5. Intended uses
– Short Linear B lines, transliteration pipelines, reverse lookup experiments.
6. Limitations
– Context window is only 64 tokens, it is untested on long passages.
7. How to Cite
@misc{drake2025gmy60m,
title = {{GMY-60m}: A T5-Small for Linear B⇄English},
author = {Drake, B. Lee},
year = {2025},
howpublished = {\url{https://huggingface.co/thalesian/GMY-60m}}
}
- Downloads last month
- 15
Evaluation results
- bleuself-reported55.550
- bleuself-reported54.630
- bleuself-reported82.520
- bleuself-reported29.200
- bleuself-reported30.790