How to do lemmatization?

#1
by felipemv - opened

Hi, I've seen your models at https://huggingface.co/bowphs .

However, I'm at a loss on how I do lemmatization. For example, consider this text:

ποικιλόθρον’ ἀθανάτ’ Ἀφρόδιτα,
παῖ Δίος δολόπλοκε, λίσσομαί σε,
μή μ’ ἄσαισι μηδ’ ὀνίαισι δάμνα,
πότνια, θῦμον,

I'm a student of Ancient Greek and also a programmer (for context).

I have cross posted this here: https://github.com/bowphs/SIGTYP-2024-hierarchical-transformers/issues/1

PS: thanks for your work :bow:

To make it super clear:

# https://github.com/bowphs/SIGTYP-2024-hierarchical-transformers/blob/main/lemmatization.py
from transformers import AutoTokenizer, T5ForConditionalGeneration

model = "bowphs/GreTa"
tokenizer = AutoTokenizer.from_pretrained(model)
model = T5ForConditionalGeneration.from_pretrained(model)

# inputs = ["lemmatize: ἀγαθή"]

inputs = ["lemmatize: <t_tok_beg> κατέβην <t_tok_end>"]

print("Tokenizing inputs...")
# Tokenize
encodings = tokenizer(inputs, return_tensors="pt", padding=False)

# Remove token_type_ids if present (T5 doesn't use them)
if "token_type_ids" in encodings:
    del encodings["token_type_ids"]

print("Generating outputs...")
# Generate
outputs = model.generate(**encodings)

print("Decoding outputs...")
# Decode
lemmas = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print("Results:")
print(lemmas)

Results in:

Tokenizing inputs...
Generating outputs...
Decoding outputs...
Results:
[':  :  : > κατέβην <t']

Thanks for your question! I've provided a response to this same issue on GitHub at https://github.com/bowphs/SIGTYP-2024-hierarchical-transformers/issues/1, including a minimal code example for lemmatization.

To keep discussions organized, I'm closing this issue in favor of the GitHub thread. Please feel free to continue the conversation there if you have any follow-up questions!

bowphs changed discussion status to closed

Sign up or log in to comment