bowphs/la-and-greek-lemmatizer · How to do lemmatization?

felipemv

Jun 19

•

edited Jun 19

Hi, I've seen your models at https://huggingface.co/bowphs .

However, I'm at a loss on how I do lemmatization. For example, consider this text:

ποικιλόθρον’ ἀθανάτ’ Ἀφρόδιτα,
παῖ Δίος δολόπλοκε, λίσσομαί σε,
μή μ’ ἄσαισι μηδ’ ὀνίαισι δάμνα,
πότνια, θῦμον,

I'm a student of Ancient Greek and also a programmer (for context).

I have cross posted this here: https://github.com/bowphs/SIGTYP-2024-hierarchical-transformers/issues/1

PS: thanks for your work :bow:

felipemv

Jun 19

To make it super clear:

# https://github.com/bowphs/SIGTYP-2024-hierarchical-transformers/blob/main/lemmatization.py
from transformers import AutoTokenizer, T5ForConditionalGeneration

model = "bowphs/GreTa"
tokenizer = AutoTokenizer.from_pretrained(model)
model = T5ForConditionalGeneration.from_pretrained(model)

# inputs = ["lemmatize: ἀγαθή"]

inputs = ["lemmatize: <t_tok_beg> κατέβην <t_tok_end>"]

print("Tokenizing inputs...")
# Tokenize
encodings = tokenizer(inputs, return_tensors="pt", padding=False)

# Remove token_type_ids if present (T5 doesn't use them)
if "token_type_ids" in encodings:
    del encodings["token_type_ids"]

print("Generating outputs...")
# Generate
outputs = model.generate(**encodings)

print("Decoding outputs...")
# Decode
lemmas = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print("Results:")
print(lemmas)

Results in:

Tokenizing inputs...
Generating outputs...
Decoding outputs...
Results:
[':  :  : > κατέβην <t']

bowphs

Owner 27 days ago

Thanks for your question! I've provided a response to this same issue on GitHub at https://github.com/bowphs/SIGTYP-2024-hierarchical-transformers/issues/1, including a minimal code example for lemmatization.

To keep discussions organized, I'm closing this issue in favor of the GitHub thread. Please feel free to continue the conversation there if you have any follow-up questions!

bowphs changed discussion status to closed 27 days ago