G2P Multilingual ByT5 Tiny (8 layers) - IPA CHILDES

This model is a sequence-to-sequence model based on Google's ByT5, fine-tuned on the IPA CHILDES (split) dataset to convert grapheme to phonemes over a context of 512 tokens for 31 languages.

ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5.

ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task.

Language tags

The following language tags can be used for prefixing the model input:

Tag	Language
ca	Catalan
cy	Welsh
da	Danish
de	German
en-na	English (North America)
en-uk	English (United Kingdom)
es	Spanish
et	Estonian
eu	Basque
fa	Persian
fr	French
ga	Irish
hr	Croatian
hu	Hungarian
id	Indonesian
is	Icelandic
it	Italian
ja	Japanese
ko	Korean
nl	Dutch
no	Norwegian
pl	Polish
pt	Portuguese
pt-br	Portuguese (Brazil)
qu	Quechua
ro	Romanian
sr	Serbian
sv	Swedish
tr	Turkish
zh	Chinese
zh-yue	Cantonese

The tag must be prepended to the prompt as a prefix using the format <{tag}>: (e.g., <pt-br>: ). Note: a space between the prefix colon (:) and the beginning of the text is mandatory.

Example 1: inference with tokenizer

For batched inference & training it is recommended using a tokenizer class for handling padding, truncation and additional tokens:

from transformers import T5ForConditionalGeneration, AutoTokenizer

model = T5ForConditionalGeneration.from_pretrained('fdemelo/g2p-multilingual-byt5-tiny-8l-ipa-childes')
tokenizer = AutoTokenizer.from_pretrained('fdemelo/g2p-multilingual-byt5-tiny-8l-ipa-childes')

model_inputs = tokenizer(["<en-na>: Life is like a box of chocolates."], max_length=512, padding=True, truncation=True, add_special_tokens=False, return_tensors="pt")
preds = model.generate(**model_inputs, num_beams=1, max_length=512) # We do not find beam search helpful. Greedy decoding is enough. 
phones = tokenizer.batch_decode(preds.tolist(), skip_special_tokens=True)
print(phones)
# ['laɪf ɪz laɪk ʌ bɑks ʌv t̠ʃɑkləts']

Example 2: inference without tokenizer

For standalone inference, the decoding without the tokenizer reads as

import torch
import json
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('fdemelo/g2p-multilingual-byt5-tiny-8l-ipa-childes')
input_ids = torch.tensor([list("<en-na>: Life is like a box of chocolates.".encode("utf-8"))]) + 3  # add shift to account for special tokens <pad>, </s>, <unk>
preds = model.generate(input_ids=input_ids, num_beams=1, max_length=512)
# Simplified version of the decoding process (discarding special/added tokens)
with open("tokenizer_config.json", "r") as f:
    added_tokens = json.load(f).get("added_tokens_decoder", {})
phone_bytes = [
    bytes([token - 3]) for token in preds[0].tolist() if str(token) not in added_tokens
]
phones = b''.join(phone_bytes).decode("utf-8", errors="ignore")
print(phones)
# 'laɪf ɪz laɪk ʌ bɑks ʌv t̠ʃɑkləts'

fdemelo
/

g2p-multilingual-byt5-tiny-8l-ipa-childes

G2P Multilingual ByT5 Tiny (8 layers) - IPA CHILDES

Language tags

Example 1: inference with tokenizer

Example 2: inference without tokenizer

Model tree for fdemelo/g2p-multilingual-byt5-tiny-8l-ipa-childes