G2P Multilingual ByT5 Tiny (8 layers) - IPA CHILDES

This model is a sequence-to-sequence model based on Google's ByT5, fine-tuned on the IPA CHILDES (split) dataset to convert grapheme to phonemes over a context of 512 tokens for 31 languages.

ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5.

ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task.

Language tags

The following language tags can be used for prefixing the model input:

Tag Language
ca Catalan
cy Welsh
da Danish
de German
en-na English (North America)
en-uk English (United Kingdom)
es Spanish
et Estonian
eu Basque
fa Persian
fr French
ga Irish
hr Croatian
hu Hungarian
id Indonesian
is Icelandic
it Italian
ja Japanese
ko Korean
nl Dutch
no Norwegian
pl Polish
pt Portuguese
pt-br Portuguese (Brazil)
qu Quechua
ro Romanian
sr Serbian
sv Swedish
tr Turkish
zh Chinese
zh-yue Cantonese

The tag must be prepended to the prompt as a prefix using the format <{tag}>: (e.g., <pt-br>: ). Note: a space between the prefix colon (:) and the beginning of the text is mandatory.

Example 1: inference with tokenizer

For batched inference & training it is recommended using a tokenizer class for handling padding, truncation and additional tokens:

from transformers import T5ForConditionalGeneration, AutoTokenizer

model = T5ForConditionalGeneration.from_pretrained('fdemelo/g2p-multilingual-byt5-tiny-8l-ipa-childes')
tokenizer = AutoTokenizer.from_pretrained('fdemelo/g2p-multilingual-byt5-tiny-8l-ipa-childes')

model_inputs = tokenizer(["<en-na>: Life is like a box of chocolates."], max_length=512, padding=True, truncation=True, add_special_tokens=False, return_tensors="pt")
preds = model.generate(**model_inputs, num_beams=1, max_length=512) # We do not find beam search helpful. Greedy decoding is enough. 
phones = tokenizer.batch_decode(preds.tolist(), skip_special_tokens=True)
print(phones)
# ['laɪf ɪz laɪk ʌ bɑks ʌv t̠ʃɑkləts']

Example 2: inference without tokenizer

For standalone inference, the decoding without the tokenizer reads as

import torch
import json
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('fdemelo/g2p-multilingual-byt5-tiny-8l-ipa-childes')
input_ids = torch.tensor([list("<en-na>: Life is like a box of chocolates.".encode("utf-8"))]) + 3  # add shift to account for special tokens <pad>, </s>, <unk>
preds = model.generate(input_ids=input_ids, num_beams=1, max_length=512)
# Simplified version of the decoding process (discarding special/added tokens)
with open("tokenizer_config.json", "r") as f:
    added_tokens = json.load(f).get("added_tokens_decoder", {})
phone_bytes = [
    bytes([token - 3]) for token in preds[0].tolist() if str(token) not in added_tokens
]
phones = b''.join(phone_bytes).decode("utf-8", errors="ignore")
print(phones)
# 'laɪf ɪz laɪk ʌ bɑks ʌv t̠ʃɑkləts'
Downloads last month
176
Safetensors
Model size
7.28M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fdemelo/g2p-multilingual-byt5-tiny-8l-ipa-childes

Quantizations
1 model