G2P Multilingual ByT5 Tiny (8 layers) - IPA CHILDES
This model is a sequence-to-sequence model based on Google's ByT5, fine-tuned on the IPA CHILDES (split) dataset to convert grapheme to phonemes over a context of 512 tokens for 31 languages.
ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5.
ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task.
Language tags
The following language tags can be used for prefixing the model input:
Tag | Language |
---|---|
ca | Catalan |
cy | Welsh |
da | Danish |
de | German |
en-na | English (North America) |
en-uk | English (United Kingdom) |
es | Spanish |
et | Estonian |
eu | Basque |
fa | Persian |
fr | French |
ga | Irish |
hr | Croatian |
hu | Hungarian |
id | Indonesian |
is | Icelandic |
it | Italian |
ja | Japanese |
ko | Korean |
nl | Dutch |
no | Norwegian |
pl | Polish |
pt | Portuguese |
pt-br | Portuguese (Brazil) |
qu | Quechua |
ro | Romanian |
sr | Serbian |
sv | Swedish |
tr | Turkish |
zh | Chinese |
zh-yue | Cantonese |
The tag must be prepended to the prompt as a prefix using the format <{tag}>:
(e.g., <pt-br>:
).
Note: a space between the prefix colon (:
) and the beginning of the text is mandatory.
Example 1: inference with tokenizer
For batched inference & training it is recommended using a tokenizer class for handling padding, truncation and additional tokens:
from transformers import T5ForConditionalGeneration, AutoTokenizer
model = T5ForConditionalGeneration.from_pretrained('fdemelo/g2p-multilingual-byt5-tiny-8l-ipa-childes')
tokenizer = AutoTokenizer.from_pretrained('fdemelo/g2p-multilingual-byt5-tiny-8l-ipa-childes')
model_inputs = tokenizer(["<en-na>: Life is like a box of chocolates."], max_length=512, padding=True, truncation=True, add_special_tokens=False, return_tensors="pt")
preds = model.generate(**model_inputs, num_beams=1, max_length=512) # We do not find beam search helpful. Greedy decoding is enough.
phones = tokenizer.batch_decode(preds.tolist(), skip_special_tokens=True)
print(phones)
# ['laɪf ɪz laɪk ʌ bɑks ʌv t̠ʃɑkləts']
Example 2: inference without tokenizer
For standalone inference, the decoding without the tokenizer reads as
import torch
import json
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('fdemelo/g2p-multilingual-byt5-tiny-8l-ipa-childes')
input_ids = torch.tensor([list("<en-na>: Life is like a box of chocolates.".encode("utf-8"))]) + 3 # add shift to account for special tokens <pad>, </s>, <unk>
preds = model.generate(input_ids=input_ids, num_beams=1, max_length=512)
# Simplified version of the decoding process (discarding special/added tokens)
with open("tokenizer_config.json", "r") as f:
added_tokens = json.load(f).get("added_tokens_decoder", {})
phone_bytes = [
bytes([token - 3]) for token in preds[0].tolist() if str(token) not in added_tokens
]
phones = b''.join(phone_bytes).decode("utf-8", errors="ignore")
print(phones)
# 'laɪf ɪz laɪk ʌ bɑks ʌv t̠ʃɑkləts'
- Downloads last month
- 176