Extracting timestamps
How can we get the timestamps when inferencing the onnx model?
Quick gist to see how this works. I am not sure if I am loading the voice correctly (quick hack), so double check the audio samples. https://gist.github.com/fagenorn/d4aa16704541370d9b9d5f91f1f07b34
@fagenorn
Thanks! I'm wondering if it's possible to do it without tokens objects. like just with phonemes string with whitespace, or token IDS
I would like to use it in kokoro-onnx package that uses espeak-ng most of the time (and not misaki) also it's useful for running kokoro in Rust etc since misaki works only with Python
It is possible, I am doing this in c# since misaki also doesn't exist there. You would have to split your sentence into tokens, which can be done with various NLP libraries. Misaki uses spacy to do this, I use openNLP, etc...
nlp = spacy.load(model_name)
doc = nlp(text)
tokens = [MyTokens(text=t.text, tag=t.tag_, whitespace=t.whitespace_) for t in doc]
tag
isn't needed, but can be usefull if you want to phonemeize correcty based on POS (homographs). Otherwise can just phonemize the individual words 1 by 1 via espeak or maybe even try phonemizing the whole sentence directly, but not sure if the timestamps will always align still. When sending them to kokoro merge them with whitespace if done 1-by-1: result = ''.join(t.phonemes + t.whitespace for t in tokens)
@fagenorn
trying to port the code to cpp, and here is the code and output to it:
std::vector joinTimestamps(const std::string &sentence, const std::vector &durations) {
const float MAGIC_DIVISOR = 80.0f;
std::vector mapping;
std::istringstream iss(sentence);
std::vector<std::string> tokens;
std::string token;
while (iss >> token) {
tokens.push_back(token);
}
if (durations.size() < 3 || tokens.empty() || durations.size() < tokens.size() + 2) {
std::cerr << "[joinTimestamps] Not enough duration values to map tokens." << std::endl;
return mapping;
}
float left = 2 * std::max(0.0f, durations[0] - 3.0f);
float right = left;
int i = 1;
for (const auto &t : tokens) {
if (i >= durations.size() - 1) break;
float token_dur = durations[i];
float start_ts = left / MAGIC_DIVISOR;
right = left + 2 * token_dur;
float end_ts = right / MAGIC_DIVISOR;
mapping.push_back({t, start_ts, end_ts});
left = right;
i++;
}
return mapping;
}
output:
[0.238062 - 0.302692] tell
[0.302692 - 0.357112] me
[0.357112 - 0.407653] a
[0.407653 - 0.457869] joke
and reference pocketsphinx alignment:
[0.230 - 0.510] tell
[0.510 - 0.690] me
[0.690 - 0.800] a
[0.800 - 1.360] joke
there seems to be lot of deviation, could you please help me with it.
@CapMcTavish
Main Issues in Your C++ Port:
- The python version uses spacy for tokenization (instead of assuming each word is a token)
- Timing Calculation: The original uses a more complex algorithm with whitespace handling and different accumulation logic.
Try something like this (not tested)
std::vector<TokenTiming> joinTimestamps(const std::string& sentence, const std::vector<float>&durations) {
std::vector mapping;
std::istringstream iss(sentence);
std::vector<std::string> tokens;
std::string token;
while (iss >> token) {
tokens.push_back(token);
}
if (durations.size() < 3 || tokens.empty()) {
std::cerr << "[joinTimestamps] Not enough duration values." << std::endl;
return mapping;
}
const int TIME_DIVISOR = 80;
// boundary timing
float leftTime = 2 * std::max(0.0f, durations[0] - 3.0f);
float rightTime = leftTime;
// Start processing after boundary token
int timingIndex = 1;
for (const auto& t : tokens) {
int endIndex = timingIndex + token.length();
if (endIndex >= durations.size()) {
continue;
}
float startTime = (float)rightTime / TIME_DIVISOR;
// Sum durations for all phonemes in this token
float tokenDuration = 0.0f;
for (int i = timingIndex; i < endIndex && i < durations.size(); i++) {
tokenDuration += durations[i];
}
// Handle whitespace after token (assuming space between tokens) - In original spacy handles tokenization where each token doesn't necessarily have a space after it
float spaceDuration = (endIndex < durations.size()) ? durations[endIndex] : 0.0f;
leftTime = rightTime + 2 * tokenDuration + spaceDuration;
float endTime = (float)leftTime / TIME_DIVISOR;
rightTime = leftTime + spaceDuration;
mapping.push_back({t, startTime, endTime});
// Next token's timing +1 for the whitespace
timingIndex = endIndex + 1;
}
return mapping;
}
@shannu122
Yes, this is possible. Check out the gist I provided: https://gist.github.com/fagenorn/d4aa16704541370d9b9d5f91f1f07b34
it returns a list of tokens with individual phonemes and their timings.
Is it possible to Insert pauses into text for making the pauses at specific times?