Extracting timestamps

by thewh1teagle - opened Feb 22

Discussion

thewh1teagle

Feb 22

How can we get the timestamps when inferencing the onnx model?

fagenorn

Feb 22

Quick gist to see how this works. I am not sure if I am loading the voice correctly (quick hack), so double check the audio samples. https://gist.github.com/fagenorn/d4aa16704541370d9b9d5f91f1f07b34

thewh1teagle

Feb 23

@fagenorn
Thanks! I'm wondering if it's possible to do it without tokens objects. like just with phonemes string with whitespace, or token IDS
I would like to use it in kokoro-onnx package that uses espeak-ng most of the time (and not misaki) also it's useful for running kokoro in Rust etc since misaki works only with Python

fagenorn

Feb 23

@thewh1teagle

It is possible, I am doing this in c# since misaki also doesn't exist there. You would have to split your sentence into tokens, which can be done with various NLP libraries. Misaki uses spacy to do this, I use openNLP, etc...

nlp = spacy.load(model_name)
doc = nlp(text)
tokens = [MyTokens(text=t.text, tag=t.tag_, whitespace=t.whitespace_) for t in doc]

tag isn't needed, but can be usefull if you want to phonemeize correcty based on POS (homographs). Otherwise can just phonemize the individual words 1 by 1 via espeak or maybe even try phonemizing the whole sentence directly, but not sure if the timestamps will always align still. When sending them to kokoro merge them with whitespace if done 1-by-1: result = ''.join(t.phonemes + t.whitespace for t in tokens)

CapMcTavish

Mar 21

@fagenorn
trying to port the code to cpp, and here is the code and output to it:
std::vector joinTimestamps(const std::string &sentence, const std::vector &durations) {
const float MAGIC_DIVISOR = 80.0f;
std::vector mapping;

std::istringstream iss(sentence);
std::vector<std::string> tokens;
std::string token;
while (iss >> token) {
    tokens.push_back(token);
}

if (durations.size() < 3 || tokens.empty() || durations.size() < tokens.size() + 2) {
    std::cerr << "[joinTimestamps] Not enough duration values to map tokens." << std::endl;
    return mapping;
}
float left = 2 * std::max(0.0f, durations[0] - 3.0f);
float right = left;
int i = 1;

for (const auto &t : tokens) {
    if (i >= durations.size() - 1) break;
    float token_dur = durations[i];
    float start_ts = left / MAGIC_DIVISOR;
    right = left + 2 * token_dur; 
    float end_ts = right / MAGIC_DIVISOR;
    mapping.push_back({t, start_ts, end_ts});
    left = right;
    i++;
}
return mapping;

}

output:
[0.238062 - 0.302692] tell
[0.302692 - 0.357112] me
[0.357112 - 0.407653] a
[0.407653 - 0.457869] joke

and reference pocketsphinx alignment:

[0.230 - 0.510] tell
[0.510 - 0.690] me
[0.690 - 0.800] a
[0.800 - 1.360] joke

there seems to be lot of deviation, could you please help me with it.

fagenorn

Mar 21

•

edited Mar 21

@CapMcTavish
Main Issues in Your C++ Port:

The python version uses spacy for tokenization (instead of assuming each word is a token)
Timing Calculation: The original uses a more complex algorithm with whitespace handling and different accumulation logic.

Try something like this (not tested)

std::vector<TokenTiming> joinTimestamps(const std::string& sentence, const std::vector<float>&durations) {
    std::vector mapping;
    std::istringstream iss(sentence);
    std::vector<std::string> tokens;
    std::string token;
    
    while (iss >> token) {
        tokens.push_back(token);
    }
    
    if (durations.size() < 3 || tokens.empty()) {
        std::cerr << "[joinTimestamps] Not enough duration values." << std::endl;
        return mapping;
    }
    
    const int TIME_DIVISOR = 80;
    
    // boundary timing
    float leftTime = 2 * std::max(0.0f, durations[0] - 3.0f);
    float rightTime = leftTime;
    
    // Start processing after boundary token
    int timingIndex = 1;
    
    for (const auto& t : tokens) {        
        int endIndex = timingIndex + token.length();
        if (endIndex >= durations.size()) {
            continue;
        }
        
        float startTime = (float)rightTime / TIME_DIVISOR;
        
        // Sum durations for all phonemes in this token
        float tokenDuration = 0.0f;
        for (int i = timingIndex; i < endIndex && i < durations.size(); i++) {
            tokenDuration += durations[i];
        }
        
        // Handle whitespace after token (assuming space between tokens) - In original spacy handles tokenization where each token doesn't necessarily have a space after it
        float spaceDuration = (endIndex < durations.size()) ? durations[endIndex] : 0.0f;
        
        leftTime = rightTime + 2 * tokenDuration + spaceDuration;
        float endTime = (float)leftTime / TIME_DIVISOR;
        rightTime = leftTime + spaceDuration;
        
        mapping.push_back({t, startTime, endTime});
        
        // Next token's timing +1 for the whitespace
        timingIndex = endIndex + 1;
    }
    
    return mapping;
}

shannu122

29 days ago

hi @fagenorn , can we get phoneme level alignment from the kokoro?

fagenorn

29 days ago

@shannu122 Yes, this is possible. Check out the gist I provided: https://gist.github.com/fagenorn/d4aa16704541370d9b9d5f91f1f07b34
it returns a list of tokens with individual phonemes and their timings.

shannu122

29 days ago

@fagenorn
i wanted timestamps for each phoneme , here i the output we are seeing only timestamps to words right?

Chanon5555

23 days ago

Is it possible to Insert pauses into text for making the pauses at specific times?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment