Has anyone successfully get the inference code for this to run?

by apobec3f - opened Jul 4, 2023

Discussion

apobec3f

Jul 4, 2023

I tried to write up the inference code based on the base model it is finetuned on, but no success :(

mirix

Jul 4, 2023

Kind of...

from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification

modelw = Wav2Vec2ForSequenceClassification.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech', device=torch.device('cpu'))
processor = Wav2Vec2FeatureExtractor.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech', device=torch.device('cpu'))

                sound_array = np.array(waveform)
                input_values = processor(sound_array, sampling_rate = 16000, padding='longest', return_tensors='pt').input_values
                with torch.no_grad():
                    result = modelw(input_values).logits
                probs = list(result.detach().numpy()[0])

This works and produces reasonable results.

However, for certain files/segments there seems to be memory leaks.

I am not sure.

apobec3f

Jul 9, 2023

•

edited Jul 9, 2023

Kind of...

from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification

modelw = Wav2Vec2ForSequenceClassification.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech', device=torch.device('cpu'))
processor = Wav2Vec2FeatureExtractor.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech', device=torch.device('cpu'))

                sound_array = np.array(waveform)
                input_values = processor(sound_array, sampling_rate = 16000, padding='longest', return_tensors='pt').input_values
                with torch.no_grad():
                    result = modelw(input_values).logits
                probs = list(result.detach().numpy()[0])

This works and produces reasonable results.

However, for certain files/segments there seems to be memory leaks.

I am not sure.

I tried your code with something like this:

import torch
import numpy as np
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification
from scipy.io.wavfile import read

modelw = Wav2Vec2ForSequenceClassification.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech')
processor = Wav2Vec2FeatureExtractor.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech')
audio_pth = 'your_test_audio_path.wav'
output = read(audio_pth)
waveform = np.array(output[1],dtype=float)
sound_array = np.array(waveform)
input_values = processor(sound_array, sampling_rate = 16000, padding='longest', return_tensors='pt').input_values
with torch.no_grad():
    result = modelw(input_values).logits
    prob = torch.nn.functional.softmax(logits, dim=1)

Note I used prob = torch.nn.functional.softmax(logits, dim=1) on the last line for the probabilities, which seems to make sense based on a few test audios I tried.

mirix

Jul 27, 2023

•

edited Jul 27, 2023

Indeed, the following for instance:

with torch.no_grad():
    logits = modelw(input_values).logits
    prob = torch.nn.functional.softmax(logits, dim=1).tolist()[0][0]

Seems to return the probability of the speaker being a female.

Whereas:

 prob = torch.nn.functional.softmax(logits, dim=1).tolist()[0][1]

Returns the probability of a male.

As this is a binary classifier, one plus the other should be approximately 1.

apobec3f

Jul 31, 2023

Sorry I just found on some audios it's making wrong prediction while on the inference API it is not:
This one for example is predicted as male while on the inference API it is female (correct)

mirix

Jul 31, 2023

For me it works fine so far with some 40 speakers.

After diarisation, I have concatenated the longest non-overlapping segments corresponding to each speaker into separate WAV files (the preprocessing involves voice isolation with demucs, normalisation with pydub and conversion to 16 kHz mono WAV with pysox). I am also removing silences with pysox as that helps for other tasks but so far it does not seem to have a noticeable effect for gender attribution.

With this preprocessing, the model seems to work well.

Previously I had also experienced some issues, including the inability to process certain segments (huge memory usage and never-ending processing) as well as gender misassignment.

apobec3f

Jul 31, 2023

I starts to feel I am you from another universe cause we have been doing exactly the same preprocessing steps! Wow!
Glad you got it working fine, I will look a bit closer to my pipeline.
Also out of topic question, the official demucs interface is really hard to use, for example it always write the results to files in a predefined folder etc. Also it has memory issue processing large audio snippets (might due to inefficient segmentation and batching).
Have you had any luck getting it to work reliably? :D

mirix

Jul 31, 2023

You can check the initial preprocessing there:

https://github.com/mirix/approaches-to-diarisation/tree/main

Namely:

demucs.separate.main(shlex.split('--two-stems vocals -n mdx_extra ' + 'samples/' + name + ' -o tmp'))

apobec3f

Jul 31, 2023

Thanks man! Appreciate that!!

alefiury changed discussion status to closed Apr 19, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment