Has anyone successfully get the inference code for this to run?
I tried to write up the inference code based on the base model it is finetuned on, but no success :(
Kind of...
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification
modelw = Wav2Vec2ForSequenceClassification.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech', device=torch.device('cpu'))
processor = Wav2Vec2FeatureExtractor.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech', device=torch.device('cpu'))
sound_array = np.array(waveform)
input_values = processor(sound_array, sampling_rate = 16000, padding='longest', return_tensors='pt').input_values
with torch.no_grad():
result = modelw(input_values).logits
probs = list(result.detach().numpy()[0])
This works and produces reasonable results.
However, for certain files/segments there seems to be memory leaks.
I am not sure.
Kind of...
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification modelw = Wav2Vec2ForSequenceClassification.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech', device=torch.device('cpu')) processor = Wav2Vec2FeatureExtractor.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech', device=torch.device('cpu')) sound_array = np.array(waveform) input_values = processor(sound_array, sampling_rate = 16000, padding='longest', return_tensors='pt').input_values with torch.no_grad(): result = modelw(input_values).logits probs = list(result.detach().numpy()[0])
This works and produces reasonable results.
However, for certain files/segments there seems to be memory leaks.
I am not sure.
I tried your code with something like this:
import torch
import numpy as np
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification
from scipy.io.wavfile import read
modelw = Wav2Vec2ForSequenceClassification.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech')
processor = Wav2Vec2FeatureExtractor.from_pretrained('alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech')
audio_pth = 'your_test_audio_path.wav'
output = read(audio_pth)
waveform = np.array(output[1],dtype=float)
sound_array = np.array(waveform)
input_values = processor(sound_array, sampling_rate = 16000, padding='longest', return_tensors='pt').input_values
with torch.no_grad():
result = modelw(input_values).logits
prob = torch.nn.functional.softmax(logits, dim=1)
Note I used prob = torch.nn.functional.softmax(logits, dim=1)
on the last line for the probabilities, which seems to make sense based on a few test audios I tried.
Indeed, the following for instance:
with torch.no_grad():
logits = modelw(input_values).logits
prob = torch.nn.functional.softmax(logits, dim=1).tolist()[0][0]
Seems to return the probability of the speaker being a female.
Whereas:
prob = torch.nn.functional.softmax(logits, dim=1).tolist()[0][1]
Returns the probability of a male.
As this is a binary classifier, one plus the other should be approximately 1.
Sorry I just found on some audios it's making wrong prediction while on the inference API it is not:
This one for example is predicted as male while on the inference API it is female (correct)
For me it works fine so far with some 40 speakers.
After diarisation, I have concatenated the longest non-overlapping segments corresponding to each speaker into separate WAV files (the preprocessing involves voice isolation with demucs, normalisation with pydub and conversion to 16 kHz mono WAV with pysox). I am also removing silences with pysox as that helps for other tasks but so far it does not seem to have a noticeable effect for gender attribution.
With this preprocessing, the model seems to work well.
Previously I had also experienced some issues, including the inability to process certain segments (huge memory usage and never-ending processing) as well as gender misassignment.
I starts to feel I am you from another universe cause we have been doing exactly the same preprocessing steps! Wow!
Glad you got it working fine, I will look a bit closer to my pipeline.
Also out of topic question, the official demucs interface is really hard to use, for example it always write the results to files in a predefined folder etc. Also it has memory issue processing large audio snippets (might due to inefficient segmentation and batching).
Have you had any luck getting it to work reliably? :D
You can check the initial preprocessing there:
https://github.com/mirix/approaches-to-diarisation/tree/main
Namely:
demucs.separate.main(shlex.split('--two-stems vocals -n mdx_extra ' + 'samples/' + name + ' -o tmp'))
Thanks man! Appreciate that!!