Orange
/

Speaker-wavLM-tbr

🇪🇺 Region: EU

Model card Files Files and versions Community

ggmbr commited on Feb 10

Commit

7c7793f

·

1 Parent(s): 2dc845f

doc updt

Files changed (1) hide show

README.md +6 -6

README.md CHANGED Viewed

@@ -31,8 +31,8 @@ Its weights are then downloaded from this repository.
 from spk_embeddings import EmbeddingsModel, compute_embedding
 import torch
-nt_extractor = EmbeddingsModel.from_pretrained("Orange/Speaker-wavLM-tbr")
-nt_extractor.eval()
 ```
 The model produces normalized vectors as embeddings.
@@ -48,8 +48,8 @@ finally, we can compute two embeddings from two different files and compare them
 wav1 = "/voxceleb1_2019/test/wav/id10270/x6uYqmx31kE/00001.wav"
 wav2 = "/voxceleb1_2019/test/wav/id10270/8jEAjG6SegY/00008.wav"
-e1 = compute_embedding(wav1, nt_extractor)
-e2 = compute_embedding(wav2, nt_extractor)
 sim = float(torch.matmul(e1,e2.t()))
 print(sim) # 0.7743815779685974
@@ -58,8 +58,8 @@ print(sim) # 0.7743815779685974
 # Evaluations
 Although it is not directly designed for this use case, evaluation on a standard ASV task can be performed with this model. Applied to
 the [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt), it leads to an equal error rate
-(EER, lower value denotes a better identification, random prediction leads to a value of 50%) of **10.681%**
-(with a decision threshold of **0.467**).
 This value can be interpreted as the ability to identify speakers only with timbral cues. A discussion about this interpretation can be
 found in the paper mentioned hereafter, as well as other experiments showing correlations between these embeddings and timbral voice attributes.

 from spk_embeddings import EmbeddingsModel, compute_embedding
 import torch
+model = EmbeddingsModel.from_pretrained("Orange/Speaker-wavLM-tbr")
+model.eval()
 ```
 The model produces normalized vectors as embeddings.
 wav1 = "/voxceleb1_2019/test/wav/id10270/x6uYqmx31kE/00001.wav"
 wav2 = "/voxceleb1_2019/test/wav/id10270/8jEAjG6SegY/00008.wav"
+e1 = compute_embedding(wav1, model)
+e2 = compute_embedding(wav2, model)
 sim = float(torch.matmul(e1,e2.t()))
 print(sim) # 0.7743815779685974
 # Evaluations
 Although it is not directly designed for this use case, evaluation on a standard ASV task can be performed with this model. Applied to
 the [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt), it leads to an equal error rate
+(EER, lower value denotes a better identification, random prediction leads to a value of 50%) of **1.685%**
+(with a decision threshold of **0.472**).
 This value can be interpreted as the ability to identify speakers only with timbral cues. A discussion about this interpretation can be
 found in the paper mentioned hereafter, as well as other experiments showing correlations between these embeddings and timbral voice attributes.