doc updt
Browse files
README.md
CHANGED
@@ -31,8 +31,8 @@ Its weights are then downloaded from this repository.
|
|
31 |
from spk_embeddings import EmbeddingsModel, compute_embedding
|
32 |
import torch
|
33 |
|
34 |
-
|
35 |
-
|
36 |
```
|
37 |
|
38 |
The model produces normalized vectors as embeddings.
|
@@ -48,8 +48,8 @@ finally, we can compute two embeddings from two different files and compare them
|
|
48 |
wav1 = "/voxceleb1_2019/test/wav/id10270/x6uYqmx31kE/00001.wav"
|
49 |
wav2 = "/voxceleb1_2019/test/wav/id10270/8jEAjG6SegY/00008.wav"
|
50 |
|
51 |
-
e1 = compute_embedding(wav1,
|
52 |
-
e2 = compute_embedding(wav2,
|
53 |
sim = float(torch.matmul(e1,e2.t()))
|
54 |
|
55 |
print(sim) # 0.7743815779685974
|
@@ -58,8 +58,8 @@ print(sim) # 0.7743815779685974
|
|
58 |
# Evaluations
|
59 |
Although it is not directly designed for this use case, evaluation on a standard ASV task can be performed with this model. Applied to
|
60 |
the [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt), it leads to an equal error rate
|
61 |
-
(EER, lower value denotes a better identification, random prediction leads to a value of 50%) of **
|
62 |
-
(with a decision threshold of **0.
|
63 |
This value can be interpreted as the ability to identify speakers only with timbral cues. A discussion about this interpretation can be
|
64 |
found in the paper mentioned hereafter, as well as other experiments showing correlations between these embeddings and timbral voice attributes.
|
65 |
|
|
|
31 |
from spk_embeddings import EmbeddingsModel, compute_embedding
|
32 |
import torch
|
33 |
|
34 |
+
model = EmbeddingsModel.from_pretrained("Orange/Speaker-wavLM-tbr")
|
35 |
+
model.eval()
|
36 |
```
|
37 |
|
38 |
The model produces normalized vectors as embeddings.
|
|
|
48 |
wav1 = "/voxceleb1_2019/test/wav/id10270/x6uYqmx31kE/00001.wav"
|
49 |
wav2 = "/voxceleb1_2019/test/wav/id10270/8jEAjG6SegY/00008.wav"
|
50 |
|
51 |
+
e1 = compute_embedding(wav1, model)
|
52 |
+
e2 = compute_embedding(wav2, model)
|
53 |
sim = float(torch.matmul(e1,e2.t()))
|
54 |
|
55 |
print(sim) # 0.7743815779685974
|
|
|
58 |
# Evaluations
|
59 |
Although it is not directly designed for this use case, evaluation on a standard ASV task can be performed with this model. Applied to
|
60 |
the [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt), it leads to an equal error rate
|
61 |
+
(EER, lower value denotes a better identification, random prediction leads to a value of 50%) of **1.685%**
|
62 |
+
(with a decision threshold of **0.472**).
|
63 |
This value can be interpreted as the ability to identify speakers only with timbral cues. A discussion about this interpretation can be
|
64 |
found in the paper mentioned hereafter, as well as other experiments showing correlations between these embeddings and timbral voice attributes.
|
65 |
|