yangwang825
/

ecapa-tdnn-voxceleb1-c512-aam

@@ -1,28 +1,14 @@
 ---
 datasets:
 - voxceleb
-library_name: transformers
 metrics:
 - accuracy
-tags:
-- audio-classification
-- generated_from_trainer
 model-index:
 - name: ecapa-tdnn-voxceleb1-c512-aam
-  results:
-  - task:
-      type: audio-classification
-      name: Audio Classification
-    dataset:
-      name: confit/voxceleb
-      type: voxceleb
-      config: verification
-      split: train
-      args: verification
-    metrics:
-    - type: accuracy
-      value: 0.8030272452068618
-      name: Accuracy
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -30,10 +16,10 @@ should probably proofread and complete it, then remove this comment. -->
 # ecapa-tdnn-voxceleb1-c512-aam
-This model is a fine-tuned version of [](https://huggingface.co/) on the confit/voxceleb dataset.
 It achieves the following results on the evaluation set:
-- Loss: 4.7003
-- Accuracy: 0.8030
 ## Model description
@@ -52,7 +38,7 @@ More information needed
 ### Training hyperparameters
 The following hyperparameters were used during training:
-- learning_rate: 0.0001
 - train_batch_size: 256
 - eval_batch_size: 1
 - seed: 914
@@ -66,16 +52,16 @@ The following hyperparameters were used during training:
 | Training Loss | Epoch | Step | Validation Loss | Accuracy |
 |:-------------:|:-----:|:----:|:---------------:|:--------:|
-| 11.3851       | 1.0   | 523  | 11.0293         | 0.1806   |
-| 9.7596        | 2.0   | 1046 | 9.1401          | 0.3850   |
-| 8.7136        | 3.0   | 1569 | 7.8821          | 0.5242   |
-| 7.848         | 4.0   | 2092 | 6.9451          | 0.6144   |
-| 7.1912        | 5.0   | 2615 | 6.2630          | 0.6821   |
-| 6.6763        | 6.0   | 3138 | 5.7182          | 0.7292   |
-| 6.3112        | 7.0   | 3661 | 5.2653          | 0.7632   |
-| 6.0255        | 8.0   | 4184 | 4.9663          | 0.7826   |
-| 5.8091        | 9.0   | 4707 | 4.7787          | 0.7957   |
-| 5.7269        | 10.0  | 5230 | 4.7003          | 0.8030   |
 ### Framework versions

 ---
+library_name: transformers
+tags:
+- generated_from_trainer
 datasets:
 - voxceleb
 metrics:
 - accuracy
 model-index:
 - name: ecapa-tdnn-voxceleb1-c512-aam
+  results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 # ecapa-tdnn-voxceleb1-c512-aam
+This model is a fine-tuned version of [](https://huggingface.co/) on the voxceleb dataset.
 It achieves the following results on the evaluation set:
+- Loss: nan
+- Accuracy: 0.0007
 ## Model description
 ### Training hyperparameters
 The following hyperparameters were used during training:
+- learning_rate: 0.0005
 - train_batch_size: 256
 - eval_batch_size: 1
 - seed: 914
 | Training Loss | Epoch | Step | Validation Loss | Accuracy |
 |:-------------:|:-----:|:----:|:---------------:|:--------:|
+| 9.047         | 1.0   | 575  | 8.3662          | 0.4304   |
+| 5.3508        | 2.0   | 1150 | 4.0252          | 0.8191   |
+| 3.3124        | 3.0   | 1725 | 2.1083          | 0.9260   |
+| 2.3212        | 4.0   | 2300 | 1.2224          | 0.9435   |
+| 1.6276        | 5.0   | 2875 | 0.8229          | 0.9677   |
+| 1.1418        | 6.0   | 3450 | 0.5840          | 0.9758   |
+| 1.0484        | 7.0   | 4025 | 0.5781          | 0.9738   |
+| 0.0           | 8.0   | 4600 | nan             | 0.0007   |
+| 0.0           | 9.0   | 5175 | nan             | 0.0007   |
+| 0.0           | 10.0  | 5750 | nan             | 0.0007   |
 ### Framework versions

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9d5d7984fc542731fbba8c59c4d9af1ce1d3673bc91f68c6f88044a29373d6b6
 size 26039912

 version https://git-lfs.github.com/spec/v1
+oid sha256:01ab04c49d211b9cb78118ce27d28f1f04761dd870ee7e94728be21790d620cd
 size 26039912

tdnn_attention.py CHANGED Viewed

@@ -214,16 +214,21 @@ class MaskedSEModule(nn.Module):
             nn.Sigmoid(),
         )
-    def forward(self, input, length=None):
         if length is None:
-            x = torch.mean(input, dim=2, keep_dim=True)
         else:
-            max_len = input.size(2)
-            mask, num_values = lens_to_mask(length, max_len=max_len, device=input.device)
-            x = torch.sum((input * mask), dim=2, keepdim=True) / (num_values)
         out = self.se_layer(x)
-        return out * input
 class TdnnSeModule(nn.Module):

             nn.Sigmoid(),
         )
+    def forward(self, inputs, length=None):
+        """
+        inputs: tensor shape of (B, D, T)
+        outputs: tensor shape of (B, D, 1)
+        """
         if length is None:
+            x = torch.mean(inputs, dim=2, keep_dim=True)
         else:
+            max_len = inputs.size(2)
+            # shape of `mask` is (B, 1, T) and shape of `num_values` is (B, 1, 1)
+            mask, num_values = lens_to_mask(length, max_len=max_len, device=inputs.device)
+            # shape of `x` is (B, D, 1)
+            x = torch.sum((inputs * mask), dim=2, keepdim=True) / (num_values)
         out = self.se_layer(x)
+        return out * inputs
 class TdnnSeModule(nn.Module):