Update README.md
Browse files
README.md
CHANGED
@@ -27,6 +27,7 @@ This model is a Hebrew finetune (continued training) of the OpenAI Whisper Large
|
|
27 |
- **Language(s) (NLP):** Hebrew
|
28 |
- **License:** Apache-2.0
|
29 |
- **Finetuned from model** openai/whisper-large-v3-turbo
|
|
|
30 |
|
31 |
## Bias, Risks, and Limitations
|
32 |
|
@@ -40,7 +41,7 @@ Additionally, the tanslation task was not trained and also degraded. This model
|
|
40 |
Please follow the original [model card](https://huggingface.co/openai/whisper-large-v3-turbo#usage) for usage details - replacing with this model name.
|
41 |
You can also fine other weight formats ad quantizations on the [ivrit ai](https://huggingface.co/ivrit-ai) HF page.
|
42 |
|
43 |
-
We created some simple example scripts using this model and weights for other
|
44 |
Find those in the ["examples"](https://github.com/ivrit-ai/asr-training/tree/master/examples) folder within the training GitHub repo.
|
45 |
|
46 |
## Training Details
|
@@ -49,13 +50,19 @@ Find those in the ["examples"](https://github.com/ivrit-ai/asr-training/tree/mas
|
|
49 |
|
50 |
This model was trained on the following datasets:
|
51 |
|
52 |
-
- [ivrit-ai/crowd-transcribe-v5](https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5) - Publicly accessible audio sources have
|
53 |
-
- [ivrit-ai/crowd-recital-whisper-training](https://huggingface.co/datasets/ivrit-ai/crowd-recital-whisper-training) - Crowd-sourced recording of Wikipedia
|
54 |
-
- [ivrit-ai/knesset-plenums-whisper-training](https://huggingface.co/datasets/ivrit-ai/knesset-plenums-whisper-training) - A subset of a Knesset (Israeli house of
|
55 |
|
56 |
### Training Procedure
|
57 |
|
58 |
-
This model
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
Training code can be found on the ivrit-ai Github [here](https://github.com/ivrit-ai/asr-training)
|
60 |
|
61 |
#### Preprocessing
|
@@ -75,10 +82,10 @@ Datasets were interleaved with 0.15:0.8:0.05 ratio (knesset:crowd-transcribe:cro
|
|
75 |
- **Learning Rate:** 1e-5, Linear decay, 800 steps warmup for 3 epochs
|
76 |
- **Batch Size:** 32
|
77 |
|
78 |
-
#### Training
|
79 |
|
80 |
- **GPU Type:** 8 x Nvidia A40 machine
|
81 |
-
- **Duration:** ~
|
82 |
|
83 |
## Evaluation
|
84 |
|
|
|
27 |
- **Language(s) (NLP):** Hebrew
|
28 |
- **License:** Apache-2.0
|
29 |
- **Finetuned from model** openai/whisper-large-v3-turbo
|
30 |
+
- **Training Date** Apr 2025
|
31 |
|
32 |
## Bias, Risks, and Limitations
|
33 |
|
|
|
41 |
Please follow the original [model card](https://huggingface.co/openai/whisper-large-v3-turbo#usage) for usage details - replacing with this model name.
|
42 |
You can also fine other weight formats ad quantizations on the [ivrit ai](https://huggingface.co/ivrit-ai) HF page.
|
43 |
|
44 |
+
We created some simple example scripts using this model and weights for other inference runtimes.
|
45 |
Find those in the ["examples"](https://github.com/ivrit-ai/asr-training/tree/master/examples) folder within the training GitHub repo.
|
46 |
|
47 |
## Training Details
|
|
|
50 |
|
51 |
This model was trained on the following datasets:
|
52 |
|
53 |
+
- [ivrit-ai/crowd-transcribe-v5](https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5) - Publicly accessible audio sources have been crowd-transcribed segment-by-segment - ~300h
|
54 |
+
- [ivrit-ai/crowd-recital-whisper-training](https://huggingface.co/datasets/ivrit-ai/crowd-recital-whisper-training) - Crowd-sourced recording of Wikipedia article snippets. ~50h
|
55 |
+
- [ivrit-ai/knesset-plenums-whisper-training](https://huggingface.co/datasets/ivrit-ai/knesset-plenums-whisper-training) - A subset of a Knesset (Israeli house of representatives) plenum protocols. ~4700h
|
56 |
|
57 |
### Training Procedure
|
58 |
|
59 |
+
This model was trained in two main phases:
|
60 |
+
- Knesset based pre-training - over all ~4700h of data - 3 epochs, ~48h run
|
61 |
+
- Mixed post-training over all crowd-transcribe-v5 (~300h), crowd-recital-whisper-training (~50h) and highest-quality filtered knessets data (~150h) - 2 epochs
|
62 |
+
- Interleaving of datasets with sampling probs: (0.9, 0.025, 0.075) respectively
|
63 |
+
- Note that crowd-transcribe-v5 has about 5x shorter samples on average thus the over-sampling.
|
64 |
+
|
65 |
+
This model is a weighted-average of the 2 lowest eval loss checkpoints (From around the end of epoch 2) from two seprate runs with the same setup.
|
66 |
Training code can be found on the ivrit-ai Github [here](https://github.com/ivrit-ai/asr-training)
|
67 |
|
68 |
#### Preprocessing
|
|
|
82 |
- **Learning Rate:** 1e-5, Linear decay, 800 steps warmup for 3 epochs
|
83 |
- **Batch Size:** 32
|
84 |
|
85 |
+
#### Training Hardware / Duration
|
86 |
|
87 |
- **GPU Type:** 8 x Nvidia A40 machine
|
88 |
+
- **Duration:** ~55h run across both phases
|
89 |
|
90 |
## Evaluation
|
91 |
|