Automatic Speech Recognition
Transformers
Safetensors
Hebrew
whisper
yoad commited on
Commit
d01d001
·
verified ·
1 Parent(s): d171567

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -7
README.md CHANGED
@@ -27,6 +27,7 @@ This model is a Hebrew finetune (continued training) of the OpenAI Whisper Large
27
  - **Language(s) (NLP):** Hebrew
28
  - **License:** Apache-2.0
29
  - **Finetuned from model** openai/whisper-large-v3-turbo
 
30
 
31
  ## Bias, Risks, and Limitations
32
 
@@ -40,7 +41,7 @@ Additionally, the tanslation task was not trained and also degraded. This model
40
  Please follow the original [model card](https://huggingface.co/openai/whisper-large-v3-turbo#usage) for usage details - replacing with this model name.
41
  You can also fine other weight formats ad quantizations on the [ivrit ai](https://huggingface.co/ivrit-ai) HF page.
42
 
43
- We created some simple example scripts using this model and weights for other indference runtimes.
44
  Find those in the ["examples"](https://github.com/ivrit-ai/asr-training/tree/master/examples) folder within the training GitHub repo.
45
 
46
  ## Training Details
@@ -49,13 +50,19 @@ Find those in the ["examples"](https://github.com/ivrit-ai/asr-training/tree/mas
49
 
50
  This model was trained on the following datasets:
51
 
52
- - [ivrit-ai/crowd-transcribe-v5](https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5) - Publicly accessible audio sources have beem crowd-transcribed segment-by-segment - ~300h
53
- - [ivrit-ai/crowd-recital-whisper-training](https://huggingface.co/datasets/ivrit-ai/crowd-recital-whisper-training) - Crowd-sourced recording of Wikipedia atricle snippets. ~50h
54
- - [ivrit-ai/knesset-plenums-whisper-training](https://huggingface.co/datasets/ivrit-ai/knesset-plenums-whisper-training) - A subset of a Knesset (Israeli house of representitives) plenum protocols. ~325h
55
 
56
  ### Training Procedure
57
 
58
- This model is a weighted-average of the lowest eval loss checkpoints (From around the end of epoch 2) from two seprate runs with the same setup.
 
 
 
 
 
 
59
  Training code can be found on the ivrit-ai Github [here](https://github.com/ivrit-ai/asr-training)
60
 
61
  #### Preprocessing
@@ -75,10 +82,10 @@ Datasets were interleaved with 0.15:0.8:0.05 ratio (knesset:crowd-transcribe:cro
75
  - **Learning Rate:** 1e-5, Linear decay, 800 steps warmup for 3 epochs
76
  - **Batch Size:** 32
77
 
78
- #### Training Hardward / Duration
79
 
80
  - **GPU Type:** 8 x Nvidia A40 machine
81
- - **Duration:** ~9h run, stopped at 3 epochs
82
 
83
  ## Evaluation
84
 
 
27
  - **Language(s) (NLP):** Hebrew
28
  - **License:** Apache-2.0
29
  - **Finetuned from model** openai/whisper-large-v3-turbo
30
+ - **Training Date** Apr 2025
31
 
32
  ## Bias, Risks, and Limitations
33
 
 
41
  Please follow the original [model card](https://huggingface.co/openai/whisper-large-v3-turbo#usage) for usage details - replacing with this model name.
42
  You can also fine other weight formats ad quantizations on the [ivrit ai](https://huggingface.co/ivrit-ai) HF page.
43
 
44
+ We created some simple example scripts using this model and weights for other inference runtimes.
45
  Find those in the ["examples"](https://github.com/ivrit-ai/asr-training/tree/master/examples) folder within the training GitHub repo.
46
 
47
  ## Training Details
 
50
 
51
  This model was trained on the following datasets:
52
 
53
+ - [ivrit-ai/crowd-transcribe-v5](https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5) - Publicly accessible audio sources have been crowd-transcribed segment-by-segment - ~300h
54
+ - [ivrit-ai/crowd-recital-whisper-training](https://huggingface.co/datasets/ivrit-ai/crowd-recital-whisper-training) - Crowd-sourced recording of Wikipedia article snippets. ~50h
55
+ - [ivrit-ai/knesset-plenums-whisper-training](https://huggingface.co/datasets/ivrit-ai/knesset-plenums-whisper-training) - A subset of a Knesset (Israeli house of representatives) plenum protocols. ~4700h
56
 
57
  ### Training Procedure
58
 
59
+ This model was trained in two main phases:
60
+ - Knesset based pre-training - over all ~4700h of data - 3 epochs, ~48h run
61
+ - Mixed post-training over all crowd-transcribe-v5 (~300h), crowd-recital-whisper-training (~50h) and highest-quality filtered knessets data (~150h) - 2 epochs
62
+ - Interleaving of datasets with sampling probs: (0.9, 0.025, 0.075) respectively
63
+ - Note that crowd-transcribe-v5 has about 5x shorter samples on average thus the over-sampling.
64
+
65
+ This model is a weighted-average of the 2 lowest eval loss checkpoints (From around the end of epoch 2) from two seprate runs with the same setup.
66
  Training code can be found on the ivrit-ai Github [here](https://github.com/ivrit-ai/asr-training)
67
 
68
  #### Preprocessing
 
82
  - **Learning Rate:** 1e-5, Linear decay, 800 steps warmup for 3 epochs
83
  - **Batch Size:** 32
84
 
85
+ #### Training Hardware / Duration
86
 
87
  - **GPU Type:** 8 x Nvidia A40 machine
88
+ - **Duration:** ~55h run across both phases
89
 
90
  ## Evaluation
91