mrprimenotes commited on
Commit
cf911a2
·
verified ·
1 Parent(s): 94c5026

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -85
README.md CHANGED
@@ -3,9 +3,8 @@ license: apache-2.0
3
  language:
4
  - de
5
  library_name: transformers
6
- pipeline_tag: automatic-speech-recognition
7
  model-index:
8
- - name: whisper-large-v3-turbo-german by Florian Zimmermeister @primeLine
9
  results:
10
  - task:
11
  type: automatic-speech-recognition
@@ -15,123 +14,81 @@ model-index:
15
  type: flozi00/asr-german-mixed
16
  metrics:
17
  - type: wer
18
- value: 2.628 %
19
- name: Test WER
20
  datasets:
21
  - flozi00/asr-german-mixed
22
- - flozi00/asr-german-mixed-evals
23
  base_model:
24
  - primeline/whisper-large-v3-german
25
  ---
26
 
27
  ### Summary
28
- This model map provides information about a model based on Whisper Large v3 that has been fine-tuned for speech recognition in German. Whisper is a powerful speech recognition platform developed by OpenAI. This model has been specially optimized for processing and recognizing German speech.
29
 
30
 
31
 
32
  ### Applications
33
- This model can be used in various application areas, including
34
-
35
- - Transcription of spoken German language
36
- - Voice commands and voice control
37
- - Automatic subtitling for German videos
38
- - Voice-based search queries in German
39
- - Dictation functions in word processing programs
40
-
41
-
42
- ## Model family
43
-
44
- | Model | Parameters | link |
45
- |----------------------------------|------------|--------------------------------------------------------------|
46
- | Whisper large v3 german | 1.54B | [link](https://huggingface.co/primeline/whisper-large-v3-german) |
47
- | Whisper large v3 turbo german | 809M | [link](https://huggingface.co/primeline/whisper-large-v3-turbo-german)
48
- | Distil-whisper large v3 german | 756M | [link](https://huggingface.co/primeline/distil-whisper-large-v3-german) |
49
- | tiny whisper | 37.8M | [link](https://huggingface.co/primeline/whisper-tiny-german) |
50
 
 
51
 
52
  ## Evaluations - Word error rate
53
-
54
- | Dataset | openai-whisper-large-v3-turbo | openai-whisper-large-v3 | primeline-whisper-large-v3-german | nyrahealth-CrisperWhisper (large)| primeline-whisper-large-v3-turbo-german |
55
- |-------------------------------------|-------------------------------|-------------------------|-----------------------------------|---------------------------|-----------------------------------------|
56
- | Tuda-De | 8.300 | 7.884 | 7.711 | **5.148** | 6.441 |
57
- | common_voice_19_0 | 3.849 | 3.484 | 3.215 | **1.927** | 3.200 |
58
- | multilingual librispeech | 3.203 | 2.832 | 2.129 | 2.815 | **2.070** |
59
- | All | 3.649 | 3.279 | 2.734 | 2.662 | **2.628** |
60
-
61
- The data and code for evaluations are available [here](https://huggingface.co/datasets/flozi00/asr-german-mixed-evals)
62
 
63
  ### Training data
64
- The training data for this model includes a large amount of spoken German from various sources. The data was carefully selected and processed to optimize recognition performance.
65
-
66
 
67
  ### Training process
68
- The training of the model was performed with the following hyperparameters
69
-
70
- - Batch size: 12288
71
- - Epochs: 3
72
- - Learning rate: 1e-6
73
- - Data augmentation: No
74
- - Optimizer: [Ademamix](https://arxiv.org/abs/2409.03137)
75
-
76
 
77
  ### How to use
78
-
79
  ```python
80
  import torch
81
- from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
82
  from datasets import load_dataset
 
83
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
84
  torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
85
- model_id = "primeline/whisper-large-v3-turbo-german"
86
- model = AutoModelForSpeechSeq2Seq.from_pretrained(
87
- model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
88
- )
89
- model.to(device)
90
- processor = AutoProcessor.from_pretrained(model_id)
91
- pipe = pipeline(
92
- "automatic-speech-recognition",
93
- model=model,
94
- tokenizer=processor.tokenizer,
95
- feature_extractor=processor.feature_extractor,
96
- max_new_tokens=128,
97
- chunk_length_s=30,
98
- batch_size=16,
99
- return_timestamps=True,
100
- torch_dtype=torch_dtype,
101
- device=device,
102
- )
103
- dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
104
- sample = dataset[0]["audio"]
105
- result = pipe(sample)
106
- print(result["text"])
107
- ```
108
-
109
-
110
- ## [About us](https://primeline-ai.com/en/)
111
-
112
- [![primeline AI](https://primeline-ai.com/wp-content/uploads/2024/02/pl_ai_bildwortmarke_original.svg)](https://primeline-ai.com/en/)
113
 
 
 
 
 
 
 
 
114
 
115
- Your partner for AI infrastructure in Germany
 
116
 
117
- Experience the powerful AI infrastructure that drives your ambitions in Deep Learning, Machine Learning & High-Performance Computing.
 
 
118
 
119
- Optimized for AI training and inference.
 
 
120
 
 
 
121
 
 
 
122
 
123
- Model author: [Florian Zimmermeister](https://huggingface.co/flozi00)
 
 
124
 
125
- **Disclaimer**
 
 
 
 
 
 
126
 
 
127
  ```
128
- This model is not a product of the primeLine Group.
129
-
130
- It represents research conducted by [Florian Zimmermeister](https://huggingface.co/flozi00), with computing power sponsored by primeLine.
131
-
132
- The model is published under this account by primeLine, but it is not a commercial product of primeLine Solutions GmbH.
133
 
134
- Please be aware that while we have tested and developed this model to the best of our abilities, errors may still occur.
135
 
136
- Use of this model is at your own risk. We do not accept liability for any incorrect outputs generated by this model.
137
- ```
 
3
  language:
4
  - de
5
  library_name: transformers
 
6
  model-index:
7
+ - name: whisper-large-v3-turbo-german
8
  results:
9
  - task:
10
  type: automatic-speech-recognition
 
14
  type: flozi00/asr-german-mixed
15
  metrics:
16
  - type: wer
17
+ value: TBD
 
18
  datasets:
19
  - flozi00/asr-german-mixed
 
20
  base_model:
21
  - primeline/whisper-large-v3-german
22
  ---
23
 
24
  ### Summary
25
+ Whisper is a powerful speech recognition platform developed by OpenAI. This model has been specially optimized for converting sign language input features into german text.
26
 
27
 
28
 
29
  ### Applications
30
+ The model is based on 'primeline/whisper-large-v3-german' and used (in combination with google mediapipe) to translate a video of german sign language into text. This model decodes a sequence of input features, where each input feature represents keypoints extracted from a video (body hands, upper body and face), into text.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
+ We keep the decoder frozen, while training the encoder.
33
 
34
  ## Evaluations - Word error rate
35
+ TBD
 
 
 
 
 
 
 
 
36
 
37
  ### Training data
38
+ TBD
 
39
 
40
  ### Training process
41
+ TBD
 
 
 
 
 
 
 
42
 
43
  ### How to use
 
44
  ```python
45
  import torch
46
+ from transformers import WhisperForConditionalGeneration, AutoProcessor, AutoTokenizer, TextStreamer
47
  from datasets import load_dataset
48
+
49
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
50
  torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
+ # Load model and processor
53
+ model = WhisperForConditionalGeneration.from_pretrained(
54
+ "primeline/whisper-large-v3-turbo-german",
55
+ torch_dtype=torch_dtype,
56
+ low_cpu_mem_usage=True,
57
+ use_safetensors=True
58
+ ).to(device)
59
 
60
+ # Load the tokenizer for the model (for decoding)
61
+ tokenizer = AutoTokenizer.from_pretrained("primeline/whisper-large-v3-turbo-german")
62
 
63
+ # input preprocessing / feature extraction (TBD)
64
+ # input_features = ...
65
+ ```
66
 
67
+ #### Use raw model for inference
68
+ ```python
69
+ output = model(input_features, labels=generated_ids)
70
 
71
+ # e.g. output.loss
72
+ # output.shape --> b x sq
73
 
74
+ tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
75
+ ```
76
 
77
+ ### Use model with generate (work in progress...)
78
+ ```python
79
+ streamer = TextStreamer(tokenizer, skip_special_tokens=False) #only needed for streaming
80
 
81
+ # Generate
82
+ generated_ids = model.generate(
83
+ input_features,
84
+ max_new_tokens=128,
85
+ return_timestamps=False, #timestamps are not supported
86
+ streamer=streamer #only needed for streaming
87
+ )
88
 
89
+ tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
90
  ```
 
 
 
 
 
91
 
92
+ ### Training
93
 
94
+ When changing the configuration of the preprocessing convolution layers make sure the last output has the shape b x 1280 x seq. See custom config in model.py for configuration options.