|
--- |
|
license: apache-2.0 |
|
language: |
|
- de |
|
tags: |
|
- sign-language |
|
- whisper |
|
- german |
|
- safetensors |
|
library_name: transformers |
|
model-index: |
|
- name: whisper-large-v3-turbo-german |
|
results: |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Speech Recognition |
|
dataset: |
|
name: German ASR Data-Mix |
|
type: flozi00/asr-german-mixed |
|
metrics: |
|
- type: wer |
|
value: TBD |
|
datasets: |
|
- flozi00/asr-german-mixed |
|
base_model: |
|
- primeline/whisper-large-v3-german |
|
--- |
|
|
|
### Summary |
|
Whisper is a powerful speech recognition platform developed by OpenAI. This model has been specially optimized for converting sign language input features into german text. |
|
|
|
|
|
|
|
### Applications |
|
The model is based on 'primeline/whisper-large-v3-german' and used (in combination with google mediapipe) to translate a video of german sign language into text. This model decodes a sequence of input features, where each input feature represents keypoints extracted from a video (body hands, upper body and face), into text. |
|
|
|
We keep the decoder frozen, while training the encoder. |
|
|
|
## Evaluations - Word error rate |
|
TBD |
|
|
|
### Training data |
|
TBD |
|
|
|
### Training process |
|
TBD |
|
|
|
### How to use |
|
```python |
|
import torch |
|
from transformers import WhisperForConditionalGeneration, AutoProcessor, AutoTokenizer, TextStreamer |
|
from datasets import load_dataset |
|
|
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 |
|
|
|
# Load model and processor |
|
model = WhisperForConditionalGeneration.from_pretrained( |
|
"mrprimenotes/sign-whisper-german", |
|
torch_dtype=torch_dtype, |
|
low_cpu_mem_usage=True, |
|
use_safetensors=True |
|
).to(device) |
|
|
|
# Load the tokenizer for the model (for decoding) |
|
tokenizer = AutoTokenizer.from_pretrained("mrprimenotes/sign-whisper-german") |
|
|
|
# input preprocessing / feature extraction (TBD) |
|
# input_features = ... |
|
``` |
|
|
|
#### Use raw model for inference |
|
```python |
|
output = model(input_features, labels=generated_ids) |
|
|
|
# e.g. output.loss |
|
# output.shape --> b x sq |
|
|
|
tokenizer.batch_decode(generated_ids, skip_special_tokens=False) |
|
``` |
|
|
|
### Use model with generate (work in progress...) |
|
```python |
|
streamer = TextStreamer(tokenizer, skip_special_tokens=False) #only needed for streaming |
|
|
|
# Generate |
|
generated_ids = model.generate( |
|
input_features, |
|
max_new_tokens=128, |
|
return_timestamps=False, #timestamps are not supported |
|
streamer=streamer #only needed for streaming |
|
) |
|
|
|
tokenizer.batch_decode(generated_ids, skip_special_tokens=False) |
|
``` |
|
|
|
### Training |
|
|
|
When changing the configuration of the preprocessing convolution layers make sure the last output has the shape b x 1280 x seq. See custom config in model.py for configuration options. |
|
|