---
license: apache-2.0
language:
- de
tags:
- sign-language
- whisper
- german
- safetensors
library_name: transformers
model-index:
- name: whisper-large-v3-turbo-german
  results:
  - task:
      type: automatic-speech-recognition
      name: Speech Recognition
    dataset:
      name: German ASR Data-Mix
      type: flozi00/asr-german-mixed
    metrics:
    - type: wer
      value: TBD
datasets:
- flozi00/asr-german-mixed
base_model:
- primeline/whisper-large-v3-german
---

### Summary
Whisper is a powerful speech recognition platform developed by OpenAI. This model has been specially optimized for converting sign language input features into german text.


### Applications
The model is based on 'primeline/whisper-large-v3-german' and used (in combination with google mediapipe) to translate a video of german sign language into text. This model decodes a sequence of input features, where each input feature represents keypoints extracted from a video (body hands, upper body and face), into text. 

We keep the decoder frozen, while training the encoder.

## Evaluations - Word error rate
TBD

### Training data
TBD

### Training process
TBD

### How to use
```python
import torch
from transformers import WhisperForConditionalGeneration, AutoProcessor, AutoTokenizer, TextStreamer
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained(
    "mrprimenotes/sign-whisper-german",
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True
).to(device)

# Load the tokenizer for the model (for decoding)
tokenizer = AutoTokenizer.from_pretrained("mrprimenotes/sign-whisper-german")

# input preprocessing / feature extraction (TBD)
# input_features = ...
```

#### Use raw model for inference
```python
output = model(input_features, labels=generated_ids)

# e.g. output.loss
# output.shape --> b x sq

tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
```

### Use model with generate (work in progress...)
```python
streamer = TextStreamer(tokenizer, skip_special_tokens=False) #only needed for streaming

# Generate
generated_ids = model.generate(
    input_features,
    max_new_tokens=128,
    return_timestamps=False, #timestamps are not supported
    streamer=streamer #only needed for streaming
)

tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
```

### Training

When changing the configuration of the preprocessing convolution layers make sure the last output has the shape b x 1280 x seq. See custom config in model.py for configuration options.