File size: 2,725 Bytes
146c0f9 288a4c7 2dda95f 146c0f9 288a4c7 cf911a2 145604e cf911a2 145604e a863382 288a4c7 146c0f9 288a4c7 cf911a2 146c0f9 288a4c7 cf911a2 146c0f9 cf911a2 146c0f9 e6aa8a5 cf911a2 e3c26be 288a4c7 cf911a2 146c0f9 288a4c7 cf911a2 146c0f9 288a4c7 cf911a2 288a4c7 cf911a2 288a4c7 146c0f9 cf911a2 bb60daa cf911a2 146c0f9 cf911a2 bb60daa 146c0f9 cf911a2 146c0f9 cf911a2 146c0f9 cf911a2 9e7012d cf911a2 9e7012d cf911a2 9e7012d cf911a2 9e7012d cf911a2 9e7012d cf911a2 9e7012d cf911a2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
---
license: apache-2.0
language:
- de
tags:
- sign-language
- whisper
- german
- safetensors
library_name: transformers
model-index:
- name: whisper-large-v3-turbo-german
results:
- task:
type: automatic-speech-recognition
name: Speech Recognition
dataset:
name: German ASR Data-Mix
type: flozi00/asr-german-mixed
metrics:
- type: wer
value: TBD
datasets:
- flozi00/asr-german-mixed
base_model:
- primeline/whisper-large-v3-german
---
### Summary
Whisper is a powerful speech recognition platform developed by OpenAI. This model has been specially optimized for converting sign language input features into german text.
### Applications
The model is based on 'primeline/whisper-large-v3-german' and used (in combination with google mediapipe) to translate a video of german sign language into text. This model decodes a sequence of input features, where each input feature represents keypoints extracted from a video (body hands, upper body and face), into text.
We keep the decoder frozen, while training the encoder.
## Evaluations - Word error rate
TBD
### Training data
TBD
### Training process
TBD
### How to use
```python
import torch
from transformers import WhisperForConditionalGeneration, AutoProcessor, AutoTokenizer, TextStreamer
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained(
"mrprimenotes/sign-whisper-german",
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True
).to(device)
# Load the tokenizer for the model (for decoding)
tokenizer = AutoTokenizer.from_pretrained("mrprimenotes/sign-whisper-german")
# input preprocessing / feature extraction (TBD)
# input_features = ...
```
#### Use raw model for inference
```python
output = model(input_features, labels=generated_ids)
# e.g. output.loss
# output.shape --> b x sq
tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
```
### Use model with generate (work in progress...)
```python
streamer = TextStreamer(tokenizer, skip_special_tokens=False) #only needed for streaming
# Generate
generated_ids = model.generate(
input_features,
max_new_tokens=128,
return_timestamps=False, #timestamps are not supported
streamer=streamer #only needed for streaming
)
tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
```
### Training
When changing the configuration of the preprocessing convolution layers make sure the last output has the shape b x 1280 x seq. See custom config in model.py for configuration options.
|