--- license: apache-2.0 language: - de tags: - sign-language - whisper - german - safetensors library_name: transformers model-index: - name: whisper-large-v3-turbo-german results: - task: type: automatic-speech-recognition name: Speech Recognition dataset: name: German ASR Data-Mix type: flozi00/asr-german-mixed metrics: - type: wer value: TBD datasets: - flozi00/asr-german-mixed base_model: - primeline/whisper-large-v3-german --- ### Summary Whisper is a powerful speech recognition platform developed by OpenAI. This model has been specially optimized for converting sign language input features into german text. ### Applications The model is based on 'primeline/whisper-large-v3-german' and used (in combination with google mediapipe) to translate a video of german sign language into text. This model decodes a sequence of input features, where each input feature represents keypoints extracted from a video (body hands, upper body and face), into text. We keep the decoder frozen, while training the encoder. ## Evaluations - Word error rate TBD ### Training data TBD ### Training process TBD ### How to use ```python import torch from transformers import WhisperForConditionalGeneration, AutoProcessor, AutoTokenizer, TextStreamer from datasets import load_dataset device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 # Load model and processor model = WhisperForConditionalGeneration.from_pretrained( "mrprimenotes/sign-whisper-german", torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True ).to(device) # Load the tokenizer for the model (for decoding) tokenizer = AutoTokenizer.from_pretrained("mrprimenotes/sign-whisper-german") # input preprocessing / feature extraction (TBD) # input_features = ... ``` #### Use raw model for inference ```python output = model(input_features, labels=generated_ids) # e.g. output.loss # output.shape --> b x sq tokenizer.batch_decode(generated_ids, skip_special_tokens=False) ``` ### Use model with generate (work in progress...) ```python streamer = TextStreamer(tokenizer, skip_special_tokens=False) #only needed for streaming # Generate generated_ids = model.generate( input_features, max_new_tokens=128, return_timestamps=False, #timestamps are not supported streamer=streamer #only needed for streaming ) tokenizer.batch_decode(generated_ids, skip_special_tokens=False) ``` ### Training When changing the configuration of the preprocessing convolution layers make sure the last output has the shape b x 1280 x seq. See custom config in model.py for configuration options.