opencampus
/

sign-whisper-german

Inference Endpoints

Model card Files Files and versions Community

sign-whisper-german / README.md

mrprimenotes's picture

Upload README.md

bb60daa verified 2 months ago

|

2.73 kB

	---
	license: apache-2.0
	language:
	- de
	tags:
	- sign-language
	- whisper
	- german
	- safetensors
	library_name: transformers
	model-index:
	- name: whisper-large-v3-turbo-german
	results:
	- task:
	type: automatic-speech-recognition
	name: Speech Recognition
	dataset:
	name: German ASR Data-Mix
	type: flozi00/asr-german-mixed
	metrics:
	- type: wer
	value: TBD
	datasets:
	- flozi00/asr-german-mixed
	base_model:
	- primeline/whisper-large-v3-german
	---

	### Summary
	Whisper is a powerful speech recognition platform developed by OpenAI. This model has been specially optimized for converting sign language input features into german text.



	### Applications
	The model is based on 'primeline/whisper-large-v3-german' and used (in combination with google mediapipe) to translate a video of german sign language into text. This model decodes a sequence of input features, where each input feature represents keypoints extracted from a video (body hands, upper body and face), into text.

	We keep the decoder frozen, while training the encoder.

	## Evaluations - Word error rate
	TBD

	### Training data
	TBD

	### Training process
	TBD

	### How to use
	```python
	import torch
	from transformers import WhisperForConditionalGeneration, AutoProcessor, AutoTokenizer, TextStreamer
	from datasets import load_dataset

	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

	# Load model and processor
	model = WhisperForConditionalGeneration.from_pretrained(
	"mrprimenotes/sign-whisper-german",
	torch_dtype=torch_dtype,
	low_cpu_mem_usage=True,
	use_safetensors=True
	).to(device)

	# Load the tokenizer for the model (for decoding)
	tokenizer = AutoTokenizer.from_pretrained("mrprimenotes/sign-whisper-german")

	# input preprocessing / feature extraction (TBD)
	# input_features = ...
	```

	#### Use raw model for inference
	```python
	output = model(input_features, labels=generated_ids)

	# e.g. output.loss
	# output.shape --> b x sq

	tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
	```

	### Use model with generate (work in progress...)
	```python
	streamer = TextStreamer(tokenizer, skip_special_tokens=False) #only needed for streaming

	# Generate
	generated_ids = model.generate(
	input_features,
	max_new_tokens=128,
	return_timestamps=False, #timestamps are not supported
	streamer=streamer #only needed for streaming
	)

	tokenizer.batch_decode(generated_ids, skip_special_tokens=False)
	```

	### Training

	When changing the configuration of the preprocessing convolution layers make sure the last output has the shape b x 1280 x seq. See custom config in model.py for configuration options.