there's an issue with the to() method in PyTorch, which is being passed a data type (float32) instead of a device (e.g., 'cpu' or 'cuda')
File "C:\Users\svena.cache\huggingface\modules\transformers_modules\fixie-ai\ultravox-v0_5-llama-3_1-8b\779bcda5ad4b7ed18fd0a37f065a564ca18efa31\ultravox_model.py", line 313, in _create_multi_modal_projector
projector.to(config.torch_dtype)
File "C:\Users\svena\VSCodePython\Ultravox\TServer.venv\Lib\site-packages\torch\nn\modules\module.py", line 1302, in to
device, dtype, non_blocking, convert_to_format = torch._C._nn._parse_to(
^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Invalid device string: 'float32'
What's the code you're using to invoke the model?torch_dtype
shouldn't be a string. It should be torch.float32
.
i just followed the instructions (using python 3.10 and cuda enabled -> pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124) on windows :
import transformers
import numpy as np
import librosa
pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_5-llama-3_1-8b', trust_remote_code=True)
path = "" # TODO: pass the audio here
audio, sr = librosa.load(path, sr=16000)
turns = [
{
"role": "system",
"content": "You are a friendly and helpful character. You love to answer questions for people."
},
]
pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
okay, add an import torch and :
pipe = transformers.pipeline(
model='fixie-ai/ultravox-v0_5-llama-3_2-1b',
torch_dtype=torch.float32, # Explicit dtype specification
device=0 if torch.cuda.is_available() else -1, # 0 = first GPU
trust_remote_code=True
) and it works
Hmm, the pipeline should work out of the box without specifying dtype
. You're right, that's a bug. I'll take a look.
btw bfloat16
is recommended if your hardware supports it since it takes less space without loss of performance (all of our training and benchmarks are in bfloat16
already).