CUDA error: misaligned address

#24
by msi-sbraun-11 - opened

Hi there,

I am trying to use Gemma 3 12b it model to generate QA pairs. The pipeline is defined as follows:

model_id = "google/gemma-3-12b-it" # google/gemma-3-12b-it

    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype = torch.float32,
        device_map="cuda",
        quantization_config=bnb_config
        )
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    if tokenizer.pad_token is None:
        eos_token_id = model.config.eos_token_id
        eos_token = tokenizer.decode(eos_token_id)
        tokenizer.pad_token = eos_token  # this is a string, which is expected

    text_gen_pipeline = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=512,
        torch_dtype=torch.float32, 
        top_p = 0.95,
        top_k = 70,
        temperature = 1.25,
        do_sample=True,
        repetition_penalty=1.3,
    )

    llm = HuggingFacePipeline(pipeline=text_gen_pipeline)

    model = ChatHuggingFace(llm=llm)

When I use this model using invoke function, at some point it threw an error:

  File "/home/nokia-proj/miniconda3/envs/vrag/lib/python3.10/site-packages/transformers/integrations/sdpa_attent
ion.py", line 54, in sdpa_attention_forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: CUDA error: misaligned address

Any ideas why this error was encountered and how to resolve this?

Thank you!

Hi @msi-sbraun-11 ,

Welcome to Google Gemma family of open source models, as I can see in your code your passing HuggingFacePipeline to ChatHuggingFace which is not supported if you are imported ChatHuggingFace from the following import statements:

from langchain.llms import HuggingFacePipeline
from langchain_community.chat_models import ChatHuggingFace

The mentioned above issue might also occur due to the data types conflict, as BitsAndBytesConfig is designed to works best with lower precision data types like torch.bfloat16 or torch.float16. Mixing float32 with 4-bit quantization can lead to alignment problems during computations. If you would like to run the model with full 32bit precision it's recommended to not to use quantization or if you would like to do the quantization torch.bfloat16 works best.

if you are facing the issue with SDPA attention mechanism you can disable it by using the following code.

import os
os.environ['TORCH_DISABLE_SDPA'] = '1'

Please let me know if you required any other assistance.

Thanks.

Hi @BalakrishnaCh ,
Thank you for your response.
Could you provide code with the suggested fixes so that it will be easier for me to run and analyse?
Thank you.

@msi-sbraun-11 , Can help you out further, could you please provide the missing parts of your code (from where you are importing the above mentioned imports) making it executable and model.genarate() method in your code, along with what's the prompt you are using in your code? So that I can better assist you further on the issue.

Sign up or log in to comment