Is the onnx files compatible with transformers in Python/torch?

I've been trying to load the model locally using Pytorch and the config parameters, here is the code:

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

Load the ONNX model

model_path = "./DeepSeek-R1-Distill-Qwen-1.5B-ONNX/onnx/model_q4f16.onnx"
session = ort.InferenceSession(model_path)

Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained("./DeepSeek-R1-Distill-Qwen-1.5B-ONNX")

Tokenize the input prompt

prompt = "give a solution to 56*986-32"
inputs = tokenizer(prompt, return_tensors="pt")

Convert inputs to numpy arrays with correct data types

input_ids = inputs["input_ids"].numpy().astype(np.int64) # Should be int64
attention_mask = inputs["attention_mask"].numpy().astype(np.int64) # Should be int64

Generate position_ids (if required)

seq_length = input_ids.shape[1]
position_ids = np.arange(seq_length, dtype=np.int64).reshape(1, -1) # Should be int64

Initialize past_key_values based on the model's configuration

num_layers = 28 # Number of transformer layers
batch_size = input_ids.shape[0] # Batch size (usually 1 for single prompts)
num_attention_heads = 12 # Number of attention heads
num_key_value_heads = 2 # Number of key/value heads
head_dim = 128 # Dimension of each attention head (hidden_size / num_attention_heads)
past_seq_length = 0 # Initial sequence length for past_key_values

Shape for past_key_values: (2, batch_size, num_key_value_heads, past_seq_length, head_dim)

past_shape = (2, batch_size, num_key_value_heads, past_seq_length, head_dim)
past_key_values = [np.zeros(past_shape, dtype=np.float16) for _ in range(num_layers)]

Prepare the input dictionary

input_feed = {
"input_ids": input_ids,
"attention_mask": attention_mask,
"position_ids": position_ids,
}

Add past_key_values to the input feed

for i in range(num_layers):
input_feed[f"past_key_values.{i}.key"] = past_key_values[i][0] # Key tensor
input_feed[f"past_key_values.{i}.value"] = past_key_values[i][1] # Value tensor

Run the model with the corrected data types

outputs = session.run(None, input_feed)

Extract logits and updated past_key_values

logits = outputs[0] # Shape: (batch_size, seq_length, vocab_size)
updated_past_key_values = outputs[1:] # Updated past_key_values for the next step

Decode the output tokens back to text

output_ids = logits.argmax(axis=-1) # Get the most likely token IDs
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print("Model Output:", output_text)

But I get:
Model Output:
an for the give0718*.

16*

Which seems to be a wrong tokenisation decode from the output.

onnx-community
/

DeepSeek-R1-Distill-Qwen-1.5B-ONNX

Compatibility with transformers in Python