Add onnx files
#8
by
jbochi
- opened
Files were created with this command:
$ optimum-cli export onnx \
--model grammarly/coedit-large \
--task text2text-generation-with-past \
--optimize O3 \
coedit-large-onnx/
There were a few warnings, but the diffs seem small enough:
Validation for the model coedit-large-onnx/decoder_model_merged.onnx raised: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 1e-05:
- present.0.encoder.key: max diff = 3.0517578125e-05
- present.2.decoder.key: max diff = 1.1920928955078125e-05
- present.2.decoder.value: max diff = 1.2740492820739746e-05
...
The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 1e-05:
- logits: max diff = 0.0004119873046875
- present.23.decoder.value: max diff = 6.103515625e-05.
The exported model was saved at: coedit-large-onyx
...
I tested it with the code below. Note that ONNX is about ~1.8x faster on CPU than the transformers implementation.
In [1]: from transformers import AutoTokenizer, T5ForConditionalGeneration
In [2]: from optimum.onnxruntime import ORTModelForSeq2SeqLM
In [4]: model = ORTModelForSeq2SeqLM.from_pretrained('./onnx', device="auto")
In [6]: torch_model = T5ForConditionalGeneration.from_pretrained("Grammarly/coedit-large")
In [7]: text = "Rewrite to make this easier to understand: A storm surge is what forecasters consider a hurricane's most treacherous aspect."
In [9]: tokenizer = AutoTokenizer.from_pretrained("grammarly/coedit-large")
In [10]: input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
In [11]: %time outputs = model.generate(input_ids=input_ids)
/opt/homebrew/Caskroom/miniconda/base/lib/python3.11/site-packages/transformers/generation/utils.py:1273: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
CPU times: user 2.2 s, sys: 178 ms, total: 2.38 s
Wall time: 399 ms
In [12]: %time torch_outputs = torch_model.generate(input_ids=input_ids)
/opt/homebrew/Caskroom/miniconda/base/lib/python3.11/site-packages/transformers/generation/utils.py:1273: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
CPU times: user 721 ms, sys: 28.7 ms, total: 750 ms
Wall time: 723 ms
In [13]: torch_outputs == outputs
Out[13]:
tensor([[True, True, True, True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True]])
In [14]: tokenizer.decode(outputs[0])
Out[14]: "<pad> It is what they consider to be a hurricane's most dangerous aspect.</s>"
This comment has been hidden
jbochi
changed pull request status to
open