GPT2
This repository contains GPT2 onnx models compatible with TensorRT:
- gpt2-xl.onnx - GPT2-XL onnx for fp32 or fp16 engines
- gpt2-xl-i8.onnx - GPT2-XL onnx for int8+fp32 engines
Quantization of models was performed by the ENOT-AutoDL framework. Code for building of TensorRT engines and examples published on github.
Metrics:
GPT2-XL
TensorRT INT8+FP32 | torch FP16 | |
---|---|---|
Lambada Acc | 72.11% | 71.43% |
Test environment
- GPU RTX 4090
- CPU 11th Gen Intel(R) Core(TM) i7-11700K
- TensorRT 8.5.3.1
- pytorch 1.13.1+cu116
Latency:
GPT2-XL
Input sequance length | Number of generated tokens | TensorRT INT8+FP32 ms | torch FP16 ms | Acceleration |
---|---|---|---|---|
64 | 64 | 462 | 1190 | 2.58 |
64 | 128 | 920 | 2360 | 2.54 |
64 | 256 | 1890 | 4710 | 2.54 |
Test environment
- GPU RTX 4090
- CPU 11th Gen Intel(R) Core(TM) i7-11700K
- TensorRT 8.5.3.1
- pytorch 1.13.1+cu116
How to use
Example of inference and accuracy test published on github:
git clone https://github.com/ENOT-AutoDL/ENOT-transformers
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.