ENOT-AutoDL
/

gpt2-tensorrt

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Edit model card

GPT2

This repository contains GPT2 onnx models compatible with TensorRT:

gpt2-xl.onnx - GPT2-XL onnx for fp32 or fp16 engines
gpt2-xl-i8.onnx - GPT2-XL onnx for int8+fp32 engines

Quantization of models was performed by the ENOT-AutoDL framework. Code for building of TensorRT engines and examples published on github.

Metrics:

GPT2-XL

	TensorRT INT8+FP32	torch FP16
Lambada Acc	72.11%	71.43%

Test environment

GPU RTX 4090
CPU 11th Gen Intel(R) Core(TM) i7-11700K
TensorRT 8.5.3.1
pytorch 1.13.1+cu116

Latency:

GPT2-XL

Input sequance length	Number of generated tokens	TensorRT INT8+FP32 ms	torch FP16 ms	Acceleration
64	64	462	1190	2.58
64	128	920	2360	2.54
64	256	1890	4710	2.54

Test environment

GPU RTX 4090
CPU 11th Gen Intel(R) Core(TM) i7-11700K
TensorRT 8.5.3.1
pytorch 1.13.1+cu116

How to use

Example of inference and accuracy test published on github:

git clone https://github.com/ENOT-AutoDL/ENOT-transformers

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Examples

Text Generation

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train ENOT-AutoDL/gpt2-tensorrt