The intent of this repo is to compare the performance delta between dense quantized MPT-7B and 70% sparse-quantized MPT-7B on OpenVINO. Quantization here is 8-bit on both weight and activation. Benchmark metric is decoding (next token) latency with context length 512.

Target HW: Intel 4th gen Xeon (Sapphire Rapids)

SW

git clone https://huggingface.co/vuiseng9/ov-mpt-7b-gsm8k-sparse70
pip install openvino==2024.2.0

Benchmarking with OpenVINO

  1. ./benchmarkapp_w8a8.bash
  2. ./benchmarkapp_w8a8_sparse70.bash

Note: do remove the numactl if your node does not support it.

Implementation of Sparse Weight Decompression in OpenVINO

  • This is the first commit of Sparse Weight Decompression on OpenVINO’s fork of oneDNN. https://github.com/openvinotoolkit/oneDNN/pull/158/files

  • you can browse this via the left pane.

  • initialization: src/cpu/reorder/simple_sparse_reorder.hpp (line 113)

  • decompression: src/cpu/x64/jit_brgemm_decompress_kernel.cpp (line 41)

  • If you'd like to build OpenVINO runtime from source for debug, see wiki page. Benchmark_app is compiled as well.

Related materials:

OpenVINO blog on Sparse-Quantized BERT (corresponding notebook)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Inference API (serverless) does not yet support OpenVINO models for this pipeline type.