Update README.md
Browse files
README.md
CHANGED
@@ -12,7 +12,7 @@ tags:
|
|
12 |
This repo contains model files for [llama2.c 110M tinystories](https://huggingface.co/Xenova/llama2.c-stories110M) optimized for [NM-vLLM](https://github.com/neuralmagic/nm-vllm), a high-throughput serving engine for compressed LLMs.
|
13 |
|
14 |
This model was pruned with [SparseGPT](https://arxiv.org/abs/2301.00774), using [SparseML](https://github.com/neuralmagic/sparseml).
|
15 |
-
|
16 |
## Inference
|
17 |
Install [NM-vLLM](https://github.com/neuralmagic/nm-vllm) for fast inference and low memory-usage:
|
18 |
```bash
|
|
|
12 |
This repo contains model files for [llama2.c 110M tinystories](https://huggingface.co/Xenova/llama2.c-stories110M) optimized for [NM-vLLM](https://github.com/neuralmagic/nm-vllm), a high-throughput serving engine for compressed LLMs.
|
13 |
|
14 |
This model was pruned with [SparseGPT](https://arxiv.org/abs/2301.00774), using [SparseML](https://github.com/neuralmagic/sparseml).
|
15 |
+
The weights for this model were saved using [compressed-tensors](https://github.com/neuralmagic/compressed-tensors/pull/30) library. The chosen compression is format bitmask-compression.
|
16 |
## Inference
|
17 |
Install [NM-vLLM](https://github.com/neuralmagic/nm-vllm) for fast inference and low memory-usage:
|
18 |
```bash
|