Spaces:
Runtime error
Runtime error
| >FlexGen is a high-throughput generation engine for running large language models with limited GPU memory (e.g., a 16GB T4 GPU or a 24GB RTX3090 gaming card!). | |
| https://github.com/FMInference/FlexGen | |
| ## Installation | |
| No additional installation steps are necessary. FlexGen is in the `requirements.txt` file for this project. | |
| ## Converting a model | |
| FlexGen only works with the OPT model, and it needs to be converted to numpy format before starting the web UI: | |
| ``` | |
| python convert-to-flexgen.py models/opt-1.3b/ | |
| ``` | |
| The output will be saved to `models/opt-1.3b-np/`. | |
| ## Usage | |
| The basic command is the following: | |
| ``` | |
| python server.py --model opt-1.3b --loader flexgen | |
| ``` | |
| For large models, the RAM usage may be too high and your computer may freeze. If that happens, you can try this: | |
| ``` | |
| python server.py --model opt-1.3b --loader flexgen --compress-weight | |
| ``` | |
| With this second command, I was able to run both OPT-6.7b and OPT-13B with **2GB VRAM**, and the speed was good in both cases. | |
| You can also manually set the offload strategy with | |
| ``` | |
| python server.py --model opt-1.3b --loader flexgen --percent 0 100 100 0 100 0 | |
| ``` | |
| where the six numbers after `--percent` are: | |
| ``` | |
| the percentage of weight on GPU | |
| the percentage of weight on CPU | |
| the percentage of attention cache on GPU | |
| the percentage of attention cache on CPU | |
| the percentage of activations on GPU | |
| the percentage of activations on CPU | |
| ``` | |
| You should typically only change the first two numbers. If their sum is less than 100, the remaining layers will be offloaded to the disk, by default into the `text-generation-webui/cache` folder. | |
| ## Performance | |
| In my experiments with OPT-30B using a RTX 3090 on Linux, I have obtained these results: | |
| * `--loader flexgen --compress-weight --percent 0 100 100 0 100 0`: 0.99 seconds per token. | |
| * `--loader flexgen --compress-weight --percent 100 0 100 0 100 0`: 0.765 seconds per token. | |
| ## Limitations | |
| * Only works with the OPT models. | |
| * Only two generation parameters are available: `temperature` and `do_sample`. |