Running into issues when trying to run with TGI
Llama variants seem to have frequently faced these issues. Relevant discussions:
Issue 2 in https://github.com/huggingface/text-generation-inference/issues/769
https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5
My input and output:
docker run --gpus all -e GPTQ_BITS=4 -e GPTQ_GROUPSIZE=32 -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model --revision $revision --quantize gptq
2023-09-08T02:41:23.437900Z INFO text_generation_launcher: Args { model_id: "TheBloke/CodeLlama-34B-Instruct-GPTQ", revision: Some("gptq-4bit-32g-actorder_True"), validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Gptq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "7c73057b8d56", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-09-08T02:41:23.438009Z INFO download: text_generation_launcher: Starting download process.
2023-09-08T02:41:25.850890Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2023-09-08T02:41:26.240869Z INFO download: text_generation_launcher: Successfully downloaded weights.
2023-09-08T02:41:26.241382Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-09-08T02:41:30.648305Z INFO text_generation_launcher: Using exllama kernels
2023-09-08T02:41:30.654582Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 187, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 68, in __init__
model = FlashLlamaForCausalLM(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 461, in __init__
self.model = FlashLlamaModel(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 399, in __init__
[
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 400, in <listcomp>
FlashLlamaLayer(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 336, in __init__
self.self_attn = FlashLlamaAttention(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 218, in __init__
self.o_proj = TensorParallelRowLinear.load(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 361, in load
get_linear(weight, bias, config.quantize),
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 233, in get_linear
linear = Ex4bitLinear(qweight, qzeros, scales, g_idx, bias, bits, groupsize)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/exllama.py", line 114, in __init__
assert groupsize == self.groupsize
AssertionError
2023-09-08T02:41:31.147356Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 187, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 68, in __init__
model = FlashLlamaForCausalLM(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 461, in __init__
self.model = FlashLlamaModel(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 399, in __init__
[
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 400, in <listcomp>
FlashLlamaLayer(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 336, in __init__
self.self_attn = FlashLlamaAttention(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 218, in __init__
self.o_proj = TensorParallelRowLinear.load(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 361, in load
get_linear(weight, bias, config.quantize),
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 233, in get_linear
linear = Ex4bitLinear(qweight, qzeros, scales, g_idx, bias, bits, groupsize)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/exllama.py", line 114, in __init__
assert groupsize == self.groupsize
AssertionError
rank=0
2023-09-08T02:41:31.245302Z ERROR text_generation_launcher: Shard 0 failed to start
2023-09-08T02:41:31.245320Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
Looks like a change in this file will fix the issue according to this comment, but I am not sure what exactly would need to be done. If someone can point me in the right direction I can raise a PR.
I am a noob at loading and running LLMs, but am an SWE with 5y of exp FWIW. Please let me know what I should do next. Thanks!
Also number of shards seems to make a difference in some cases. Is that expected to be the case here?
After a bunch of hit and trial, this command worked:
docker run --gpus all -p 8080:80 -e DISABLE_EXLLAMA=True -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --revision $revision --quantize gptq --max-batch-prefill-tokens=1024
But the inference speed is quite slow