Text Generation
Transformers
PyTorch
TensorBoard
Safetensors
bloom
Eval Results
text-generation-inference
Inference Endpoints

"Model is overloaded, please wait for a bit"

#70
by jmarxza - opened

Any way to stop this image from popping up?

I was facing this issue a few hours ago. But now it is working on Huggingface's Accelerated Inference! It takes normally more than 90s to generate 64 tokens, with "use_gpu": True, but it runs.

I have the same problem here. I can only get generated_text from bloom-3b and never succeeded with bloom. Any solutions?

BigScience Workshop org

Are you still having the issue? We're recently been moving to AzureML and so service might have been disrupted at some point. But it should be a lot more stable now.

Just to be clear we're talking about the Inference API?

Are you still having the issue? We're recently been moving to AzureML and so service might have been disrupted at some point. But it should be a lot more stable now.

Just to be clear we're talking about the Inference API?

Now it's working (yes, it is the Inference API). But now I have another problem. It seems that even if I specify num_return_sequences to be more than 1, I can only get 1 generated_text from bloom. I can get the right number of generated_text with bloom-3b. Is it because bloom is too big so that it can only do greedy decoding?

BigScience Workshop org
edited Oct 21, 2022

We have a custom deployment setup right now for BLOOM (in order to improve inference speed and such), which doesn't support all the options right now. We'll try to support new options as the requests come in I guess.

Is it because bloom is too big so that it can only do greedy decoding?

Actually it does more than greedy decoding, you can add top_k and top_p options.

Hi @TimeRobber , The max tokens for the API seems to be 1024, although I believe the bloom model can take longer sequences. Andy greater than 1024, and I get the message - Model is overloaded, please wait for a bit. Is this max length fixed for all users, or paid plans can increase this ?

BigScience Workshop org

I think we hard limit incoming requests that are beyond a specific length so that people don't spam our service. In theory if you host the model yourself, you can go to arbitrarily long sequences as it uses relative positional embeddings system that can extrapolate to any length regardless of what was the sequence length when training. More details can be found here: https://arxiv.org/abs/2108.12409

Max length should be fixed for all users. At least from this API endpoint. cc @olivierdehaene

Concerning whether there's a paid plan you'd have to ask @Narsil to confirm, but I think there should be none.

Thanks, re: hosting this ourselves, can I confirm when I call API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom" I am hitting large bigscience/bloom version and not one of the smaller versions, like bigscience/bloom7b1; and also can I confirm if this is running on CPU.
It seems very fast for a CPU model on large bloom - I get 10 seconds. If it was feasible to get this speed on onnx accelerated large bloom, I could try hosting myself.

P.s. I saw in the docu to check x-compute-type in the headers of the response to check if it is CPU or GPU, but I could not see that values.

BigScience Workshop org

when I call API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom" I am hitting large bigscience/bloom version and not one of the smaller versions, like bigscience/bloom7b1

Yes you're running the big model

can I confirm if this is running on CPU

No it runs on GPUs in a parallel fashion. You can find more details at https://huggingface.co/blog/bloom-inference-optimization

BLOOM is a more special deployment, and it's currently being powered by AzureML. There won't be cpu inference on BLOOM

Thanks a lot @TimeRobber this is very helpful

BigScience Workshop org

If you are interested in the code behind our BLOOM deployment, you can find the new version currently running here: https://github.com/huggingface/text-generation-inference.
The original code described in the the blog post can also be found here: https://github.com/huggingface/transformers_bloom_parallel/.

I have the same problem all night. I just wanted to try it out and see if I can get any kind of response, maybe I have to wait or call the API myself is working better, anyone has a solution?

BigScience Workshop org

Hi! Bloom hosting is currently undergoing maintenance by the AzureML team and will be back up as soon as this has been completed. We'll try to get it back up ASAP.

BigScience Workshop org

Model is back up.

christopher changed discussion status to closed

Sign up or log in to comment