Inference Endpoints Changelog π
Week 23, June 2 - June 8
The big update from this week is here: weβre excited to introduce audit logs and the vLLM container type! π
With audit logs, you now get detailed tracking of all your endpoint activity. Perfect for security and compliance, giving you full visibility into who did what and when. Note that audit logs is an Enterprise feature, you can sign up here π
On the container side, we rolled out the vLLM container type, which even more widens the selection of models that can be run with high performance inference engines. Try it out and let us know your feedback!
Week 22, May 26 - June 1
This week we have no visible updates for users. Work in the backend and behind the scenes. I promise something good is coming π€«
Week 21, May 19 - May 25
This week we had two solid updates for you:
Global navigation update. The aim here is to make it more clear for users that are part of several organisations. Which organisation am I currently in? How can easily make sure that my personal organisation isn't mized with the business account? With this update we split the views and hopefully things are more clear and easy to navigate π
Upgraded the default TGI version to 3.3.1. The newest TGI version is out and you'll get all the benefits by default when deploying a new endpoint π₯
Week 20, May 12 - May 18
The huge work by ngxson to add image-text-to-text models to llama.cpp has landed to Inference Endpoints as well π₯
You can deploy for example Qwen2.5-VL-7B-Instruct-Q8_0.gguf from the catalog. And we will automatically select the multimodal projector file for you:
We also made some quality of life improvements on the notifications. There should only be one prompt asking if you'd like browser notifications and the preference will be saved in your browser π
Week 19, May 5 - May 11
A slightly more quite week from our part but we made a small change based on user feedback: the default scale-to-zero timout is now 1 hour instead of 15 minutes. Users simply felt that 15 minutes was too short of a default so we bumped it up π
Week 18, April 28 - May 4
This week was big on the model side:
- llama 4 was released and you can find it on the HuggingFace hub here --> (Llama-4-Scout-17B-16E-Instruct)[https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct] and deploy it on H200s or A100s π₯
- we also had a ton of behind the scenes bugfixes which should make the experience of using Inference Endpoints smoother π
Week 17, April 21 - April 27
We're super excited to show you this weeks updates π₯
The first one is that H200s available in Inference Endpoints. In general the H200 is very good from a price/performance standpoint and the best choice for when you want top performance and run larger models
You can also now tag endpoints. This especially helps in organisations where you have many different endpoints, maybe for different purposes. Now you can clearly mark the ones that belong to a specific group. You can also filter based on tags in your dashboard
- Lastly we added custom deployment of OpenAI's whisper to the catalog. It's a one-click deploy and gives you up to 8x performance improvements compared to the previous version. You can read more about the technical details in the (blog)[https://huggingface.co/blog/fast-whisper-endpoints]
Week 16, April 14 - April 20
Hope you all are doing well π This week we:
- Added a more advanced feature for configuring your containers. You can override the default command and add arguments when running the container. Probably in 95% of cases you wont need to adjust this, but if you ever need (e.g. for debugging purposes), you'll find it here:
We also upgraded the default TGI version to 3.2.3 and TEI version to 1.7.0. So you get by default the latest and greatest performance
Two items were also deprecated this week:
- The TPU hardware type is no longer available
- The API section was removed, after talks with users it caused more confusion than solution
Week 15, April 7 - April 13
This week we have no visible updates to post about. A lot of behind the scenes fixes and such. Enjoy the weekend everyone π
Week 14, March 31 - April 6
This week we honed in a bit more on the new analytics page which we released a few weeks ago!
The latency distribution graph got a new remake and got a few bugs fixed
And we added a grouped view of the replica statuses and in general made it better to view data per replica vs all grouped together.
We also improved about 5 small UI snags. One of them being able to navigate back and forth in the browser and retain the time selection for the analytics. Gnarly but 100% worth it.
Week 13, March 24 - March 30
This week we got:
- Tessa T1 added to catalog
- improvements to runtime logs. It was a long time since we touched these so they we're ripe for an update π
- updated the default TGI version to 3.2.1 π₯
Week 12, March 17 - March 23
Most of this week went to improve the experience of creating a new model and putting emphasiz on the model catalog. We updated the UI and made the deployment cards more informative. If you're model isn't in the catalog you can still deploy a model by importing it from the HuggingFace Hub and configuring the deployment yourself π
Week 11, March 10 - March 16
The big update from this week was an analytics update! Not only did we make a refresh on the visuals, but also the backend side was improved: you should now see data coming in real time ποΈ
- besides that we also had day-0 support when Google released Gemma 3, which you can try out here.
- and now you can adjust the security level to a "semi public" endpoint, only for people with an authenticated Hugging Face accounts
Week 10, March 03 - March 09
For this week's update, we have some exciting additions and improvements:
- we've added QwQ and Qwen VLM 2.5 to the inference catalog. Try them out, now you can easily deploy them in one click.
- we've also upgraded TGI to version 3.1.1, bringing in the latest improvements. If you deploy a TGI-based endpoint, you get notice the benefits right away.
- continuted work on the updates for the Analytics section, we're super eager to get them out to everyone π
That's it for this week, but stay tunedβmore updates are coming soon! π
Week 09, Feb 24 - March 02
Last week we added over 10 new models in the catalog π₯ To mention a few we have PaliGemma 2, TinyR1-32B, phi-4 and a bunch of Qwen models.
Also Alvaro and I are working on an improved version of the default container, also known as the Inference Toolkit. It's still a work in progress, but the idea is to be able to support all transformers, sentence-transformers and diffusers models out of the box immediately when the model is implemented in the respective library.
TGI/TEI/llama.cpp are specialised and more performant servers. And the idea of the toolkit is to be as general as possible, and reasonably performant.
We also have some super nice UI updates coming for the Analytics section that we've been working on π stay tuned!
Week 08, Feb 17 - Feb 23
Bug fixes, model releases and llama.cpp improvements. That what we did this week π₯π
- The recently released Paligemma2 is now in the catalog
- TGI released 3.1.0 and we're now defaulting to that version πͺ
- A lot of updates to llama.cpp support:
- now you can deploy reranking models out of the box with llama.cpp, e.g. we added jina-reranker-v1-turbo-en-GGUF to the catalog
- this means that (optionally and when applicable) you can change the pooling method for the model in llama.cpp
- and we also pinned the llama.cpp container to a specific version instead of always defaulting to the latest. The aim is to give a more stable experience and manually update as often as possible
- The "out of quota" warning caused sometimes confusion, especially if people increased their max replicas a lot. We hope this improved error message helps π
- This week we also had a potpourri of bugfixes π nice to have those behind us
Week 07, Feb 10 - Feb 16
This week we got some updates to the catalog, UI improvements and behind the scenes we did a log of bug squashing πͺ²
- simplescaling/s1-32B added to catalog π₯ link to one-click-deploy
- Better message on pause modal, it wasn't as clear before that the underlying hardware isn't reserved in case the endpoint is paused. Now it looks like this:
- Added a dropdown for selecting the quantization for GGUF model
- A Wake up button for scaled to zero endpoints
- And now we also support in the playground the openAI API for embedding models
Week 06, Feb 03 - Feb 09
Even more catalog updates this week. I think it's the third week in a row π
- The first ever catalog item from Bytedance, namely UI-TARS, to give agents a native GUI to interact with. Go try it out!
- You can now do a bulk selection on endpoints to delete, pause or resume them π
- This week we also deprecated tensorflow as an option to use as the backend when deploying transformers models.
Week 05, Jan 27 - Feb 02
This week we have more nice catalog updates and UI enhancements:
- We got Mistral-Small-24B-Instruct-2501 to the catalog as well, click to deploy
- And added another GGUF quant of the Deepeek-R1 model, now with 32 768 max tokens π₯
- And a quality of life improvemnt on copying the URL of the endpoint directly from the table
Week 04, Jan 20 - Jan 26
This week the whale got introduced to our inference catalog π
- Both deepseek-r1-distill-llama-70 and deepseek-r1-distill-qwen-32b are available as one-click deploys with optmized configurations
- You can also now share the catalog models and get smooth previews on social media platforms, like so:
- We also made it more clear in the UI if you're about to deploy a model that has an optimized configuration, a.k.a a catalog model:
- And finally we've started remaking the settings page, let us know what you think! π
- And ofc we fixed some small bugs π€
Week 03, Jan 13 - Jan 19
This update comes with super nice features to the catalog and managing endpoints π₯
- managing your endpoints now looks like this π you can:
- filter
- sort by different columns
- bulk delete
- the
/new
page can be accessed without being logged in, try this link in in incognito mode π - we also squashed bugs that were causing issues with deploying text embedding models and tweaded the instance recommendation for gemma models!
Week 02, Jan 06 - Jan 12
We're getting back to a good momentum after Christmas and New Years with some cool updates! π
Download pattern π¦ here the core idea is that you should only download the files that are relevant for your endpoint. Using only
.safetensor
files? No need to download the.pt
files. Especially if you want to optimize the startup time for your endpoint. We now have an explict setting to tune this!When creating an endpoint, you don't need to open the full modal to see what you've chosen, previews are visible like this π₯
We made the dashboard more straight to point! By default you'll now arrive on the page for managing your endpoints, no intermediary views
Week Christmas and New Years, Dec 15 - Jan 05
I'll do one larger with several weeks of updates combined with the Holidays we had. Since last time we have some nice improvements on:
- added llama.cpp supported models to the catalog π₯ check them out here
- fixes bugs on the
/new
page - bug fix related to updating passwords in container
- a lot of nitty gritty work in the background
- and recharged for the new year πͺ
Week 50, Dec 09 - Dec 15
The big update from this week is getting TGI v3 out π₯ You can read all about the update here but a short tl;dr is:
- zero configuration
- increased performance
We also:
- improved the messaging in the UI when you reach your quota
- did minor bug fixes
Week 49, Dec 02 - Dec 08
This week we have a lot of nice updates π
- New and improved UI for the
/new
page π our aim was to make the configuration cleaner and remove outdated fields, there are more updates coming but we think this is already a nice improvement.
- You can now configure the hardware utilization threshold for autoscaling.
- A bunch of models are now supported on the inf2 accelerator.
- Mixtral-8x7B is now supported on TPUs.
Week 48, Nov 25 - Dec 01
This week we finally got back to shipping after the off-site and flu π₯
Updates:
- If you autoscale based on pending requests, you can manually set the threshold to meet your specific requirements
- You can now view logs further back in history. Up to the last 50 replicas for a particular deployment.
- New models added to the catalogue, like Qwen2-VL-7B-Instruct and Qwen2.5-Coder-32B-Instruct.
- Updated default TGI version to 2.4.1
- Added CPU as an alternative for the llama.cpp container type (shoutout to @ngxson)
- Fixed an issue with the revision link and default hardware configurations for catalog models.
- The default scale-to-zero timeout is now 15min. Previously it was never scale to zero.
Week 47, Nov 18 - Nov 24
Unfortunately, a wave of flu has hit our team, and we needed some time to recover π€ No updates this week, but stay tuned for next weekβwe have a lot of exciting things coming up! π₯
Week 46, Nov 11 - Nov 17
No changes this week as the team was on an off-site in Martinique! But a lot of ideas and energy cooked up for the coming week π
Week 45, Nov 04 - Nov 10
This week, we have some awesome updates that are finally out π
- Scaling replicas based on pending requests is now in beta π₯ Since it's in beta, things might change, but you can try it out and read more about it here
- Improved analytics with a graph of the replica history
- Updates to the widgets
- Fixed bug in streaming
- Conversations can now be cleared
- Submit message with cmd+enter
Week 44, Oct 28 - Nov 03
Probably the biggest update this week was a revamp to the Inference Catalogue π₯ You can now with a one-click-deploy find a model based on:
- license
- price range
- inference server
- accelerator
- and the previously existing task and search filters
Additionally:
- we fixed the config for
MoritzLaurer/deberta-v3-large-zeroshot-v2.0
so that you can run it on CPU as well - and also thanks to @ngxson for fixing a bug in the llama.cpp snippet
Week 43, Oct 21-27
This week you'll get a sneak peak of the upcoming autoscaling, in the form of analytics π
We have:
- Added pending http requests to the analytics
- Support for Image-Text-To-Text, aka language vision models π₯ (llama vision has some good jokes π
)
- Improved the log pagination and added some nice visual touches
- Fixed a bug related to total request count in the analytics
Week 42, Oct 14-20
This week was unfortunately slower on the user-facing updates.
Behind the scenes, we:
- fixed several recommendation values for LLaMA and Qwen 2,
- improved our internal analytics,
- debugged issues related to weights downloading and getting 429s,
- and hopefully squashed the last bugs so we can soon release the new autoscaling π₯
Week 41, Oct 7-13
This week we had a lot of nice UI/UX improvements:
Additionally:
- deprecated the "text2text-generation" tasks, it's been deprecated on the Hub and in the Inference API as well
- you can now pass the "seed" parameter in the widget for diffuser models
- small bug fixes on llama.cpp containers
- you can directly play in the widget with openAI API parameters
- Shoutout to Alvaro for making the NVLM-D-72B model compatible on endpoints π
On the backend we're also making improvements to the autoscaling. This might not immediately have noticeable impact for user but soon it'll ripple to the front end as well. Stay tuned π