Post
2733
Inference for generative ai models looks like a mine field, but thereβs a simple protocol for picking the best inference:
π 95% of users >> If youβre using open (large) models and need fast online inference, then use Inference providers on auto mode, and let it choose the best provider for the model. https://huggingface.co/docs/inference-providers/index
π· fine-tuners/ bespoke >> If youβve got custom setups, use Inference Endpoints to define a configuration from AWS, Azure, GCP. https://endpoints.huggingface.co/
𦫠Locals >> If youβre trying to stretch everything you can out of a server or local machine, use Llama.cpp, Jan, LMStudio or vLLM. https://huggingface.co/settings/local-apps#local-apps
πͺ Browsers >> If you need open models running right here in the browser, use transformers.js. https://github.com/huggingface/transformers.js
Let me know what youβre using, and if you think itβs more complex than this.
π 95% of users >> If youβre using open (large) models and need fast online inference, then use Inference providers on auto mode, and let it choose the best provider for the model. https://huggingface.co/docs/inference-providers/index
π· fine-tuners/ bespoke >> If youβve got custom setups, use Inference Endpoints to define a configuration from AWS, Azure, GCP. https://endpoints.huggingface.co/
𦫠Locals >> If youβre trying to stretch everything you can out of a server or local machine, use Llama.cpp, Jan, LMStudio or vLLM. https://huggingface.co/settings/local-apps#local-apps
πͺ Browsers >> If you need open models running right here in the browser, use transformers.js. https://github.com/huggingface/transformers.js
Let me know what youβre using, and if you think itβs more complex than this.