Three Mighty Alerts Supporting Hugging Face’s Production Infrastructure

Published July 8, 2025

The Infrastructure team at Hugging Face is excited to share a behind-the-scenes look at the inner workings of Hugging Face's production infrastructure, which we’ve had the privilege of helping to build and maintain. Our team's dedication to designing and implementing a robust monitoring and alerting system has been instrumental in ensuring the stability and scalability of our platforms. We’re constantly reminded of the impact that our alerts have on our ability to identify and respond to potential issues before they become major incidents.

In this blog post, we’ll dive into the details of three mighty alerts that play their unique role in supporting our production infrastructure, and explore how they've helped us maintain the high level of performance and uptime that our community relies on.

High NAT Gateway Throughput

Oftentimes, with cloud computing architectures, where data flows between private and public networks, implementing a NAT (Network Address Translation) gateway stands as a steadfast best practice. This gateway acts as a strategic gatekeeper, monitoring and facilitating all outbound traffic towards the public Internet. By centralizing egress traffic, the NAT gateway offers a strategic vantage point for comprehensive visibility. Our team can easily query and analyze this traffic, making it an invaluable asset when working through security, cost optimization or various other investigative scenarios.

Cost optimization is a critical aspect of cloud infrastructure management, and understanding the pricing dynamics is key. In data centers, pricing structures often differentiate between east-west traffic (typically communication within the same rack or building) and north-south traffic (communication between further away private networks or the internet). By monitoring network traffic volume, Hugging Face gains valuable insights into these traffic patterns. This awareness allows us to make informed decisions regarding infrastructure configuration and architecture, ensuring we limit incurring needless costs.

One of our key alerts is designed to notify us when our network traffic volume surpasses a predefined threshold. This alert serves multiple purposes. Firstly, it acts as an early warning system, alerting us to any unusual spikes in traffic that might indicate potential issues or unexpected behavior. Secondly, it prompts us to regularly review our traffic trends, ensuring we stay on top of our infrastructure's growth and evolving needs. This alert is set at a static threshold, which we have fine-tuned over time, ensuring it remains relevant and effective. When triggered, it often coincides with periods of refactoring our infrastructure.

For instance, when integrating third-party security and autoscaling tools, we've observed increased telemetry data egress from our nodes, triggering the alert and prompting us to optimize our configurations.

On another occasion, adjustments to our infrastructure lead to mistakenly avoiding a private, low-cost path between product-specific infrastructure (ex. traffic destined to the Hub from a Space to interact with repository data). To elaborate further, the most impactful workloads in terms of cost savings we’ve found are those that access object storage. Fetching objects directly prices cheaper than going through CDN-hosted assets for our LFS repository storage and additionally does not require the same security measures that our WAF provides compared to public requests arriving at our front door. Leveraging DNS overrides to switch traffic through private network paths and public network paths has become a valuable technique for us, driven by the CDKTF AWS provider.

new Route53ResolverFirewallRule(
  stack,
  `dns-override-rule-${key}-${j}`,
  {
    provider: group?.provider!,
    name: `dns-override-${dnsOverride.name}-${rule[0]}`,
    action: 'BLOCK',
    blockOverrideDnsType: 'CNAME',
    blockOverrideDomain: `${rule[1]}.`,
    blockOverrideTtl: dnsOverride.ttl,
    blockResponse: 'OVERRIDE',
    firewallDomainListId: list.id,
    firewallRuleGroupId: group!.id,
    priority: 100 + j,
  },
);

As a final note, while we have configuration-as-code ensuring the desired state is always in effect, having an additional layer of alerting around helps in case mistakes are made when expressing the desired state through code.

Hub Request Logs Archival Success Rate

The logging infrastructure at Hugging Face is a sophisticated system designed to collect, process, and store vast amounts of log data generated by our applications and services. At the heart of this system is the Hub application logging pipeline, a well-architected solution that ensures Hub model usage data is efficiently captured, enriched, and stored for reporting and archival purposes. The pipeline begins with Filebeat, a lightweight log shipper that runs as a daemonset alongside our application pods in each Kubernetes cluster. Filebeat's role is to collect logs from various sources, including application containers, and forward them to the next stage of the pipeline.

Once logs are collected by Filebeat, they are sent to Logstash, a powerful log processing tool. Logstash acts as the data processing workhorse, applying a series of mutations and transformations to the incoming logs. This includes enriching logs with GeoIP data for geolocation insights, routing logs to specific Elasticsearch indexes based on predefined criteria, and manipulating log fields by adding, removing, or reformatting them to ensure consistency and ease of analysis. After Logstash has processed the logs, they are forwarded to an Elasticsearch cluster.

Elasticsearch, a distributed search and analytics engine, forms the core of our log storage and analysis platform. It receives the logs from Logstash and applies its own set of processing rules through Elasticsearch pipelines. These pipelines perform minimal processing tasks, such as adding timestamp fields to indicate the time of processing, which is crucial for log analysis and correlation. Elasticsearch provides a scalable and flexible storage solution, allowing us to buffer logs for operational use and real-time analysis.

To manage the lifecycle of logs within Elasticsearch, we employ a robust storage and lifecycle management strategy. This ensures that logs are retained in Elasticsearch for a defined period, providing quick access for operational and troubleshooting purposes. After this retention period, logs are offloaded to long-term archival storage. The archival process involves an automated tool that reads logs from Elasticsearch indexes, formats them as Parquet files—an efficient columnar storage format—and writes them to our object storage system.

The final stage of our logging pipeline leverages AWS data warehousing services. Here, AWS Glue crawlers are utilized to discover and classify data in our object storage, automatically generating a Glue Data Catalog, which provides a unified metadata repository. The Glue table schema is periodically refreshed to ensure it remains up-to-date with the evolving structure of our log data. This integration with AWS Glue enables us to query the archived logs using Amazon Athena, a serverless interactive query service. Athena allows us to run SQL queries directly against the data in object storage, providing a cost-effective and scalable solution for log analysis and historical data exploration.

The logging pipeline, while meticulously designed, is not without its challenges and potential points of failure. One of the most critical vulnerabilities lies in the elasticity of the system, particularly in the Elasticsearch cluster. Elasticsearch, being a distributed system, can experience backpressure in various scenarios, such as high ingress traffic, intensive querying, or internal operations like shard relocation. When backpressure occurs, it can lead to a cascade of issues throughout the pipeline. For instance, if the Elasticsearch cluster becomes overwhelmed, it may start rejecting or delaying log ingestion, causing backlogs in Logstash or even Filebeat, which can result in log loss or delayed processing.

Another point of fragility is the auto-schema detection mechanism in Elasticsearch. While it is designed to adapt to changing log structures, it can fail when application logs undergo significant field type changes. If the schema detection fails to recognize the new field types, it may lead to failed writes from Logstash to Elasticsearch, causing log processing bottlenecks and potential data inconsistencies. This issue highlights the importance of proactive log schema management and the need for robust monitoring to detect and address such issues promptly.

Memory management is also a critical aspect of the pipeline's stability. The log processing tier, including Logstash and Filebeat, operates with limited memory resources to control costs. When backpressure occurs, these components can experience Out-of-Memory (OOM) issues, especially during system slowdowns. As logs accumulate and backpressure increases, the memory footprint of these processes grows, pushing them closer to their limits. If not addressed promptly, this can lead to process crashes or further exacerbation of the backpressure problem.

Archival jobs, responsible for transferring logs from Elasticsearch to object storage, have also encountered challenges. On occasion, these jobs can be resource-intensive, with their performance becoming sensitive to node size and memory availability. In cases where junk data or unusually large log entries pass through the pipeline, they can strain the archival process, leading to failures due to memory exhaustion or node capacity limits. This underscores the importance of data validation and filtering earlier in the pipeline to prevent such issues from reaching the archival stage.

To mitigate these potential failures, we've implemented a powerful alert system with a unique motivation: validating end-to-end log flow. The alert is designed to compare the number of requests received by our Application Load Balancer (ALB) with the number of logs successfully archived, providing a comprehensive view of log data flow throughout the entire pipeline. This approach allows us to quickly identify any discrepancies that might indicate potential log loss or processing issues.

The alert mechanism is based on a simple yet effective comparison: the number of requests hitting our ALB, which represents the total log volume entering the system, versus the number of logs successfully archived in our long-term storage. By monitoring this ratio, we can ensure that what goes in must come out, providing a robust validation of our logging infrastructure's health. When the alert is triggered, it indicates a potential mismatch, prompting immediate investigation and remediation.

In practice, this alert has proven to be a valuable tool, especially during periods of infrastructure refactoring. For instance, when we migrated our ALB to a VPC origin, the alert was instrumental in identifying and addressing the resulting log flow discrepancies. However, it has also saved us in less obvious scenarios. For example, when archive jobs failed to run due to unforeseen issues, the alert flagged the missing archived logs, allowing us to promptly investigate and resolve the problem before it impacted our log analysis and retention processes.

While this alert is a powerful tool, it is just one part of our comprehensive monitoring strategy. We continuously refine and adapt our logging infrastructure to handle the ever-increasing volume and complexity of log data. By combining proactive monitoring, efficient resource management, and a deep understanding of our system's behavior, Hugging Face ensures that our logging pipeline remains resilient, reliable, and capable of supporting our platform's growth and evolving needs. This alert is a testament to our commitment to maintaining a robust and transparent logging system, providing our teams with the insights they need to keep Hugging Face running smoothly.

Kubernetes API Request Errors and Rate Limiting

When operating cloud-native applications and Kubernetes-based infrastructures, even seemingly minor issues can escalate into significant downtime if left unchecked. This is particularly true for the Kubernetes API, which serves as the central nervous system of a Kubernetes cluster, orchestrating the creation, management, and networking of containers. At Hugging Face, we've learned through experience that monitoring the Kubernetes API error rate and rate limiting metrics is a crucial practice, one that can prevent potential disasters.

Hugging Face's infrastructure is deeply integrated with Kubernetes, and the kube-rs library has been instrumental in building and managing this ecosystem efficiently. kube-rs offers a Rust-centric approach to Kubernetes application development, providing developers with a familiar and powerful toolkit. At its core, kube-rs introduces three key concepts: reflectors, controllers, and custom resource interfaces. Reflectors ensure real-time synchronization of Kubernetes resources, enabling applications to react swiftly to changes. Controllers, the decision-makers, continuously reconcile the desired and actual states of resources, making Kubernetes self-healing. Custom resource interfaces extend Kubernetes, allowing developers to define application-specific resources for better abstraction.

Additionally, kube-rs introduces watchers and finalizers. Watchers monitor specific resources for changes, triggering actions in response to events. Finalizers, on the other hand, ensure proper cleanup and resource termination by defining custom logic. By providing Rust-based abstractions for these Kubernetes concepts, kube-rs allows developers to build robust, efficient applications, leveraging the Kubernetes platform's power and flexibility while maintaining a Rust-centric development approach. This integration streamlines the process of building and managing complex Kubernetes applications, making it a valuable tool in Hugging Face's infrastructure.

Hugging Face's integration with Kubernetes is a cornerstone of our infrastructure, and the kube-rs library plays a pivotal role in managing this ecosystem. The kube::api:: module is instrumental in automating various tasks, such as managing HTTPS certificates for custom domains supporting our Spaces product. By programmatically handling certificate lifecycles, we ensure the security and accessibility of our services, providing users with a seamless experience. Additionally, we have used this module outside of user-facing features during routine maintenance to facilitate node draining and termination providing cluster stability during infrastructure updates.

The kube::runtime:: module has been equally crucial for us, enabling the development and deployment of custom controllers that enhance our infrastructure's automation and resilience. For instance, we've implemented controllers for billing management in our managed services, where watchers and finalizers on customer pods ensure accurate resource tracking and billing. This level of customization allows us to adapt Kubernetes to our specific needs.

Through kube-rs, Hugging Face has achieved a high level of efficiency, reliability, and control over our cloud-native applications. The library's Rust-centric design aligns with our engineering philosophy, allowing us to leverage Rust's strengths in managing Kubernetes resources. By automating critical tasks and building custom controllers, we've created a scalable, self-healing infrastructure that meets the diverse and evolving needs of our users and enterprise customers. This integration demonstrates our commitment to harnessing the full potential of Kubernetes while maintaining a development approach tailored to our unique requirements.

While our infrastructure rarely encounters issues related to the Kubernetes API, we remain vigilant, especially during and after deployments. The Kubernetes API is a critical component in our use of kube::runtime:: for managing customer pods and cloud networking resources. Any disruptions or inefficiencies in API communication can have cascading effects on our services, potentially leading to downtime or degraded performance.

The importance of monitoring these API metrics is underscored by the experiences of other users of Kubernetes. OpenAI, for instance, shared a status update detailing how DNS availability issues resulted in significant downtime. While not directly related to the Kubernetes API, their experience highlights the interconnectedness of various infrastructure components and the potential for cascading failures. Just as DNS availability is vital for application accessibility, a healthy and responsive Kubernetes API is essential for managing and orchestrating our containerized workloads.

As a best practice, we've integrated these API metrics into our monitoring and alerting systems, ensuring that any anomalies or trends are promptly brought to our attention. This allows us to take a proactive approach, investigating and addressing issues before they impact our customers. For instance, on one occasion a single cluster started rate limiting requests to the Kubernetes API. We were able to trace this back to one of our third-party tools hitting a bug and repeatedly requesting a node be drained even though it had already done so. In response we were able to flush the malfunctioning job from the system before any noticeable degradation impacted our users. This is a great example that alerting scenarios do not just happen directly after deploying new versions of our custom controllers — bugs can take time to manifest as production issues.

In conclusion, while our infrastructure is robust and well-architected, we recognize that vigilance and proactive monitoring are essential to maintaining its health and stability. By keeping a close eye on the Kubernetes API error rate and rate limiting metrics, we safeguard our managed services, ensure smooth customer experiences, and uphold our commitment to reliability and performance. This is a testament to our belief that in the world of cloud-native technologies, every component, no matter how small, plays a significant role in the overall resilience and success of our platform.

Bonus Alert: New Cluster Sending Zero Metrics

And a final bonus alert for reading through this far into the post!

At Hugging Face, our experiments are constantly in flux, often with purpose-fit clusters spinning up and down as we iterate on features and products. To add to the entropy, our growth is also a significant factor, with clusters expanding to their limits and triggering meiosis-like splits to maintain balance. To navigate this dynamic environment without resorting to hardcoding or introducing an additional cluster discovery layer, we've devised a clever alert that adapts to these changes.

(
  (sum by (cluster) (rate(container_network_transmit_packets_total{pod="prometheus"}[1h] ) ) > 0)
or
  (-1 * (sum by (cluster) (rate(container_network_transmit_packets_total{pod="prometheus"}[1h] offset 48h) ) > 0))
) < 0

The metric used in this query is container_network_transmit_packets_total, which represents the total number of packets transmitted by a container. The query is filtering for metrics from a cluster’s local Prometheus instance, which is tasked with metric collection as well as remote writing to our central metric store — Grafana Mimir. Transmission of packets approximates healthy remote writes, which is what we want to ensure across all active clusters.

The first part of the query performs a current rate check. The second part of the query performs a historical rate check by using the same calculation as the current rate check plus an offset 48h clause. The -1 * multiplication is used to invert the result, so that if the historical rate is greater than 0, the result will be less than 0.

The or operator combines the two parts of the query. The query will return true if either of the following conditions is met:

The current rate of packet transmission for a cluster is greater than 0.
The historical rate of packet transmission for a cluster (48 hours ago) is greater than 0, but the current rate is not.

The outer < 0 condition checks if the result of the or operation is less than 0. This means that the query will only trigger if neither of the conditions is met, i.e., if a cluster has never sent any metrics (both current and historical rates are 0).

There are two cases where the query will trigger:

New cluster with no metrics: A new cluster is added, but it has not sent any metrics yet. In this case, both the current and historical rates will be 0, and the query will trigger.
Cluster that has never sent metrics: A cluster has been present for more than 48 hours, but it has never sent any metrics. In this case, the historical rate will be 0, and the current rate will also be 0, triggering the query.

In both cases, the query will detect that the cluster is not sending any metrics and trigger the alert.

This simple yet effective solution fires in scenarios where our metrics infrastructure crashes, during cluster setup, and when they're torn down providing us with timely insights into our infrastructure's health. While it may not be the most critical alert in our arsenal, it holds a special place since it was born out of collaboration. It is a testament to the power of teamwork through rigorous code review, made possible by the expertise and willingness to help of fellow colleagues in the Hugging Face infrastructure team 🤗

Wrapping Up

In this post we shared some of our favourite alerts supporting infrastructure at Hugging Face. We'd love to hear your team's favourites as well!

How are you monitoring your ML infrastructure? Which alerts keep your team coming back for fixes? What breaks often in your infrastructure or conversely what have you never monitored and just works?

Share with us in the comments below!

Blazingly fast whisper transcriptions with Inference Endpoints

By May 13, 2025 • 71

The New and Fresh analytics in Inference Endpoints

By March 21, 2025 • 21

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote