WebhookLatencyHigh
Alert Description
This alert fires when the 90th percentile latency of a Greenhouse webhook exceeds 200ms for 15 minutes.
What does this alert mean?
Webhooks are admission controllers that validate or mutate resources before they are persisted to etcd. High webhook latency can slow down all API requests for the resources the webhook handles, affecting user operations and controller reconciliations.
This could be due to:
- Complex validation or mutation logic
- External API calls from the webhook (e.g., checking clusters, teams)
- Resource constraints on the webhook pod
- High rate of requests to the webhook
- Network latency within the cluster
Diagnosis
Identify the Affected Webhook and Resource
The alert label webhook identifies which webhook has high latency. The webhook path indicates the resource type:
/mutate-greenhouse-sap-v1alpha1-plugin→ Plugin resource/validate-greenhouse-sap-v1alpha1-plugin→ Plugin resource/mutate-greenhouse-sap-v1alpha1-cluster→ Cluster resource/validate-greenhouse-sap-v1alpha1-cluster→ Cluster resource- And similar patterns for other resources
Extract the resource type from the webhook path (e.g., Plugin, Cluster, Organization) to use in log filtering.
Check Webhook Metrics
Access the Prometheus instance monitoring your Greenhouse cluster and query the webhook latency metrics using the following PromQL queries:
# Webhook latency distribution
controller_runtime_webhook_latency_seconds{webhook="<webhook-path>"}
# 90th percentile latency
histogram_quantile(0.90, rate(controller_runtime_webhook_latency_seconds_bucket{webhook="<webhook-path>"}[5m]))
Replace <webhook-path> with the actual webhook path from the alert.
Check Webhook Request Rate
High request rates can contribute to latency. Query Prometheus:
# Request rate
rate(controller_runtime_webhook_requests_total{webhook="<webhook-path>"}[5m])
Check Webhook Logs
Review webhook logs for slow operations or errors. Use the resource type extracted from the webhook path:
kubectl logs -n greenhouse -l app=greenhouse,app.kubernetes.io/component=webhook --tail=500 | grep '"kind":"<Resource>"'
For example, for the plugin webhook:
kubectl logs -n greenhouse -l app=greenhouse,app.kubernetes.io/component=webhook --tail=500 | grep '"kind":"Plugin"'
Look for:
- Long-running validation or mutation operations
- External API call timeouts
- Error messages
- Repeated webhook calls for the same resources
Check Webhook Pod Resource Usage
Verify the webhook pod has sufficient resources:
kubectl top pod -n greenhouse -l app=greenhouse,app.kubernetes.io/component=webhook
kubectl describe pod -n greenhouse -l app=greenhouse,app.kubernetes.io/component=webhook
Check for Resource Contention
Check if the webhook pod is being throttled:
kubectl describe pod -n greenhouse -l app=greenhouse,app.kubernetes.io/component=webhook | grep -i throttl