OperatorWorkqueueNotDrained
Alert Description
This alert fires when a controller’s workqueue backlog is not getting drained for 15 minutes.
What does this alert mean?
Each controller uses a workqueue to process reconciliation requests. When the workqueue depth continues to grow rather than being drained, it indicates that the controller cannot keep up with the incoming reconciliation requests.
This could be due to:
- High rate of resource changes overwhelming the controller
- Slow reconciliation operations (see also OperatorReconcileDurationHigher10Min)
- Controller pod being resource-constrained
- Deadlocks or stuck reconciliation loops
- External systems being slow or unavailable
Diagnosis
Identify the Affected Controller
The alert label controller identifies the controller workqueue that is not draining.
Check Workqueue Metrics
Access the Prometheus instance monitoring your Greenhouse cluster and query the workqueue metrics using the following PromQL queries:
# Current workqueue depth
workqueue_depth{controller="<controller-name>"}
# Rate of items being added to the queue
rate(workqueue_adds_total{controller="<controller-name>"}[5m])
# Work duration
workqueue_work_duration_seconds{controller="<controller-name>"}
Replace <controller-name> with the actual controller name from the alert.
Check Controller Logs
Review controller logs to see if reconciliations are processing:
kubectl logs -n greenhouse -l app=greenhouse --tail=500 | grep "<controller-name>"
Look for:
- Repeated reconciliation of the same resources
- Error messages indicating stuck operations
- Long pauses between log entries
Check Reconciliation Duration
If reconciliations are slow, this may prevent the queue from draining. Query Prometheus:
controller_runtime_reconcile_time_seconds{controller="<controller-name>"}
Check Controller Resource Usage
Verify the controller has sufficient resources:
kubectl top pod -n greenhouse -l app=greenhouse
kubectl describe pod -n greenhouse -l app=greenhouse
Check Number of Resources
A high number of resources may be causing excessive reconciliation load:
kubectl get <resource-type> --all-namespaces --no-headers | wc -l
Replace <resource-type> with the appropriate resource the controller is managing.
Check for External System Issues
If the controller depends on external systems, verify they are responsive:
# Check cluster connectivity
kubectl get clusters --all-namespaces -o json | jq -r '.items[] | select(.status.statusConditions.conditions[]? | select(.type=="Ready" and .status!="True")) | "\(.metadata.namespace)/\(.metadata.name)"'
# Check organization SCIM connectivity
kubectl get organizations -o json | jq -r '.items[] | select(.status.statusConditions.conditions[]? | select(.type=="SCIMAPIAvailable" and .status!="True")) | .metadata.name'