OperatorReconcileErrorsHigh
Alert Description
This alert fires when more than 10% of reconciling operations fail for a controller for 15 minutes.
What does this alert mean?
The Greenhouse operator uses controllers to manage various resources. When a controller’s reconciliation error rate exceeds 10%, it indicates systemic issues preventing the controller from properly managing its resources.
This could be due to:
- API server connectivity issues
- Resource conflicts or invalid resource states
- Missing dependencies or referenced resources
- Permission issues preventing controller operations
- Resource exhaustion (memory, CPU) affecting controller performance
- Bugs in the controller logic
Diagnosis
Identify the Affected Controller
The alert label controller identifies which controller is failing.
Check Controller Metrics
Access the Prometheus instance monitoring your Greenhouse cluster and query the controller error metrics using the following PromQL queries:
# Total reconciliation errors
controller_runtime_reconcile_errors_total{controller="<controller-name>"}
# Total reconciliations
controller_runtime_reconcile_total{controller="<controller-name>"}
# Error rate
rate(controller_runtime_reconcile_errors_total{controller="<controller-name>"}[5m]) / rate(controller_runtime_reconcile_total{controller="<controller-name>"}[5m])
Replace <controller-name> with the actual controller name from the alert.
Check Controller Logs
Review the controller logs for specific error messages:
kubectl logs -n greenhouse -l app=greenhouse --tail=500 | grep "controller=\"<controller-name>\"" | grep "error"
Look for patterns in the errors to identify the root cause.
Check Affected Resources
List resources managed by the failing controller that are not ready:
kubectl get <resource-type> --all-namespaces -o json | jq -r '.items[] | select(.status.statusConditions.conditions[]? | select(.type=="Ready" and .status!="True")) | "\(.metadata.namespace)/\(.metadata.name)"'
Replace <resource-type> with the appropriate resource the controller is managing (e.g., clusters, plugins, organizations, teams, teamrolebindings).
Check Controller Resource Usage
Verify the controller pod is not resource-constrained:
kubectl top pod -n greenhouse -l app=greenhouse
kubectl describe pod -n greenhouse -l app=greenhouse
Check API Server Connectivity
Test if the controller can reach the Kubernetes API server:
kubectl get --raw /healthz
kubectl get --raw /readyz