OperatorReconcileDurationHigher10Min

Playbook for the OperatorReconcileDurationHigher10Min Alert

Alert Description

This alert fires when the average reconciliation duration exceeds 10 minutes for a controller for 15 minutes.

What does this alert mean?

Controllers should reconcile resources quickly. When reconciliation takes longer than 10 minutes on average, it indicates performance issues that can lead to delays in applying configuration changes and resource state updates.

This could be due to:

High number of resources being managed
Slow external API calls (e.g., to remote clusters, SCIM APIs)
Resource contention or controller pod being throttled
Inefficient reconciliation logic
Large resource objects or complex computations

Diagnosis

Identify the Affected Controller

The alert label controller identifies which controller has slow reconciliations.

Check Controller Metrics

Access the Prometheus instance monitoring your Greenhouse cluster and query the reconciliation duration metrics using the following PromQL query:

controller_runtime_reconcile_time_seconds{controller="<controller-name>"}

Replace <controller-name> with the actual controller name from the alert.

Check Controller Logs for Slow Operations

Review the controller logs for slow operations:

kubectl logs -n greenhouse -l app=greenhouse --tail=1000 | grep "controller=\"<controller-name>\""

Look for:

Long-running operations
Timeouts or retries
External API call latencies
Large number of resources being processed

Check Number of Managed Resources

Count how many resources the controller is managing:

kubectl get <resource-type> --all-namespaces --no-headers | wc -l

Replace <resource-type> with the appropriate resource the controller is managing.

Check Controller Resource Usage

Verify the controller pod has sufficient resources:

kubectl top pod -n greenhouse -l app=greenhouse

kubectl describe pod -n greenhouse -l app=greenhouse | grep -A 5 "Limits:\|Requests:"

Check for Resource Throttling

Check if the controller pod is being CPU throttled:

kubectl describe pod -n greenhouse -l app=greenhouse | grep -i throttl

Check External System Latency

If the controller interacts with external systems (remote clusters, SCIM, etc.), verify their responsiveness:

# For cluster controller - check if remote clusters are accessible
kubectl get clusters --all-namespaces -o json | jq -r '.items[] | select(.status.statusConditions.conditions[]? | select(.type=="Ready" and .status!="True")) | "\(.metadata.namespace)/\(.metadata.name)"'

# For organization controller - check SCIM connectivity
kubectl get organizations -o json | jq -r '.items[] | select(.status.statusConditions.conditions[]? | select(.type=="SCIMAPIAvailable" and .status!="True")) | .metadata.name'