1 - Playbooks

This section provides the playbooks necessary to act on alerts generated by Greenhouse controller metrics

1.1 - ClusterNotReady

Playbook for the ClusterNotReady Alert

Alert Description

This alert fires when a Greenhouse-managed cluster has not been ready for more than 15 minutes.

What does this alert mean?

The Greenhouse controller monitors the health of all registered clusters. When a cluster is not ready, it indicates that the Greenhouse operator cannot properly communicate with or manage resources on that cluster. This could be due to:

  • Network connectivity issues between Greenhouse and the cluster
  • Invalid or expired kubeconfig credentials
  • The cluster API server being unavailable
  • Insufficient permissions for Greenhouse to access the cluster
  • Node issues preventing the cluster from being operational

Diagnosis

Get the Cluster Resource

Retrieve the cluster resource to view its current status:

kubectl get cluster <cluster-name> -n <namespace> -o yaml

Or use kubectl describe for a more readable output:

kubectl describe cluster <cluster-name> -n <namespace>

Check the Status Conditions

Look at the status.statusConditions section in the cluster resource. Pay special attention to:

  • Ready: The main indicator of cluster health
  • KubeConfigValid: Indicates if credentials are valid
  • AllNodesReady: Shows if all nodes in the cluster are ready
  • PermissionsVerified: Confirms Greenhouse has required permissions
  • ManagedResourcesDeployed: Indicates if Greenhouse resources were deployed

Check Controller Logs

Review the Greenhouse controller and webhook logs for more detailed error messages:

kubectl logs -n greenhouse -l app=greenhouse 
 --tail=100 | grep "<cluster-name>" # requires permissions on the greenhouse namespace

Or access your logs sink for Greenhouse logs.

Additional Resources

1.2 - OperatorReconcileErrorsHigh

Playbook for the OperatorReconcileErrorsHigh Alert

Alert Description

This alert fires when more than 10% of reconciling operations fail for a controller for 15 minutes.

What does this alert mean?

The Greenhouse operator uses controllers to manage various resources. When a controller’s reconciliation error rate exceeds 10%, it indicates systemic issues preventing the controller from properly managing its resources.

This could be due to:

  • API server connectivity issues
  • Resource conflicts or invalid resource states
  • Missing dependencies or referenced resources
  • Permission issues preventing controller operations
  • Resource exhaustion (memory, CPU) affecting controller performance
  • Bugs in the controller logic

Diagnosis

Identify the Affected Controller

The alert label controller identifies which controller is failing.

Check Controller Metrics

Access the Prometheus instance monitoring your Greenhouse cluster and query the controller error metrics using the following PromQL queries:

# Total reconciliation errors
controller_runtime_reconcile_errors_total{controller="<controller-name>"}

# Total reconciliations
controller_runtime_reconcile_total{controller="<controller-name>"}

# Error rate
rate(controller_runtime_reconcile_errors_total{controller="<controller-name>"}[5m]) / rate(controller_runtime_reconcile_total{controller="<controller-name>"}[5m])

Replace <controller-name> with the actual controller name from the alert.

Check Controller Logs

Review the controller logs for specific error messages:

kubectl logs -n greenhouse -l app=greenhouse --tail=500 | grep "controller=\"<controller-name>\"" | grep "error"

Look for patterns in the errors to identify the root cause.

Check Affected Resources

List resources managed by the failing controller that are not ready:

kubectl get <resource-type> --all-namespaces -o json | jq -r '.items[] | select(.status.statusConditions.conditions[]? | select(.type=="Ready" and .status!="True")) | "\(.metadata.namespace)/\(.metadata.name)"'

Replace <resource-type> with the appropriate resource the controller is managing (e.g., clusters, plugins, organizations, teams, teamrolebindings).

Check Controller Resource Usage

Verify the controller pod is not resource-constrained:

kubectl top pod -n greenhouse -l app=greenhouse

kubectl describe pod -n greenhouse -l app=greenhouse

Check API Server Connectivity

Test if the controller can reach the Kubernetes API server:

kubectl get --raw /healthz
kubectl get --raw /readyz

Additional Resources

1.3 - OrganizationNotReady

Playbook for the OrganizationNotReady Alert

Alert Description

This alert fires when a Greenhouse Organization has not been ready for more than 15 minutes.

What does this alert mean?

An Organization in Greenhouse represents a tenant and serves as the primary namespace for all resources belonging to that organization. When an Organization is not ready, it indicates that Greenhouse cannot properly initialize or manage the organization’s resources.

This could be due to:

  • Issues with the organization’s namespace creation or configuration
  • RBAC setup failures
  • IdP (Identity Provider) configuration problems
  • Service proxy provisioning issues
  • Default team role configuration problems

Diagnosis

Get the Organization Resource

Retrieve the organization resource to view its current status:

kubectl get organization <organization-name> -o yaml

Or use kubectl describe for a more readable output:

kubectl describe organization <organization-name>

Check the Status Conditions

Look at the status.statusConditions section in the organization resource. Pay special attention to:

  • Ready: The main indicator of organization health
  • NamespaceCreated: Indicates if the organization namespace was successfully created
  • OrganizationRBACConfigured: Shows if RBAC for the organization is properly configured
  • OrganizationDefaultTeamRolesConfigured: Indicates if default team roles are configured
  • ServiceProxyProvisioned: Shows if the service proxy is provisioned
  • OrganizationOICDConfigured: Indicates if OIDC is configured correctly
  • OrganizationAdminTeamConfigured: Shows if the admin team is configured for the organization

Check Controller Logs

Review the Greenhouse controller logs for more detailed error messages:

kubectl logs -n greenhouse -l app=greenhouse --tail=100 | grep "<organization-name>" | grep "error" # requires permissions on the greenhouse namespace

Or access your logs sink for Greenhouse logs.

Additional Resources

1.4 - PluginNotReady

Playbook for the PluginNotReady Alert

Alert Description

This alert fires when a Plugin has not been ready for more than 15 minutes.

What does this alert mean?

A Plugin in Greenhouse represents an application or service deployed to a target cluster via Helm. When a Plugin is not ready, it indicates that the deployment or the workload resources are not functioning correctly.

This could be due to:

  • Helm chart deployment failures
  • Missing or invalid PluginDefinition
  • Cluster access issues
  • Invalid plugin option values
  • Workload resources not becoming ready (pods failing, etc.)
  • Dependencies not being met (via waitFor)

Diagnosis

Get the Plugin Resource

Retrieve the plugin resource to view its current status:

kubectl get plugin <plugin-name> -n <namespace> -o yaml

Or use kubectl describe for a more readable output:

kubectl describe plugin <plugin-name> -n <namespace>

Check the Status Conditions

Look at the status.statusConditions section in the plugin resource. Pay special attention to:

  • Ready: The main indicator of plugin health
  • ClusterAccessReady: Indicates if Greenhouse can access the target cluster. If false check target Cluster status.
  • HelmReconcileFailed: Shows if Helm reconciliation failed
  • HelmDriftDetected: Indicates drift between desired and actual state
  • HelmChartTestSucceeded: Shows if Helm chart tests passed
  • WaitingForDependencies: Indicates if waiting for other plugins
  • RetriesExhausted: Shows if all retry attempts have been exhausted

Check Underlying Flux Resources

Since Greenhouse uses Flux as the default deployment mechanism, you can inspect the Flux HelmRelease resource belonging to a Plugin:

# Get the HelmRelease in the organization namespace
kubectl get helmrelease <plugin-name> -n <namespace> -o yaml

# Describe the HelmRelease for detailed status
kubectl describe helmrelease <plugin-name> -n <namespace>

Additional Resources

1.5 - ProxyRequestErrorsHigh

Playbook for the ProxyRequestErrorsHigh Alert

Alert Description

This alert fires when more than 10% of HTTP requests result in 4xx (excluding 401/403) or 5xx errors for a proxy service for 15 minutes.

What does this alert mean?

Greenhouse proxy services (like service-proxy, cors-proxy, idproxy) handle HTTP traffic for various purposes. High error rates indicate that requests are failing, which affects user experience and functionality.

This could be due to:

  • Backend services being unavailable or unhealthy
  • Misconfigured routing or proxy rules
  • Authentication/authorization issues (if 401/403 are included)
  • Network connectivity problems to backend services
  • Resource exhaustion in the proxy pod
  • Invalid requests from clients

Diagnosis

Identify the Affected Proxy Service

The alert label proxy identifies which proxy service has high error rates:

  • greenhouse-service-proxy - Proxies requests to services in remote clusters. Is deployed to the <org-name> namespace, not greenhouse!
  • greenhouse-cors-proxy - Handles CORS for frontend applications
  • greenhouse-idproxy - Handles authentication and identity proxying

The placeholder <proxy-name> from here on is the above without the greenhouse- prefix. E.g. idproxy.

Check Proxy Metrics

Access the Prometheus instance monitoring your Greenhouse cluster and query the proxy request metrics using the following PromQL queries:

# Total HTTP requests by status code
http_requests_total{service="<proxy-name>"}

# Successful requests (2xx)
http_requests_total{service="<proxy-name>",status=~"2.."}

# Client errors (4xx, excluding 401/403)
http_requests_total{service="<proxy-name>",status=~"4..",status!~"40[13]"}

# Server errors (5xx)
http_requests_total{service="<proxy-name>",status=~"5.."}

# Error rate
(rate(http_requests_total{service="<proxy-name>",status=~"4..",status!~"40[13]"}[5m]) + rate(http_requests_total{service="<proxy-name>",status=~"5.."}[5m])) / rate(http_requests_total{service="<proxy-name>"}[5m])

Replace <proxy-name> with the actual proxy service name from the alert (e.g., greenhouse-service-proxy, greenhouse-cors-proxy, greenhouse-idproxy).

Check Proxy Logs

Important! the service-proxy is deployed to the <org-name> namespace, not greenhouse!

Review proxy logs for detailed error messages:

kubectl logs -n greenhouse -l app.kubernetes.io/name=<proxy-name> --tail=500 | grep -i error

For service-proxy specifically:

kubectl logs -n greenhouse -l app.kubernetes.io/name=idproxy --tail=500 | grep -E "error|status.*[45][0-9]{2}"

Look for:

  • Backend connection failures
  • Timeout errors
  • Authentication/authorization failures
  • Invalid routing or target service issues

Check Backend Service Health

If the proxy is routing to backend services, verify they are healthy. For service-proxy, check plugins with exposed services:

kubectl get plugins --all-namespaces -l greenhouse.sap/plugin-exposed-services=true

# Check if any plugins are not ready
kubectl get plugins --all-namespaces -l greenhouse.sap/plugin-exposed-services=true -o json | jq -r '.items[] | select(.status.statusConditions.conditions[]? | select(.type=="Ready" and .status!="True")) | "\(.metadata.namespace)/\(.metadata.name)"'

Check Proxy Pod Resource Usage

Verify the proxy pod has sufficient resources:

kubectl top pod -n greenhouse -l app=<service-name>

kubectl describe pod -n greenhouse -l app=<service-name>

Additional Resources

1.6 - ResourceOwnedByLabelMissing

Playbook for the ResourceOwnedByLabelMissing Alert

Alert Description

This alert fires when resources exist without the required greenhouse.sap/owned-by label for 15 minutes.

What does this alert mean?

The greenhouse.sap/owned-by label is used to track resource ownership by Teams. This label should reference a Team with the greenhouse.sap/support-group=true label. Missing ownership labels make it difficult to:

  • Track responsibility for resources
  • Audit resource ownership
  • Contact support teams for issues
  • Enforce access control policies

Diagnosis

Identify the Affected Resource

The alert provides:

  • resource: The type of resource (e.g., Plugin, Cluster, TeamRoleBinding)
  • namespace: The namespace where the resource exists
  • name: The name of the resource (in alert labels)

Check the Resource

Retrieve the resource to inspect its labels:

kubectl get <resource-type> <resource-name> -n <namespace> -o yaml

Check the metadata.labels section for the greenhouse.sap/owned-by label.

List All Resources Missing the Label

Find all resources of the same type missing the ownership label:

kubectl get <resource-type> --all-namespaces -o json | jq -r '.items[] | select(.metadata.labels["greenhouse.sap/owned-by"] == null) | "\(.metadata.namespace)/\(.metadata.name)"'

Identify the Appropriate Owner Team

List support group teams in the namespace:

kubectl get teams -n <namespace> -l greenhouse.sap/support-group=true

Add the Missing Label

Once you’ve identified the appropriate owner team, add the label:

kubectl label <resource-type> <resource-name> -n <namespace> greenhouse.sap/owned-by=<team-name>

For example:

kubectl label plugin my-plugin -n my-org greenhouse.sap/owned-by=platform-team

Verify Webhooks are Working

Check if webhook validation is functioning properly:

# Check webhook pod status
kubectl get pods -n greenhouse -l app.kubernetes.io/component=webhook # requires permissions on the greenhouse namespace

# Check webhook logs for errors
kubectl logs -n greenhouse -l app.kubernetes.io/component=webhook --tail=100 | grep -i "owned-by" # requires permissions on the greenhouse namespace

Or access your logs sink for Greenhouse logs.

Prevent Future Occurrences

Ensure that:

  • Webhooks are enabled and functioning
  • Users are aware of the label requirement
  • Resource creation processes include the ownership label

Additional Resources

1.7 - TeamMembershipCountDrop

Playbook for the TeamMembershipCountDrop Alert

Alert Description

This alert fires when the number of members for a team has dropped by more than 5 in the last 5 minutes.

What does this alert mean?

This alert detects sudden drops in team membership that could indicate:

  • Accidental bulk removal of team members in the IdP
  • SCIM synchronization issues causing member data loss
  • IdP group configuration changes
  • Potential security incidents (unauthorized access removal)

A drop of more than 5 members in 5 minutes is unusual and warrants investigation.

Diagnosis

Get the Team Resource

Retrieve the team resource to view current membership:

kubectl get team <team-name> -n <namespace> -o yaml

Check the current Team members in .status.members.

Check .status.statusConditions:

  • SCIMAccessReady: Indicates if there is a connection to SCIM
  • SCIMAllMembersValid: Shows if all members are valid (no invalid or inactive members)

Check Organization SCIM Status

Verify that the organization’s SCIM connection is working:

kubectl get organization <namespace> -o jsonpath='{.status.statusConditions.conditions[?(@.type=="SCIMAPIAvailable")]}'

Check IdP Group Membership

Verify the current membership in the IdP group directly to confirm if the drop is legitimate or a sync issue:

  1. Access your IdP (Identity Provider) console
  2. Navigate to the group specified in spec.mappedIdPGroup
  3. Compare the member list with what’s shown in Greenhouse

Check Controller Logs

Review the Greenhouse controller logs for SCIM synchronization errors:

kubectl logs -n greenhouse -l app=greenhouse --tail=200 | grep "<team-name>" | grep -E "scim|member|error" # requires permissions on the greenhouse namespace

Or access your logs sink for Greenhouse logs.

Additional Resources

1.8 - ClusterTokenExpiry

Playbook for the ClusterTokenExpiry Alert

Alert Description

This alert fires when the kubeconfig token for a cluster will expire in less than 20 hours.

What does this alert mean?

Greenhouse has two ways of authenticating to Clusters:

This alert only fires when initially a kubeconfig was provided. Greenhouse will create a Service Account on the target Cluster and keep a kubeconfig with a token scoped to this SA on the Greenhouse Cluster. These tokens have a limited validity period and are auto-rotated by the Greenhouse controller. When a token is about to expire, this alert fires. Since Greenhouse auto-rotates these tokens, this alert indicates that Greenhouse cannot (or could not) interact properly with the Cluster.

If the token expires without being refreshed:

  • Greenhouse will lose access to the cluster
  • The cluster will become NotReady
  • Plugin deployments and updates will fail
  • RBAC synchronization will stop working

Quick fix

With this first attempt to fix you will delete the .data.greenhousekubeconfig entry of the Secret holding the authentication credentials for the Cluster. This will trigger reconciliation of the Cluster by the Greenhouse controller.

Important! Please make sure you have a valid .data.kubeconfig entry (base64 encoded) in the Secret.

Step 1: Verify the Secret exists

kubectl get secret <cluster-name> -n <namespace>

Step 2: Update the kubeconfig (if needed)

If you need to replace the kubeconfig with a new one:

# Base64 encode your kubeconfig
KUBECONFIG_BASE64=$(cat <path-to-new-kubeconfig> | base64)

# Patch the secret to update the kubeconfig
kubectl patch secret <cluster-name> -n <namespace> \
  --type='json' \
  -p='[{"op": "replace", "path": "/data/kubeconfig", "value":"'$KUBECONFIG_BASE64'"}]'

Step 3: Remove the greenhousekubeconfig to trigger token refresh

kubectl patch secret <cluster-name> -n <namespace> \
  --type='json' \
  -p='[{"op": "remove", "path": "/data/greenhousekubeconfig"}]'

This will trigger the Greenhouse controller to:

  1. Detect the missing greenhousekubeconfig
  2. Use the kubeconfig to authenticate to the remote cluster
  3. Create/verify the ServiceAccount on the remote cluster
  4. Generate a new token and update the greenhousekubeconfig entry

Further Diagnosis

You might want to find out, why Greenhouse could not auto-rotate the token in the first place:

Get the Cluster Resource

Retrieve the cluster resource to view its current status:

kubectl get cluster <cluster-name> -n <namespace> -o yaml

Or use kubectl describe for a more readable output:

kubectl describe cluster <cluster-name> -n <namespace>

Check the Status Conditions

Look at the status.statusConditions section in the cluster resource. Pay special attention to:

  • Ready: The main indicator of cluster health
  • KubeConfigValid: Indicates if credentials are valid
  • AllNodesReady: Shows if all nodes in the cluster are ready
  • PermissionsVerified: Confirms Greenhouse has required permissions
  • ManagedResourcesDeployed: Indicates if Greenhouse resources were deployed

Check Controller Logs

Review the Greenhouse controller and webhook logs for more detailed error messages:

kubectl logs -n greenhouse -l app=greenhouse
 --tail=100 | grep "<cluster-name>" # requires permissions on the greenhouse namespace

Or access your logs sink for Greenhouse logs.

Look for messages about token refresh operations or authentication issues.

Additional Resources

1.9 - OperatorReconcileDurationHigher10Min

Playbook for the OperatorReconcileDurationHigher10Min Alert

Alert Description

This alert fires when the average reconciliation duration exceeds 10 minutes for a controller for 15 minutes.

What does this alert mean?

Controllers should reconcile resources quickly. When reconciliation takes longer than 10 minutes on average, it indicates performance issues that can lead to delays in applying configuration changes and resource state updates.

This could be due to:

  • High number of resources being managed
  • Slow external API calls (e.g., to remote clusters, SCIM APIs)
  • Resource contention or controller pod being throttled
  • Inefficient reconciliation logic
  • Large resource objects or complex computations

Diagnosis

Identify the Affected Controller

The alert label controller identifies which controller has slow reconciliations.

Check Controller Metrics

Access the Prometheus instance monitoring your Greenhouse cluster and query the reconciliation duration metrics using the following PromQL query:

controller_runtime_reconcile_time_seconds{controller="<controller-name>"}

Replace <controller-name> with the actual controller name from the alert.

Check Controller Logs for Slow Operations

Review the controller logs for slow operations:

kubectl logs -n greenhouse -l app=greenhouse --tail=1000 | grep "controller=\"<controller-name>\""

Look for:

  • Long-running operations
  • Timeouts or retries
  • External API call latencies
  • Large number of resources being processed

Check Number of Managed Resources

Count how many resources the controller is managing:

kubectl get <resource-type> --all-namespaces --no-headers | wc -l

Replace <resource-type> with the appropriate resource the controller is managing.

Check Controller Resource Usage

Verify the controller pod has sufficient resources:

kubectl top pod -n greenhouse -l app=greenhouse

kubectl describe pod -n greenhouse -l app=greenhouse | grep -A 5 "Limits:\|Requests:"

Check for Resource Throttling

Check if the controller pod is being CPU throttled:

kubectl describe pod -n greenhouse -l app=greenhouse | grep -i throttl

Check External System Latency

If the controller interacts with external systems (remote clusters, SCIM, etc.), verify their responsiveness:

# For cluster controller - check if remote clusters are accessible
kubectl get clusters --all-namespaces -o json | jq -r '.items[] | select(.status.statusConditions.conditions[]? | select(.type=="Ready" and .status!="True")) | "\(.metadata.namespace)/\(.metadata.name)"'

# For organization controller - check SCIM connectivity
kubectl get organizations -o json | jq -r '.items[] | select(.status.statusConditions.conditions[]? | select(.type=="SCIMAPIAvailable" and .status!="True")) | .metadata.name'

Additional Resources

1.10 - PluginConstantlyFailing

Playbook for the PluginConstantlyFailing Alert

Alert Description

This alert fires when a Plugin reconciliation is constantly failing for 15 minutes.

What does this alert mean?

This alert indicates that the Greenhouse controller is repeatedly failing to reconcile the Plugin resource. Unlike a one-time failure, this suggests a persistent issue that prevents the Plugin from being properly managed.

Common causes include:

  • Invalid plugin option values that cannot be resolved
  • Missing PluginDefinition reference
  • Persistent Helm chart rendering or installation errors
  • Invalid or missing secrets referenced in option values
  • Cluster access issues that don’t resolve
  • Configuration conflicts

Diagnosis

Get the Plugin Resource

Retrieve the plugin resource to view its current status:

kubectl get plugin <plugin-name> -n <namespace> -o yaml

Or use kubectl describe for a more readable output:

kubectl describe plugin <plugin-name> -n <namespace>

Check the Status Conditions and Reasons

Look at the status.statusConditions section in the plugin resource. Pay special attention to:

  • Ready: The main indicator of plugin health
  • ClusterAccessReady: Indicates if Greenhouse can access the target cluster. If false check target Cluster status.
  • HelmReconcileFailed: Shows if Helm reconciliation failed
  • HelmDriftDetected: Indicates drift between desired and actual state
  • HelmChartTestSucceeded: Shows if Helm chart tests passed
  • WaitingForDependencies: Indicates if waiting for other plugins
  • RetriesExhausted: Shows if all retry attempts have been exhausted

Common failure reasons to look for:

  • PluginDefinitionNotFound: The referenced PluginDefinition does not exist
  • OptionValueResolutionFailed: Option values could not be resolved
  • PluginOptionValueInvalid: Option values could not be converted to Helm values
  • HelmUninstallFailed: The Helm release could not be uninstalled

Check for Specific Issues

PluginDefinitionNotFound

# Check if the PluginDefinition exists
kubectl get plugindefinition <plugin-definition-name> -n <namespace>

# Or check ClusterPluginDefinition
kubectl get clusterpluginefinition <plugin-definition-name> -n greenhouse # requires permissions on the greenhouse namespace

OptionValueResolutionFailed

# Check if referenced secrets exist (ValueFrom.Secret)
kubectl get secrets -n <namespace>

# Verify option values in the plugin spec
kubectl get plugin <plugin-name> -n <namespace> -o jsonpath='{.spec.optionValues}'

Check Controller Logs

Review the Greenhouse controller logs for detailed reconciliation errors:

kubectl logs -n greenhouse -l app=greenhouse --tail=200 | grep "<plugin-name>" | grep "error"

Check Underlying Flux Resources

Check the Flux HelmRelease for additional error details:

kubectl get helmrelease <plugin-name> -n <namespace> -o yaml

kubectl describe helmrelease <plugin-name> -n <namespace>

Additional Resources

1.11 - ProxyRequestDurationHigh

Playbook for the ProxyRequestDurationHigh Alert

Alert Description

This alert fires when the 90th percentile latency of a proxy service exceeds 500ms for 15 minutes.

What does this alert mean?

High latency in proxy services degrades user experience and can cause timeouts. When response times consistently exceed 500ms, it indicates performance issues that need investigation.

This could be due to:

  • Slow backend services
  • Network latency to remote clusters or services
  • Resource constraints on the proxy pod
  • High traffic volume overwhelming the proxy
  • Inefficient routing or processing logic
  • DNS resolution delays

Diagnosis

Identify the Affected Proxy Service

The alert label proxy identifies which proxy service has high latency:

  • greenhouse-service-proxy - Proxies requests to services in remote clusters. Is deployed to the <org-name> namespace, not greenhouse!
  • greenhouse-cors-proxy - Handles CORS for frontend applications
  • greenhouse-idproxy - Handles authentication and identity proxying

The placeholder <proxy-name> from here on is the above without the greenhouse- prefix. E.g. idproxy.

Check Proxy Metrics

Access the Prometheus instance monitoring your Greenhouse cluster and query the proxy request duration metrics using the following PromQL queries:

# Request duration distribution
request_duration_seconds{service="<proxy-name>"}

# 90th percentile latency
histogram_quantile(0.90, rate(request_duration_seconds_bucket{service="<proxy-name>"}[5m]))

# 99th percentile latency
histogram_quantile(0.99, rate(request_duration_seconds_bucket{service="<proxy-name>"}[5m]))

Replace <proxy-name> with the actual proxy service name from the alert (e.g., greenhouse-service-proxy, greenhouse-cors-proxy, greenhouse-idproxy).

Check Proxy Logs

Important! the service-proxy is deployed to the <org-name> namespace, not greenhouse!

Review proxy logs for slow requests:

kubectl logs -n greenhouse -l app.kubernetes.io/name=<proxy-name> --tail=500

Look for patterns indicating slow responses or timeout warnings.

Check Backend Service Response Times

For service-proxy, verify that backend services in remote clusters are responding quickly:

# List plugins with exposed services
kubectl get plugins --all-namespaces -l greenhouse.sap/plugin-exposed-services=true

# Check if any plugins are not ready
kubectl get plugins --all-namespaces -l greenhouse.sap/plugin-exposed-services=true -o json | jq -r '.items[] | select(.status.statusConditions.conditions[]? | select(.type=="Ready" and .status!="True")) | "\(.metadata.namespace)/\(.metadata.name)"'

Check Network Latency

Test network latency to remote clusters:

# For each cluster, check connectivity
kubectl get clusters --all-namespaces -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'

Check Proxy Pod Resource Usage

Verify the proxy pod has sufficient resources and is not throttled:

kubectl top pod -n greenhouse -l app.kubernetes.io/name=<proxy-name>

kubectl describe pod -n greenhouse -l app.kubernetes.io/name=<proxy-name>

Additional Resources

1.12 - SCIMAccessNotReady

Playbook for the SCIMAccessNotReady Alert

Alert Description

This alert fires when the SCIM access for an organization is not ready for more than 15 minutes.

What does this alert mean?

SCIM (System for Cross-domain Identity Management) is used by Greenhouse to synchronize team members from external identity providers. When SCIM access is not ready, it indicates that Greenhouse cannot properly communicate with the SCIM API to fetch and synchronize user and group information.

This could be due to:

  • Invalid or missing SCIM credentials in the referenced secret
  • Network connectivity issues to the SCIM API endpoint
  • SCIM API authentication failures
  • Incorrect SCIM configuration in the Organization spec

While SCIM access is not crucial:

  • Team member synchronization will not work
  • New members added in the IdP will not appear in Greenhouse teams
  • Member removals in the IdP will not be reflected in Greenhouse

Diagnosis

Get the Organization Resource

Retrieve the organization resource to check SCIM configuration:

kubectl get organization <organization-name> -o yaml

Look for the spec.authentication.scim section to see the SCIM configuration.

Check the Status Conditions

Look at the status.statusConditions section in the organization resource. Pay special attention to:

  • Ready: The main indicator of organization health
  • SCIMAPIAvailable: Indicates if there is a connection to the SCIM API

Check for specific reasons:

  • SecretNotFound: The secret with SCIM credentials is not found
  • SCIMRequestFailed: A request to SCIM API failed
  • SCIMConfigErrorReason: SCIM config is missing or invalid

Verify the SCIM Secret

Check if the referenced secret exists and contains the correct credentials:

# Check if the secret exists
kubectl get secret <scim-secret-name> -n <organization-name>

# View the secret keys (not the values)
kubectl get secret <scim-secret-name> -n <organization-name> -o jsonpath='{.data}' | jq 'keys'

Check Controller Logs

Review the Greenhouse controller logs for SCIM-related errors:

kubectl logs -n greenhouse -l app=greenhouse --tail=100 | grep "<organization-name>" | grep -i "scim\|error" # requires permissions on the greenhouse namespace

Or access your logs sink for Greenhouse logs.

Additional Resources

1.13 - TeamRoleBindingNotReady

Playbook for the TeamRoleBindingNotReady Alert

Alert Description

This alert fires when a TeamRoleBinding has not been ready for more than 15 minutes.

What does this alert mean?

A TeamRoleBinding in Greenhouse maps a Team to a TeamRole on one or more clusters. When a TeamRoleBinding is not ready, it means that the RBAC resources (RoleBindings or ClusterRoleBindings) could not be properly created on the target clusters, preventing team members from accessing the clusters with the intended permissions.

This could be due to:

  • Cluster access issues (cluster not ready or inaccessible)
  • Permission issues on the target Cluster
  • Referenced Team or TeamRole does not exist
  • Cluster selector not matching any clusters

Diagnosis

Get the TeamRoleBinding Resource

Retrieve the TeamRoleBinding resource to view its current status:

kubectl get teamrolebinding <trb-name> -n <namespace> -o yaml

Or use the shortname:

kubectl get trb <trb-name> -n <namespace> -o yaml

Check the Status Conditions

Look at the status.statusConditions section. Pay special attention to:

  • Ready: The main indicator of TeamRoleBinding health
  • RBACReady: Indicates if the RBAC resources are ready on the clusters

Common failure reasons:

  • RBACReconcileFailed: Not all RBAC resources have been successfully reconciled
  • EmptyClusterList: The clusterSelector and clusterName do not match any existing clusters
  • TeamNotFound: The referenced Team does not exist
  • ClusterConnectionFailed: Cannot connect to the target cluster
  • ClusterRoleFailed: ClusterRole could not be created on the remote cluster
  • RoleBindingFailed: RoleBinding could not be created on the remote cluster
  • CreateNamespacesFailed: Namespaces could not be created (when createNamespaces is enabled)

Check Propagation Status

The status.clusters field shows the propagation status per cluster:

kubectl get trb <trb-name> -n <namespace> -o jsonpath='{.status.clusters}' | jq

This will show which specific clusters are failing and why.

Verify Referenced Resources

Check if the referenced Team and TeamRole exist:

# Check Team
kubectl get team <team-name> -n <namespace>

# Check TeamRole
kubectl get teamrole <teamrole-name> -n <namespace>

Check Cluster Availability

If the issue is cluster connectivity, check the target cluster status:

kubectl get cluster <cluster-name> -n <namespace>

See the ClusterNotReady playbook for cluster troubleshooting.

Check Controller Logs

Review the Greenhouse controller logs for detailed error messages:

kubectl logs -n greenhouse -l app=greenhouse --tail=200 | grep "<trb-name>" | grep "error" # requires permissions on the greenhouse namespace

Or access your logs sink for Greenhouse logs.

Additional Resources

1.14 - ClusterKubernetesVersionOutOfMaintenance

Playbook for the ClusterKubernetesVersionOutOfMaintenance Alert

Alert Description

This alert fires when a cluster is running a Kubernetes version that is out of maintenance.

What does this alert mean?

Kubernetes versions have a limited support lifecycle. When a version goes out of maintenance, it no longer receives security patches or bug fixes. Running clusters on unsupported versions poses security risks and may lead to compatibility issues with newer features and tools.

This alert fires when a cluster is detected running Kubernetes version which are out of the official Kubernetes maintenance window.

Fix

Update the kubernetes version of the target Cluster.

Diagnosis

Get the Cluster Resource

Check the detected Kubernetes version:

kubectl get cluster <cluster-name> -n <namespace> -o yaml

Look for the status.kubernetesVersion field to see the current version.

Verify the Version

Check the version directly on the target cluster:

kubectl --kubeconfig=<target-cluster-kubeconfig> version --short

Additional Resources

1.15 - IDProxyErrorsHigh

Playbook for the IDProxyErrorsHigh Alert

Alert Description

This alert fires when more than 10% of IDProxy operations result in errors for 15 minutes.

What does this alert mean?

The IDProxy handles authentication and identity proxying for Greenhouse. High error rates indicate authentication or identity management issues that prevent users from accessing resources.

This could be due to:

  • Issues with the identity provider (IdP) integration
  • OIDC/OAuth configuration problems
  • Network connectivity to the IdP
  • Invalid or expired tokens
  • Misconfigured callback URLs or client credentials
  • Resource constraints on the IDProxy pod

Diagnosis

Check IDProxy Metrics

Access the Prometheus instance monitoring your Greenhouse cluster and query the IDProxy request metrics using the following PromQL queries:

# Total HTTP requests by status code
http_requests_total{service="greenhouse-idproxy"}

# Successful requests (2xx)
http_requests_total{service="greenhouse-idproxy",status=~"2.."}

# Error requests (4xx and 5xx)
http_requests_total{service="greenhouse-idproxy",status=~"[45].."}

# Error rate
rate(http_requests_total{service="greenhouse-idproxy",status=~"[45].."}[5m]) / rate(http_requests_total{service="greenhouse-idproxy"}[5m])

Analyze the distribution of HTTP status codes to understand what types of errors are occurring.

Check IDProxy Logs

Review IDProxy logs for detailed error messages:

kubectl logs -n greenhouse -l app.kubernetes.io/name=idproxy --tail=500 | grep -i error

Look for:

  • Authentication failures
  • Token validation errors
  • IdP connection issues
  • OIDC/OAuth errors
  • Callback URL mismatches

Check Identity Provider Status

Verify the identity provider is accessible and responding:

# Check Organization configuration
kubectl get organization <org-name> -o jsonpath='{.spec.authentication}'

Test connectivity to the IdP endpoints if accessible.

Check IDProxy Configuration

Verify the IDProxy configuration in the Organization resource:

kubectl get organization <org-name> -o yaml

Check:

  • OIDC issuer URL is correct
  • Client ID and client secret are configured
  • Redirect URIs are properly set

Check IDProxy Pod Resource Usage

Verify the IDProxy pod has sufficient resources:

kubectl top pod -n greenhouse -l app.kubernetes.io/name=idproxy

kubectl describe pod -n greenhouse -l app.kubernetes.io/name=idproxy

Check for Certificate Issues

If using HTTPS for IdP communication, verify certificates are valid:

kubectl logs -n greenhouse -l app.kubernetes.io/name=idproxy --tail=500 | grep -i "certificate\|tls\|x509"

Additional Resources

1.16 - OperatorWorkqueueNotDrained

Playbook for the OperatorWorkqueueNotDrained Alert

Alert Description

This alert fires when a controller’s workqueue backlog is not getting drained for 15 minutes.

What does this alert mean?

Each controller uses a workqueue to process reconciliation requests. When the workqueue depth continues to grow rather than being drained, it indicates that the controller cannot keep up with the incoming reconciliation requests.

This could be due to:

  • High rate of resource changes overwhelming the controller
  • Slow reconciliation operations (see also OperatorReconcileDurationHigher10Min)
  • Controller pod being resource-constrained
  • Deadlocks or stuck reconciliation loops
  • External systems being slow or unavailable

Diagnosis

Identify the Affected Controller

The alert label controller identifies the controller workqueue that is not draining.

Check Workqueue Metrics

Access the Prometheus instance monitoring your Greenhouse cluster and query the workqueue metrics using the following PromQL queries:

# Current workqueue depth
workqueue_depth{controller="<controller-name>"}

# Rate of items being added to the queue
rate(workqueue_adds_total{controller="<controller-name>"}[5m])

# Work duration
workqueue_work_duration_seconds{controller="<controller-name>"}

Replace <controller-name> with the actual controller name from the alert.

Check Controller Logs

Review controller logs to see if reconciliations are processing:

kubectl logs -n greenhouse -l app=greenhouse --tail=500 | grep "<controller-name>"

Look for:

  • Repeated reconciliation of the same resources
  • Error messages indicating stuck operations
  • Long pauses between log entries

Check Reconciliation Duration

If reconciliations are slow, this may prevent the queue from draining. Query Prometheus:

controller_runtime_reconcile_time_seconds{controller="<controller-name>"}

Check Controller Resource Usage

Verify the controller has sufficient resources:

kubectl top pod -n greenhouse -l app=greenhouse

kubectl describe pod -n greenhouse -l app=greenhouse

Check Number of Resources

A high number of resources may be causing excessive reconciliation load:

kubectl get <resource-type> --all-namespaces --no-headers | wc -l

Replace <resource-type> with the appropriate resource the controller is managing.

Check for External System Issues

If the controller depends on external systems, verify they are responsive:

# Check cluster connectivity
kubectl get clusters --all-namespaces -o json | jq -r '.items[] | select(.status.statusConditions.conditions[]? | select(.type=="Ready" and .status!="True")) | "\(.metadata.namespace)/\(.metadata.name)"'

# Check organization SCIM connectivity
kubectl get organizations -o json | jq -r '.items[] | select(.status.statusConditions.conditions[]? | select(.type=="SCIMAPIAvailable" and .status!="True")) | .metadata.name'

Additional Resources

1.17 - WebhookLatencyHigh

Playbook for the WebhookLatencyHigh Alert

Alert Description

This alert fires when the 90th percentile latency of a Greenhouse webhook exceeds 200ms for 15 minutes.

What does this alert mean?

Webhooks are admission controllers that validate or mutate resources before they are persisted to etcd. High webhook latency can slow down all API requests for the resources the webhook handles, affecting user operations and controller reconciliations.

This could be due to:

  • Complex validation or mutation logic
  • External API calls from the webhook (e.g., checking clusters, teams)
  • Resource constraints on the webhook pod
  • High rate of requests to the webhook
  • Network latency within the cluster

Diagnosis

Identify the Affected Webhook and Resource

The alert label webhook identifies which webhook has high latency. The webhook path indicates the resource type:

  • /mutate-greenhouse-sap-v1alpha1-plugin → Plugin resource
  • /validate-greenhouse-sap-v1alpha1-plugin → Plugin resource
  • /mutate-greenhouse-sap-v1alpha1-cluster → Cluster resource
  • /validate-greenhouse-sap-v1alpha1-cluster → Cluster resource
  • And similar patterns for other resources

Extract the resource type from the webhook path (e.g., Plugin, Cluster, Organization) to use in log filtering.

Check Webhook Metrics

Access the Prometheus instance monitoring your Greenhouse cluster and query the webhook latency metrics using the following PromQL queries:

# Webhook latency distribution
controller_runtime_webhook_latency_seconds{webhook="<webhook-path>"}

# 90th percentile latency
histogram_quantile(0.90, rate(controller_runtime_webhook_latency_seconds_bucket{webhook="<webhook-path>"}[5m]))

Replace <webhook-path> with the actual webhook path from the alert.

Check Webhook Request Rate

High request rates can contribute to latency. Query Prometheus:

# Request rate
rate(controller_runtime_webhook_requests_total{webhook="<webhook-path>"}[5m])

Check Webhook Logs

Review webhook logs for slow operations or errors. Use the resource type extracted from the webhook path:

kubectl logs -n greenhouse -l app=greenhouse,app.kubernetes.io/component=webhook --tail=500 | grep '"kind":"<Resource>"'

For example, for the plugin webhook:

kubectl logs -n greenhouse -l app=greenhouse,app.kubernetes.io/component=webhook --tail=500 | grep '"kind":"Plugin"'

Look for:

  • Long-running validation or mutation operations
  • External API call timeouts
  • Error messages
  • Repeated webhook calls for the same resources

Check Webhook Pod Resource Usage

Verify the webhook pod has sufficient resources:

kubectl top pod -n greenhouse -l app=greenhouse,app.kubernetes.io/component=webhook

kubectl describe pod -n greenhouse -l app=greenhouse,app.kubernetes.io/component=webhook

Check for Resource Contention

Check if the webhook pod is being throttled:

kubectl describe pod -n greenhouse -l app=greenhouse,app.kubernetes.io/component=webhook | grep -i throttl

Additional Resources

1.18 - WebhookErrorsHigh

Playbook for the WebhookErrorsHigh Alert

Alert Description

This alert fires when more than 10% of webhook operations fail for a webhook for 15 minutes.

What does this alert mean?

Webhooks validate or mutate resources before they are persisted. When a webhook’s error rate exceeds 10%, it indicates that many API requests for the affected resources are being rejected or failing.

This could be due to:

  • Invalid resource configurations being submitted
  • External dependencies being unavailable (e.g., clusters, teams, secrets)
  • Permission issues in webhook operations
  • Bugs in the webhook logic
  • Network issues preventing webhook from accessing required resources

Diagnosis

Identify the Affected Webhook and Resource

The alert label webhook identifies which webhook has high error rates. Extract the resource type from the webhook path (e.g., Plugin, Cluster, Organization) to use in log filtering.

Common webhook paths:

  • /mutate-greenhouse-sap-v1alpha1-plugin → Plugin resource
  • /validate-greenhouse-sap-v1alpha1-plugin → Plugin resource
  • /mutate-greenhouse-sap-v1alpha1-cluster → Cluster resource
  • /validate-greenhouse-sap-v1alpha1-cluster → Cluster resource

Check Webhook Metrics

Access the Prometheus instance monitoring your Greenhouse cluster and query the webhook request metrics using the following PromQL queries:

# Total webhook requests by status code
controller_runtime_webhook_requests_total{webhook="<webhook-path>"}

# Successful requests (200)
controller_runtime_webhook_requests_total{webhook="<webhook-path>",code="200"}

# Failed requests (non-200)
controller_runtime_webhook_requests_total{webhook="<webhook-path>",code!="200"}

# Error rate
rate(controller_runtime_webhook_requests_total{webhook="<webhook-path>",code!="200"}[5m]) / rate(controller_runtime_webhook_requests_total{webhook="<webhook-path>"}[5m])

Replace <webhook-path> with the actual webhook path from the alert.

Check Webhook Logs

Review webhook logs for error messages using the resource type:

kubectl logs -n greenhouse -l app=greenhouse,app.kubernetes.io/component=webhook --tail=500 | grep '"kind":"<Resource>"' | grep -i error

For example, for the plugin webhook:

kubectl logs -n greenhouse -l app=greenhouse,app.kubernetes.io/component=webhook --tail=500 | grep '"kind":"Plugin"' | grep -i error

Look for:

  • Validation errors indicating why resources are being rejected
  • Missing referenced resources (Teams, Secrets, PluginDefinitions, Clusters)
  • Permission errors
  • Network errors when accessing external systems

Check Recent Resource Submissions

List recent resources of the affected type to see if there are patterns:

kubectl get <resource-type> --all-namespaces --sort-by=.metadata.creationTimestamp

Check if recently created or updated resources have issues:

kubectl get <resource-type> --all-namespaces -o json | jq -r '.items[] | select(.status.statusConditions.conditions[]? | select(.type=="Ready" and .status!="True")) | "\(.metadata.namespace)/\(.metadata.name)"'

Check Webhook Pod Resource Usage

Verify the webhook pod has sufficient resources:

kubectl top pod -n greenhouse -l app=greenhouse,app.kubernetes.io/component=webhook

kubectl describe pod -n greenhouse -l app=greenhouse,app.kubernetes.io/component=webhook

Additional Resources