Skip to main content

IncidentRule

The IncidentRule CRD allows you to define rules for automatically creating, updating, and managing incidents based on events and conditions in your infrastructure.

Definition

apiVersion: mission-control.flanksource.com/v1
kind: IncidentRule
metadata:
name: example-incident-rule
spec:
# Source of events to process
source:
type: canary
selector:
matchLabels:
app: frontend

# Conditions that trigger the rule
condition:
status: unhealthy
duration: 10m

# Incident creation settings
incident:
title: "Frontend Availability Issue"
severity: high
owner: platform-team
labels:
service: frontend
type: availability

Schema

The IncidentRule resource supports the following fields:

FieldDescription
spec.sourceSource configuration for events
spec.source.typeType of event source (canary, component, alert, etc.)
spec.source.selectorKubernetes label selector for matching sources
spec.conditionConditions that trigger the rule
spec.condition.statusRequired status of the source (e.g., unhealthy)
spec.condition.durationTime duration condition must be true before triggering
spec.condition.countNumber of occurrences required to trigger
spec.condition.messageMessage pattern to match
spec.condition.labelsLabels that must be present on the source
spec.condition.expressionCEL expression for complex conditions
spec.incidentIncident configuration
spec.incident.titleTitle template for the incident
spec.incident.descriptionDescription template for the incident
spec.incident.severitySeverity level (critical, high, medium, low)
spec.incident.typeType classification for the incident
spec.incident.ownerDefault owner for the incident
spec.incident.labelsLabels to apply to the incident
spec.incident.componentsComponents to associate with the incident
spec.incident.playbooksPlaybooks to trigger when incident is created
spec.incident.respondersInitial responders to assign
spec.jiraJIRA integration settings
spec.pagerdutyPagerDuty integration settings
spec.teamsMicrosoft Teams integration settings
spec.slackSlack integration settings

Examples

Basic Canary Failure Rule

apiVersion: mission-control.flanksource.com/v1
kind: IncidentRule
metadata:
name: api-availability
spec:
source:
type: canary
selector:
matchLabels:
check: api-health
condition:
status: unhealthy
duration: 5m
incident:
title: "API Availability Issue"
severity: high
owner: api-team
labels:
service: api
type: availability

Component Health Rule

apiVersion: mission-control.flanksource.com/v1
kind: IncidentRule
metadata:
name: database-health
spec:
source:
type: component
selector:
matchLabels:
type: database
tier: production
condition:
status: unhealthy
duration: 2m
incident:
title: "Database Health Issue - {{.component.name}}"
description: "The database component {{.component.name}} is reporting unhealthy status.\n\nLast error: {{.component.status.message}}"
severity: critical
components:
- "{{.component.id}}"
playbooks:
- database-recovery

Alert Manager Integration

apiVersion: mission-control.flanksource.com/v1
kind: IncidentRule
metadata:
name: prometheus-alerts
spec:
source:
type: alertmanager
selector:
matchLabels:
severity: critical
condition:
status: firing
duration: 1m
incident:
title: "{{.alert.labels.alertname}}"
description: "{{.alert.annotations.description}}"
severity: "{{.alert.labels.severity}}"
labels:
source: prometheus
pagerduty:
integration: primary-pd-service
severity: critical
slack:
channel: "#incidents"
message: "Critical alert triggered: {{.alert.labels.alertname}}"

Complex Condition with Expression

apiVersion: mission-control.flanksource.com/v1
kind: IncidentRule
metadata:
name: advanced-rule
spec:
source:
type: component
condition:
expression: |
source.status == "unhealthy" &&
(source.labels.tier == "production" || source.labels.criticality == "high") &&
duration("10m")
incident:
title: "Service Disruption - {{.component.name}}"
severity: high
type: availability
components:
- "{{.component.id}}"
- "{{range .component.dependencies}}{{.id}}{{end}}"

See Also