Incident Runbook Templates

Production-ready incident response runbooks that save precious minutes

✨ The solution you've been looking for

Verified

Tested and verified by our team

25450 Stars

Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.

incident-response runbooks devops sre on-call troubleshooting production escalation

Repository

See It In Action

Interactive preview & real-world examples

Live Demo

AI Conversation Simulator

See how users interact with this skill

User Prompt

Create an incident runbook for our payment processing service that handles outages, including detection, triage, and recovery steps

Skill Processing

Analyzing request...

Agent Response

A structured runbook with severity classification, quick triage steps, mitigation procedures, verification steps, and escalation paths

Quick Start (3 Steps)

Get up and running in minutes

Install

claude-code skill install incident-runbook-templates

claude-code skill install incident-runbook-templates

Config

First Trigger

@incident-runbook-templates help

Commands

Command	Description	Required Args
@incident-runbook-templates service-outage-response	Generate comprehensive runbooks for handling complete service outages with step-by-step recovery procedures	None
@incident-runbook-templates database-emergency-procedures	Build runbooks for critical database incidents including connection issues, replication lag, and disk space problems	None
@incident-runbook-templates on-call-engineer-onboarding	Create standardized incident response procedures for new team members joining on-call rotation	None

Typical Use Cases

Service Outage Response

Generate comprehensive runbooks for handling complete service outages with step-by-step recovery procedures

Database Emergency Procedures

Build runbooks for critical database incidents including connection issues, replication lag, and disk space problems

On-Call Engineer Onboarding

Create standardized incident response procedures for new team members joining on-call rotation

Overview

Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

When to Use This Skill

Creating incident response procedures
Building service-specific runbooks
Establishing escalation paths
Documenting recovery procedures
Responding to active incidents
Onboarding on-call engineers

Core Concepts

1. Incident Severity Levels

Severity	Impact	Response Time	Example
SEV1	Complete outage, data loss	15 min	Production down
SEV2	Major degradation	30 min	Critical feature broken
SEV3	Minor impact	2 hours	Non-critical bug
SEV4	Minimal impact	Next business day	Cosmetic issue

2. Runbook Structure

1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix

Runbook Templates

Template 1: Service Outage Runbook

 1# [Service Name] Outage Runbook
 2
 3## Overview
 4
 5**Service**: Payment Processing Service
 6**Owner**: Platform Team
 7**Slack**: #payments-incidents
 8**PagerDuty**: payments-oncall
 9
10## Impact Assessment
11
12- [ ] Which customers are affected?
13- [ ] What percentage of traffic is impacted?
14- [ ] Are there financial implications?
15- [ ] What's the blast radius?
16
17## Detection
18
19### Alerts
20
21- `payment_error_rate > 5%` (PagerDuty)
22- `payment_latency_p99 > 2s` (Slack)
23- `payment_success_rate < 95%` (PagerDuty)
24
25### Dashboards
26
27- [Payment Service Dashboard](https://grafana/d/payments)
28- [Error Tracking](https://sentry.io/payments)
29- [Dependency Status](https://status.stripe.com)
30
31## Initial Triage (First 5 Minutes)
32
33### 1. Assess Scope
34
35```bash
36# Check service health
37kubectl get pods -n payments -l app=payment-service
38
39# Check recent deployments
40kubectl rollout history deployment/payment-service -n payments
41
42# Check error rates
43curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
44```

2. Quick Health Checks

Can you reach the service? curl -I https://api.company.com/payments/health
Database connectivity? Check connection pool metrics
External dependencies? Check Stripe, bank API status
Recent changes? Check deploy history

3. Initial Classification

Symptom	Likely Cause	Go To Section
All requests failing	Service down	Section 4.1
High latency	Database/dependency	Section 4.2
Partial failures	Code bug	Section 4.3
Spike in errors	Traffic surge	Section 4.4

Mitigation Procedures

4.1 Service Completely Down

 1# Step 1: Check pod status
 2kubectl get pods -n payments
 3
 4# Step 2: If pods are crash-looping, check logs
 5kubectl logs -n payments -l app=payment-service --tail=100
 6
 7# Step 3: Check recent deployments
 8kubectl rollout history deployment/payment-service -n payments
 9
10# Step 4: ROLLBACK if recent deploy is suspect
11kubectl rollout undo deployment/payment-service -n payments
12
13# Step 5: Scale up if resource constrained
14kubectl scale deployment/payment-service -n payments --replicas=10
15
16# Step 6: Verify recovery
17kubectl rollout status deployment/payment-service -n payments

4.2 High Latency

 1# Step 1: Check database connections
 2kubectl exec -n payments deploy/payment-service -- \
 3  curl localhost:8080/metrics | grep db_pool
 4
 5# Step 2: Check slow queries (if DB issue)
 6psql -h $DB_HOST -U $DB_USER -c "
 7  SELECT pid, now() - query_start AS duration, query
 8  FROM pg_stat_activity
 9  WHERE state = 'active' AND duration > interval '5 seconds'
10  ORDER BY duration DESC;"
11
12# Step 3: Kill long-running queries if needed
13psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
14
15# Step 4: Check external dependency latency
16curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
17
18# Step 5: Enable circuit breaker if dependency is slow
19kubectl set env deployment/payment-service \
20  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments

4.3 Partial Failures (Specific Errors)

 1# Step 1: Identify error pattern
 2kubectl logs -n payments -l app=payment-service --tail=500 | \
 3  grep -i error | sort | uniq -c | sort -rn | head -20
 4
 5# Step 2: Check error tracking
 6# Go to Sentry: https://sentry.io/payments
 7
 8# Step 3: If specific endpoint, enable feature flag to disable
 9curl -X POST https://api.company.com/internal/feature-flags \
10  -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
11
12# Step 4: If data issue, check recent data changes
13psql -h $DB_HOST -c "
14  SELECT * FROM audit_log
15  WHERE table_name = 'payment_methods'
16  AND created_at > now() - interval '1 hour';"

4.4 Traffic Surge

 1# Step 1: Check current request rate
 2kubectl top pods -n payments
 3
 4# Step 2: Scale horizontally
 5kubectl scale deployment/payment-service -n payments --replicas=20
 6
 7# Step 3: Enable rate limiting
 8kubectl set env deployment/payment-service \
 9  RATE_LIMIT_ENABLED=true \
10  RATE_LIMIT_RPS=1000 -n payments
11
12# Step 4: If attack, block suspicious IPs
13kubectl apply -f - <<EOF
14apiVersion: networking.k8s.io/v1
15kind: NetworkPolicy
16metadata:
17  name: block-suspicious
18  namespace: payments
19spec:
20  podSelector:
21    matchLabels:
22      app: payment-service
23  ingress:
24  - from:
25    - ipBlock:
26        cidr: 0.0.0.0/0
27        except:
28        - 192.168.1.0/24  # Suspicious range
29EOF

Verification Steps

 1# Verify service is healthy
 2curl -s https://api.company.com/payments/health | jq
 3
 4# Verify error rate is back to normal
 5curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
 6
 7# Verify latency is acceptable
 8curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
 9
10# Smoke test critical flows
11./scripts/smoke-test-payments.sh

Rollback Procedures

1# Rollback Kubernetes deployment
2kubectl rollout undo deployment/payment-service -n payments
3
4# Rollback database migration (if applicable)
5./scripts/db-rollback.sh $MIGRATION_VERSION
6
7# Rollback feature flag
8curl -X POST https://api.company.com/internal/feature-flags \
9  -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'

Escalation Matrix

Condition	Escalate To	Contact
> 15 min unresolved SEV1	Engineering Manager	@manager (Slack)
Data breach suspected	Security Team	#security-incidents
Financial impact > $10k	Finance + Legal	@finance-oncall
Customer communication needed	Support Lead	@support-lead

Communication Templates

Initial Notification (Internal)

🚨 INCIDENT: Payment Service Degradation

Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]

Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards

Updates in #payments-incidents

Status Update

📊 UPDATE: Payment Service Incident

Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes

Actions Taken:
- Rolled back deployment v2.3.4 → v2.3.3
- Scaled service from 5 → 10 replicas

Next Steps:
- Continuing to monitor
- Root cause analysis in progress

ETA to Resolution: ~15 minutes

Resolution Notification

✅ RESOLVED: Payment Service Incident

Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4

Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully

Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress


### Template 2: Database Incident Runbook

```markdown
# Database Incident Runbook

## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
| Kill query | `SELECT pg_terminate_backend(pid);` |
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |

## Connection Pool Exhaustion
```sql
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;

-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';

Replication Lag

 1-- Check lag on replica
 2SELECT
 3  CASE
 4    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
 5    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
 6  END AS lag_seconds;
 7
 8-- If lag > 60s, consider:
 9-- 1. Check network between primary/replica
10-- 2. Check replica disk I/O
11-- 3. Consider failover if unrecoverable

Disk Space Critical

 1# Check disk usage
 2df -h /var/lib/postgresql/data
 3
 4# Find large tables
 5psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
 6FROM pg_catalog.pg_statio_user_tables
 7ORDER BY pg_total_relation_size(relid) DESC
 8LIMIT 10;"
 9
10# VACUUM to reclaim space
11psql -c "VACUUM FULL large_table;"
12
13# If emergency, delete old data or expand disk


## Best Practices

### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress

### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident

## Resources

- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
- [PagerDuty Incident Response](https://response.pagerduty.com/)
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)

What Users Are Saying

Real feedback from the community

Environment Matrix

Dependencies

Kubernetes cluster access

Database access (PostgreSQL)

Monitoring tools (Prometheus/Grafana)

Alert management (PagerDuty)

Framework Support

Kubernetes ✓ (recommended) PostgreSQL ✓ Prometheus ✓ Grafana ✓ PagerDuty ✓

Incident Runbook Templates

See It In Action

AI Conversation Simulator

Quick Start (3 Steps)

Install

Config

First Trigger

Commands

Typical Use Cases

Service Outage Response

Database Emergency Procedures

On-Call Engineer Onboarding

Overview

Incident Runbook Templates

When to Use This Skill

Core Concepts

1. Incident Severity Levels

2. Runbook Structure

Runbook Templates

Template 1: Service Outage Runbook

2. Quick Health Checks

3. Initial Classification

Mitigation Procedures

4.1 Service Completely Down

4.2 High Latency

4.3 Partial Failures (Specific Errors)

4.4 Traffic Surge

Verification Steps

Rollback Procedures

Escalation Matrix

Communication Templates

Initial Notification (Internal)

Status Update

Resolution Notification

Replication Lag

Disk Space Critical

What Users Are Saying

Environment Matrix

Dependencies

Framework Support

Context Window

Security & Privacy

Information

Related Skills

Incident Runbook Templates

On Call Handoff Patterns

On Call Handoff Patterns