Incident Runbook Templates

Production-ready incident response runbooks that save precious minutes

✨ The solution you've been looking for

Verified
Tested and verified by our team
25450 Stars

Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.

incident-response runbooks devops sre on-call troubleshooting production escalation
Repository

See It In Action

Interactive preview & real-world examples

Live Demo
Skill Demo Animation

AI Conversation Simulator

See how users interact with this skill

User Prompt

Create an incident runbook for our payment processing service that handles outages, including detection, triage, and recovery steps

Skill Processing

Analyzing request...

Agent Response

A structured runbook with severity classification, quick triage steps, mitigation procedures, verification steps, and escalation paths

Quick Start (3 Steps)

Get up and running in minutes

1

Install

claude-code skill install incident-runbook-templates

claude-code skill install incident-runbook-templates
2

Config

3

First Trigger

@incident-runbook-templates help

Commands

CommandDescriptionRequired Args
@incident-runbook-templates service-outage-responseGenerate comprehensive runbooks for handling complete service outages with step-by-step recovery proceduresNone
@incident-runbook-templates database-emergency-proceduresBuild runbooks for critical database incidents including connection issues, replication lag, and disk space problemsNone
@incident-runbook-templates on-call-engineer-onboardingCreate standardized incident response procedures for new team members joining on-call rotationNone

Typical Use Cases

Service Outage Response

Generate comprehensive runbooks for handling complete service outages with step-by-step recovery procedures

Database Emergency Procedures

Build runbooks for critical database incidents including connection issues, replication lag, and disk space problems

On-Call Engineer Onboarding

Create standardized incident response procedures for new team members joining on-call rotation

Overview

Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

When to Use This Skill

  • Creating incident response procedures
  • Building service-specific runbooks
  • Establishing escalation paths
  • Documenting recovery procedures
  • Responding to active incidents
  • Onboarding on-call engineers

Core Concepts

1. Incident Severity Levels

SeverityImpactResponse TimeExample
SEV1Complete outage, data loss15 minProduction down
SEV2Major degradation30 minCritical feature broken
SEV3Minor impact2 hoursNon-critical bug
SEV4Minimal impactNext business dayCosmetic issue

2. Runbook Structure

1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix

Runbook Templates

Template 1: Service Outage Runbook

 1# [Service Name] Outage Runbook
 2
 3## Overview
 4
 5**Service**: Payment Processing Service
 6**Owner**: Platform Team
 7**Slack**: #payments-incidents
 8**PagerDuty**: payments-oncall
 9
10## Impact Assessment
11
12- [ ] Which customers are affected?
13- [ ] What percentage of traffic is impacted?
14- [ ] Are there financial implications?
15- [ ] What's the blast radius?
16
17## Detection
18
19### Alerts
20
21- `payment_error_rate > 5%` (PagerDuty)
22- `payment_latency_p99 > 2s` (Slack)
23- `payment_success_rate < 95%` (PagerDuty)
24
25### Dashboards
26
27- [Payment Service Dashboard](https://grafana/d/payments)
28- [Error Tracking](https://sentry.io/payments)
29- [Dependency Status](https://status.stripe.com)
30
31## Initial Triage (First 5 Minutes)
32
33### 1. Assess Scope
34
35```bash
36# Check service health
37kubectl get pods -n payments -l app=payment-service
38
39# Check recent deployments
40kubectl rollout history deployment/payment-service -n payments
41
42# Check error rates
43curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
44```

2. Quick Health Checks

  • Can you reach the service? curl -I https://api.company.com/payments/health
  • Database connectivity? Check connection pool metrics
  • External dependencies? Check Stripe, bank API status
  • Recent changes? Check deploy history

3. Initial Classification

SymptomLikely CauseGo To Section
All requests failingService downSection 4.1
High latencyDatabase/dependencySection 4.2
Partial failuresCode bugSection 4.3
Spike in errorsTraffic surgeSection 4.4

Mitigation Procedures

4.1 Service Completely Down

 1# Step 1: Check pod status
 2kubectl get pods -n payments
 3
 4# Step 2: If pods are crash-looping, check logs
 5kubectl logs -n payments -l app=payment-service --tail=100
 6
 7# Step 3: Check recent deployments
 8kubectl rollout history deployment/payment-service -n payments
 9
10# Step 4: ROLLBACK if recent deploy is suspect
11kubectl rollout undo deployment/payment-service -n payments
12
13# Step 5: Scale up if resource constrained
14kubectl scale deployment/payment-service -n payments --replicas=10
15
16# Step 6: Verify recovery
17kubectl rollout status deployment/payment-service -n payments

4.2 High Latency

 1# Step 1: Check database connections
 2kubectl exec -n payments deploy/payment-service -- \
 3  curl localhost:8080/metrics | grep db_pool
 4
 5# Step 2: Check slow queries (if DB issue)
 6psql -h $DB_HOST -U $DB_USER -c "
 7  SELECT pid, now() - query_start AS duration, query
 8  FROM pg_stat_activity
 9  WHERE state = 'active' AND duration > interval '5 seconds'
10  ORDER BY duration DESC;"
11
12# Step 3: Kill long-running queries if needed
13psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
14
15# Step 4: Check external dependency latency
16curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
17
18# Step 5: Enable circuit breaker if dependency is slow
19kubectl set env deployment/payment-service \
20  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments

4.3 Partial Failures (Specific Errors)

 1# Step 1: Identify error pattern
 2kubectl logs -n payments -l app=payment-service --tail=500 | \
 3  grep -i error | sort | uniq -c | sort -rn | head -20
 4
 5# Step 2: Check error tracking
 6# Go to Sentry: https://sentry.io/payments
 7
 8# Step 3: If specific endpoint, enable feature flag to disable
 9curl -X POST https://api.company.com/internal/feature-flags \
10  -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
11
12# Step 4: If data issue, check recent data changes
13psql -h $DB_HOST -c "
14  SELECT * FROM audit_log
15  WHERE table_name = 'payment_methods'
16  AND created_at > now() - interval '1 hour';"

4.4 Traffic Surge

 1# Step 1: Check current request rate
 2kubectl top pods -n payments
 3
 4# Step 2: Scale horizontally
 5kubectl scale deployment/payment-service -n payments --replicas=20
 6
 7# Step 3: Enable rate limiting
 8kubectl set env deployment/payment-service \
 9  RATE_LIMIT_ENABLED=true \
10  RATE_LIMIT_RPS=1000 -n payments
11
12# Step 4: If attack, block suspicious IPs
13kubectl apply -f - <<EOF
14apiVersion: networking.k8s.io/v1
15kind: NetworkPolicy
16metadata:
17  name: block-suspicious
18  namespace: payments
19spec:
20  podSelector:
21    matchLabels:
22      app: payment-service
23  ingress:
24  - from:
25    - ipBlock:
26        cidr: 0.0.0.0/0
27        except:
28        - 192.168.1.0/24  # Suspicious range
29EOF

Verification Steps

 1# Verify service is healthy
 2curl -s https://api.company.com/payments/health | jq
 3
 4# Verify error rate is back to normal
 5curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
 6
 7# Verify latency is acceptable
 8curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
 9
10# Smoke test critical flows
11./scripts/smoke-test-payments.sh

Rollback Procedures

1# Rollback Kubernetes deployment
2kubectl rollout undo deployment/payment-service -n payments
3
4# Rollback database migration (if applicable)
5./scripts/db-rollback.sh $MIGRATION_VERSION
6
7# Rollback feature flag
8curl -X POST https://api.company.com/internal/feature-flags \
9  -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'

Escalation Matrix

ConditionEscalate ToContact
> 15 min unresolved SEV1Engineering Manager@manager (Slack)
Data breach suspectedSecurity Team#security-incidents
Financial impact > $10kFinance + Legal@finance-oncall
Customer communication neededSupport Lead@support-lead

Communication Templates

Initial Notification (Internal)

🚨 INCIDENT: Payment Service Degradation

Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]

Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards

Updates in #payments-incidents

Status Update

📊 UPDATE: Payment Service Incident

Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes

Actions Taken:
- Rolled back deployment v2.3.4 → v2.3.3
- Scaled service from 5 → 10 replicas

Next Steps:
- Continuing to monitor
- Root cause analysis in progress

ETA to Resolution: ~15 minutes

Resolution Notification

✅ RESOLVED: Payment Service Incident

Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4

Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully

Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress

### Template 2: Database Incident Runbook

```markdown
# Database Incident Runbook

## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
| Kill query | `SELECT pg_terminate_backend(pid);` |
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |

## Connection Pool Exhaustion
```sql
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;

-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';

Replication Lag

 1-- Check lag on replica
 2SELECT
 3  CASE
 4    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
 5    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
 6  END AS lag_seconds;
 7
 8-- If lag > 60s, consider:
 9-- 1. Check network between primary/replica
10-- 2. Check replica disk I/O
11-- 3. Consider failover if unrecoverable

Disk Space Critical

 1# Check disk usage
 2df -h /var/lib/postgresql/data
 3
 4# Find large tables
 5psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
 6FROM pg_catalog.pg_statio_user_tables
 7ORDER BY pg_total_relation_size(relid) DESC
 8LIMIT 10;"
 9
10# VACUUM to reclaim space
11psql -c "VACUUM FULL large_table;"
12
13# If emergency, delete old data or expand disk

## Best Practices

### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress

### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident

## Resources

- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
- [PagerDuty Incident Response](https://response.pagerduty.com/)
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)

What Users Are Saying

Real feedback from the community

Environment Matrix

Dependencies

Kubernetes cluster access
Database access (PostgreSQL)
Monitoring tools (Prometheus/Grafana)
Alert management (PagerDuty)

Framework Support

Kubernetes ✓ (recommended) PostgreSQL ✓ Prometheus ✓ Grafana ✓ PagerDuty ✓

Context Window

Token Usage ~3K-8K tokens for comprehensive runbook generation

Security & Privacy

Information

Author
wshobson
Updated
2026-01-30
Category
productivity-tools