Incident Runbook Templates
Production-ready incident response runbooks that save precious minutes
✨ The solution you've been looking for
Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.
See It In Action
Interactive preview & real-world examples
AI Conversation Simulator
See how users interact with this skill
User Prompt
Create an incident runbook for our payment processing service that handles outages, including detection, triage, and recovery steps
Skill Processing
Analyzing request...
Agent Response
A structured runbook with severity classification, quick triage steps, mitigation procedures, verification steps, and escalation paths
Quick Start (3 Steps)
Get up and running in minutes
Install
claude-code skill install incident-runbook-templates
claude-code skill install incident-runbook-templatesConfig
First Trigger
@incident-runbook-templates helpCommands
| Command | Description | Required Args |
|---|---|---|
| @incident-runbook-templates service-outage-response | Generate comprehensive runbooks for handling complete service outages with step-by-step recovery procedures | None |
| @incident-runbook-templates database-emergency-procedures | Build runbooks for critical database incidents including connection issues, replication lag, and disk space problems | None |
| @incident-runbook-templates on-call-engineer-onboarding | Create standardized incident response procedures for new team members joining on-call rotation | None |
Typical Use Cases
Service Outage Response
Generate comprehensive runbooks for handling complete service outages with step-by-step recovery procedures
Database Emergency Procedures
Build runbooks for critical database incidents including connection issues, replication lag, and disk space problems
On-Call Engineer Onboarding
Create standardized incident response procedures for new team members joining on-call rotation
Overview
Incident Runbook Templates
Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
When to Use This Skill
- Creating incident response procedures
- Building service-specific runbooks
- Establishing escalation paths
- Documenting recovery procedures
- Responding to active incidents
- Onboarding on-call engineers
Core Concepts
1. Incident Severity Levels
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| SEV1 | Complete outage, data loss | 15 min | Production down |
| SEV2 | Major degradation | 30 min | Critical feature broken |
| SEV3 | Minor impact | 2 hours | Non-critical bug |
| SEV4 | Minimal impact | Next business day | Cosmetic issue |
2. Runbook Structure
1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix
Runbook Templates
Template 1: Service Outage Runbook
1# [Service Name] Outage Runbook
2
3## Overview
4
5**Service**: Payment Processing Service
6**Owner**: Platform Team
7**Slack**: #payments-incidents
8**PagerDuty**: payments-oncall
9
10## Impact Assessment
11
12- [ ] Which customers are affected?
13- [ ] What percentage of traffic is impacted?
14- [ ] Are there financial implications?
15- [ ] What's the blast radius?
16
17## Detection
18
19### Alerts
20
21- `payment_error_rate > 5%` (PagerDuty)
22- `payment_latency_p99 > 2s` (Slack)
23- `payment_success_rate < 95%` (PagerDuty)
24
25### Dashboards
26
27- [Payment Service Dashboard](https://grafana/d/payments)
28- [Error Tracking](https://sentry.io/payments)
29- [Dependency Status](https://status.stripe.com)
30
31## Initial Triage (First 5 Minutes)
32
33### 1. Assess Scope
34
35```bash
36# Check service health
37kubectl get pods -n payments -l app=payment-service
38
39# Check recent deployments
40kubectl rollout history deployment/payment-service -n payments
41
42# Check error rates
43curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
44```
2. Quick Health Checks
- Can you reach the service?
curl -I https://api.company.com/payments/health - Database connectivity? Check connection pool metrics
- External dependencies? Check Stripe, bank API status
- Recent changes? Check deploy history
3. Initial Classification
| Symptom | Likely Cause | Go To Section |
|---|---|---|
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |
Mitigation Procedures
4.1 Service Completely Down
1# Step 1: Check pod status
2kubectl get pods -n payments
3
4# Step 2: If pods are crash-looping, check logs
5kubectl logs -n payments -l app=payment-service --tail=100
6
7# Step 3: Check recent deployments
8kubectl rollout history deployment/payment-service -n payments
9
10# Step 4: ROLLBACK if recent deploy is suspect
11kubectl rollout undo deployment/payment-service -n payments
12
13# Step 5: Scale up if resource constrained
14kubectl scale deployment/payment-service -n payments --replicas=10
15
16# Step 6: Verify recovery
17kubectl rollout status deployment/payment-service -n payments
4.2 High Latency
1# Step 1: Check database connections
2kubectl exec -n payments deploy/payment-service -- \
3 curl localhost:8080/metrics | grep db_pool
4
5# Step 2: Check slow queries (if DB issue)
6psql -h $DB_HOST -U $DB_USER -c "
7 SELECT pid, now() - query_start AS duration, query
8 FROM pg_stat_activity
9 WHERE state = 'active' AND duration > interval '5 seconds'
10 ORDER BY duration DESC;"
11
12# Step 3: Kill long-running queries if needed
13psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
14
15# Step 4: Check external dependency latency
16curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
17
18# Step 5: Enable circuit breaker if dependency is slow
19kubectl set env deployment/payment-service \
20 STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
4.3 Partial Failures (Specific Errors)
1# Step 1: Identify error pattern
2kubectl logs -n payments -l app=payment-service --tail=500 | \
3 grep -i error | sort | uniq -c | sort -rn | head -20
4
5# Step 2: Check error tracking
6# Go to Sentry: https://sentry.io/payments
7
8# Step 3: If specific endpoint, enable feature flag to disable
9curl -X POST https://api.company.com/internal/feature-flags \
10 -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
11
12# Step 4: If data issue, check recent data changes
13psql -h $DB_HOST -c "
14 SELECT * FROM audit_log
15 WHERE table_name = 'payment_methods'
16 AND created_at > now() - interval '1 hour';"
4.4 Traffic Surge
1# Step 1: Check current request rate
2kubectl top pods -n payments
3
4# Step 2: Scale horizontally
5kubectl scale deployment/payment-service -n payments --replicas=20
6
7# Step 3: Enable rate limiting
8kubectl set env deployment/payment-service \
9 RATE_LIMIT_ENABLED=true \
10 RATE_LIMIT_RPS=1000 -n payments
11
12# Step 4: If attack, block suspicious IPs
13kubectl apply -f - <<EOF
14apiVersion: networking.k8s.io/v1
15kind: NetworkPolicy
16metadata:
17 name: block-suspicious
18 namespace: payments
19spec:
20 podSelector:
21 matchLabels:
22 app: payment-service
23 ingress:
24 - from:
25 - ipBlock:
26 cidr: 0.0.0.0/0
27 except:
28 - 192.168.1.0/24 # Suspicious range
29EOF
Verification Steps
1# Verify service is healthy
2curl -s https://api.company.com/payments/health | jq
3
4# Verify error rate is back to normal
5curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
6
7# Verify latency is acceptable
8curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
9
10# Smoke test critical flows
11./scripts/smoke-test-payments.sh
Rollback Procedures
1# Rollback Kubernetes deployment
2kubectl rollout undo deployment/payment-service -n payments
3
4# Rollback database migration (if applicable)
5./scripts/db-rollback.sh $MIGRATION_VERSION
6
7# Rollback feature flag
8curl -X POST https://api.company.com/internal/feature-flags \
9 -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
Escalation Matrix
| Condition | Escalate To | Contact |
|---|---|---|
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
| Data breach suspected | Security Team | #security-incidents |
| Financial impact > $10k | Finance + Legal | @finance-oncall |
| Customer communication needed | Support Lead | @support-lead |
Communication Templates
Initial Notification (Internal)
🚨 INCIDENT: Payment Service Degradation
Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]
Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards
Updates in #payments-incidents
Status Update
📊 UPDATE: Payment Service Incident
Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes
Actions Taken:
- Rolled back deployment v2.3.4 → v2.3.3
- Scaled service from 5 → 10 replicas
Next Steps:
- Continuing to monitor
- Root cause analysis in progress
ETA to Resolution: ~15 minutes
Resolution Notification
✅ RESOLVED: Payment Service Incident
Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4
Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully
Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress
### Template 2: Database Incident Runbook
```markdown
# Database Incident Runbook
## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
| Kill query | `SELECT pg_terminate_backend(pid);` |
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |
## Connection Pool Exhaustion
```sql
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;
-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';
Replication Lag
1-- Check lag on replica
2SELECT
3 CASE
4 WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
5 ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
6 END AS lag_seconds;
7
8-- If lag > 60s, consider:
9-- 1. Check network between primary/replica
10-- 2. Check replica disk I/O
11-- 3. Consider failover if unrecoverable
Disk Space Critical
1# Check disk usage
2df -h /var/lib/postgresql/data
3
4# Find large tables
5psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
6FROM pg_catalog.pg_statio_user_tables
7ORDER BY pg_total_relation_size(relid) DESC
8LIMIT 10;"
9
10# VACUUM to reclaim space
11psql -c "VACUUM FULL large_table;"
12
13# If emergency, delete old data or expand disk
## Best Practices
### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress
### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident
## Resources
- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
- [PagerDuty Incident Response](https://response.pagerduty.com/)
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)
What Users Are Saying
Real feedback from the community
Environment Matrix
Dependencies
Framework Support
Context Window
Security & Privacy
Information
- Author
- wshobson
- Updated
- 2026-01-30
- Category
- productivity-tools
Related Skills
Incident Runbook Templates
Create structured incident response runbooks with step-by-step procedures, escalation paths, and …
View Details →On Call Handoff Patterns
Master on-call shift handoffs with context transfer, escalation procedures, and documentation. Use …
View Details →On Call Handoff Patterns
Master on-call shift handoffs with context transfer, escalation procedures, and documentation. Use …
View Details →