On Call Handoff Patterns

Master seamless on-call transitions with expert handoff patterns

✨ The solution you've been looking for

Verified

Tested and verified by our team

25450 Stars

Master on-call shift handoffs with context transfer, escalation procedures, and documentation. Use when transitioning on-call responsibilities, documenting shift summaries, or improving on-call processes.

on-call incident-response documentation handoff sre devops shift-management operations

Repository

See It In Action

Interactive preview & real-world examples

Live Demo

AI Conversation Simulator

See how users interact with this skill

User Prompt

Generate a shift handoff document for the platform team. We have one ongoing API timeout investigation and a major release scheduled for tomorrow.

Skill Processing

Analyzing request...

Agent Response

Complete handoff document with active incidents, ongoing investigations, recent changes, known issues, and upcoming events

Quick Start (3 Steps)

Get up and running in minutes

Install

claude-code skill install on-call-handoff-patterns

claude-code skill install on-call-handoff-patterns

Config

First Trigger

@on-call-handoff-patterns help

Commands

Command	Description	Required Args
@on-call-handoff-patterns standard-shift-handoff	Create comprehensive handoff documentation for routine shift transitions	None
@on-call-handoff-patterns mid-incident-handoff	Transfer incident ownership during active emergencies	None
@on-call-handoff-patterns quick-async-handoff	Document essential information for time-sensitive handoffs	None

Typical Use Cases

Standard Shift Handoff

Create comprehensive handoff documentation for routine shift transitions

Mid-Incident Handoff

Transfer incident ownership during active emergencies

Quick Async Handoff

Document essential information for time-sensitive handoffs

Overview

On-Call Handoff Patterns

Effective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.

When to Use This Skill

Transitioning on-call responsibilities
Writing shift handoff summaries
Documenting ongoing investigations
Establishing on-call rotation procedures
Improving handoff quality
Onboarding new on-call engineers

Core Concepts

1. Handoff Components

Component	Purpose
Active Incidents	What’s currently broken
Ongoing Investigations	Issues being debugged
Recent Changes	Deployments, configs
Known Issues	Workarounds in place
Upcoming Events	Maintenance, releases

2. Handoff Timing

Recommended: 30 min overlap between shifts

Outgoing:
├── 15 min: Write handoff document
└── 15 min: Sync call with incoming

Incoming:
├── 15 min: Review handoff document
├── 15 min: Sync call with outgoing
└── 5 min: Verify alerting setup

Templates

Template 1: Shift Handoff Document

  1# On-Call Handoff: Platform Team
  2
  3**Outgoing**: @alice (2024-01-15 to 2024-01-22)
  4**Incoming**: @bob (2024-01-22 to 2024-01-29)
  5**Handoff Time**: 2024-01-22 09:00 UTC
  6
  7---
  8
  9## 🔴 Active Incidents
 10
 11### None currently active
 12
 13No active incidents at handoff time.
 14
 15---
 16
 17## 🟡 Ongoing Investigations
 18
 19### 1. Intermittent API Timeouts (ENG-1234)
 20
 21**Status**: Investigating
 22**Started**: 2024-01-20
 23**Impact**: ~0.1% of requests timing out
 24
 25**Context**:
 26
 27- Timeouts correlate with database backup window (02:00-03:00 UTC)
 28- Suspect backup process causing lock contention
 29- Added extra logging in PR #567 (deployed 01/21)
 30
 31**Next Steps**:
 32
 33- [ ] Review new logs after tonight's backup
 34- [ ] Consider moving backup window if confirmed
 35
 36**Resources**:
 37
 38- Dashboard: [API Latency](https://grafana/d/api-latency)
 39- Thread: #platform-eng (01/20, 14:32)
 40
 41---
 42
 43### 2. Memory Growth in Auth Service (ENG-1235)
 44
 45**Status**: Monitoring
 46**Started**: 2024-01-18
 47**Impact**: None yet (proactive)
 48
 49**Context**:
 50
 51- Memory usage growing ~5% per day
 52- No memory leak found in profiling
 53- Suspect connection pool not releasing properly
 54
 55**Next Steps**:
 56
 57- [ ] Review heap dump from 01/21
 58- [ ] Consider restart if usage > 80%
 59
 60**Resources**:
 61
 62- Dashboard: [Auth Service Memory](https://grafana/d/auth-memory)
 63- Analysis doc: [Memory Investigation](https://docs/eng-1235)
 64
 65---
 66
 67## 🟢 Resolved This Shift
 68
 69### Payment Service Outage (2024-01-19)
 70
 71- **Duration**: 23 minutes
 72- **Root Cause**: Database connection exhaustion
 73- **Resolution**: Rolled back v2.3.4, increased pool size
 74- **Postmortem**: [POSTMORTEM-89](https://docs/postmortem-89)
 75- **Follow-up tickets**: ENG-1230, ENG-1231
 76
 77---
 78
 79## 📋 Recent Changes
 80
 81### Deployments
 82
 83| Service      | Version | Time        | Notes                      |
 84| ------------ | ------- | ----------- | -------------------------- |
 85| api-gateway  | v3.2.1  | 01/21 14:00 | Bug fix for header parsing |
 86| user-service | v2.8.0  | 01/20 10:00 | New profile features       |
 87| auth-service | v4.1.2  | 01/19 16:00 | Security patch             |
 88
 89### Configuration Changes
 90
 91- 01/21: Increased API rate limit from 1000 to 1500 RPS
 92- 01/20: Updated database connection pool max from 50 to 75
 93
 94### Infrastructure
 95
 96- 01/20: Added 2 nodes to Kubernetes cluster
 97- 01/19: Upgraded Redis from 6.2 to 7.0
 98
 99---
100
101## ⚠️ Known Issues & Workarounds
102
103### 1. Slow Dashboard Loading
104
105**Issue**: Grafana dashboards slow on Monday mornings
106**Workaround**: Wait 5 min after 08:00 UTC for cache warm-up
107**Ticket**: OPS-456 (P3)
108
109### 2. Flaky Integration Test
110
111**Issue**: `test_payment_flow` fails intermittently in CI
112**Workaround**: Re-run failed job (usually passes on retry)
113**Ticket**: ENG-1200 (P2)
114
115---
116
117## 📅 Upcoming Events
118
119| Date        | Event                | Impact              | Contact       |
120| ----------- | -------------------- | ------------------- | ------------- |
121| 01/23 02:00 | Database maintenance | 5 min read-only     | @dba-team     |
122| 01/24 14:00 | Major release v5.0   | Monitor closely     | @release-team |
123| 01/25       | Marketing campaign   | 2x traffic expected | @platform     |
124
125---
126
127## 📞 Escalation Reminders
128
129| Issue Type      | First Escalation     | Second Escalation |
130| --------------- | -------------------- | ----------------- |
131| Payment issues  | @payments-oncall     | @payments-manager |
132| Auth issues     | @auth-oncall         | @security-team    |
133| Database issues | @dba-team            | @infra-manager    |
134| Unknown/severe  | @engineering-manager | @vp-engineering   |
135
136---
137
138## 🔧 Quick Reference
139
140### Common Commands
141
142```bash
143# Check service health
144kubectl get pods -A | grep -v Running
145
146# Recent deployments
147kubectl get events --sort-by='.lastTimestamp' | tail -20
148
149# Database connections
150psql -c "SELECT count(*) FROM pg_stat_activity;"
151
152# Clear cache (emergency only)
153redis-cli FLUSHDB
154```

Important Links

Handoff Checklist

Outgoing Engineer

Document active incidents
Document ongoing investigations
List recent changes
Note known issues
Add upcoming events
Sync with incoming engineer

Incoming Engineer

Read this document
Join sync call
Verify PagerDuty is routing to you
Verify Slack notifications working
Check VPN/access working
Review critical dashboards


### Template 2: Quick Handoff (Async)

```markdown
# Quick Handoff: @alice → @bob

## TL;DR
- No active incidents
- 1 investigation ongoing (API timeouts, see ENG-1234)
- Major release tomorrow (01/24) - be ready for issues

## Watch List
1. API latency around 02:00-03:00 UTC (backup window)
2. Auth service memory (restart if > 80%)

## Recent
- Deployed api-gateway v3.2.1 yesterday (stable)
- Increased rate limits to 1500 RPS

## Coming Up
- 01/23 02:00 - DB maintenance (5 min read-only)
- 01/24 14:00 - v5.0 release

## Questions?
I'll be available on Slack until 17:00 today.

Template 3: Incident Handoff (Mid-Incident)

 1# INCIDENT HANDOFF: Payment Service Degradation
 2
 3**Incident Start**: 2024-01-22 08:15 UTC
 4**Current Status**: Mitigating
 5**Severity**: SEV2
 6
 7---
 8
 9## Current State
10
11- Error rate: 15% (down from 40%)
12- Mitigation in progress: scaling up pods
13- ETA to resolution: ~30 min
14
15## What We Know
16
171. Root cause: Memory pressure on payment-service pods
182. Triggered by: Unusual traffic spike (3x normal)
193. Contributing: Inefficient query in checkout flow
20
21## What We've Done
22
23- Scaled payment-service from 5 → 15 pods
24- Enabled rate limiting on checkout endpoint
25- Disabled non-critical features
26
27## What Needs to Happen
28
291. Monitor error rate - should reach <1% in ~15 min
302. If not improving, escalate to @payments-manager
313. Once stable, begin root cause investigation
32
33## Key People
34
35- Incident Commander: @alice (handing off)
36- Comms Lead: @charlie
37- Technical Lead: @bob (incoming)
38
39## Communication
40
41- Status page: Updated at 08:45
42- Customer support: Notified
43- Exec team: Aware
44
45## Resources
46
47- Incident channel: #inc-20240122-payment
48- Dashboard: [Payment Service](https://grafana/d/payments)
49- Runbook: [Payment Degradation](https://wiki/runbooks/payments)
50
51---
52
53**Incoming on-call (@bob) - Please confirm you have:**
54
55- [ ] Joined #inc-20240122-payment
56- [ ] Access to dashboards
57- [ ] Understand current state
58- [ ] Know escalation path

Handoff Sync Meeting

Agenda (15 minutes)

 1## Handoff Sync: @alice → @bob
 2
 31. **Active Issues** (5 min)
 4   - Walk through any ongoing incidents
 5   - Discuss investigation status
 6   - Transfer context and theories
 7
 82. **Recent Changes** (3 min)
 9   - Deployments to watch
10   - Config changes
11   - Known regressions
12
133. **Upcoming Events** (3 min)
14   - Maintenance windows
15   - Expected traffic changes
16   - Releases planned
17
184. **Questions** (4 min)
19   - Clarify anything unclear
20   - Confirm access and alerting
21   - Exchange contact info

On-Call Best Practices

Before Your Shift

 1## Pre-Shift Checklist
 2
 3### Access Verification
 4
 5- [ ] VPN working
 6- [ ] kubectl access to all clusters
 7- [ ] Database read access
 8- [ ] Log aggregator access (Splunk/Datadog)
 9- [ ] PagerDuty app installed and logged in
10
11### Alerting Setup
12
13- [ ] PagerDuty schedule shows you as primary
14- [ ] Phone notifications enabled
15- [ ] Slack notifications for incident channels
16- [ ] Test alert received and acknowledged
17
18### Knowledge Refresh
19
20- [ ] Review recent incidents (past 2 weeks)
21- [ ] Check service changelog
22- [ ] Skim critical runbooks
23- [ ] Know escalation contacts
24
25### Environment Ready
26
27- [ ] Laptop charged and accessible
28- [ ] Phone charged
29- [ ] Quiet space available for calls
30- [ ] Secondary contact identified (if traveling)

During Your Shift

 1## Daily On-Call Routine
 2
 3### Morning (start of day)
 4
 5- [ ] Check overnight alerts
 6- [ ] Review dashboards for anomalies
 7- [ ] Check for any P0/P1 tickets created
 8- [ ] Skim incident channels for context
 9
10### Throughout Day
11
12- [ ] Respond to alerts within SLA
13- [ ] Document investigation progress
14- [ ] Update team on significant issues
15- [ ] Triage incoming pages
16
17### End of Day
18
19- [ ] Hand off any active issues
20- [ ] Update investigation docs
21- [ ] Note anything for next shift

After Your Shift

1## Post-Shift Checklist
2
3- [ ] Complete handoff document
4- [ ] Sync with incoming on-call
5- [ ] Verify PagerDuty routing changed
6- [ ] Close/update investigation tickets
7- [ ] File postmortems for any incidents
8- [ ] Take time off if shift was stressful

Escalation Guidelines

When to Escalate

 1## Escalation Triggers
 2
 3### Immediate Escalation
 4
 5- SEV1 incident declared
 6- Data breach suspected
 7- Unable to diagnose within 30 min
 8- Customer or legal escalation received
 9
10### Consider Escalation
11
12- Issue spans multiple teams
13- Requires expertise you don't have
14- Business impact exceeds threshold
15- You're uncertain about next steps
16
17### How to Escalate
18
191. Page the appropriate escalation path
202. Provide brief context in Slack
213. Stay engaged until escalation acknowledges
224. Hand off cleanly, don't just disappear

Best Practices

Do’s

Document everything - Future you will thank you
Escalate early - Better safe than sorry
Take breaks - Alert fatigue is real
Keep handoffs synchronous - Async loses context
Test your setup - Before incidents, not during

Don’ts

Don’t skip handoffs - Context loss causes incidents
Don’t hero - Escalate when needed
Don’t ignore alerts - Even if they seem minor
Don’t work sick - Swap shifts instead
Don’t disappear - Stay reachable during shift

On Call Handoff Patterns

See It In Action

AI Conversation Simulator

Quick Start (3 Steps)

Install

Config

First Trigger

Commands

Typical Use Cases

Standard Shift Handoff

Mid-Incident Handoff

Quick Async Handoff

Overview

On-Call Handoff Patterns

When to Use This Skill

Core Concepts

1. Handoff Components

2. Handoff Timing

Templates

Template 1: Shift Handoff Document

Important Links

Handoff Checklist

Outgoing Engineer

Incoming Engineer

Template 3: Incident Handoff (Mid-Incident)

Handoff Sync Meeting

Agenda (15 minutes)

On-Call Best Practices

Before Your Shift

During Your Shift

After Your Shift

Escalation Guidelines

When to Escalate

Best Practices

Do’s

Don’ts

Resources

What Users Are Saying

Environment Matrix

Dependencies

Context Window

Security & Privacy

Information

Related Skills

On Call Handoff Patterns

Incident Runbook Templates

Incident Runbook Templates