Responding to Alerts - SRE Guide | Nife Deploy
When an alert fires, you need to respond quickly. This guide shows you how to handle active alerts.
Getting an Alert
What You'll See
When an alert fires, you'll be notified through your configured channels:
If Email:
Subject: CRITICAL: Production API Down
Body: Your production API is not responding
Severity: Critical
Time: 2024-01-15 14:32:00 UTC
If Slack:
🚨 CRITICAL: Production API Down
Status: Firing
Resource: api.example.com
Click here to view details
If PagerDuty:
- You'll get paged immediately
- Incident automatically created
- Escalates if not acknowledged
Quick Response Workflow
1. Get Notification
↓
2. Open SRE Alerts page
↓
3. Click "Acknowledge"
↓
4. Investigate the issue
↓
5. Fix the problem
↓
6. Click "Resolve"
↓
7. Confirm everything is working
Step 1: Access the Alert
Via Notification Link
Most notifications include a direct link:
- Click the link
- Goes straight to the alert details
Via Dashboard
- Go to SRE → Alerts
- Look for the alert with Firing status (red badge)
- Usually at the top of the list
Find Specific Alert
Use filters to find your alert quickly:
Status: Firing (most urgent)
Severity: Critical (highest priority)
Step 2: Review Alert Details
When you open the alert, you'll see:
Alert Information:
- 🔔 Alert status (Firing/Acknowledged/Resolved)
- Alert title and description
- Resource affected
- Severity level
- When it fired (e.g., "5 minutes ago")
Example Alert:
Status: Firing 🔔
Title: High CPU Usage on Production API
Severity: Critical
Resource: prod-api-server-01
Fired: 3 minutes ago
Description: CPU usage exceeded 80% threshold
Current value: 87%
Step 3: Acknowledge the Alert
Why Acknowledge?
Tells your team:
- You've seen the alert
- You're investigating it
- Others don't need to also respond
How to Acknowledge
- Click the Acknowledge button
- Confirm in the popup dialog
- Alert status changes from "Firing" to "Acknowledged"
- Your name appears as the investigator
What Happens:
Before: Status = Firing (everyone should look at it)
↓
You click Acknowledge
↓
After: Status = Acknowledged (I'm looking at it)
When to Acknowledge:
- ✅ Immediately when you start investigating
- ✅ Even if you can't fix it right away
- ✅ So team knows you're on it
When NOT to Acknowledge:
- ❌ If you don't actually know what's happening
- ❌ If someone else should handle it
- ❌ If you can't take action
Step 4: Investigate the Issue
What to Do
-
Understand the Alert
- What is it monitoring?
- What threshold triggered it?
- What's the current value?
-
Check the Resource
- Log in to the system
- View metrics/logs for that resource
- Check for errors or anomalies
-
Identify the Problem
- What's causing the issue?
- When did it start?
- Is it affecting users?
-
Document Your Findings
- Write down what you found
- Note what you're trying
- Keep team informed in Slack if needed
Investigation Tips
If CPU is High:
- Check what process is using CPU
- Look for runaway queries or loops
- Check if traffic spike occurred
If API is Slow:
- Check database performance
- Review error logs
- Check if upstream service is down
If Memory is High:
- Look for memory leaks
- Check if cache is bloated
- Verify application version
If Service is Down:
- Check if it's running
- Look at recent deployments
- Check network connectivity
- Review error logs
Step 5: Fix the Problem
Take Action
Based on your investigation, take appropriate action:
Common Fixes:
- Restart the service
- Scale up the application
- Clear cache
- Kill runaway process
- Deploy a fix
- Adjust configuration
- Route traffic elsewhere
Verify the Fix
After taking action, verify it worked:
- Check the metric that triggered alert
- Confirm it's back to normal
- Test the functionality
- Have users report if working
Before Resolving: Make absolutely sure the issue is fixed. Don't resolve and have it fire again 5 minutes later.
Step 6: Resolve the Alert
How to Resolve
- Click the Resolve button
- Confirm in the popup dialog
- Alert status changes to "Resolved" (✓)
- Your name appears as who resolved it
What This Means:
Status: Resolved ✓ = Issue is fixed, no more action needed
When to Resolve:
- ✅ After you've fixed the underlying issue
- ✅ After you've verified the fix works
- ✅ When the metric is back to normal
Don't Resolve Until:
- ❌ The issue is completely fixed
- ❌ You've verified the fix works
- ❌ Metric is back to acceptable level
Step 7: Verify and Document
Final Verification
Check that:
- The alert status changed to "Resolved"
- The metric is back to normal
- No related alerts are firing
- Users aren't reporting issues
- Team is aware it's resolved
Document for the Team
Post an update in Slack:
✅ RESOLVED: Production API High CPU
Issue: Memory leak in v2.1.0
Fix: Rolled back to v2.0.5
Status: All systems normal, no user impact
ETA for permanent fix: Tuesday
Alert Statuses Reference
Firing Status 🔔 (Red)
What it means: Alert condition is currently true. Action is needed.
What you should do:
- Acknowledge it
- Investigate
- Fix it
- Resolve it
Duration: Until you resolve it