Responding to Alerts - SRE Guide | Nife Deploy

When an alert fires, you need to respond quickly. This guide shows you how to handle active alerts.

Getting an Alert

What You'll See

When an alert fires, you'll be notified through your configured channels:

If Email:

Subject: CRITICAL: Production API Down
Body: Your production API is not responding
Severity: Critical
Time: 2024-01-15 14:32:00 UTC

If Slack:

🚨 CRITICAL: Production API Down
Status: Firing
Resource: api.example.com
Click here to view details

If PagerDuty:

You'll get paged immediately
Incident automatically created
Escalates if not acknowledged

Quick Response Workflow

1. Get Notification
   ↓
2. Open SRE Alerts page
   ↓
3. Click "Acknowledge"
   ↓
4. Investigate the issue
   ↓
5. Fix the problem
   ↓
6. Click "Resolve"
   ↓
7. Confirm everything is working

Step 1: Access the Alert

Via Notification Link

Most notifications include a direct link:

Click the link
Goes straight to the alert details

Via Dashboard

Go to SRE → Alerts
Look for the alert with Firing status (red badge)
Usually at the top of the list

Find Specific Alert

Use filters to find your alert quickly:

Status: Firing (most urgent)
Severity: Critical (highest priority)

Step 2: Review Alert Details

When you open the alert, you'll see:

Alert Information:

🔔 Alert status (Firing/Acknowledged/Resolved)
Alert title and description
Resource affected
Severity level
When it fired (e.g., "5 minutes ago")

Example Alert:

Status: Firing 🔔
Title: High CPU Usage on Production API
Severity: Critical
Resource: prod-api-server-01
Fired: 3 minutes ago
Description: CPU usage exceeded 80% threshold
Current value: 87%

Step 3: Acknowledge the Alert

Why Acknowledge?

Tells your team:

You've seen the alert
You're investigating it
Others don't need to also respond

How to Acknowledge

Click the Acknowledge button
Confirm in the popup dialog
Alert status changes from "Firing" to "Acknowledged"
Your name appears as the investigator

What Happens:

Before: Status = Firing (everyone should look at it)
↓
You click Acknowledge
↓
After: Status = Acknowledged (I'm looking at it)

When to Acknowledge:

✅ Immediately when you start investigating
✅ Even if you can't fix it right away
✅ So team knows you're on it

When NOT to Acknowledge:

❌ If you don't actually know what's happening
❌ If someone else should handle it
❌ If you can't take action

Step 4: Investigate the Issue

What to Do

Understand the Alert
- What is it monitoring?
- What threshold triggered it?
- What's the current value?
Check the Resource
- Log in to the system
- View metrics/logs for that resource
- Check for errors or anomalies
Identify the Problem
- What's causing the issue?
- When did it start?
- Is it affecting users?
Document Your Findings
- Write down what you found
- Note what you're trying
- Keep team informed in Slack if needed

Investigation Tips

If CPU is High:

Check what process is using CPU
Look for runaway queries or loops
Check if traffic spike occurred

If API is Slow:

Check database performance
Review error logs
Check if upstream service is down

If Memory is High:

Look for memory leaks
Check if cache is bloated
Verify application version

If Service is Down:

Check if it's running
Look at recent deployments
Check network connectivity
Review error logs

Step 5: Fix the Problem

Take Action

Based on your investigation, take appropriate action:

Common Fixes:

Restart the service
Scale up the application
Clear cache
Kill runaway process
Deploy a fix
Adjust configuration
Route traffic elsewhere

Verify the Fix

After taking action, verify it worked:

Check the metric that triggered alert
Confirm it's back to normal
Test the functionality
Have users report if working

Before Resolving: Make absolutely sure the issue is fixed. Don't resolve and have it fire again 5 minutes later.

Step 6: Resolve the Alert

How to Resolve

Click the Resolve button
Confirm in the popup dialog
Alert status changes to "Resolved" (✓)
Your name appears as who resolved it

What This Means:

Status: Resolved ✓ = Issue is fixed, no more action needed

When to Resolve:

✅ After you've fixed the underlying issue
✅ After you've verified the fix works
✅ When the metric is back to normal

Don't Resolve Until:

❌ The issue is completely fixed
❌ You've verified the fix works
❌ Metric is back to acceptable level

Step 7: Verify and Document

Final Verification

Check that:

The alert status changed to "Resolved"
The metric is back to normal
No related alerts are firing
Users aren't reporting issues
Team is aware it's resolved

Document for the Team

Post an update in Slack:

✅ RESOLVED: Production API High CPU
Issue: Memory leak in v2.1.0
Fix: Rolled back to v2.0.5
Status: All systems normal, no user impact
ETA for permanent fix: Tuesday

Alert Statuses Reference

Firing Status 🔔 (Red)

What it means: Alert condition is currently true. Action is needed.

What you should do:

Acknowledge it
Investigate
Fix it
Resolve it

Duration: Until you resolve it

Acknowledged Status ⏱️ (Yellow)

What it means: Someone is already investigating. No need for others to duplicate work.

What it shows:

Shows who acknowledged it
Shows when they acknowledged it

Next step: Resolve once the issue is fixed

Duration: While being investigated

Resolved Status ✓ (Green)

What it means: The issue has been fixed. Alert is closed.

What it shows:

Shows who resolved it
Shows when it was resolved

Historical: Kept for records and trend analysis

Duration: Forever (historical record)

Filtering to Find Alerts

Filter by Status

Firing:

Currently active alerts
Need attention NOW
Most urgent

Acknowledged:

Someone is investigating
Not yet resolved
Don't need to duplicate work

Resolved:

Historical view
Issue is fixed
Use for trend analysis

Filter by Severity

Critical:

Immediate action needed
System down or critical function broken

High:

Needs urgent attention
User-impacting

Medium:

Should be addressed
Non-critical issues

Low:

Nice to know
Can handle when you have time

Team Coordination

When Multiple People Need to Help

First person: Acknowledge the alert
Post in Slack: "I'm on the X alert, currently investigating"
Assign tasks: "Can someone check the database?"
Coordinate: "Try restarting service on server-02"
Report: "Found the issue, deploying fix now"
Resolve: Once fixed

Escalation if Stuck

If you can't resolve it:

Post in Slack asking for help
Escalate to team lead if urgent
If critical, create incident ticket
Keep alert Acknowledged so others know it's being worked on

Common Alert Scenarios

Scenario 1: False Alarm

Problem: Alert fired but nothing is actually wrong

Solution:

Acknowledge it
Verify the metric
Resolve if confirmed to be false
Later: Adjust the alert threshold to prevent false alarms

Scenario 2: Recurring Alert

Problem: Same alert keeps firing over and over

Solution:

First time: Acknowledge and investigate
Second time: Find root cause
Implement permanent fix
Adjust alert threshold if needed

Scenario 3: Cascading Alerts

Problem: One issue causes multiple alerts to fire

Example:

Database goes down
→ API Alert fires
→ Website Alert fires
→ Scheduled Job Alert fires

Solution:

Fix the root cause (database)
All related alerts auto-resolve
Document it for future reference

Scenario 4: Alert During Maintenance

Problem: Alert fires while you're doing planned maintenance

Solution:

You expected it
Still acknowledge it
Note in alert: "Expected, maintenance in progress"
Resolve when maintenance complete

Best Practices for Responding

1. Respond Quickly

Acknowledge within 5 minutes
Team should see it's being handled
Time is critical for critical alerts

2. Acknowledge Immediately

Don't wait until you have the solution
Let team know you're investigating
Prevents duplicate work

3. Keep Team Informed

Post updates in Slack
Let others know what you found
Ask for help if needed

4. Verify Before Resolving

Don't resolve until truly fixed
Verify the metric is back to normal
Check downstream systems

5. Document What Happened

Write it down for future reference
Include root cause
Note how you fixed it

6. Learn from It

Why did it happen?
How can we prevent it next time?
Do we need to adjust alert thresholds?
Do we need better monitoring?

Quick Response Checklist

Getting Help During an Alert

Need to ask for help?

Post in Slack channel
@mention the relevant team
Include alert details
Ask specific questions

Example:

@backend-team: Production API CPU alert
Currently at 87%, investigating.
Can someone check database performance?
Last deployment was 2 hours ago, could be related.

Next Steps

Best Practices - Learn more advanced strategies
Alert Management - Manage your rules
Monitoring Guide - Overall monitoring strategy

Quick Links

Need	Location
View alerts	SRE → Alerts
Create rule	Alerts → Alert Rules
Configure notifications	Alerts → Alert Config
Help	Click ? icon

Contact Support

For issues with responding to alerts:

Email: support@nifetency.com
Dashboard chat: Available 24/7

Getting an Alert​

What You'll See​

Quick Response Workflow​

Step 1: Access the Alert​

Via Notification Link​

Via Dashboard​

Find Specific Alert​

Step 2: Review Alert Details​

Step 3: Acknowledge the Alert​

Why Acknowledge?​

How to Acknowledge​

What Happens:​

When to Acknowledge:​

When NOT to Acknowledge:​

Step 4: Investigate the Issue​

What to Do​

Investigation Tips​

Step 5: Fix the Problem​

Take Action​

Verify the Fix​

Step 6: Resolve the Alert​

How to Resolve​

What This Means:​

When to Resolve:​

Don't Resolve Until:​

Step 7: Verify and Document​

Final Verification​

Document for the Team​

Alert Statuses Reference​

Firing Status 🔔 (Red)​

Acknowledged Status ⏱️ (Yellow)​

Resolved Status ✓ (Green)​

Filtering to Find Alerts​

Filter by Status​

Filter by Severity​

Team Coordination​

When Multiple People Need to Help​

Escalation if Stuck​

Common Alert Scenarios​

Scenario 1: False Alarm​

Scenario 2: Recurring Alert​

Scenario 3: Cascading Alerts​

Scenario 4: Alert During Maintenance​

Best Practices for Responding​

1. Respond Quickly​

2. Acknowledge Immediately​

3. Keep Team Informed​

4. Verify Before Resolving​

5. Document What Happened​

6. Learn from It​

Quick Response Checklist​

Getting Help During an Alert​

Next Steps​

Quick Links​

Contact Support​

Getting an Alert

What You'll See

Quick Response Workflow

Step 1: Access the Alert

Via Notification Link

Via Dashboard

Find Specific Alert

Step 2: Review Alert Details

Step 3: Acknowledge the Alert

Why Acknowledge?

How to Acknowledge

What Happens:

When to Acknowledge:

When NOT to Acknowledge:

Step 4: Investigate the Issue

What to Do

Investigation Tips

Step 5: Fix the Problem

Take Action

Verify the Fix

Step 6: Resolve the Alert

How to Resolve

What This Means:

When to Resolve:

Don't Resolve Until:

Step 7: Verify and Document

Final Verification

Document for the Team

Alert Statuses Reference

Firing Status 🔔 (Red)

Acknowledged Status ⏱️ (Yellow)

Resolved Status ✓ (Green)

Filtering to Find Alerts

Filter by Status

Filter by Severity

Team Coordination

When Multiple People Need to Help

Escalation if Stuck

Common Alert Scenarios

Scenario 1: False Alarm

Scenario 2: Recurring Alert

Scenario 3: Cascading Alerts

Scenario 4: Alert During Maintenance

Best Practices for Responding

1. Respond Quickly

2. Acknowledge Immediately

3. Keep Team Informed

4. Verify Before Resolving

5. Document What Happened

6. Learn from It

Quick Response Checklist

Getting Help During an Alert

Next Steps

Quick Links

Contact Support