Incident Lifecycle

Every incident follows a clear path: Triggered → Acknowledged → Resolved.


States

Triggered

The incident is new. Notifications are being sent according to the escalation policy. The clock is ticking.

Acknowledged

Someone has taken ownership. Escalation stops. The team knows who's handling it.

Resolved

The issue is fixed. The incident is closed. Metrics are recorded.

Triggered ──→ Acknowledged ──→ Resolved
    │              │
    │              ├──→ Snoozed (auto-retriggers)
    │              │
    └──→ Escalated (no ack within timeout)

Triggering Incidents

Incidents are triggered by:

  • Inbound events — from monitoring tool webhooks
  • Manual creation — from the Dashboard or API
  • AI grouping — multiple alerts grouped into one incident

Each incident gets:

  • Unique incident number (#INC-1042)
  • Severity level (critical, error, warning, info)
  • Assigned service and escalation policy
  • Full timeline from trigger to resolution

Acknowledging

Acknowledge from any surface:

  • Dashboard — click Acknowledge
  • Mobile app — swipe to ack
  • Slack/Teams — click the Ack button on the notification
  • Phone call — press 1
  • APIPUT /v1/incidents/{id}/acknowledge

Acknowledging stops escalation and tells the team: "I'm on it."

Auto-acknowledge: Configure services to auto-ack when someone joins the war room or posts in the incident Slack channel.


Resolving

Resolve when the issue is fixed:

  • Manual — from Dashboard, mobile, Slack, or API
  • Auto-resolve — when your monitoring tool sends a recovery event
  • Timeout — auto-resolve after N hours if no further events (configurable per service)
# Resolve via API
curl -X PUT https://api.notifyhero.com/v1/incidents/INC-1042/resolve \
  -H "Authorization: Bearer nh_live_abc123" \
  -H "Content-Type: application/json" \
  -d '{"resolution_note": "Scaled up API pods to handle traffic spike"}'

Priority Levels

Assign priority to control response urgency:

| Priority | Label | Expected Response | |----------|-------|-------------------| | P1 | Critical | Immediate — all hands on deck | | P2 | High | Within 15 minutes | | P3 | Medium | Within 1 hour | | P4 | Low | Next business day | | P5 | Informational | No response required |

Priority can be set manually or auto-assigned based on severity and service tier.


Merging Incidents

When multiple incidents are related (e.g., same root cause), merge them:

  1. Select incidents from the Dashboard
  2. Click Merge
  3. Choose the primary incident
  4. All alerts, timeline entries, and responders consolidate

Merged incidents share a single timeline and resolution.


Snoozing

Temporarily dismiss an incident that can't be fixed right now:

  • Snooze for 30 min / 1 hour / 4 hours / custom
  • The incident moves to "Snoozed" state
  • After the snooze expires, it re-triggers and restarts escalation

Use snooze for known issues with a scheduled fix window. Don't use it to avoid alerts — that's what suppression rules are for.


Reassignment

Transfer ownership of an incident:

  1. Open the incident
  2. Click Reassign
  3. Select a person or team
  4. They receive an immediate notification

The original responder is released. Full reassignment history is logged in the timeline.


Timeline

Every incident has a complete, immutable timeline:

09:00:00  Triggered by Datadog webhook
09:00:01  Notification sent to Alice (push)
09:00:02  Notification sent to Alice (Slack)
09:01:01  Notification sent to Alice (phone)
09:01:45  Acknowledged by Alice
09:02:00  War room created: #inc-1042
09:15:00  Note added: "Identified — bad deploy at 08:55"
09:22:00  Resolved by Alice
09:22:01  Resolution note: "Rolled back deploy v2.4.1"

Every action, every notification, every escalation — logged with timestamps.