Core Concepts

Understand the building blocks of NotifyHero's incident management platform.


Events, Alerts, and Incidents

These three levels form the core data model:

Events

Raw signals from your monitoring tools. A single Datadog monitor firing sends one event to NotifyHero via webhook.

{
  "event_type": "trigger",
  "severity": "critical",
  "title": "Disk usage > 90%",
  "source": "db-primary-01",
  "dedup_key": "disk-db-primary-01"
}

Alerts

Events are deduplicated and enriched into alerts. Multiple events with the same dedup_key collapse into a single alert. NotifyHero adds context: runbooks, past incidents, service ownership.

Incidents

Related alerts are grouped into incidents — the unit your team actually responds to. NotifyHero's AI groups alerts automatically (e.g., 47 server alerts from one bad deploy become 1 incident).

The flow: Event → Alert → Incident → Notification → Response → Resolution


Severity Levels

Every event has a severity that determines routing and urgency:

| Severity | Behavior | Example | |----------|----------|---------| | Critical | Immediate notification via high-urgency channels (phone, SMS) | Database down | | Error | High-urgency notification | API error rate > 5% | | Warning | Low-urgency notification (email, Slack) | CPU at 80% | | Info | Logged, no notification by default | Deployment completed |

You can customize severity-to-urgency mapping per service.


Services

A service represents a system, application, or component your team owns. Every alert routes through a service.

Examples:

  • checkout-service — owned by the Payments team
  • api-gateway — owned by the Platform team
  • cdn-edge — owned by the Infrastructure team

Each service has:

  • An escalation policy (who gets notified and when)
  • Integrations (which monitoring tools send events)
  • Runbooks (how to respond)

Escalation Policies

Escalation policies define who gets notified, when, and what happens if they don't respond.

Level 1: On-call engineer (immediate)
Level 2: Team lead (after 5 minutes)
Level 3: Engineering manager (after 15 minutes)
Level 4: VP Engineering (after 30 minutes)

If Level 1 acknowledges the incident, escalation stops. If not, it moves up. See Escalation Policies for full details.


Teams

Teams group people and services together. A team owns services and has its own schedules and escalation policies.

Platform Team
├── Members: Alice, Bob, Carol
├── Services: api-gateway, auth-service, rate-limiter
├── Schedule: Platform On-Call (weekly rotation)
└── Escalation: Platform Escalation Policy

Teams can be nested. A "Backend" team can contain "Platform" and "Payments" sub-teams.


On-Call Schedules

Schedules define who is on-call and when. They support:

  • Rotations — weekly, daily, or custom intervals
  • Layers — primary, secondary, and override layers
  • Handoff times — when shifts change
  • Restrictions — weekday-only or weekend-only schedules

The person on-call at the time an incident triggers is the first to be notified. See Schedules.


How It All Fits Together

Monitoring Tool → Event → Service → Alert → Incident
                                      ↓
                              Escalation Policy
                                      ↓
                              On-Call Schedule
                                      ↓
                              Notification (phone/SMS/Slack/etc.)
                                      ↓
                              Acknowledge → Investigate → Resolve
                                      ↓
                              Postmortem → Action Items

Every step is logged. Every action has a timestamp. Full audit trail from trigger to resolution.