Core Concepts

Understand the building blocks of NotifyHero's incident management platform.

Events, Alerts, and Incidents

These three levels form the core data model:

Events

Raw signals from your monitoring tools. A single Datadog monitor firing sends one event to NotifyHero via webhook.

{
  "event_type": "trigger",
  "severity": "critical",
  "title": "Disk usage > 90%",
  "source": "db-primary-01",
  "dedup_key": "disk-db-primary-01"
}

Alerts

Events are deduplicated and enriched into alerts. Multiple events with the same dedup_key collapse into a single alert. NotifyHero adds context: runbooks, past incidents, service ownership.

Incidents

Related alerts are grouped into incidents — the unit your team actually responds to. NotifyHero's AI groups alerts automatically (e.g., 47 server alerts from one bad deploy become 1 incident).

The flow: Event → Alert → Incident → Notification → Response → Resolution

Severity Levels

Every event has a severity that determines routing and urgency:

| Severity | Behavior | Example | |----------|----------|---------| | Critical | Immediate notification via high-urgency channels (phone, SMS) | Database down | | Error | High-urgency notification | API error rate > 5% | | Warning | Low-urgency notification (email, Slack) | CPU at 80% | | Info | Logged, no notification by default | Deployment completed |

You can customize severity-to-urgency mapping per service.

Services

A service represents a system, application, or component your team owns. Every alert routes through a service.

Examples:

checkout-service — owned by the Payments team
api-gateway — owned by the Platform team
cdn-edge — owned by the Infrastructure team

Each service has:

An escalation policy (who gets notified and when)
Integrations (which monitoring tools send events)
Runbooks (how to respond)

Escalation Policies

Escalation policies define who gets notified, when, and what happens if they don't respond.

Level 1: On-call engineer (immediate)
Level 2: Team lead (after 5 minutes)
Level 3: Engineering manager (after 15 minutes)
Level 4: VP Engineering (after 30 minutes)

If Level 1 acknowledges the incident, escalation stops. If not, it moves up. See Escalation Policies for full details.

Teams

Teams group people and services together. A team owns services and has its own schedules and escalation policies.

Platform Team
├── Members: Alice, Bob, Carol
├── Services: api-gateway, auth-service, rate-limiter
├── Schedule: Platform On-Call (weekly rotation)
└── Escalation: Platform Escalation Policy

Teams can be nested. A "Backend" team can contain "Platform" and "Payments" sub-teams.

On-Call Schedules

Schedules define who is on-call and when. They support:

Rotations — weekly, daily, or custom intervals
Layers — primary, secondary, and override layers
Handoff times — when shifts change
Restrictions — weekday-only or weekend-only schedules

The person on-call at the time an incident triggers is the first to be notified. See Schedules.

How It All Fits Together

Monitoring Tool → Event → Service → Alert → Incident
                                      ↓
                              Escalation Policy
                                      ↓
                              On-Call Schedule
                                      ↓
                              Notification (phone/SMS/Slack/etc.)
                                      ↓
                              Acknowledge → Investigate → Resolve
                                      ↓
                              Postmortem → Action Items

Every step is logged. Every action has a timestamp. Full audit trail from trigger to resolution.