Monitoring & Alerting
Health checks, Slack/webhook/email alerts, dead letter queue, and audit logs — operational visibility out of the box
Hogsend includes built-in monitoring, alerting, and failure recovery — no external tools required. This page covers health checks, alert rules, the dead letter queue, and audit logs for incident investigation.
Health Check
The health endpoint reports the status of each infrastructure component:
curl http://localhost:3002/v1/health{
"status": "healthy",
"uptime": 86400.123,
"timestamp": "2026-05-25T10:30:00.000Z",
"version": "0.0.1",
"components": {
"database": { "status": "up", "latencyMs": 2 },
"redis": { "status": "up" }
},
"schema": {
"engine": { "required": "0012", "applied": "0012", "inSync": true, "pending": [] },
"client": { "required": "0003", "applied": "0003", "inSync": true, "pending": [] }
}
}No authentication required -- this endpoint is public so infrastructure tools can call it.
Status Values
| Status | Meaning |
|---|---|
healthy | All components up and both migration tracks in sync |
degraded | One or more components are down, but the API is still serving requests |
migration_pending | Either migration track is behind the code (schema.engine.inSync or schema.client.inSync is false) |
Each component reports up or down (the database also reports latencyMs). If any component is down, the overall status becomes degraded but the API continues to serve requests that do not depend on the failed component.
The schema block (two migration tracks)
/v1/health reports the migration state of both tracks — the engine track (bundled in @hogsend/db, gates boot) and the client track (your repo's migrations/, surfaced non-fatally). Each track reports:
| Field | Meaning |
|---|---|
required | Latest migration the running build needs (or null for an empty client track) |
applied | Latest migration applied to the database (or null) |
inSync | applied is at least required (a DB ahead of the build is true by design) |
pending | Migrations the code needs but the DB lacks |
Overall status is migration_pending when either track is behind. A client repo with no migrations reports an empty client track (required: null, applied: null, inSync: true) — it never flips migration_pending. The engine track additionally asserts at boot: if it is behind, the API logs the pending migrations and exit(1)s rather than serving against a schema it does not understand. See Deployment and Upgrading & Customizing for the full migration model.
What to Monitor
Set up an external uptime monitor (Pingdom, Better Uptime, etc.) pointed at your health endpoint. Watch for:
- Status flip to
degraded-- investigate which component is down - Database latency above 50ms -- may indicate connection pool exhaustion or query performance issues
- Redis going down -- rate limiting falls back to in-memory (per-instance only), PostHog property caching stops working. Email delivery and journeys continue to function.
The health endpoint is configured as Railway's health check in railway.toml, so Railway will restart the service automatically if it becomes unresponsive.
System Metrics
The overview endpoint gives you a high-level snapshot:
curl -H "Authorization: Bearer your-api-key" \
http://localhost:3002/v1/admin/metrics/overview{
"totalContacts": 1250,
"activeJourneys": 8,
"emailsSent24h": 340,
"emailsSent7d": 2100,
"emailsSent30d": 8500,
"bounceRate30d": 0.012,
"unsubscribeRate": 0.034
}Check this daily to spot trends:
| Metric | Normal | Investigate |
|---|---|---|
bounceRate30d | <0.02 (2%) | >0.03 (3%) |
unsubscribeRate | <0.05 (5%) | >0.10 (10%) |
emailsSent24h | Consistent day-to-day | Sudden spikes or drops |
activeJourneys | Stable or growing | Sudden drop (journeys disabled?) |
For deeper metrics on journeys, emails, and events, see Metrics & Analytics.
Event Volume
Track event inflow to verify your pipeline is working and spot anomalies:
curl -H "Authorization: Bearer your-api-key" \
"http://localhost:3002/v1/admin/metrics/events?granularity=hour&from=2026-05-25T00:00:00Z"{
"events": [
{ "event": "user:signed_up", "date": "2026-05-25T08:00:00Z", "count": 15 },
{ "event": "user:signed_up", "date": "2026-05-25T09:00:00Z", "count": 22 },
{ "event": "user:activated", "date": "2026-05-25T08:00:00Z", "count": 8 }
]
}Useful patterns:
- Zero events for an expected type -- your webhook source or ingest integration may be broken
- Event volume spike -- could indicate a bulk import, a marketing campaign launch, or a bug causing duplicate events
- Events arriving but no journey enrollments -- check if journeys are enabled and trigger conditions match
Alert Rules
Alert rules define conditions that trigger notifications. Each rule monitors a specific metric, fires when a threshold is crossed, and sends a notification through your chosen channel.
Creating Alert Rules
# Alert when bounce rate exceeds 3%
curl -X POST http://localhost:3002/v1/admin/alerts/rules \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"name": "Bounce rate warning",
"type": "bounce_rate_exceeded",
"threshold": 0.03,
"channel": "slack",
"channelConfig": { "webhookUrl": "https://hooks.slack.com/services/T.../B.../xxx" },
"cooldownMinutes": 120
}'# Alert when delivery rate drops below 95%
curl -X POST http://localhost:3002/v1/admin/alerts/rules \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"name": "Low delivery rate",
"type": "delivery_issue",
"threshold": 0.95,
"channel": "webhook",
"channelConfig": { "url": "https://your-app.com/webhooks/alerts" },
"cooldownMinutes": 60
}'# Alert on journey failure spikes (>10 failures per hour)
curl -X POST http://localhost:3002/v1/admin/alerts/rules \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"name": "Journey failures spiking",
"type": "journey_failure_spike",
"threshold": 10,
"channel": "email",
"channelConfig": { "to": "ops@yourcompany.com" },
"cooldownMinutes": 30
}'Alert Types
| Type | What it monitors | Threshold meaning |
|---|---|---|
bounce_rate_exceeded | 30-day bounce rate | Fires when rate exceeds this value (e.g., 0.05 = 5%) |
journey_failure_spike | Journey failures per hour | Fires when hourly count exceeds this number |
delivery_issue | Email delivery rate | Fires when rate drops below this value (e.g., 0.95 = 95%) |
high_complaint_rate | Spam complaint rate | Fires when rate exceeds this value |
Notification Channels
Slack -- send to a channel via incoming webhook:
{
"channel": "slack",
"channelConfig": {
"webhookUrl": "https://hooks.slack.com/services/T.../B.../xxx",
"channel": "#ops-alerts"
}
}The channel field in config is optional -- if omitted, the message goes to the webhook's default channel.
Webhook -- POST the alert payload to any URL:
{
"channel": "webhook",
"channelConfig": { "url": "https://your-app.com/webhooks/alerts" }
}Email -- send through your configured email provider (Resend by default) using your sender address:
{
"channel": "email",
"channelConfig": { "to": "ops@yourcompany.com" }
}Cooldown and Deduplication
The cooldownMinutes setting prevents alert fatigue. After a rule fires, it will not fire again until the cooldown period elapses. Set this based on how quickly you can respond:
| Scenario | Recommended cooldown |
|---|---|
| Critical alerts (delivery failures) | 30 minutes |
| Warning alerts (bounce rate trending up) | 2 hours |
| Informational alerts (high event volume) | 4-6 hours |
Managing Rules
# List all rules
curl -H "Authorization: Bearer your-api-key" \
http://localhost:3002/v1/admin/alerts/rules
# Update a rule (change threshold and cooldown)
curl -X PATCH http://localhost:3002/v1/admin/alerts/rules/rule-uuid \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{ "threshold": 0.02, "cooldownMinutes": 60 }'
# Delete a rule
curl -X DELETE http://localhost:3002/v1/admin/alerts/rules/rule-uuid \
-H "Authorization: Bearer your-api-key"Alert History
Review past alert triggers to verify notifications are working and thresholds are tuned:
curl -H "Authorization: Bearer your-api-key" \
"http://localhost:3002/v1/admin/alerts/history?limit=20"{
"alerts": [
{
"id": "alert-uuid",
"ruleId": "rule-uuid",
"ruleName": "Bounce rate warning",
"type": "bounce_rate_exceeded",
"currentValue": 0.042,
"threshold": 0.03,
"channel": "slack",
"delivered": true,
"triggeredAt": "2026-05-25T08:00:00.000Z"
}
],
"total": 5,
"limit": 20,
"offset": 0
}| Field | Meaning |
|---|---|
currentValue | The metric value when the alert fired |
threshold | The configured threshold |
delivered | Whether the notification was successfully sent |
If delivered: false, the notification channel is misconfigured. Check that the webhook URL is reachable, the Slack webhook is valid, or the email address is correct.
Filter by rule to see how often a specific rule is firing:
curl -H "Authorization: Bearer your-api-key" \
"http://localhost:3002/v1/admin/alerts/history?ruleId=rule-uuid"If a rule fires constantly, either the threshold is too sensitive or you have a real problem that needs attention.
Dead Letter Queue
When a task fails after all retry attempts, it is moved to the dead letter queue (DLQ) instead of being silently dropped. The DLQ is your last line of defense against data loss.
What Goes in the DLQ
| Source | Common causes |
|---|---|
email | Provider API errors, template rendering failures, rate limits |
journey | Journey code errors that exhausted Hatchet retries |
webhook | Outbound alert webhook delivery failures |
Inspecting the DLQ
# All pending entries
curl -H "Authorization: Bearer your-api-key" \
"http://localhost:3002/v1/admin/dlq?status=pending"
# Only failed emails
curl -H "Authorization: Bearer your-api-key" \
"http://localhost:3002/v1/admin/dlq?source=email&status=pending"
# Only failed journeys
curl -H "Authorization: Bearer your-api-key" \
"http://localhost:3002/v1/admin/dlq?source=journey&status=pending"{
"entries": [
{
"id": "dlq-uuid",
"source": "email",
"sourceId": "email-uuid",
"payload": {
"templateKey": "activation/welcome",
"toEmail": "user@acme.com"
},
"error": "Resend API timeout after 3 retries",
"retryCount": 3,
"status": "pending",
"retriedAt": null,
"createdAt": "2026-05-25T10:30:00.000Z"
}
],
"total": 1,
"limit": 50,
"offset": 0
}Retrying a Failed Task
If the underlying issue is resolved (Resend is back up, a bug was fixed), retry the task:
curl -X POST http://localhost:3002/v1/admin/dlq/dlq-uuid/retry \
-H "Authorization: Bearer your-api-key"{
"id": "dlq-uuid",
"status": "retried",
"retriedAt": "2026-05-25T11:00:00.000Z"
}The task is re-queued through its original pipeline. If it fails again, it returns to the DLQ with an incremented retryCount.
Discarding an Entry
If a failure is not worth retrying (recipient unsubscribed, event is no longer relevant):
curl -X DELETE http://localhost:3002/v1/admin/dlq/dlq-uuid \
-H "Authorization: Bearer your-api-key"Discarded entries remain in the DLQ with status: "discarded" for audit purposes.
DLQ Best Practices
- Review the DLQ weekly -- look for recurring patterns that indicate systemic issues
- Retry in batches after outages -- if Resend was down for an hour, retry all pending email entries once it recovers
- Discard stale entries -- an email from 2 weeks ago for a time-sensitive offer is not worth retrying
- Alert on DLQ growth -- if the pending count is growing, something is broken upstream
Audit Logs
Every admin mutation (POST, PUT, PATCH, DELETE) is automatically recorded. No configuration needed.
What Gets Logged
| Field | Description |
|---|---|
actor | The API key name, or "legacy" for the env-var key |
actorKeyId | API key UUID (null for legacy key) |
action | create, update, delete, revoke, enroll, cancel, import, export, replay, resend |
resource | contact, journey, api-key, alert-rule, email, event, dlq |
resourceId | The target resource's identifier |
detail | Additional context (e.g., the externalId of a created contact) |
ipAddress | Client IP address |
Searching Audit Logs
# All mutations in the last 24 hours
curl -H "Authorization: Bearer your-api-key" \
"http://localhost:3002/v1/admin/audit-logs?from=2026-05-24T10:30:00Z"
# Who deleted contacts recently?
curl -H "Authorization: Bearer your-api-key" \
"http://localhost:3002/v1/admin/audit-logs?resource=contact&action=delete"
# What did the CI Pipeline key do?
curl -H "Authorization: Bearer your-api-key" \
"http://localhost:3002/v1/admin/audit-logs?actor=CI%20Pipeline"
# All key management actions
curl -H "Authorization: Bearer your-api-key" \
"http://localhost:3002/v1/admin/audit-logs?resource=api-key"{
"logs": [
{
"id": "log-uuid",
"actor": "CI Pipeline",
"actorKeyId": "key-uuid",
"action": "create",
"resource": "contact",
"resourceId": "contact-uuid",
"detail": { "externalId": "user_abc123" },
"ipAddress": "192.168.1.1",
"createdAt": "2026-05-25T10:30:00.000Z"
}
],
"total": 1,
"limit": 50,
"offset": 0
}Using Audit Logs for Incident Response
When investigating an issue, the audit log answers "who did what, when":
- A journey was unexpectedly disabled -- search for
resource=journey&action=updateto find who toggled it - Contacts were deleted -- search for
resource=contact&action=deletewith a time range - An API key was compromised -- search for the key's actor name across all actions to see what it was used for, then revoke it
- A bulk import went wrong -- search for
resource=contact&action=importto find the import job details
Recommended Production Setup
A solid monitoring setup for a typical Hogsend deployment:
1. External Uptime Monitor
Point an external uptime service at https://api.hogsend.com/v1/health. Check every 60 seconds. Alert your on-call channel if it goes down.
2. Core Alert Rules
Create these four alert rules as a baseline:
# Bounce rate > 3%
curl -X POST http://localhost:3002/v1/admin/alerts/rules \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"name": "Bounce rate warning",
"type": "bounce_rate_exceeded",
"threshold": 0.03,
"channel": "slack",
"channelConfig": { "webhookUrl": "https://hooks.slack.com/services/..." },
"cooldownMinutes": 120
}'
# Delivery rate < 95%
curl -X POST http://localhost:3002/v1/admin/alerts/rules \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"name": "Low delivery rate",
"type": "delivery_issue",
"threshold": 0.95,
"channel": "slack",
"channelConfig": { "webhookUrl": "https://hooks.slack.com/services/..." },
"cooldownMinutes": 60
}'
# Journey failures > 10/hour
curl -X POST http://localhost:3002/v1/admin/alerts/rules \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"name": "Journey failure spike",
"type": "journey_failure_spike",
"threshold": 10,
"channel": "slack",
"channelConfig": { "webhookUrl": "https://hooks.slack.com/services/..." },
"cooldownMinutes": 30
}'
# Complaint rate > 0.1%
curl -X POST http://localhost:3002/v1/admin/alerts/rules \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"name": "High complaint rate",
"type": "high_complaint_rate",
"threshold": 0.001,
"channel": "slack",
"channelConfig": { "webhookUrl": "https://hooks.slack.com/services/..." },
"cooldownMinutes": 240
}'3. Weekly Checks
Build these into your weekly ops routine:
- Review the DLQ -- retry or discard pending entries
- Check alert history -- verify alerts are firing and being delivered
- Review audit logs -- look for unexpected mutations
- Check deliverability trends -- catch gradual degradation before it becomes a problem
- Review API key usage -- revoke stale keys that have not been used
4. Incident Response Checklist
When something goes wrong:
- Check health --
GET /v1/health-- is the database or Redis down? Isstatusmigration_pending(a track behind)? Checkschema.engine.inSyncandschema.client.inSync. - Check metrics overview --
GET /v1/admin/metrics/overview-- are the numbers off? - Check the DLQ --
GET /v1/admin/dlq?status=pending-- are tasks piling up? - Check alert history --
GET /v1/admin/alerts/history-- when did the problem start? - Check audit logs --
GET /v1/admin/audit-logs-- did someone change something? - Check Hatchet dashboard --
localhost:8888-- are worker processes running?
For the full endpoint specification, see the API Reference.
Test mode
The provider-neutral safety net — while your sending domain is unverified, every email redirects to your own inbox instead of the real recipient. How it activates, what a redirected send looks like, and how to exit it.
Metrics & Analytics
Real-time journey performance, email deliverability, and event volume — computed on demand, no ETL required.