Alerting & runbooks

Monitoring tells you what is happening; alerting tells you when to act. This page assumes you completed Monitoring: Prometheus is scraping the validator, sentries, signer, and host metrics.

Severity levels

Severity	Response time	Channel	Examples
P1: page	< 5 min, 24/7	PagerDuty / OpsGenie / phone	Validator stopped signing; height stalled; double-sign risk
P2: wake	< 30 min, business hours	Discord/Slack with audible ping	Missed blocks > 50/hour; peers < 3; disk < 15 %
P3: ticket	next business day	email/issue tracker	Disk < 30 %; clock drift > 100 ms; sentry behind

A validator with only P1 alerts will get jailed; a validator with all P3 alerts will burn out the operator. Tune both.

1 · `alerts.yml`: battle-tested rules

Drop this file at ./alerts.yml next to your prometheus.yml. Reload Prometheus with curl -X POST http://localhost:9090/-/reload.

groups:
  - name: safrochain.validator
    rules:
      - alert: ValidatorJailed
        expr: cometbft_consensus_validator_power{role="validator"} == 0
        for: 2m
        labels: { severity: critical, page: "true" }
        annotations:
          summary: "Validator has 0 voting power (jailed or unbonded)"
          runbook: "https://draft-docs.safrochain.com/validators/alerting#runbook-validatorjailed"

      - alert: ValidatorMissingBlocks
        expr: increase(cometbft_consensus_validator_missed_blocks[10m]) > 5
        for: 0m
        labels: { severity: critical, page: "true" }
        annotations:
          summary: "Validator missed > 5 blocks in 10 min: pre-jail warning"
          runbook: "https://draft-docs.safrochain.com/validators/alerting#runbook-validatormissingblocks"

      - alert: ChainHeightStalled
        expr: rate(cometbft_consensus_height{role="validator"}[5m]) == 0
        for: 2m
        labels: { severity: critical, page: "true" }
        annotations:
          summary: "Validator block height has not advanced in 2 min"
          runbook: "https://draft-docs.safrochain.com/validators/alerting#runbook-chainheightstalled"

      - alert: ValidatorBehindNetwork
        expr: |
          (max(cometbft_consensus_height) by (chain))
          - on(chain) cometbft_consensus_height{role="validator"} > 5
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Validator is more than 5 blocks behind the network tip"

      - alert: LowPeerCount
        expr: cometbft_p2p_peers{role="validator"} < 3
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Validator has fewer than 3 P2P peers"

      - alert: SentryDown
        expr: up{job="cometbft", role="sentry"} == 0
        for: 2m
        labels: { severity: warning }
        annotations:
          summary: "Sentry {{ $labels.instance }} is unreachable"

      - alert: SignerSilent
        expr: rate(horcrux_consensus_signing_attempts_total[5m]) == 0
        for: 2m
        labels: { severity: critical, page: "true" }
        annotations:
          summary: "Remote signer has not attempted to sign in 2 min"
          runbook: "https://draft-docs.safrochain.com/validators/alerting#runbook-signersilent"

      - alert: ClockDrift
        expr: abs(node_ntp_offset_seconds) > 0.1
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Clock on {{ $labels.instance }} drifted > 100 ms (signing risk above 250 ms)"

      - alert: DiskFillingFast
        expr: |
          predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
        for: 30m
        labels: { severity: warning }
        annotations:
          summary: "{{ $labels.instance }} disk projected to fill in 24h"

      - alert: DiskCritical
        expr: 100 * node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 10
        for: 5m
        labels: { severity: critical, page: "true" }
        annotations:
          summary: "{{ $labels.instance }} disk free < 10 %"

      - alert: NodeProcessRestarted
        expr: changes(process_start_time_seconds{job="cometbft"}[10m]) > 0
        for: 0m
        labels: { severity: warning }
        annotations:
          summary: "{{ $labels.instance }} safrochaind process restarted"

      - alert: HighBlockProcessingTime
        expr: histogram_quantile(0.95, rate(cometbft_state_block_processing_time_bucket[10m])) > 2
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Block processing p95 > 2s: node may be CPU/disk bound"

2 · `alertmanager.yml`: routing & paging

global:
  resolve_timeout: 5m

route:
  receiver: discord-warnings
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  routes:
    - matchers: [ severity="critical", page="true" ]
      receiver: pagerduty
      continue: true
    - matchers: [ severity="critical" ]
      receiver: discord-critical

receivers:
  - name: pagerduty
    pagerduty_configs:
      - service_key: ${PAGERDUTY_SERVICE_KEY}
        description: "{{ .CommonAnnotations.summary }}"

  - name: discord-critical
    webhook_configs:
      - url: ${DISCORD_WEBHOOK_CRITICAL}
        send_resolved: true

  - name: discord-warnings
    webhook_configs:
      - url: ${DISCORD_WEBHOOK_WARNINGS}
        send_resolved: true

inhibit_rules:
  - source_matchers: [ alertname="ChainHeightStalled" ]
    target_matchers: [ alertname="ValidatorMissingBlocks" ]
    equal: [ instance ]

Paging integrations: quick wiring

Discord webhook
Telegram bot
PagerDuty
OpsGenie

# Server settings → Integrations → Webhooks → New Webhook
# Use the webhook URL directly in alertmanager.yml

- name: telegram
  telegram_configs:
    - bot_token: ${TELEGRAM_BOT_TOKEN}
      chat_id: ${TELEGRAM_CHAT_ID}
      parse_mode: HTML
      message: |
        <b>[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}</b>
        {{ .CommonAnnotations.summary }}

PagerDuty → Configuration → Services → New Service
Integration type: Prometheus
Copy the integration key into PAGERDUTY_SERVICE_KEY

- name: opsgenie
  opsgenie_configs:
    - api_key: ${OPSGENIE_API_KEY}
      message: "{{ .CommonAnnotations.summary }}"
      priority: P1

3 · Runbooks

A page without a runbook is a page that prolongs the incident. Each critical alert above must link to a runbook. The minimum runbook set is below: keep it short, action-first, and version-controlled.

Runbook: `ValidatorJailed`

Confirm the alert: safrochaind query staking validator $VALADDR | yq '.jailed'.
Check why:
- missed blocks → see runbook ValidatorMissingBlocks first
- tombstoned (tombstoned: true) → stop; this is a double-sign, follow Disaster recovery

If only jailed (not tombstoned), root-cause then unjail:

safrochaind tx slashing unjail \
  --from validator --keyring-backend file \
  --chain-id safro-mainnet-1 \
  --gas auto --gas-adjustment 1.3 --fees 5000usaf \
  --node https://rpc.safrochain.network:443 --yes

Post-incident: write a short timeline in your team channel, file a ticket if the cause was a software/config issue.

Runbook: `ValidatorMissingBlocks`

Is the node reachable? curl -fsS http://localhost:26657/status > /dev/null && echo OK.
Is the signer alive? journalctl -u tmkms -f or journalctl -u horcrux -f.
Is catching_up true? If so the node restarted recently, let it catch up.
Is peer count low? Check sentries: curl localhost:26657/net_info | jq '.result.n_peers'.
Disk full? df -h: most common silent killer.
CPU at 100 %? top -bn1 | head: could be heavy mempool.
Restart the validator only as a last resort and only after step 5 shows the signer is not still busy: sudo systemctl restart safrochaind.

Runbook: `ChainHeightStalled`

Is the chain itself live? curl -fsS https://rpc.safrochain.network/status | jq '.result.sync_info'.
- If the chain has stalled (rare): wait, check announcements.
- If the chain is fine and only this node is stalled, continue.
Is safrochaind running? systemctl status safrochaind.
Disk free? df -h /home/safro/.safrochain/data.
P2P peers > 0? If 0, validator is partitioned; check sentry firewall.
If everything looks fine but height is stuck, restart safrochaind.

Runbook: `SignerSilent`

Identify which signer is in use: grep priv_validator_laddr ~/.safrochain/config/config.toml.
TMKMS: systemctl status tmkms → if down, start it; if up, check journalctl -u tmkms for HSM/network errors.
Horcrux: check all three cosigners; if 2 of 3 are alive, the cluster should still be signing. If 2+ are down, you are at risk; restart the dead cosigners.
If you cannot bring the signer back within 5 minutes, stop the validator to prevent any window where a misconfigured fallback could sign with a different key. A jailed validator can be unjailed; a tombstoned one cannot.

Runbook: `DiskCritical`

du -sh ~/.safrochain/data/* | sort -h: find the biggest consumer.
If application.db is huge, run pruning (see Operations → State pruning).
If the WAL/snapshot directory is full, you can safely remove old snapshots in data/snapshots/.
Never remove priv_validator_state.json: it is the double-sign safety latch.

4 · Test the alerts (don't trust untested alerts)

# Force a synthetic alert
curl -XPOST http://localhost:9093/api/v2/alerts -H 'Content-Type: application/json' -d '[{
  "labels": {"alertname":"SyntheticTest","severity":"critical","page":"true"},
  "annotations":{"summary":"this is a test"}
}]'

Confirm the page lands in every channel: PagerDuty, Discord, Telegram, email. Repeat once a quarter. The week your real alert misfires is not the week to discover the webhook expired.

5 · Status pages & public dashboards

Operators with public delegators usually publish a small status page so delegators can verify uptime themselves. Two cheap options:

Tool	Cost	Notes
Better Uptime / Statuspage	low monthly	scrape one of your sentries' RPC
Self-hosted Uptime Kuma	free	Docker container, pings `/status`

Make the status page show: chain ID, voting power, last signed height, missed blocks (24h), and incident history.

Severity levels​

1 · alerts.yml: battle-tested rules​

2 · alertmanager.yml: routing & paging​

Paging integrations: quick wiring​

3 · Runbooks​

Runbook: ValidatorJailed​

Runbook: ValidatorMissingBlocks​

Runbook: ChainHeightStalled​

Runbook: SignerSilent​

Runbook: DiskCritical​

4 · Test the alerts (don't trust untested alerts)​

5 · Status pages & public dashboards​