Skip to main content

Monitoring (Prometheus + Grafana)

A validator without monitoring is a validator that gets jailed at 3 a.m. This page sets up a complete observability stack: CometBFT exposes metrics → Prometheus scrapes them → Grafana renders them. The next page, Alerting & runbooks, wires the alerting layer on top.

What you should be measuring

LayerSignalWhy
Consensuscometbft_consensus_heightIf it stops, the validator is dead
Consensuscometbft_consensus_validator_missed_blocksPre-jail warning
Consensuscometbft_consensus_validator_powerDrops to 0 → unbonded
Consensuscometbft_consensus_validatorsActive set membership
Mempoolcometbft_mempool_size, cometbft_mempool_tx_size_bytesBacklog & DOS detection
P2Pcometbft_p2p_peers< 3 = network partition risk
Statecometbft_state_block_processing_timeApp slowdown / disk pressure
Hostnode_filesystem_avail_bytesDisk full = node crash
Hostnode_load1, CPU, memoryCapacity planning
Hostnode_time_offset_secondsClock drift > 250 ms breaks signing
Processprocess_resident_memory_bytesMemory leaks, GC pressure

1 · Enable CometBFT metrics

Edit ~/.safrochain/config/config.toml on the validator and on each sentry:

[instrumentation]
prometheus = true
prometheus_listen_addr = "0.0.0.0:26660"
namespace = "cometbft"
max_open_connections = 3

Restart the node:

sudo systemctl restart safrochaind

Confirm the endpoint is up:

curl -s http://localhost:26660/metrics | head -20
# # HELP cometbft_consensus_height Height of the chain
# # TYPE cometbft_consensus_height gauge

2 · Install node_exporter (host metrics)

sudo useradd -rs /bin/false node_exporter
NODE_EXPORTER_VERSION=1.8.2

curl -sLO "https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz"
tar -xzf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
sudo install -m 0755 node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/

Systemd unit at /etc/systemd/system/node_exporter.service:

[Unit]
Description=Prometheus Node Exporter
After=network-online.target
Wants=network-online.target

[Service]
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.ntp \
--collector.netdev.device-include="^(eth|ens|enp|wg)" \
--web.listen-address=:9100
Restart=on-failure

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
curl -s http://localhost:9100/metrics | head -10

3 · Stand up Prometheus

Prometheus runs on a separate observability host, never on the validator. Below is a minimal Docker Compose stack on that host.

docker-compose.yml:

services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./alerts.yml:/etc/prometheus/alerts.yml:ro
- prom_data:/prometheus
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.retention.time=30d
- --web.enable-lifecycle
ports:
- "9090:9090"

alertmanager:
image: prom/alertmanager:latest
restart: unless-stopped
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
ports:
- "9093:9093"

grafana:
image: grafana/grafana:latest
restart: unless-stopped
environment:
GF_SECURITY_ADMIN_PASSWORD: change_me
GF_USERS_ALLOW_SIGN_UP: "false"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
ports:
- "3000:3000"

volumes:
prom_data:
grafana_data:

prometheus.yml:

global:
scrape_interval: 15s
evaluation_interval: 30s
external_labels:
chain: safro-mainnet-1

rule_files:
- /etc/prometheus/alerts.yml

alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]

scrape_configs:
- job_name: cometbft
metrics_path: /metrics
static_configs:
- targets:
- validator.internal:26660
labels: { role: validator }
- targets:
- sentry-1.internal:26660
- sentry-2.internal:26660
labels: { role: sentry }

- job_name: node
static_configs:
- targets:
- validator.internal:9100
labels: { role: validator }
- targets:
- sentry-1.internal:9100
- sentry-2.internal:9100
labels: { role: sentry }

- job_name: signer
metrics_path: /metrics
static_configs:
- targets:
- signer-1.internal:26660 # tmkms or horcrux
labels: { role: signer }

Bring it up:

docker compose up -d
docker compose logs -f prometheus | head

Confirm Prometheus is reaching the validator: open http://<observability-host>:9090/targets: every target should show UP.

4 · Grafana: datasource & dashboards

Provision the datasource (grafana/provisioning/datasources/prom.yml):

apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true

Auto-load dashboards (grafana/provisioning/dashboards/dashboards.yml):

apiVersion: 1
providers:
- name: safrochain
folder: Safrochain
type: file
options:
path: /var/lib/grafana/dashboards
DashboardSourceWhat to import
CometBFT validatorGrafana.com: search "CometBFT" or "Tendermint validator"covers height, missed blocks, peers, mempool
Node Exporter FullGrafana ID 1860host CPU/memory/disk/network
Cosmos SDK / Validator overview (Safrochain custom: see below)snippet hereKPIs in one screen

Drop a safro-validator.json file into grafana/dashboards/. A starting JSON skeleton with the right queries:

{
"title": "Safrochain validator overview",
"uid": "safro-validator",
"schemaVersion": 39,
"panels": [
{
"type": "stat",
"title": "Latest block height",
"targets": [{ "expr": "cometbft_consensus_height{role=\"validator\"}" }]
},
{
"type": "stat",
"title": "Validator power",
"targets": [{ "expr": "cometbft_consensus_validator_power" }]
},
{
"type": "timeseries",
"title": "Missed blocks (1h rate)",
"targets": [{ "expr": "rate(cometbft_consensus_validator_missed_blocks[1h])" }]
},
{
"type": "stat",
"title": "P2P peers",
"targets": [{ "expr": "cometbft_p2p_peers" }]
},
{
"type": "timeseries",
"title": "Block processing time (p95)",
"targets": [{ "expr": "histogram_quantile(0.95, rate(cometbft_state_block_processing_time_bucket[5m]))" }]
},
{
"type": "timeseries",
"title": "Mempool size",
"targets": [{ "expr": "cometbft_mempool_size" }]
},
{
"type": "stat",
"title": "NTP offset (ms)",
"targets": [{ "expr": "node_ntp_offset_seconds * 1000" }]
},
{
"type": "timeseries",
"title": "Disk free %",
"targets": [{ "expr": "100 * node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"}" }]
}
]
}

Open Grafana at http://<observability-host>:3000 (admin / change_me), and you should see Safrochain → Safrochain validator overview with live data.

5 · The seven panels every validator owner stares at

PanelPromQL
Latest block height (validator vs chain)cometbft_consensus_height{role="validator"} - on() (max(cometbft_consensus_height) by (chain))
Missed blocks (last hour)increase(cometbft_consensus_validator_missed_blocks[1h])
Validator powercometbft_consensus_validator_power
P2P peer countcometbft_p2p_peers
Mempool depthcometbft_mempool_size
Block processing p95histogram_quantile(0.95, rate(cometbft_state_block_processing_time_bucket[5m]))
Disk free100 * node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}

Pin those seven on a wall TV in the operations area and 90 % of incidents will be visible at a glance.

6 · Don't forget the signer

If you run TMKMS or Horcrux, scrape the signer too. TMKMS exposes a Prometheus endpoint when built with the prometheus feature; Horcrux exposes one at :26660. The single most useful query:

rate(horcrux_consensus_signing_attempts_total[5m])

If signing attempts go to zero while the chain keeps producing blocks, the signer cluster lost quorum: page the on-call immediately.

7 · Continue to alerting

Now that the metrics flow, define what is worth waking someone up for in Alerting & runbooks.