Grafana monitoring for ChatMail: dashboards, Prometheus, and the Ansible role

Grafana instance: stats.eijo.im ChatMail relay: lca-01 (eijo.im) Ansible role: roles/grafana/ Prometheus: localhost:9090 on lca-01 Database backend: PostgreSQL 18 on buf-01 over WireGuard mesh


1. Architecture

The monitoring stack on the ChatMail relay (lca-01) runs entirely on the node itself, with one exception — Grafana’s database lives on the PostgreSQL primary (buf-01) over the WireGuard mesh.

┌─────────────────────────────────────────────────────┐
│  lca-01 (eijo.im)                                   │
│                                                     │
│  postfix/dovecot ──► mtail ──► :3903/metrics        │
│  node_exporter ──────────────► :9100/metrics        │
│  nginx_exporter ─────────────► :9113/metrics        │
│  postfix_exporter ───────────► :9154/metrics        │
│                                                     │
│  prometheus (localhost:9090)                         │
│    └─ scrapes all four exporters                    │
│                                                     │
│  grafana (127.0.0.1:3000)                           │
│    ├─ datasource: prometheus @ localhost:9090        │
│    ├─ database: postgresql @ fd53:105:1000::10:5432  │
│    └─ nginx reverse proxy → stats.eijo.im:443       │
└─────────────────────────────────────────────────────┘

Why PostgreSQL instead of SQLite?

Grafana defaults to SQLite, but Grafana 12’s unified storage layer produces a "No last resource version found, starting from scratch" log entry every 30 seconds with SQLite. Switching to PostgreSQL eliminates this entirely and gives proper transactional semantics. Since the mesh already runs a replicated PostgreSQL cluster, the grafana database is created on the primary and Grafana connects over the WireGuard mesh (fd53:105:1000::10).


2. Prometheus Scrape Targets

Prometheus scrapes four jobs, all on localhost:

JobEndpointWhat it collects
chatmail-mtail127.0.0.1:3903Mail delivery, DKIM, encryption, accounts, Postfix errors
nodelocalhost:9100CPU, memory, disk, network, load
nginxlocalhost:9113Connections, request rate
postfixlocalhost:9154SMTP connects, queue size

All jobs add an alias: eijo.im label via relabeling, which the dashboards use for filtering.


3. Dashboards

3.1 ChatMail — eijo.im

UID: chatmail-eijo-im

Monitors mail flow using mtail metrics parsed from Postfix logs.

PanelTypeQuery
Total Accounts Createdstatcreated_accounts{prog="delivered_mail.mtail", alias="eijo.im"}
Total Delivered Mailstatdelivered_mail{prog="delivered_mail.mtail", alias="eijo.im"}
Mail Delivery Ratetimeseriesrate(delivered_mail{...}[5m])
Encryption Ratetimeseriesoutgoing encrypted, incoming encrypted, rejected unencrypted
DKIM Signing Ratetimeseriesrate(dkim_signed{...}[5m])
Quota Exceeded Ratetimeseriesrate(quota_exceeded{...}[5m])
Warning Ratetimeseriesrate(warning_count{...}[5m])
Postfix Error Ratetimeseriestimeouts + noqueue rates
Cumulative panelstimeseriesdelivered mail, accounts created, DKIM signed
Account Creation Ratetimeseriesall, CI, non-CI accounts

All queries use hardcoded label selectors rather than Grafana template variables. Grafana 12 has a bug where null datasource references on template-variable-based panels cause “No data” even when the underlying Prometheus queries return results. Hardcoded selectors bypass this entirely.

3.2 Infrastructure — eijo.im

UID: infra-eijo-im

System health monitoring using node_exporter, nginx_exporter, and fail2ban metrics.

PanelTypeQuery
CPU Usagestat100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
Memory Usagestat(1 - MemAvailable / MemTotal) * 100
Disk Usagestat(1 - avail / size) * 100 for /
Uptime (days)stat(time() - node_boot_time_seconds) / 86400
CPU Usage %timeseriessame as stat, over time
Memory Usagetimeseriestotal vs used
Disk I/Otimeseriesread/write bytes rate
Network Traffictimeseriesreceive/transmit bytes rate
Nginx Connectionstimeseriesnginx_connections_active
Nginx Request Ratetimeseriesrate(nginx_http_requests_total[5m])
Fail2ban Banned IPstimeseriesfail2ban_banned_total
Load Averagetimeseries1m, 5m, 15m

4. Systemd Hardening: The Node Exporter Pitfall

The chatmail_hardening role deploys systemd drop-in overrides that restrict services with seccomp filters and namespace isolation. This caused the node exporter to crash-loop:

Symptom: prometheus-node-exporter starts, runs for ~30 seconds, gets killed by SIGSYS (signal 31), restarts. Prometheus has zero node_* metrics.

Root cause: Two settings in the hardening drop-in:

  1. ProtectClock=true — prevents clock-related syscalls
  2. SystemCallFilter=@system-service — does not include adjtimex

The node exporter’s timex collector calls adjtimex to read clock synchronization status. The kernel’s seccomp filter kills the process when it attempts a blocked syscall.

Fix:

# /etc/systemd/system/prometheus-node-exporter.service.d/hardening.conf
ProtectClock=false
SystemCallFilter=@system-service adjtimex clock_adjtime clock_gettime

This allows the timex collector to function while keeping all other hardening in place. The security exposure score remains under 2.0.


5. Ansible Role: grafana

The role handles everything from database creation to dashboard provisioning.

Role structure

roles/grafana/
├── defaults/main.yml              # all configurable variables
├── handlers/main.yml              # restart handler
├── tasks/main.yml                 # install, configure, provision, verify
├── templates/
│   ├── grafana.ini.j2             # main config with PG backend
│   ├── datasource.yml.j2          # Prometheus datasource provisioning
│   └── dashboard-provider.yml.j2  # file-based dashboard provisioning
└── files/dashboards/
    ├── chatmail-eijo-im.json      # ChatMail dashboard
    └── infra-eijo-im.json         # Infrastructure dashboard

What the role does

  1. Creates the grafana database on the PostgreSQL primary (delegated to groups['primary'][0])
  2. Installs Grafana from the official APT repository
  3. Deploys grafana.ini with PostgreSQL backend, anonymous viewer access, and the configured domain
  4. Provisions the Prometheus datasource via file (/etc/grafana/provisioning/datasources/prometheus.yml) — no API calls needed
  5. Provisions dashboards via file provider (/var/lib/grafana/dashboards/)
  6. Verifies Grafana is healthy, the datasource is reachable, and both dashboards are loaded

Key defaults

grafana_domain: stats.eijo.im
grafana_db_type: postgres
grafana_db_host: "[fd53:105:1000::10]:5432"
grafana_db_name: grafana
grafana_db_user: grafana
grafana_db_password: "{{ vault_grafana_db_password }}"
grafana_prometheus_url: "http://localhost:9090"
grafana_anonymous_enabled: true
grafana_anonymous_role: Viewer

Deployment

# Full Grafana deployment
ansible-playbook site.yml --tags grafana --limit chatmail

# Verify only
ansible-playbook site.yml --tags verify --limit chatmail

Replicating to a new node

  1. Add the node to the chatmail group in inventory.ini
  2. Set grafana_domain and credential vault vars in host_vars/
  3. Ensure Prometheus is running with the expected scrape targets
  4. Run the role — it creates the database, installs Grafana, and provisions everything

6. Nginx Reverse Proxy

Grafana listens on 127.0.0.1:3000. Nginx terminates TLS and proxies:

server {
    listen 127.0.0.1:8443 ssl;
    server_name stats.eijo.im;

    ssl_certificate /etc/letsencrypt/live/stats.eijo.im/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/stats.eijo.im/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

The Upgrade and Connection headers are required for Grafana’s WebSocket live features.


7. Downloads

Ansible Role

Dashboards

Hardening (updated)