Grafana monitoring for ChatMail: dashboards, Prometheus, and the Ansible role

Grafana instance: stats.eijo.im ChatMail relay: lca-01 (eijo.im) Ansible role: roles/grafana/ Prometheus: localhost:9090 on lca-01 Database backend: PostgreSQL 18 on buf-01 over WireGuard mesh

1. Architecture

The monitoring stack on the ChatMail relay (lca-01) runs entirely on the node itself, with one exception — Grafana’s database lives on the PostgreSQL primary (buf-01) over the WireGuard mesh.

┌─────────────────────────────────────────────────────┐
│  lca-01 (eijo.im)                                   │
│                                                     │
│  postfix/dovecot ──► mtail ──► :3903/metrics        │
│  node_exporter ──────────────► :9100/metrics        │
│  nginx_exporter ─────────────► :9113/metrics        │
│  postfix_exporter ───────────► :9154/metrics        │
│                                                     │
│  prometheus (localhost:9090)                         │
│    └─ scrapes all four exporters                    │
│                                                     │
│  grafana (127.0.0.1:3000)                           │
│    ├─ datasource: prometheus @ localhost:9090        │
│    ├─ database: postgresql @ fd53:105:1000::10:5432  │
│    └─ nginx reverse proxy → stats.eijo.im:443       │
└─────────────────────────────────────────────────────┘

Why PostgreSQL instead of SQLite?

Grafana defaults to SQLite, but Grafana 12’s unified storage layer produces a "No last resource version found, starting from scratch" log entry every 30 seconds with SQLite. Switching to PostgreSQL eliminates this entirely and gives proper transactional semantics. Since the mesh already runs a replicated PostgreSQL cluster, the grafana database is created on the primary and Grafana connects over the WireGuard mesh (fd53:105:1000::10).

2. Prometheus Scrape Targets

Prometheus scrapes four jobs, all on localhost:

Job	Endpoint	What it collects
`chatmail-mtail`	`127.0.0.1:3903`	Mail delivery, DKIM, encryption, accounts, Postfix errors
`node`	`localhost:9100`	CPU, memory, disk, network, load
`nginx`	`localhost:9113`	Connections, request rate
`postfix`	`localhost:9154`	SMTP connects, queue size

All jobs add an alias: eijo.im label via relabeling, which the dashboards use for filtering.

3. Dashboards

3.1 ChatMail — eijo.im

UID: chatmail-eijo-im

Monitors mail flow using mtail metrics parsed from Postfix logs.

Panel	Type	Query
Total Accounts Created	stat	`created_accounts{prog="delivered_mail.mtail", alias="eijo.im"}`
Total Delivered Mail	stat	`delivered_mail{prog="delivered_mail.mtail", alias="eijo.im"}`
Mail Delivery Rate	timeseries	`rate(delivered_mail{...}[5m])`
Encryption Rate	timeseries	outgoing encrypted, incoming encrypted, rejected unencrypted
DKIM Signing Rate	timeseries	`rate(dkim_signed{...}[5m])`
Quota Exceeded Rate	timeseries	`rate(quota_exceeded{...}[5m])`
Warning Rate	timeseries	`rate(warning_count{...}[5m])`
Postfix Error Rate	timeseries	timeouts + noqueue rates
Cumulative panels	timeseries	delivered mail, accounts created, DKIM signed
Account Creation Rate	timeseries	all, CI, non-CI accounts

All queries use hardcoded label selectors rather than Grafana template variables. Grafana 12 has a bug where null datasource references on template-variable-based panels cause “No data” even when the underlying Prometheus queries return results. Hardcoded selectors bypass this entirely.

3.2 Infrastructure — eijo.im

UID: infra-eijo-im

System health monitoring using node_exporter, nginx_exporter, and fail2ban metrics.

Panel	Type	Query
CPU Usage	stat	`100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100`
Memory Usage	stat	`(1 - MemAvailable / MemTotal) * 100`
Disk Usage	stat	`(1 - avail / size) * 100` for `/`
Uptime (days)	stat	`(time() - node_boot_time_seconds) / 86400`
CPU Usage %	timeseries	same as stat, over time
Memory Usage	timeseries	total vs used
Disk I/O	timeseries	read/write bytes rate
Network Traffic	timeseries	receive/transmit bytes rate
Nginx Connections	timeseries	`nginx_connections_active`
Nginx Request Rate	timeseries	`rate(nginx_http_requests_total[5m])`
Fail2ban Banned IPs	timeseries	`fail2ban_banned_total`
Load Average	timeseries	1m, 5m, 15m

4. Systemd Hardening: The Node Exporter Pitfall

The chatmail_hardening role deploys systemd drop-in overrides that restrict services with seccomp filters and namespace isolation. This caused the node exporter to crash-loop:

Symptom: prometheus-node-exporter starts, runs for ~30 seconds, gets killed by SIGSYS (signal 31), restarts. Prometheus has zero node_* metrics.

Root cause: Two settings in the hardening drop-in:

ProtectClock=true — prevents clock-related syscalls
SystemCallFilter=@system-service — does not include adjtimex

The node exporter’s timex collector calls adjtimex to read clock synchronization status. The kernel’s seccomp filter kills the process when it attempts a blocked syscall.

Fix:

# /etc/systemd/system/prometheus-node-exporter.service.d/hardening.conf
ProtectClock=false
SystemCallFilter=@system-service adjtimex clock_adjtime clock_gettime

This allows the timex collector to function while keeping all other hardening in place. The security exposure score remains under 2.0.

5. Ansible Role: `grafana`

The role handles everything from database creation to dashboard provisioning.

Role structure

roles/grafana/
├── defaults/main.yml              # all configurable variables
├── handlers/main.yml              # restart handler
├── tasks/main.yml                 # install, configure, provision, verify
├── templates/
│   ├── grafana.ini.j2             # main config with PG backend
│   ├── datasource.yml.j2          # Prometheus datasource provisioning
│   └── dashboard-provider.yml.j2  # file-based dashboard provisioning
└── files/dashboards/
    ├── chatmail-eijo-im.json      # ChatMail dashboard
    └── infra-eijo-im.json         # Infrastructure dashboard

What the role does

Creates the grafana database on the PostgreSQL primary (delegated to groups['primary'][0])
Installs Grafana from the official APT repository
Deploys grafana.ini with PostgreSQL backend, anonymous viewer access, and the configured domain
Provisions the Prometheus datasource via file (/etc/grafana/provisioning/datasources/prometheus.yml) — no API calls needed
Provisions dashboards via file provider (/var/lib/grafana/dashboards/)
Verifies Grafana is healthy, the datasource is reachable, and both dashboards are loaded

Key defaults

grafana_domain: stats.eijo.im
grafana_db_type: postgres
grafana_db_host: "[fd53:105:1000::10]:5432"
grafana_db_name: grafana
grafana_db_user: grafana
grafana_db_password: "{{ vault_grafana_db_password }}"
grafana_prometheus_url: "http://localhost:9090"
grafana_anonymous_enabled: true
grafana_anonymous_role: Viewer

Deployment

# Full Grafana deployment
ansible-playbook site.yml --tags grafana --limit chatmail

# Verify only
ansible-playbook site.yml --tags verify --limit chatmail

Replicating to a new node

Add the node to the chatmail group in inventory.ini
Set grafana_domain and credential vault vars in host_vars/
Ensure Prometheus is running with the expected scrape targets
Run the role — it creates the database, installs Grafana, and provisions everything

6. Nginx Reverse Proxy

Grafana listens on 127.0.0.1:3000. Nginx terminates TLS and proxies:

server {
    listen 127.0.0.1:8443 ssl;
    server_name stats.eijo.im;

    ssl_certificate /etc/letsencrypt/live/stats.eijo.im/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/stats.eijo.im/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

The Upgrade and Connection headers are required for Grafana’s WebSocket live features.

Grafana monitoring for ChatMail: dashboards, Prometheus, and the Ansible role

1. Architecture

Why PostgreSQL instead of SQLite?

2. Prometheus Scrape Targets

3. Dashboards

3.1 ChatMail — eijo.im

3.2 Infrastructure — eijo.im

4. Systemd Hardening: The Node Exporter Pitfall

5. Ansible Role: `grafana`

Role structure

What the role does

Key defaults

Deployment

Replicating to a new node

6. Nginx Reverse Proxy

7. Downloads

Ansible Role

Dashboards

Hardening (updated)

1. Architecture

Why PostgreSQL instead of SQLite?

2. Prometheus Scrape Targets

3. Dashboards

3.1 ChatMail — eijo.im

3.2 Infrastructure — eijo.im

4. Systemd Hardening: The Node Exporter Pitfall

5. Ansible Role: grafana

Role structure

What the role does

Key defaults

Deployment

Replicating to a new node

6. Nginx Reverse Proxy

7. Downloads

Ansible Role

Dashboards

Hardening (updated)

5. Ansible Role: `grafana`