Grafana instance: stats.eijo.im
ChatMail relay: lca-01 (eijo.im)
Ansible role: roles/grafana/
Prometheus: localhost:9090 on lca-01
Database backend: PostgreSQL 18 on buf-01 over WireGuard mesh
1. Architecture
The monitoring stack on the ChatMail relay (lca-01) runs entirely on the node itself, with one exception — Grafana’s database lives on the PostgreSQL primary (buf-01) over the WireGuard mesh.
┌─────────────────────────────────────────────────────┐
│ lca-01 (eijo.im) │
│ │
│ postfix/dovecot ──► mtail ──► :3903/metrics │
│ node_exporter ──────────────► :9100/metrics │
│ nginx_exporter ─────────────► :9113/metrics │
│ postfix_exporter ───────────► :9154/metrics │
│ │
│ prometheus (localhost:9090) │
│ └─ scrapes all four exporters │
│ │
│ grafana (127.0.0.1:3000) │
│ ├─ datasource: prometheus @ localhost:9090 │
│ ├─ database: postgresql @ fd53:105:1000::10:5432 │
│ └─ nginx reverse proxy → stats.eijo.im:443 │
└─────────────────────────────────────────────────────┘Why PostgreSQL instead of SQLite?
Grafana defaults to SQLite, but Grafana 12’s unified storage layer produces a "No last resource version found, starting from scratch" log entry every 30 seconds with SQLite. Switching to PostgreSQL eliminates this entirely and gives proper transactional semantics. Since the mesh already runs a replicated PostgreSQL cluster, the grafana database is created on the primary and Grafana connects over the WireGuard mesh (fd53:105:1000::10).
2. Prometheus Scrape Targets
Prometheus scrapes four jobs, all on localhost:
| Job | Endpoint | What it collects |
|---|---|---|
chatmail-mtail | 127.0.0.1:3903 | Mail delivery, DKIM, encryption, accounts, Postfix errors |
node | localhost:9100 | CPU, memory, disk, network, load |
nginx | localhost:9113 | Connections, request rate |
postfix | localhost:9154 | SMTP connects, queue size |
All jobs add an alias: eijo.im label via relabeling, which the dashboards use for filtering.
3. Dashboards
3.1 ChatMail — eijo.im
UID: chatmail-eijo-im
Monitors mail flow using mtail metrics parsed from Postfix logs.
| Panel | Type | Query |
|---|---|---|
| Total Accounts Created | stat | created_accounts{prog="delivered_mail.mtail", alias="eijo.im"} |
| Total Delivered Mail | stat | delivered_mail{prog="delivered_mail.mtail", alias="eijo.im"} |
| Mail Delivery Rate | timeseries | rate(delivered_mail{...}[5m]) |
| Encryption Rate | timeseries | outgoing encrypted, incoming encrypted, rejected unencrypted |
| DKIM Signing Rate | timeseries | rate(dkim_signed{...}[5m]) |
| Quota Exceeded Rate | timeseries | rate(quota_exceeded{...}[5m]) |
| Warning Rate | timeseries | rate(warning_count{...}[5m]) |
| Postfix Error Rate | timeseries | timeouts + noqueue rates |
| Cumulative panels | timeseries | delivered mail, accounts created, DKIM signed |
| Account Creation Rate | timeseries | all, CI, non-CI accounts |
All queries use hardcoded label selectors rather than Grafana template variables. Grafana 12 has a bug where null datasource references on template-variable-based panels cause “No data” even when the underlying Prometheus queries return results. Hardcoded selectors bypass this entirely.
3.2 Infrastructure — eijo.im
UID: infra-eijo-im
System health monitoring using node_exporter, nginx_exporter, and fail2ban metrics.
| Panel | Type | Query |
|---|---|---|
| CPU Usage | stat | 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 |
| Memory Usage | stat | (1 - MemAvailable / MemTotal) * 100 |
| Disk Usage | stat | (1 - avail / size) * 100 for / |
| Uptime (days) | stat | (time() - node_boot_time_seconds) / 86400 |
| CPU Usage % | timeseries | same as stat, over time |
| Memory Usage | timeseries | total vs used |
| Disk I/O | timeseries | read/write bytes rate |
| Network Traffic | timeseries | receive/transmit bytes rate |
| Nginx Connections | timeseries | nginx_connections_active |
| Nginx Request Rate | timeseries | rate(nginx_http_requests_total[5m]) |
| Fail2ban Banned IPs | timeseries | fail2ban_banned_total |
| Load Average | timeseries | 1m, 5m, 15m |
4. Systemd Hardening: The Node Exporter Pitfall
The chatmail_hardening role deploys systemd drop-in overrides that restrict services with seccomp filters and namespace isolation. This caused the node exporter to crash-loop:
Symptom: prometheus-node-exporter starts, runs for ~30 seconds, gets killed by SIGSYS (signal 31), restarts. Prometheus has zero node_* metrics.
Root cause: Two settings in the hardening drop-in:
ProtectClock=true— prevents clock-related syscallsSystemCallFilter=@system-service— does not includeadjtimex
The node exporter’s timex collector calls adjtimex to read clock synchronization status. The kernel’s seccomp filter kills the process when it attempts a blocked syscall.
Fix:
# /etc/systemd/system/prometheus-node-exporter.service.d/hardening.conf
ProtectClock=false
SystemCallFilter=@system-service adjtimex clock_adjtime clock_gettime
This allows the timex collector to function while keeping all other hardening in place. The security exposure score remains under 2.0.
5. Ansible Role: grafana
The role handles everything from database creation to dashboard provisioning.
Role structure
roles/grafana/
├── defaults/main.yml # all configurable variables
├── handlers/main.yml # restart handler
├── tasks/main.yml # install, configure, provision, verify
├── templates/
│ ├── grafana.ini.j2 # main config with PG backend
│ ├── datasource.yml.j2 # Prometheus datasource provisioning
│ └── dashboard-provider.yml.j2 # file-based dashboard provisioning
└── files/dashboards/
├── chatmail-eijo-im.json # ChatMail dashboard
└── infra-eijo-im.json # Infrastructure dashboardWhat the role does
- Creates the
grafanadatabase on the PostgreSQL primary (delegated togroups['primary'][0]) - Installs Grafana from the official APT repository
- Deploys
grafana.iniwith PostgreSQL backend, anonymous viewer access, and the configured domain - Provisions the Prometheus datasource via file (
/etc/grafana/provisioning/datasources/prometheus.yml) — no API calls needed - Provisions dashboards via file provider (
/var/lib/grafana/dashboards/) - Verifies Grafana is healthy, the datasource is reachable, and both dashboards are loaded
Key defaults
grafana_domain: stats.eijo.im
grafana_db_type: postgres
grafana_db_host: "[fd53:105:1000::10]:5432"
grafana_db_name: grafana
grafana_db_user: grafana
grafana_db_password: "{{ vault_grafana_db_password }}"
grafana_prometheus_url: "http://localhost:9090"
grafana_anonymous_enabled: true
grafana_anonymous_role: ViewerDeployment
# Full Grafana deployment
ansible-playbook site.yml --tags grafana --limit chatmail
# Verify only
ansible-playbook site.yml --tags verify --limit chatmailReplicating to a new node
- Add the node to the
chatmailgroup ininventory.ini - Set
grafana_domainand credential vault vars inhost_vars/ - Ensure Prometheus is running with the expected scrape targets
- Run the role — it creates the database, installs Grafana, and provisions everything
6. Nginx Reverse Proxy
Grafana listens on 127.0.0.1:3000. Nginx terminates TLS and proxies:
server {
listen 127.0.0.1:8443 ssl;
server_name stats.eijo.im;
ssl_certificate /etc/letsencrypt/live/stats.eijo.im/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/stats.eijo.im/privkey.pem;
location / {
proxy_pass http://127.0.0.1:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}
The Upgrade and Connection headers are required for Grafana’s WebSocket live features.
7. Downloads
Ansible Role
- grafana — defaults/main.yml
- grafana — tasks/main.yml
- grafana — handlers/main.yml
- grafana — grafana.ini.j2
- grafana — datasource.yml.j2
- grafana — dashboard-provider.yml.j2