Spec 0006: Grafana Dashboards via GitOps
- Save this spec to
.sdd/specification/spec-0006-grafana-dashboards.mdin the repo. - Implement all changes described below.
- After completing all work, create a provenance record at
.sdd/provenance/spec-0006-grafana-dashboards.provenance.md. See the Provenance Record section for the required format.
Prerequisites
Section titled “Prerequisites”- Spec 0005 deployed: Grafana, Loki, and Promtail running in the
observabilitynamespace on node2 - Read ADR-006 (
docs/adr/adr-006-observability-grafana-loki-promtail.md) — the architectural decision that established the observability stack
Context
Section titled “Context”Grafana is deployed but has no dashboards. The only datasource is Loki (logs). Prometheus is not yet installed — dashboards must use LogQL queries only.
The Grafana community Helm chart includes a sidecar container that watches for ConfigMaps with a specific label and auto-provisions them as dashboards. This is the standard GitOps mechanism for dashboard management: dashboard JSON is stored in ConfigMap manifests in Git, Flux applies them, and the sidecar loads them into Grafana without manual intervention.
Current state (read these files before making changes)
Section titled “Current state (read these files before making changes)”| File / Directory | What it does |
|---|---|
k8s/observability/helmrelease-grafana.yaml | Grafana HelmRelease — needs sidecar config added |
k8s/observability/helmrelease-loki.yaml | Loki HelmRelease — SingleBinary mode, 744h retention |
k8s/observability/helmrelease-promtail.yaml | Promtail DaemonSet on all nodes |
k8s/flux-system/observability-sync.yaml | Flux Kustomization for k8s/observability/ — already syncs this directory |
Key facts
Section titled “Key facts”- Datasource: Loki only (no Prometheus) — all panels must use LogQL
- Loki datasource name:
Loki(configured inhelmrelease-grafana.yamlas the default datasource) - Sidecar label:
grafana_dashboard: "1"(Grafana chart default) - Dashboard namespace:
observability(same as Grafana) - Promtail labels available:
namespace,pod,container,node_name,stream(stdout/stderr) - Workloads generating logs:
- 7 static sites (nginx access/error logs): kevinryan-io, brand-kevinryan-io, aiimmigrants-com, specmcp-ai, sddbook-com, distributedequity-org, docs-kevinryan-io
- Umami (Node.js application logs)
- Flux controllers (kustomize-controller, helm-controller, source-controller — structured JSON logs)
- External Secrets Operator
- K3s system pods (coredns, traefik, metrics-server, local-path-provisioner)
- Observability stack itself (Loki, Promtail, Grafana)
1. Update Grafana HelmRelease — enable dashboard sidecar
Section titled “1. Update Grafana HelmRelease — enable dashboard sidecar”Modify k8s/observability/helmrelease-grafana.yaml to explicitly enable the dashboard sidecar in the values section. Add the following block under values:
sidecar: dashboards: enabled: true label: grafana_dashboard labelValue: "1" searchNamespace: observability folderAnnotation: grafana_folderDesign notes:
- The chart enables the sidecar by default, but making it explicit documents the label convention and avoids surprises if chart defaults change.
searchNamespace: observabilityrestricts the sidecar to only watch ConfigMaps in the Grafana namespace. This is sufficient because all dashboard ConfigMaps live ink8s/observability/.folderAnnotation: grafana_folderallows future dashboards to specify a folder via annotation.
2. Create Platform Overview dashboard ConfigMap
Section titled “2. Create Platform Overview dashboard ConfigMap”Create k8s/observability/dashboard-platform-overview.yaml:
apiVersion: v1kind: ConfigMapmetadata: name: dashboard-platform-overview namespace: observability labels: grafana_dashboard: "1"data: platform-overview.json: | { "annotations": { "list": [] }, "editable": false, "fiscalYearStartMonth": 0, "graphTooltip": 1, "links": [], "panels": [ { "title": "Log Volume by Namespace", "type": "timeseries", "gridPos": { "h": 8, "w": 24, "x": 0, "y": 0 }, "datasource": { "type": "loki", "uid": "" }, "targets": [ { "expr": "sum by (namespace) (count_over_time({namespace=~\"$namespace\"} [$__interval]))", "legendFormat": "{{ namespace }}", "refId": "A" } ], "fieldConfig": { "defaults": { "custom": { "drawStyle": "bars", "fillOpacity": 80, "stacking": { "mode": "normal" } }, "unit": "short" }, "overrides": [] }, "options": { "tooltip": { "mode": "multi" }, "legend": { "displayMode": "table", "placement": "right", "calcs": ["sum"] } } }, { "title": "Error Rate", "type": "timeseries", "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 }, "datasource": { "type": "loki", "uid": "" }, "targets": [ { "expr": "sum(count_over_time({namespace=~\"$namespace\"} |~ \"(?i)(error|fatal|panic)\" [$__interval]))", "legendFormat": "Errors", "refId": "A" } ], "fieldConfig": { "defaults": { "custom": { "drawStyle": "line", "fillOpacity": 20, "lineWidth": 2 }, "color": { "mode": "fixed", "fixedColor": "red" }, "unit": "short" }, "overrides": [] }, "options": { "tooltip": { "mode": "single" }, "legend": { "displayMode": "list", "placement": "bottom" } } }, { "title": "Error Rate by Namespace", "type": "timeseries", "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 }, "datasource": { "type": "loki", "uid": "" }, "targets": [ { "expr": "sum by (namespace) (count_over_time({namespace=~\"$namespace\"} |~ \"(?i)(error|fatal|panic)\" [$__interval]))", "legendFormat": "{{ namespace }}", "refId": "A" } ], "fieldConfig": { "defaults": { "custom": { "drawStyle": "line", "fillOpacity": 10, "lineWidth": 2 }, "unit": "short" }, "overrides": [] }, "options": { "tooltip": { "mode": "multi" }, "legend": { "displayMode": "table", "placement": "right", "calcs": ["sum"] } } }, { "title": "Recent Errors", "type": "logs", "gridPos": { "h": 10, "w": 24, "x": 0, "y": 16 }, "datasource": { "type": "loki", "uid": "" }, "targets": [ { "expr": "{namespace=~\"$namespace\"} |~ \"(?i)(error|fatal|panic)\"", "refId": "A" } ], "options": { "showTime": true, "showLabels": true, "showCommonLabels": false, "wrapLogMessage": true, "prettifyLogMessage": false, "enableLogDetails": true, "sortOrder": "Descending", "dedupStrategy": "none" } } ], "schemaVersion": 39, "tags": ["platform", "overview", "loki"], "templating": { "list": [ { "name": "namespace", "type": "query", "datasource": { "type": "loki", "uid": "" }, "query": { "label": "namespace", "stream": "", "type": 1 }, "current": { "text": "All", "value": "$__all" }, "includeAll": true, "allValue": ".+", "multi": true, "refresh": 2, "sort": 1 } ] }, "time": { "from": "now-6h", "to": "now" }, "title": "Platform Overview", "uid": "platform-overview" }Design notes:
"editable": false— dashboards are managed by Git, not by clicking in the UI. Manual edits would be overwritten on next Flux reconciliation."uid": "platform-overview"— stable UID so Grafana updates the dashboard in place rather than creating duplicates.- The
namespacetemplate variable uses a Loki label values query, includes “All” option with regex.+, and supports multi-select. - Datasource UID is left as
""(empty string) — Grafana resolves this to the default datasource, which is Loki. - Four panels: log volume (stacked bars), total error rate (red line), error rate by namespace (multi-line), and recent errors (logs table).
- Error detection regex
(?i)(error|fatal|panic)is case-insensitive and catches the most common severity indicators across nginx, Node.js, and Go log formats.
3. Create Flux CD dashboard ConfigMap
Section titled “3. Create Flux CD dashboard ConfigMap”Create k8s/observability/dashboard-flux-cd.yaml:
apiVersion: v1kind: ConfigMapmetadata: name: dashboard-flux-cd namespace: observability labels: grafana_dashboard: "1"data: flux-cd.json: | { "annotations": { "list": [] }, "editable": false, "fiscalYearStartMonth": 0, "graphTooltip": 1, "links": [], "panels": [ { "title": "Reconciliation Activity", "type": "timeseries", "gridPos": { "h": 6, "w": 24, "x": 0, "y": 0 }, "datasource": { "type": "loki", "uid": "" }, "targets": [ { "expr": "sum by (container) (count_over_time({namespace=\"flux-system\", container=~\"$controller\"} |~ \"(?i)reconcil\" [$__interval]))", "legendFormat": "{{ container }}", "refId": "A" } ], "fieldConfig": { "defaults": { "custom": { "drawStyle": "bars", "fillOpacity": 60, "stacking": { "mode": "normal" } }, "unit": "short" }, "overrides": [] }, "options": { "tooltip": { "mode": "multi" }, "legend": { "displayMode": "list", "placement": "bottom" } } }, { "title": "Flux Errors and Warnings", "type": "timeseries", "gridPos": { "h": 6, "w": 24, "x": 0, "y": 6 }, "datasource": { "type": "loki", "uid": "" }, "targets": [ { "expr": "sum by (container) (count_over_time({namespace=\"flux-system\", container=~\"$controller\"} |~ \"(?i)(error|warn|failed)\" [$__interval]))", "legendFormat": "{{ container }}", "refId": "A" } ], "fieldConfig": { "defaults": { "custom": { "drawStyle": "line", "fillOpacity": 20, "lineWidth": 2, "pointSize": 5 }, "color": { "mode": "palette-classic" }, "unit": "short" }, "overrides": [] }, "options": { "tooltip": { "mode": "multi" }, "legend": { "displayMode": "list", "placement": "bottom" } } }, { "title": "Reconciliation Events", "type": "logs", "gridPos": { "h": 10, "w": 24, "x": 0, "y": 12 }, "datasource": { "type": "loki", "uid": "" }, "targets": [ { "expr": "{namespace=\"flux-system\", container=~\"$controller\"} |~ \"(?i)(reconcil|applied|created|deleted|updated|drift|error|warn|failed)\"", "refId": "A" } ], "options": { "showTime": true, "showLabels": true, "showCommonLabels": false, "wrapLogMessage": true, "prettifyLogMessage": false, "enableLogDetails": true, "sortOrder": "Descending", "dedupStrategy": "none" } }, { "title": "All Flux Logs", "type": "logs", "gridPos": { "h": 10, "w": 24, "x": 0, "y": 22 }, "datasource": { "type": "loki", "uid": "" }, "targets": [ { "expr": "{namespace=\"flux-system\", container=~\"$controller\"}", "refId": "A" } ], "options": { "showTime": true, "showLabels": true, "showCommonLabels": false, "wrapLogMessage": true, "prettifyLogMessage": false, "enableLogDetails": true, "sortOrder": "Descending", "dedupStrategy": "none" } } ], "schemaVersion": 39, "tags": ["flux", "gitops", "loki"], "templating": { "list": [ { "name": "controller", "type": "custom", "current": { "text": "All", "value": "$__all" }, "includeAll": true, "allValue": "kustomize-controller|helm-controller|source-controller", "options": [ { "text": "All", "value": "$__all", "selected": true }, { "text": "kustomize-controller", "value": "kustomize-controller", "selected": false }, { "text": "helm-controller", "value": "helm-controller", "selected": false }, { "text": "source-controller", "value": "source-controller", "selected": false } ], "query": "kustomize-controller,helm-controller,source-controller", "multi": true } ] }, "time": { "from": "now-6h", "to": "now" }, "title": "Flux CD", "uid": "flux-cd" }Design notes:
controllertemplate variable is a static custom list (not a query) because the three Flux controllers are fixed and known.allValueuses pipe-separated regex so the LogQLcontainer=~"$controller"filter works correctly when “All” is selected.- Four panels: reconciliation activity (stacked bars), errors/warnings (line chart), filtered reconciliation events (logs), and unfiltered full logs (for debugging).
- Reconciliation event filter includes
reconcil|applied|created|deleted|updated|drift|error|warn|failedto surface the most operationally relevant log lines.
No Terraform changes
Section titled “No Terraform changes”DNS, secrets, and infrastructure are all already in place from Spec 0005. No changes to infra/.
No Flux sync changes
Section titled “No Flux sync changes”Dashboard ConfigMaps live in k8s/observability/ which is already synced by k8s/flux-system/observability-sync.yaml. No new sync resource needed.
Manual steps (not performed by the agent)
Section titled “Manual steps (not performed by the agent)”After merging to main, Flux will reconcile automatically (within 10 minutes) or trigger manually:
az vm run-command invoke \ --resource-group rg-kevinryan-io \ --name vm-kevinryan-node1 \ --command-id RunShellScript \ --scripts "export KUBECONFIG=/etc/rancher/k3s/k3s.yaml && flux reconcile kustomization flux-system --with-source"Wait 2-3 minutes for the Grafana pod to restart (sidecar config change triggers rollout), then verify:
az vm run-command invoke \ --resource-group rg-kevinryan-io \ --name vm-kevinryan-node1 \ --command-id RunShellScript \ --scripts "export KUBECONFIG=/etc/rancher/k3s/k3s.yaml && kubectl get configmap -n observability -l grafana_dashboard=1"Should show dashboard-platform-overview and dashboard-flux-cd.
Then visit https://monitoring.kevinryan.io and verify both dashboards appear in the dashboard list.
Provenance Record
Section titled “Provenance Record”After completing the work, create .sdd/provenance/spec-0006-grafana-dashboards.provenance.md using the provenance template at .sdd/provenance/template.md.
Validation steps
Section titled “Validation steps”After completing all work, confirm:
- This spec has been saved to
.sdd/specification/spec-0006-grafana-dashboards.md k8s/observability/helmrelease-grafana.yamlincludessidecar.dashboards.enabled: truewithlabel: grafana_dashboard,labelValue: "1", andsearchNamespace: observabilityk8s/observability/dashboard-platform-overview.yamlexists as a ConfigMap with labelgrafana_dashboard: "1"and contains valid dashboard JSON with"uid": "platform-overview"- The Platform Overview dashboard JSON includes: namespace template variable, log volume by namespace panel, error rate panel, error rate by namespace panel, and recent errors logs panel
k8s/observability/dashboard-flux-cd.yamlexists as a ConfigMap with labelgrafana_dashboard: "1"and contains valid dashboard JSON with"uid": "flux-cd"- The Flux CD dashboard JSON includes: controller template variable (kustomize-controller, helm-controller, source-controller), reconciliation activity panel, errors/warnings panel, reconciliation events logs panel, and all logs panel
- All dashboard JSON has
"editable": false - No Terraform files were modified
- No new Flux sync resources were created (existing
observability-sync.yamlcoversk8s/observability/) pnpm lintpasses (markdownlint on spec and provenance files)- The provenance record exists at
.sdd/provenance/spec-0006-grafana-dashboards.provenance.mdand contains all required sections - All files (spec, ConfigMaps, HelmRelease change, provenance) are committed together