Spec 0005: Observability Stack (Grafana + Loki + Promtail)
- Save this spec to
.sdd/specification/spec-0005-observability-stack.mdin the repo (create the.sdd/specification/directory if it does not exist). - Implement all Terraform and Kubernetes manifest changes described below.
- After completing all work, create a provenance record at
.sdd/provenance/spec-0005-observability-stack.provenance.md(create the.sdd/provenance/directory if it does not exist). See the Provenance Record section for the required format.
Prerequisites
Section titled “Prerequisites”- Spec 0002 deployed: PostgreSQL Flexible Server running with
grafana_dbdatabase created - Spec 0003 deployed: ESO running, ClusterSecretStore
azure-keyvaultisValidandReady - Read ADR-006 (
docs/adr/adr-006-observability-grafana-loki-promtail.md) — the architectural decision this spec implements
Context
Section titled “Context”ADR-006 mandates Grafana + Loki + Promtail as the observability stack. Grafana is the dashboard UI, Loki is the log store, and Promtail ships container logs to Loki. Grafana uses grafana_db on the shared PostgreSQL instance for state storage. All three components are deployed via Helm charts through Flux.
Node2 has taint observability=true:NoSchedule and label role=observability — Grafana and Loki should schedule there. Promtail runs as a DaemonSet on ALL nodes (it needs to collect logs from both node1 and node2).
Current state (read these files before making changes)
Section titled “Current state (read these files before making changes)”| File / Directory | What it does |
|---|---|
k8s/flux-system/kustomization.yaml | Lists all Flux sync resources |
k8s/flux-system/external-secrets-sync.yaml | Pattern for HelmRelease-based Flux sync |
k8s/external-secrets/helmrelease.yaml | Pattern for HelmRelease spec |
k8s/umami/externalsecret.yaml | Pattern for ExternalSecret with templating |
infra/main.tf | Root Terraform module |
infra/modules/cloudflare/main.tf | Cloudflare module — subdomains get cache rules, standalone records don’t |
Key facts
Section titled “Key facts”- Grafana Helm chart:
https://grafana-community.github.io/helm-charts(migrated from grafana.github.io as of Jan 2026), chart namegrafana - Loki Helm chart:
https://grafana.github.io/helm-charts, chart nameloki - Promtail Helm chart:
https://grafana.github.io/helm-charts, chart namepromtail - Grafana DB:
grafana_dbonpsql-kevinryan-io.postgres.database.azure.com(already created in Spec 0002) - Key Vault secrets available:
pg-admin-password,pg-fqdn,pg-admin-username - Node2 taint:
observability=true:NoSchedule - Node2 label:
role=observability - Subdomain:
monitoring.kevinryan.io
1. Terraform changes (small)
Section titled “1. Terraform changes (small)”Add DNS record for monitoring.kevinryan.io
Section titled “Add DNS record for monitoring.kevinryan.io”Add to infra/main.tf (after the existing cloudflare_record.analytics block):
resource "cloudflare_record" "monitoring" { zone_id = var.cloudflare_zone_id name = "monitoring" content = module.network.public_ip_address type = "A" proxied = true ttl = 1}Why standalone: Same reason as analytics — Grafana serves dynamic content that must not be cached by the Cloudflare module’s aggressive cache rule.
Add Grafana admin password to Key Vault
Section titled “Add Grafana admin password to Key Vault”Add to infra/main.tf (after the existing azurerm_key_vault_secret.umami_app_secret block):
resource "random_password" "grafana_admin_password" { length = 32 special = false}
resource "azurerm_key_vault_secret" "grafana_admin_password" { name = "grafana-admin-password" value = random_password.grafana_admin_password.result key_vault_id = module.keyvault.key_vault_id}Note: special = false — learned from Spec 0004 that special characters in passwords cause URL/config parsing issues.
No other Terraform changes
Section titled “No other Terraform changes”No changes to variables, outputs, or modules. The pgcrypto extension is already allowlisted from Spec 0004.
2. Kubernetes manifests
Section titled “2. Kubernetes manifests”Create k8s/observability/ with the following files:
namespace.yaml
Section titled “namespace.yaml”apiVersion: v1kind: Namespacemetadata: name: observabilityhelmrepository-grafana.yaml
Section titled “helmrepository-grafana.yaml”apiVersion: source.toolkit.fluxcd.io/v1kind: HelmRepositorymetadata: name: grafana namespace: observabilityspec: interval: 1h url: https://grafana.github.io/helm-chartshelmrepository-grafana-community.yaml
Section titled “helmrepository-grafana-community.yaml”apiVersion: source.toolkit.fluxcd.io/v1kind: HelmRepositorymetadata: name: grafana-community namespace: observabilityspec: interval: 1h url: https://grafana-community.github.io/helm-chartshelmrelease-loki.yaml
Section titled “helmrelease-loki.yaml”Loki deployed in SingleBinary mode (appropriate for a single-node log store):
apiVersion: helm.toolkit.fluxcd.io/v2kind: HelmReleasemetadata: name: loki namespace: observabilityspec: interval: 1h chart: spec: chart: loki version: ">=6.0.0 <7.0.0" sourceRef: kind: HelmRepository name: grafana namespace: observability interval: 1h install: crds: CreateReplace remediation: retries: 5 upgrade: crds: CreateReplace remediation: retries: 5 values: deploymentMode: SingleBinary loki: auth_enabled: false commonConfig: replication_factor: 1 schemaConfig: configs: - from: "2024-01-01" store: tsdb object_store: filesystem schema: v13 index: prefix: loki_index_ period: 24h storage: type: filesystem limits_config: retention_period: 744h singleBinary: replicas: 1 nodeSelector: role: observability tolerations: - key: observability operator: Equal value: "true" effect: NoSchedule persistence: enabled: true size: 10Gi backend: replicas: 0 read: replicas: 0 write: replicas: 0 ingester: replicas: 0 querier: replicas: 0 queryFrontend: replicas: 0 queryScheduler: replicas: 0 distributor: replicas: 0 compactor: replicas: 0 indexGateway: replicas: 0 bloomCompactor: replicas: 0 bloomGateway: replicas: 0 gateway: enabled: false minio: enabled: false chunksCache: enabled: false resultsCache: enabled: false lokiCanary: enabled: false test: enabled: falseDesign notes:
SingleBinarymode runs all Loki components in one pod — appropriate for this cluster size.- All other deployment modes zeroed out to avoid validation errors.
auth_enabled: false— single-tenant mode, no need for multi-tenancy.retention_period: 744h(31 days) per ADR-006.filesystemstorage — sufficient for a single-node setup. Can migrate to Azure Blob later if needed.nodeSelectorandtolerationsschedule Loki on node2.gateway,minio, caches, canary, and tests disabled — not needed for SingleBinary.
helmrelease-promtail.yaml
Section titled “helmrelease-promtail.yaml”apiVersion: helm.toolkit.fluxcd.io/v2kind: HelmReleasemetadata: name: promtail namespace: observabilityspec: interval: 1h chart: spec: chart: promtail version: ">=6.0.0 <7.0.0" sourceRef: kind: HelmRepository name: grafana namespace: observability interval: 1h install: remediation: retries: 5 upgrade: remediation: retries: 5 values: config: clients: - url: http://loki.observability.svc.cluster.local:3100/loki/api/v1/push tolerations: - key: observability operator: Equal value: "true" effect: NoScheduleDesign notes:
- Promtail is a DaemonSet by default — it runs on ALL nodes (both node1 and node2) to collect logs from every pod.
- The
tolerationsallow Promtail to also schedule on node2 (which has the observability taint). It already schedules on node1 by default. - Pushes logs to Loki’s in-cluster service endpoint.
externalsecret.yaml
Section titled “externalsecret.yaml”apiVersion: external-secrets.io/v1kind: ExternalSecretmetadata: name: grafana-db namespace: observabilityspec: refreshInterval: 1h secretStoreRef: kind: ClusterSecretStore name: azure-keyvault target: name: grafana-db creationPolicy: Owner template: engineVersion: v2 data: GF_DATABASE_TYPE: "postgres" GF_DATABASE_HOST: "{{ .pg_fqdn }}:5432" GF_DATABASE_NAME: "grafana_db" GF_DATABASE_USER: "{{ .pg_admin_username }}" GF_DATABASE_PASSWORD: "{{ .pg_admin_password }}" GF_DATABASE_SSL_MODE: "require" GF_SECURITY_ADMIN_PASSWORD: "{{ .grafana_admin_password }}" data: - secretKey: pg_fqdn remoteRef: key: pg-fqdn - secretKey: pg_admin_username remoteRef: key: pg-admin-username - secretKey: pg_admin_password remoteRef: key: pg-admin-password - secretKey: grafana_admin_password remoteRef: key: grafana-admin-passwordDesign notes:
- Grafana reads
GF_DATABASE_*environment variables for PostgreSQL backend configuration. GF_SECURITY_ADMIN_PASSWORDsets the admin password from Key Vault (not a default password).- The resulting K8s Secret
grafana-dbis consumed by the Grafana HelmRelease viaenvFromSecret.
helmrelease-grafana.yaml
Section titled “helmrelease-grafana.yaml”apiVersion: helm.toolkit.fluxcd.io/v2kind: HelmReleasemetadata: name: grafana namespace: observabilityspec: interval: 1h chart: spec: chart: grafana version: ">=11.0.0 <12.0.0" sourceRef: kind: HelmRepository name: grafana-community namespace: observability interval: 1h install: remediation: retries: 5 upgrade: remediation: retries: 5 values: replicas: 1 nodeSelector: role: observability tolerations: - key: observability operator: Equal value: "true" effect: NoSchedule envFromSecret: grafana-db grafana.ini: server: root_url: "https://monitoring.kevinryan.io" auth: disable_login_form: false users: allow_sign_up: false datasources: datasources.yaml: apiVersion: 1 datasources: - name: Loki type: loki access: proxy url: http://loki.observability.svc.cluster.local:3100 isDefault: true service: type: ClusterIP port: 80Design notes:
envFromSecret: grafana-dbinjects allGF_*env vars from the ExternalSecret.- Loki datasource is pre-configured — Grafana can query logs immediately after deploy.
root_urlset for correct URL generation behind Cloudflare proxy.- Schedules on node2 via
nodeSelectorandtolerations. - The Grafana chart defaults to port 3000 internally; the service exposes it on port 80.
ingress.yaml
Section titled “ingress.yaml”apiVersion: traefik.io/v1alpha1kind: IngressRoutemetadata: name: grafana namespace: observabilityspec: entryPoints: - websecure routes: - match: Host(`monitoring.kevinryan.io`) kind: Rule services: - name: grafana port: 80 tls: {}3. Flux sync
Section titled “3. Flux sync”Create k8s/flux-system/observability-sync.yaml:
apiVersion: kustomize.toolkit.fluxcd.io/v1kind: Kustomizationmetadata: name: observability namespace: flux-systemspec: dependsOn: - name: external-secrets-store interval: 10m0s path: ./k8s/observability prune: true sourceRef: kind: GitRepository name: flux-systemWhy dependsOn: external-secrets-store: The ExternalSecret references the ClusterSecretStore. Same pattern as the Umami sync.
4. Update k8s/flux-system/kustomization.yaml
Section titled “4. Update k8s/flux-system/kustomization.yaml”Add observability-sync.yaml to the resources list (after umami-sync.yaml).
Manual steps (not performed by the agent)
Section titled “Manual steps (not performed by the agent)”Terraform apply (before or after merge)
Section titled “Terraform apply (before or after merge)”cd infraterraform plan # Expect: 1 random_password + 1 KV secret + 1 Cloudflare record = 3 new resourcesterraform applyVerify:
az keyvault secret list --vault-name kv-kevinryan-io --query "[].name" -o tsv# Should include: grafana-admin-password
nslookup monitoring.kevinryan.io# Should resolve via Cloudflare proxyAfter merge to main — Flux reconciliation
Section titled “After merge to main — Flux reconciliation”az vm run-command invoke \ --resource-group rg-kevinryan-io \ --name vm-kevinryan-node1 \ --command-id RunShellScript \ --scripts "export KUBECONFIG=/etc/rancher/k3s/k3s.yaml && flux reconcile kustomization flux-system --with-source"Wait 3-5 minutes for all charts to install, then verify:
az vm run-command invoke \ --resource-group rg-kevinryan-io \ --name vm-kevinryan-node1 \ --command-id RunShellScript \ --scripts "export KUBECONFIG=/etc/rancher/k3s/k3s.yaml && echo '=== HelmReleases ===' && kubectl get helmrelease -n observability && echo '=== Pods ===' && kubectl get pods -n observability -o wide && echo '=== ExternalSecret ===' && kubectl get externalsecret -n observability && echo '=== Services ===' && kubectl get svc -n observability && echo '=== IngressRoute ===' && kubectl get ingressroute -n observability"Final check:
curl -k https://monitoring.kevinryan.io/api/healthShould return {"commit":"...","database":"ok","version":"..."}.
Login: admin / the password from az keyvault secret show --vault-name kv-kevinryan-io --name grafana-admin-password --query value -o tsv
Provenance Record
Section titled “Provenance Record”After completing the work, create .sdd/provenance/spec-0005-observability-stack.provenance.md with the following structure:
# Provenance: Spec 0005 — Observability Stack
**Spec:** `.sdd/specification/spec-0005-observability-stack.md`**Executed:** <timestamp>**Agent:** <agent identifier if available>
## Actions Taken
Chronological list of every action performed (files created, files modified, commands run).
## Decisions Made
Any decisions the agent made during execution that were not explicitly specified in the spec. For each:
| Decision | Options Considered | Chosen | Rationale ||----------|-------------------|--------|-----------|| ... | ... | ... | ... |
If no autonomous decisions were required, state: "No autonomous decisions were required — all actions were explicitly specified in the spec."
## Deviations from Spec
Any points where the agent deviated from the spec, and why. If none, state: "No deviations from spec."
## Artifacts Produced
| File | Status ||------|--------|| ... | Created / Modified |
## Validation Results
Results of each validation step from the spec (pass/fail with details).Validation steps
Section titled “Validation steps”After completing all work, confirm:
- This spec has been saved to
.sdd/specification/spec-0005-observability-stack.md infra/main.tfcontainsrandom_password.grafana_admin_password,azurerm_key_vault_secret.grafana_admin_password, andcloudflare_record.monitoringk8s/observability/exists with exactly 8 files:namespace.yaml,helmrepository-grafana.yaml,helmrepository-grafana-community.yaml,helmrelease-loki.yaml,helmrelease-promtail.yaml,helmrelease-grafana.yaml,externalsecret.yaml,ingress.yaml- Loki HelmRelease uses
deploymentMode: SingleBinarywithreplication_factor: 1, filesystem storage, 744h retention, and schedules on node2 (nodeSelector+tolerations) - All non-SingleBinary components are zeroed out (backend, read, write, ingester, querier, etc. all
replicas: 0) - Promtail HelmRelease pushes to
http://loki.observability.svc.cluster.local:3100/loki/api/v1/pushand tolerates the observability taint (runs on ALL nodes) - Grafana HelmRelease uses
envFromSecret: grafana-db, has Loki pre-configured as a datasource,root_urlset tohttps://monitoring.kevinryan.io, and schedules on node2 - The ExternalSecret constructs
GF_DATABASE_*env vars from Key Vault secrets and includesGF_SECURITY_ADMIN_PASSWORD - The IngressRoute matches
Host(\monitoring.kevinryan.io`)withwebsecureentryPoint andtls: {}` k8s/flux-system/observability-sync.yamlexists withdependsOn: [{name: external-secrets-store}]k8s/flux-system/kustomization.yamlincludesobservability-sync.yamlterraform fmt -check -recursive infra/passespnpm lintpasses- The provenance record exists at
.sdd/provenance/spec-0005-observability-stack.provenance.mdand contains all required sections - All files (spec, Terraform changes, K8s manifests, provenance) are committed together