Skip to content

Observability

The platform runs a full observability stack for log aggregation and metrics collection, accessible at monitoring.kevinryan.io. All components are deployed as Flux CD HelmReleases in the observability namespace and scheduled exclusively on the dedicated agent node (node2).

graph TD
    subgraph sites["All Pods (both nodes)"]
        app1["Site pods"]
        app2["Flux controllers"]
        app3["Umami"]
    end

    subgraph n2["node2 — Observability"]
        promtail["Promtail<br/>(DaemonSet)"]
        loki["Loki<br/>(log storage)"]
        vmagent["VMAgent<br/>(metric scraper)"]
        vmsingle["VMSingle<br/>(metric storage)"]
        grafana["Grafana<br/>(dashboards)"]
        nodexp["Node Exporter"]
        ksm["Kube State Metrics"]
    end

    promtail_n1["Promtail<br/>(node1 DaemonSet)"]

    pg["Azure PostgreSQL<br/>(grafana_db)"]
    kv["Azure Key Vault"]
    user["User"]

    app1 & app2 & app3 -.->|stdout/stderr| promtail_n1
    promtail_n1 -->|push| loki
    promtail -->|push| loki
    nodexp & ksm -->|metrics| vmagent
    vmagent -->|write| vmsingle
    loki --> grafana
    vmsingle --> grafana
    kv -.->|secrets| grafana
    grafana -->|state| pg
    user -->|HTTPS| grafana
ComponentChartVersion RangePurpose
Grafanagrafana (community)>=11.0.0 <12.0.0Dashboard UI and query engine
Lokiloki (grafana)>=6.0.0 <7.0.0Log aggregation and storage
Promtailpromtail (grafana)>=6.0.0 <7.0.0Log collection agent (DaemonSet)
VictoriaMetricsvictoria-metrics-k8s-stack>=0.70.0 <1.0.0Metrics collection, storage, and alerting rules

All charts are pinned to semver ranges and reconciled hourly by Flux. Install and upgrade failures retry up to 5 times automatically.

Three HelmRepository resources provide chart sources:

RepositoryURLCharts
grafanahttps://grafana.github.io/helm-chartsLoki, Promtail
grafana-communityhttps://grafana-community.github.io/helm-chartsGrafana
victoriametricshttps://victoriametrics.github.io/helm-charts/victoria-metrics-k8s-stack

All observability workloads are isolated on node2 using Kubernetes taints and node selectors. Node2 is configured at K3s install time with:

Terminal window
--node-taint observability=true:NoSchedule --node-label role=observability

Every HelmRelease values block includes:

nodeSelector:
role: observability
tolerations:
- key: observability
operator: Equal
value: "true"
effect: NoSchedule

This ensures site workloads on node1 and observability workloads on node2 never compete for resources. The only exception is Promtail, which runs as a DaemonSet on both nodes to collect logs from all pods.

Grafana provides the dashboard interface at monitoring.kevinryan.io.

SettingValue
Replicas1
Root URLhttps://monitoring.kevinryan.io
Login formEnabled
Sign-upDisabled
ServiceClusterIP on port 80

Grafana stores its state (dashboards, users, preferences) in the grafana_db PostgreSQL database on the Azure Flexible Server. Database credentials are sourced from Azure Key Vault via an ExternalSecret:

apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: grafana-db
namespace: observability
spec:
refreshInterval: 1h
secretStoreRef:
kind: ClusterSecretStore
name: azure-keyvault
target:
name: grafana-db
template:
data:
GF_DATABASE_TYPE: "postgres"
GF_DATABASE_HOST: "{{ .pg_fqdn }}:5432"
GF_DATABASE_NAME: "grafana_db"
GF_DATABASE_USER: "{{ .pg_admin_username }}"
GF_DATABASE_PASSWORD: "{{ .pg_admin_password }}"
GF_DATABASE_SSL_MODE: "require"
GF_SECURITY_ADMIN_PASSWORD: "{{ .grafana_admin_password }}"
admin-user: "admin"

The admin password is also stored in this secret, keeping all Grafana credentials in a single Kubernetes secret managed by the External Secrets Operator.

Grafana is pre-configured with two datasources:

DatasourceTypeInternal URLDefault
Lokilokihttp://loki.observability.svc.cluster.local:3100Yes
VictoriaMetricsprometheushttp://vmsingle-vm.observability.svc.cluster.local:8428No

VictoriaMetrics uses the prometheus datasource type because it is fully compatible with the Prometheus query API (PromQL).

Grafana runs a sidecar container that watches for ConfigMaps with the label grafana_dashboard: "1" in the observability namespace. Any ConfigMap with this label is automatically loaded as a Grafana dashboard — no manual import required.

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: grafana
namespace: observability
spec:
entryPoints:
- websecure
routes:
- match: Host(`monitoring.kevinryan.io`)
kind: Rule
services:
- name: grafana
port: 80
tls: {}

Loki is the log aggregation engine. It receives logs from Promtail, indexes them, and serves queries to Grafana.

Loki runs in SingleBinary mode — all components (ingester, querier, compactor) run in a single process. This is the simplest deployment topology, appropriate for the platform’s log volume. All distributed-mode components are explicitly disabled:

deploymentMode: SingleBinary
singleBinary:
replicas: 1
backend:
replicas: 0
read:
replicas: 0
write:
replicas: 0
# ... all other components set to 0
SettingValue
Storage typeFilesystem (local-path PV)
Volume size10Gi
SchemaTSDB v13
Index period24 hours
Retention744 hours (31 days)

Loki stores both index and chunk data on a persistent volume provisioned by the K3s local-path storage provisioner. No object store (S3, MinIO) is required.

To keep the deployment minimal, several optional components are turned off:

ComponentWhy Disabled
GatewayNot needed — Promtail pushes directly to Loki
MinIOFilesystem storage is used instead
Chunks/Results cacheSingle-binary mode handles caching internally
Loki CanarySynthetic log testing not needed at this scale
Bloom filtersAdvanced query optimisation not needed

Promtail is the log collection agent. It runs as a DaemonSet on every node in the cluster, tailing container logs from the node’s filesystem and pushing them to Loki.

config:
clients:
- url: http://loki.observability.svc.cluster.local:3100/loki/api/v1/push
tolerations:
- key: observability
operator: Equal
value: "true"
effect: NoSchedule

The toleration is needed so Promtail can schedule on node2 (which has the observability taint). As a DaemonSet, it also runs on node1 without any additional configuration.

graph LR
    subgraph node1
        pods1["Pods"] -->|stdout/stderr| journal1["/var/log/pods/"]
        pt1["Promtail"] -->|tail| journal1
    end

    subgraph node2
        pods2["Pods"] -->|stdout/stderr| journal2["/var/log/pods/"]
        pt2["Promtail"] -->|tail| journal2
    end

    pt1 & pt2 -->|HTTP push| loki["Loki<br/>:3100"]
    loki --> pv["10Gi PV"]

Promtail automatically discovers pods, attaches Kubernetes labels (namespace, pod name, container name) to each log line, and pushes to Loki’s HTTP API. Logs from all seven sites, Flux controllers, Umami, and the observability stack itself are collected.

The nginx containers across all sites emit JSON-formatted access logs:

{
"time": "2026-03-13T10:00:00+00:00",
"remote_addr": "10.0.1.1",
"request": "GET / HTTP/1.1",
"status": 200,
"body_bytes_sent": 4523,
"request_time": "0.001",
"http_user_agent": "Mozilla/5.0 ..."
}

This structured format enables Loki LogQL queries to extract fields for filtering and aggregation in Grafana dashboards.

VictoriaMetrics provides metrics collection and storage as a lightweight, Prometheus-compatible alternative. It is deployed via the victoria-metrics-k8s-stack chart, which bundles multiple components.

ComponentRole
VictoriaMetrics OperatorManages VMSingle, VMAgent, and scrape configs
VMSingleSingle-node metrics storage (Prometheus-compatible TSDB)
VMAgentMetrics scraper (replaces Prometheus server for scraping)
Node ExporterExposes host-level metrics (CPU, memory, disk, network)
Kube State MetricsExposes Kubernetes object metrics (pod status, deployment replicas, etc.)
KubeletKubelet metrics collection
ComponentWhy Disabled
GrafanaDeployed separately with its own HelmRelease
AlertmanagerNo alerting configured
VMAlertNo alerting rules active
VMAuthNo multi-tenant authentication needed
VMClusterSingle-node mode (VMSingle) is sufficient
kubeControllerManager, kubeScheduler, kubeEtcd, kubeProxyNot accessible in K3s (embedded in the K3s binary)
SettingValue
Retention31 days
Volume10Gi PersistentVolumeClaim (ReadWriteOnce)
Scrape interval30 seconds

The VictoriaMetrics Operator’s Prometheus converter is enabled (disable_prometheus_converter: false). This means any ServiceMonitor or PodMonitor CRDs in the cluster are automatically converted to VictoriaMetrics scrape configs — maintaining compatibility with the Prometheus ecosystem.

graph LR
    nodexp["Node Exporter"] --> vmagent["VMAgent"]
    ksm["Kube State<br/>Metrics"] --> vmagent
    kubelet["Kubelet"] --> vmagent
    vmagent -->|remote write| vmsingle["VMSingle<br/>(10Gi PV, 31d retention)"]
    vmsingle -->|PromQL| grafana["Grafana"]

Two custom Grafana dashboards are deployed as ConfigMaps with the grafana_dashboard: "1" label, automatically loaded by the Grafana sidecar.

PanelTypeData
Reconciliation ActivityTime series (stacked bars)Count of reconciliation events per controller
Flux Errors and WarningsTime series (line)Count of error/warn/failed events per controller
Reconciliation EventsLogsFiltered log stream showing reconcile, apply, create, delete, drift, error events
All Flux LogsLogsUnfiltered log stream from Flux controllers

Includes a $controller template variable to filter by kustomize-controller, helm-controller, or source-controller.

PanelTypeData
Log Volume by NamespaceTime series (stacked bars)Count of log lines per namespace over time
Error RateTime series (line, red)Total count of error/fatal/panic log lines
Error Rate by NamespaceTime series (line)Error count broken down by namespace
Recent ErrorsLogsFiltered log stream showing only error/fatal/panic entries

Includes a $namespace template variable (dynamically populated from Loki labels) to filter by any namespace in the cluster.

Grafana credentials are managed via the External Secrets Operator. Four Key Vault secrets are composed into the grafana-db Kubernetes secret:

Key Vault SecretKubernetes KeyPurpose
pg-fqdnGF_DATABASE_HOSTPostgreSQL server address
pg-admin-usernameGF_DATABASE_USERDatabase login
pg-admin-passwordGF_DATABASE_PASSWORDDatabase password
grafana-admin-passwordGF_SECURITY_ADMIN_PASSWORDGrafana admin UI password

Secrets refresh hourly. Loki, Promtail, and VictoriaMetrics do not require external secrets — they have no credentials or external state.

The observability Kustomization depends on the External Secrets store being available:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: observability
namespace: flux-system
spec:
dependsOn:
- name: external-secrets-store
interval: 10m0s
path: ./k8s/observability
prune: true
sourceRef:
kind: GitRepository
name: flux-system

This ensures the ClusterSecretStore is ready before Grafana’s ExternalSecret is created.

The monitoring.kevinryan.io A record is managed by Terraform in the root module:

resource "cloudflare_record" "monitoring" {
zone_id = var.cloudflare_zone_id
name = "monitoring"
content = module.network.public_ip_address
type = "A"
proxied = true
ttl = 1
}

Traffic is proxied through Cloudflare, providing CDN caching for static dashboard assets and DDoS protection for the Grafana API.

All 12 files in k8s/observability/:

FileResource TypePurpose
namespace.yamlNamespaceobservability namespace
helmrepository-grafana.yamlHelmRepositoryLoki + Promtail charts
helmrepository-grafana-community.yamlHelmRepositoryGrafana chart
helmrepository-victoriametrics.yamlHelmRepositoryVictoriaMetrics chart
helmrelease-grafana.yamlHelmReleaseGrafana deployment
helmrelease-loki.yamlHelmReleaseLoki deployment
helmrelease-promtail.yamlHelmReleasePromtail DaemonSet
helmrelease-victoria-metrics.yamlHelmReleaseVictoriaMetrics stack
externalsecret.yamlExternalSecretGrafana DB + admin credentials
ingress.yamlIngressRoutemonitoring.kevinryan.io routing
dashboard-flux-cd.yamlConfigMapFlux CD Grafana dashboard
dashboard-platform-overview.yamlConfigMapPlatform overview Grafana dashboard