Observability
The platform runs a full observability stack for log aggregation and metrics collection, accessible at monitoring.kevinryan.io. All components are deployed as Flux CD HelmReleases in the observability namespace and scheduled exclusively on the dedicated agent node (node2).
Stack Overview
Section titled “Stack Overview”graph TD
subgraph sites["All Pods (both nodes)"]
app1["Site pods"]
app2["Flux controllers"]
app3["Umami"]
end
subgraph n2["node2 — Observability"]
promtail["Promtail<br/>(DaemonSet)"]
loki["Loki<br/>(log storage)"]
vmagent["VMAgent<br/>(metric scraper)"]
vmsingle["VMSingle<br/>(metric storage)"]
grafana["Grafana<br/>(dashboards)"]
nodexp["Node Exporter"]
ksm["Kube State Metrics"]
end
promtail_n1["Promtail<br/>(node1 DaemonSet)"]
pg["Azure PostgreSQL<br/>(grafana_db)"]
kv["Azure Key Vault"]
user["User"]
app1 & app2 & app3 -.->|stdout/stderr| promtail_n1
promtail_n1 -->|push| loki
promtail -->|push| loki
nodexp & ksm -->|metrics| vmagent
vmagent -->|write| vmsingle
loki --> grafana
vmsingle --> grafana
kv -.->|secrets| grafana
grafana -->|state| pg
user -->|HTTPS| grafana
Components
Section titled “Components”| Component | Chart | Version Range | Purpose |
|---|---|---|---|
| Grafana | grafana (community) | >=11.0.0 <12.0.0 | Dashboard UI and query engine |
| Loki | loki (grafana) | >=6.0.0 <7.0.0 | Log aggregation and storage |
| Promtail | promtail (grafana) | >=6.0.0 <7.0.0 | Log collection agent (DaemonSet) |
| VictoriaMetrics | victoria-metrics-k8s-stack | >=0.70.0 <1.0.0 | Metrics collection, storage, and alerting rules |
All charts are pinned to semver ranges and reconciled hourly by Flux. Install and upgrade failures retry up to 5 times automatically.
Helm Repositories
Section titled “Helm Repositories”Three HelmRepository resources provide chart sources:
| Repository | URL | Charts |
|---|---|---|
grafana | https://grafana.github.io/helm-charts | Loki, Promtail |
grafana-community | https://grafana-community.github.io/helm-charts | Grafana |
victoriametrics | https://victoriametrics.github.io/helm-charts/ | victoria-metrics-k8s-stack |
Node Scheduling
Section titled “Node Scheduling”All observability workloads are isolated on node2 using Kubernetes taints and node selectors. Node2 is configured at K3s install time with:
--node-taint observability=true:NoSchedule --node-label role=observabilityEvery HelmRelease values block includes:
nodeSelector: role: observabilitytolerations: - key: observability operator: Equal value: "true" effect: NoScheduleThis ensures site workloads on node1 and observability workloads on node2 never compete for resources. The only exception is Promtail, which runs as a DaemonSet on both nodes to collect logs from all pods.
Grafana
Section titled “Grafana”Grafana provides the dashboard interface at monitoring.kevinryan.io.
Configuration
Section titled “Configuration”| Setting | Value |
|---|---|
| Replicas | 1 |
| Root URL | https://monitoring.kevinryan.io |
| Login form | Enabled |
| Sign-up | Disabled |
| Service | ClusterIP on port 80 |
Database
Section titled “Database”Grafana stores its state (dashboards, users, preferences) in the grafana_db PostgreSQL database on the Azure Flexible Server. Database credentials are sourced from Azure Key Vault via an ExternalSecret:
apiVersion: external-secrets.io/v1kind: ExternalSecretmetadata: name: grafana-db namespace: observabilityspec: refreshInterval: 1h secretStoreRef: kind: ClusterSecretStore name: azure-keyvault target: name: grafana-db template: data: GF_DATABASE_TYPE: "postgres" GF_DATABASE_HOST: "{{ .pg_fqdn }}:5432" GF_DATABASE_NAME: "grafana_db" GF_DATABASE_USER: "{{ .pg_admin_username }}" GF_DATABASE_PASSWORD: "{{ .pg_admin_password }}" GF_DATABASE_SSL_MODE: "require" GF_SECURITY_ADMIN_PASSWORD: "{{ .grafana_admin_password }}" admin-user: "admin"The admin password is also stored in this secret, keeping all Grafana credentials in a single Kubernetes secret managed by the External Secrets Operator.
Datasources
Section titled “Datasources”Grafana is pre-configured with two datasources:
| Datasource | Type | Internal URL | Default |
|---|---|---|---|
| Loki | loki | http://loki.observability.svc.cluster.local:3100 | Yes |
| VictoriaMetrics | prometheus | http://vmsingle-vm.observability.svc.cluster.local:8428 | No |
VictoriaMetrics uses the prometheus datasource type because it is fully compatible with the Prometheus query API (PromQL).
Dashboard Sidecar
Section titled “Dashboard Sidecar”Grafana runs a sidecar container that watches for ConfigMaps with the label grafana_dashboard: "1" in the observability namespace. Any ConfigMap with this label is automatically loaded as a Grafana dashboard — no manual import required.
Ingress
Section titled “Ingress”apiVersion: traefik.io/v1alpha1kind: IngressRoutemetadata: name: grafana namespace: observabilityspec: entryPoints: - websecure routes: - match: Host(`monitoring.kevinryan.io`) kind: Rule services: - name: grafana port: 80 tls: {}Loki is the log aggregation engine. It receives logs from Promtail, indexes them, and serves queries to Grafana.
Deployment Mode
Section titled “Deployment Mode”Loki runs in SingleBinary mode — all components (ingester, querier, compactor) run in a single process. This is the simplest deployment topology, appropriate for the platform’s log volume. All distributed-mode components are explicitly disabled:
deploymentMode: SingleBinarysingleBinary: replicas: 1backend: replicas: 0read: replicas: 0write: replicas: 0# ... all other components set to 0Storage
Section titled “Storage”| Setting | Value |
|---|---|
| Storage type | Filesystem (local-path PV) |
| Volume size | 10Gi |
| Schema | TSDB v13 |
| Index period | 24 hours |
| Retention | 744 hours (31 days) |
Loki stores both index and chunk data on a persistent volume provisioned by the K3s local-path storage provisioner. No object store (S3, MinIO) is required.
Disabled Features
Section titled “Disabled Features”To keep the deployment minimal, several optional components are turned off:
| Component | Why Disabled |
|---|---|
| Gateway | Not needed — Promtail pushes directly to Loki |
| MinIO | Filesystem storage is used instead |
| Chunks/Results cache | Single-binary mode handles caching internally |
| Loki Canary | Synthetic log testing not needed at this scale |
| Bloom filters | Advanced query optimisation not needed |
Promtail
Section titled “Promtail”Promtail is the log collection agent. It runs as a DaemonSet on every node in the cluster, tailing container logs from the node’s filesystem and pushing them to Loki.
Configuration
Section titled “Configuration”config: clients: - url: http://loki.observability.svc.cluster.local:3100/loki/api/v1/pushtolerations: - key: observability operator: Equal value: "true" effect: NoScheduleThe toleration is needed so Promtail can schedule on node2 (which has the observability taint). As a DaemonSet, it also runs on node1 without any additional configuration.
Log Flow
Section titled “Log Flow”graph LR
subgraph node1
pods1["Pods"] -->|stdout/stderr| journal1["/var/log/pods/"]
pt1["Promtail"] -->|tail| journal1
end
subgraph node2
pods2["Pods"] -->|stdout/stderr| journal2["/var/log/pods/"]
pt2["Promtail"] -->|tail| journal2
end
pt1 & pt2 -->|HTTP push| loki["Loki<br/>:3100"]
loki --> pv["10Gi PV"]
Promtail automatically discovers pods, attaches Kubernetes labels (namespace, pod name, container name) to each log line, and pushes to Loki’s HTTP API. Logs from all seven sites, Flux controllers, Umami, and the observability stack itself are collected.
Log Format
Section titled “Log Format”The nginx containers across all sites emit JSON-formatted access logs:
{ "time": "2026-03-13T10:00:00+00:00", "remote_addr": "10.0.1.1", "request": "GET / HTTP/1.1", "status": 200, "body_bytes_sent": 4523, "request_time": "0.001", "http_user_agent": "Mozilla/5.0 ..."}This structured format enables Loki LogQL queries to extract fields for filtering and aggregation in Grafana dashboards.
VictoriaMetrics
Section titled “VictoriaMetrics”VictoriaMetrics provides metrics collection and storage as a lightweight, Prometheus-compatible alternative. It is deployed via the victoria-metrics-k8s-stack chart, which bundles multiple components.
Enabled Components
Section titled “Enabled Components”| Component | Role |
|---|---|
| VictoriaMetrics Operator | Manages VMSingle, VMAgent, and scrape configs |
| VMSingle | Single-node metrics storage (Prometheus-compatible TSDB) |
| VMAgent | Metrics scraper (replaces Prometheus server for scraping) |
| Node Exporter | Exposes host-level metrics (CPU, memory, disk, network) |
| Kube State Metrics | Exposes Kubernetes object metrics (pod status, deployment replicas, etc.) |
| Kubelet | Kubelet metrics collection |
Disabled Components
Section titled “Disabled Components”| Component | Why Disabled |
|---|---|
| Grafana | Deployed separately with its own HelmRelease |
| Alertmanager | No alerting configured |
| VMAlert | No alerting rules active |
| VMAuth | No multi-tenant authentication needed |
| VMCluster | Single-node mode (VMSingle) is sufficient |
| kubeControllerManager, kubeScheduler, kubeEtcd, kubeProxy | Not accessible in K3s (embedded in the K3s binary) |
Storage
Section titled “Storage”| Setting | Value |
|---|---|
| Retention | 31 days |
| Volume | 10Gi PersistentVolumeClaim (ReadWriteOnce) |
| Scrape interval | 30 seconds |
Prometheus Converter
Section titled “Prometheus Converter”The VictoriaMetrics Operator’s Prometheus converter is enabled (disable_prometheus_converter: false). This means any ServiceMonitor or PodMonitor CRDs in the cluster are automatically converted to VictoriaMetrics scrape configs — maintaining compatibility with the Prometheus ecosystem.
Metrics Flow
Section titled “Metrics Flow”graph LR
nodexp["Node Exporter"] --> vmagent["VMAgent"]
ksm["Kube State<br/>Metrics"] --> vmagent
kubelet["Kubelet"] --> vmagent
vmagent -->|remote write| vmsingle["VMSingle<br/>(10Gi PV, 31d retention)"]
vmsingle -->|PromQL| grafana["Grafana"]
Custom Dashboards
Section titled “Custom Dashboards”Two custom Grafana dashboards are deployed as ConfigMaps with the grafana_dashboard: "1" label, automatically loaded by the Grafana sidecar.
Flux CD Dashboard
Section titled “Flux CD Dashboard”| Panel | Type | Data |
|---|---|---|
| Reconciliation Activity | Time series (stacked bars) | Count of reconciliation events per controller |
| Flux Errors and Warnings | Time series (line) | Count of error/warn/failed events per controller |
| Reconciliation Events | Logs | Filtered log stream showing reconcile, apply, create, delete, drift, error events |
| All Flux Logs | Logs | Unfiltered log stream from Flux controllers |
Includes a $controller template variable to filter by kustomize-controller, helm-controller, or source-controller.
Platform Overview Dashboard
Section titled “Platform Overview Dashboard”| Panel | Type | Data |
|---|---|---|
| Log Volume by Namespace | Time series (stacked bars) | Count of log lines per namespace over time |
| Error Rate | Time series (line, red) | Total count of error/fatal/panic log lines |
| Error Rate by Namespace | Time series (line) | Error count broken down by namespace |
| Recent Errors | Logs | Filtered log stream showing only error/fatal/panic entries |
Includes a $namespace template variable (dynamically populated from Loki labels) to filter by any namespace in the cluster.
Secrets
Section titled “Secrets”Grafana credentials are managed via the External Secrets Operator. Four Key Vault secrets are composed into the grafana-db Kubernetes secret:
| Key Vault Secret | Kubernetes Key | Purpose |
|---|---|---|
pg-fqdn | GF_DATABASE_HOST | PostgreSQL server address |
pg-admin-username | GF_DATABASE_USER | Database login |
pg-admin-password | GF_DATABASE_PASSWORD | Database password |
grafana-admin-password | GF_SECURITY_ADMIN_PASSWORD | Grafana admin UI password |
Secrets refresh hourly. Loki, Promtail, and VictoriaMetrics do not require external secrets — they have no credentials or external state.
Flux CD Integration
Section titled “Flux CD Integration”The observability Kustomization depends on the External Secrets store being available:
apiVersion: kustomize.toolkit.fluxcd.io/v1kind: Kustomizationmetadata: name: observability namespace: flux-systemspec: dependsOn: - name: external-secrets-store interval: 10m0s path: ./k8s/observability prune: true sourceRef: kind: GitRepository name: flux-systemThis ensures the ClusterSecretStore is ready before Grafana’s ExternalSecret is created.
The monitoring.kevinryan.io A record is managed by Terraform in the root module:
resource "cloudflare_record" "monitoring" { zone_id = var.cloudflare_zone_id name = "monitoring" content = module.network.public_ip_address type = "A" proxied = true ttl = 1}Traffic is proxied through Cloudflare, providing CDN caching for static dashboard assets and DDoS protection for the Grafana API.
Manifest Inventory
Section titled “Manifest Inventory”All 12 files in k8s/observability/:
| File | Resource Type | Purpose |
|---|---|---|
namespace.yaml | Namespace | observability namespace |
helmrepository-grafana.yaml | HelmRepository | Loki + Promtail charts |
helmrepository-grafana-community.yaml | HelmRepository | Grafana chart |
helmrepository-victoriametrics.yaml | HelmRepository | VictoriaMetrics chart |
helmrelease-grafana.yaml | HelmRelease | Grafana deployment |
helmrelease-loki.yaml | HelmRelease | Loki deployment |
helmrelease-promtail.yaml | HelmRelease | Promtail DaemonSet |
helmrelease-victoria-metrics.yaml | HelmRelease | VictoriaMetrics stack |
externalsecret.yaml | ExternalSecret | Grafana DB + admin credentials |
ingress.yaml | IngressRoute | monitoring.kevinryan.io routing |
dashboard-flux-cd.yaml | ConfigMap | Flux CD Grafana dashboard |
dashboard-platform-overview.yaml | ConfigMap | Platform overview Grafana dashboard |