ADR-016: Add Second K3s Node for Observability Workloads
Status: Accepted Date: 2026-03-05 Decision Makers: Human + Agent Prompted By: The current B2s (4 GB RAM) has ~1–2 GB headroom after running K3s, Traefik, Flux CD, and site containers — insufficient to co-host the observability stack (Grafana, Loki, Promtail) introduced in ADR-006.
Context
Section titled “Context”The observability stack defined in ADR-006 (Grafana, Loki, Promtail) requires meaningful memory to operate reliably. The single B2s node migrated to in ADR-014 runs K3s server, Traefik, Flux CD (source-controller + kustomize-controller), and the site containers. At steady state, only ~1–2 GB of the 4 GB remains free — not enough headroom to safely add Grafana (~300–500 MB), Loki (~200–400 MB), and Promtail (~50 MB) without risking OOM kills against site workloads.
Two broad approaches were considered: vertical scaling (upgrade the single node) or horizontal scaling (add a dedicated agent node for observability).
Decision Drivers
Section titled “Decision Drivers”- Site availability: Observability tooling must not compete with site containers for memory on the same node.
- Failure isolation: An observability failure should not affect site uptime.
- Cost: Any solution must stay within the sub-£50/month budget.
- Portfolio value: The infrastructure should demonstrate patterns relevant to enterprise Kubernetes work.
- Operational simplicity: GitOps via Flux must remain the single source of truth for cluster state.
Options Considered
Section titled “Options Considered”Option A: Upgrade to Standard_B2as_v2 (8 GB, single node)
Section titled “Option A: Upgrade to Standard_B2as_v2 (8 GB, single node)”Replace the B2s with a B2as_v2 (AMD, 2 vCPU, 8 GB RAM). All workloads remain on one node.
Doubles available RAM, resolves the headroom problem immediately, and requires minimal infrastructure change (a Terraform vm_size update and a destroy/recreate).
Trade-offs: vertical scaling only — the ceiling moves but the single-node constraint remains. A node failure takes down both sites and observability simultaneously. Does not advance multi-node cluster experience. Cost: ~£28–32/month.
Option B: Add a second Standard_B2s as a K3s agent node (chosen)
Section titled “Option B: Add a second Standard_B2s as a K3s agent node (chosen)”Provision a second B2s VM (node2) as a K3s agent joined to the existing server (node1). Site workloads remain on node1; observability workloads are scheduled exclusively on node2 via a NoSchedule taint on node2 and matching tolerations + nodeSelector on observability Helm releases.
Both nodes run at the same B2s spec. Total cost: ~£48/month (two B2s VMs + ACR Basic).
Failure isolation: if node2 fails, metrics and dashboards are unavailable but sites are completely unaffected. The inverse is also true — a node1 failure does not corrupt Loki data or Grafana state held on node2.
Option C: External managed observability (e.g. Grafana Cloud free tier)
Section titled “Option C: External managed observability (e.g. Grafana Cloud free tier)”Offload Grafana and Loki to Grafana Cloud’s free tier; retain Promtail on-cluster as a log shipper.
Eliminates the memory problem without any VM changes. Trade-offs: data leaves the cluster, introduces an external dependency, caps log retention and query capacity on the free tier, and removes the self-hosted observability story from the portfolio. Rejected on portfolio and data-locality grounds.
Decision
Section titled “Decision”Add a second Standard_B2s VM as a K3s agent node. Sites stay on node1; all observability workloads move to node2.
Scheduling isolation is enforced by:
- A
NoScheduletaint onnode2(observability=true:NoSchedule) to prevent general workloads from landing there. - Matching
tolerationsandnodeSelector: { kubernetes.io/hostname: node2 }on the Grafana, Loki, and Promtail Helm releases.
Infrastructure changes:
infra/modules/compute/: extract a reusable compute module; instantiate two VMs (node1as K3s server,node2as K3s agent joining via the server’s private IP).infra/modules/network/: associate a second NIC / public IP withnode2(or use a private-only NIC ifnode2requires no direct inbound traffic).- K3s agent join token passed from
node1tonode2via cloud-init; token stored as an Azure Key Vault secret to avoid embedding secrets in Terraform state in plaintext. - Taint applied via a
kubectlcall innode2’s cloud-init post-join, or managed declaratively via a FluxHelmReleasepost-hook.
Consequences
Section titled “Consequences”Positive
Section titled “Positive”- Observability stack runs without memory pressure; ~4 GB dedicated to Grafana, Loki, and Promtail on
node2 - A
node2failure leaves sites fully operational — monitoring is degraded, not site availability - Multi-node K3s cluster (server + agent) is a recognisable enterprise Kubernetes pattern, strengthening the portfolio narrative
- Failure domains are separated at the VM level, which is the strongest isolation available at this budget tier
- Cost (~£48/month) remains below the sub-£50 budget ceiling
Negative
Section titled “Negative”- Monthly cost increases from ~£24 to ~£48 (adds one B2s on-demand VM)
- Operational surface doubles: two VMs to patch, monitor, and rebuild after a destroy/recreate
- The Cloudflare Origin Certificate TLS secret (currently only needed on
node1) must still be manually re-applied to the cluster after anode1rebuild — unchanged risk from ADR-014 - K3s agent join requires the server’s private IP and join token to be available at
node2boot time; ordering dependency must be handled in Terraform (depends_on) and cloud-init
- Join token exposure: The K3s join token grants full cluster access. Mitigation: store in Azure Key Vault; cloud-init retrieves it via the VM’s managed identity at boot time.
- node2 rebuild loses observability data: Loki stores log data on the node’s local disk by default. A destroy/recreate of
node2loses historical logs. Mitigation: document the data-loss boundary; consider Azure Disk persistent volume in a future ADR if log retention becomes a requirement. - node1 still a SPOF for the control plane: K3s HA requires an odd number of server nodes (3+). At this budget tier,
node1remains the single control-plane node; anode1failure makes the API server unavailable (sites continue to serve from cached container state until pods are evicted, but no new scheduling occurs). Accepted trade-off at this scale.
Agent Decisions
Section titled “Agent Decisions”| Decision | Rationale | Acceptable |
|---|---|---|
| No agent implementation decisions recorded | This ADR documents the human-directed architectural decision only; implementation details will be recorded in subsequent ADRs or PRs | Yes |