Skip to content

Instantly share code, notes, and snippets.

@shirhatti
Created January 6, 2026 07:00
Show Gist options
  • Select an option

  • Save shirhatti/4e3c08f5b835a490eea84df7b5f3c27d to your computer and use it in GitHub Desktop.

Select an option

Save shirhatti/4e3c08f5b835a490eea84df7b5f3c27d to your computer and use it in GitHub Desktop.
How a Missing SNAT Flag Broke My Homelab's Cloud Database Connection

I run Coder (a development environment platform) in my homelab Kubernetes cluster, with the PostgreSQL database hosted in Azure. The networking setup uses Tailscale to connect my homelab to Azure resources through a subnet router VM. Everything was working fine until I migrated to a new Azure subnet router—then Coder went down hard.

The Symptom

Coder was crash-looping with 693 restarts over 22 days. The logs showed:

2026-01-06 06:15:03.939 [warn]  ping postgres: retrying
error="dial tcp: lookup coder-postgres.internal.azure.shirhatti.com
on 10.43.0.10:53: read udp 10.42.1.129:45479->10.43.0.10:53: i/o timeout"

DNS timeouts. Classic networking issue.

The Investigation

Following the DNS query path revealed an interesting chain:

  1. Coder pod queries CoreDNS (10.43.0.10:53) in the k3s cluster
  2. CoreDNS forwards to the host's /etc/resolv.conf
  3. Host uses Tailscale MagicDNS (100.100.100.100)
  4. Tailscale MagicDNS should forward *.internal.azure.shirhatti.com queries to Azure DNS (168.63.129.16)

But the Tailscale daemon logs on my homelab server showed:

dns udp query: waiting for response or error from [168.63.129.16]:
context deadline exceeded
health(warnable=dns-forward-failing): error: Tailscale can't reach
the configured DNS servers

Wait, but routing worked fine! I could reach 168.63.129.16 from the homelab:

$ ip route get 168.63.129.16
168.63.129.16 dev tailscale0 table 52 src 100.78.37.20

$ echo "test" | nc -u -w2 168.63.129.16 53
# Port reachable!

So packets could get there, but DNS queries timed out. Why?

The Root Cause

The Azure subnet router was advertising the route to 168.63.129.16/32, but it was configured with --snat-subnet-routes=false.

Here's what that meant:

Without SNAT (false):

  • DNS query from homelab (100.78.37.20) → Azure DNS (168.63.129.16)
  • Azure DNS receives query from source IP 100.78.37.20
  • Azure DNS tries to respond to 100.78.37.20
  • Problem: Azure's network doesn't know how to route to Tailscale IPs
  • Response gets dropped, query times out

With SNAT (true):

  • DNS query from homelab (100.78.37.20) → Subnet router (10.4.2.4) → Azure DNS (168.63.129.16)
  • Subnet router performs source NAT
  • Azure DNS receives query from source IP 10.4.2.4 (the router's VNet IP)
  • Azure DNS responds to 10.4.2.4 (routable within the VNet)
  • Router performs NAT translation back
  • Response reaches homelab, success!

Why It Used to Work

The old subnet router must have been configured with --snat-subnet-routes=true. When I created the new one, the cloud-init script had it set to false (probably copied from some documentation that prioritized preserving source IPs for logging).

To make matters worse, the cloud-init script also had an invalid Tailscale tag (tag:westus3 that didn't exist), which caused the initial tailscale up command to fail silently. The VM was running Tailscale, but not advertising any routes at all!

The Fix

Simple, once identified:

sudo tailscale up \
  --advertise-routes=10.4.0.0/16,168.63.129.16/32 \
  --advertise-exit-node \
  --advertise-tags=tag:azure \
  --accept-routes \
  --accept-dns=false \
  --snat-subnet-routes=true \  # The critical flag!
  --hostname=azure-router-westus3

After restarting the Tailscale daemon on the homelab server, DNS resolution started working:

$ host coder-postgres.internal.azure.shirhatti.com
coder-postgres.internal.azure.shirhatti.com has address 10.4.1.4

Coder came right back up.

Lessons Learned

  1. SNAT isn't just about preserving source IPs—it's essential when your DNS server can't route responses back to the original client network

  2. Cloud-init failures can be silent—the VM booted fine, Tailscale was running, but the critical configuration never applied

  3. Routing != Reachability—just because ip route shows a path and nc can connect doesn't mean application-layer protocols will work

  4. Test DNS resolution, not just connectivity—I could ping and connect to the port, but DNS queries specifically failed

  5. Document your network assumptions—I updated my topology documentation to explicitly note why SNAT is required for Azure DNS to work

The updated creation script now has --snat-subnet-routes=true, and future subnet routers won't have this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment