I run Coder (a development environment platform) in my homelab Kubernetes cluster, with the PostgreSQL database hosted in Azure. The networking setup uses Tailscale to connect my homelab to Azure resources through a subnet router VM. Everything was working fine until I migrated to a new Azure subnet router—then Coder went down hard.
Coder was crash-looping with 693 restarts over 22 days. The logs showed:
2026-01-06 06:15:03.939 [warn] ping postgres: retrying
error="dial tcp: lookup coder-postgres.internal.azure.shirhatti.com
on 10.43.0.10:53: read udp 10.42.1.129:45479->10.43.0.10:53: i/o timeout"
DNS timeouts. Classic networking issue.
Following the DNS query path revealed an interesting chain:
- Coder pod queries CoreDNS (
10.43.0.10:53) in the k3s cluster - CoreDNS forwards to the host's
/etc/resolv.conf - Host uses Tailscale MagicDNS (
100.100.100.100) - Tailscale MagicDNS should forward
*.internal.azure.shirhatti.comqueries to Azure DNS (168.63.129.16)
But the Tailscale daemon logs on my homelab server showed:
dns udp query: waiting for response or error from [168.63.129.16]:
context deadline exceeded
health(warnable=dns-forward-failing): error: Tailscale can't reach
the configured DNS servers
Wait, but routing worked fine! I could reach 168.63.129.16 from the homelab:
$ ip route get 168.63.129.16
168.63.129.16 dev tailscale0 table 52 src 100.78.37.20
$ echo "test" | nc -u -w2 168.63.129.16 53
# Port reachable!So packets could get there, but DNS queries timed out. Why?
The Azure subnet router was advertising the route to 168.63.129.16/32, but it was configured with --snat-subnet-routes=false.
Here's what that meant:
Without SNAT (false):
- DNS query from homelab (
100.78.37.20) → Azure DNS (168.63.129.16) - Azure DNS receives query from source IP
100.78.37.20 - Azure DNS tries to respond to
100.78.37.20 - Problem: Azure's network doesn't know how to route to Tailscale IPs
- Response gets dropped, query times out
With SNAT (true):
- DNS query from homelab (
100.78.37.20) → Subnet router (10.4.2.4) → Azure DNS (168.63.129.16) - Subnet router performs source NAT
- Azure DNS receives query from source IP
10.4.2.4(the router's VNet IP) - Azure DNS responds to
10.4.2.4(routable within the VNet) - Router performs NAT translation back
- Response reaches homelab, success!
The old subnet router must have been configured with --snat-subnet-routes=true. When I created the new one, the cloud-init script had it set to false (probably copied from some documentation that prioritized preserving source IPs for logging).
To make matters worse, the cloud-init script also had an invalid Tailscale tag (tag:westus3 that didn't exist), which caused the initial tailscale up command to fail silently. The VM was running Tailscale, but not advertising any routes at all!
Simple, once identified:
sudo tailscale up \
--advertise-routes=10.4.0.0/16,168.63.129.16/32 \
--advertise-exit-node \
--advertise-tags=tag:azure \
--accept-routes \
--accept-dns=false \
--snat-subnet-routes=true \ # The critical flag!
--hostname=azure-router-westus3After restarting the Tailscale daemon on the homelab server, DNS resolution started working:
$ host coder-postgres.internal.azure.shirhatti.com
coder-postgres.internal.azure.shirhatti.com has address 10.4.1.4Coder came right back up.
-
SNAT isn't just about preserving source IPs—it's essential when your DNS server can't route responses back to the original client network
-
Cloud-init failures can be silent—the VM booted fine, Tailscale was running, but the critical configuration never applied
-
Routing != Reachability—just because
ip routeshows a path andnccan connect doesn't mean application-layer protocols will work -
Test DNS resolution, not just connectivity—I could ping and connect to the port, but DNS queries specifically failed
-
Document your network assumptions—I updated my topology documentation to explicitly note why SNAT is required for Azure DNS to work
The updated creation script now has --snat-subnet-routes=true, and future subnet routers won't have this issue.