shirhatti/snat.md

## snat.md

      
    Raw
  

              snat.md
            
          
    I run Coder (a development environment platform) in my homelab Kubernetes cluster, with the PostgreSQL database hosted in Azure. The networking setup uses Tailscale to connect my homelab to Azure resources through a subnet router VM. Everything was working fine until I migrated to a new Azure subnet router—then Coder went down hard.
The Symptom

Coder was crash-looping with 693 restarts over 22 days. The logs showed:
2026-01-06 06:15:03.939 [warn]  ping postgres: retrying
error="dial tcp: lookup coder-postgres.internal.azure.shirhatti.com
on 10.43.0.10:53: read udp 10.42.1.129:45479->10.43.0.10:53: i/o timeout"

DNS timeouts. Classic networking issue.
The Investigation

Following the DNS query path revealed an interesting chain:

Coder pod queries CoreDNS (10.43.0.10:53) in the k3s cluster
CoreDNS forwards to the host's /etc/resolv.conf
Host uses Tailscale MagicDNS (100.100.100.100)
Tailscale MagicDNS should forward *.internal.azure.shirhatti.com queries to Azure DNS (168.63.129.16)

But the Tailscale daemon logs on my homelab server showed:
dns udp query: waiting for response or error from [168.63.129.16]:
context deadline exceeded
health(warnable=dns-forward-failing): error: Tailscale can't reach
the configured DNS servers

Wait, but routing worked fine! I could reach 168.63.129.16 from the homelab:
$ ip route get 168.63.129.16
168.63.129.16 dev tailscale0 table 52 src 100.78.37.20

$ echo "test" | nc -u -w2 168.63.129.16 53
# Port reachable!
So packets could get there, but DNS queries timed out. Why?
The Root Cause

The Azure subnet router was advertising the route to 168.63.129.16/32, but it was configured with --snat-subnet-routes=false.
Here's what that meant:
Without SNAT (false):

DNS query from homelab (100.78.37.20) → Azure DNS (168.63.129.16)
Azure DNS receives query from source IP 100.78.37.20
Azure DNS tries to respond to 100.78.37.20
Problem: Azure's network doesn't know how to route to Tailscale IPs
Response gets dropped, query times out

With SNAT (true):

DNS query from homelab (100.78.37.20) → Subnet router (10.4.2.4) → Azure DNS (168.63.129.16)
Subnet router performs source NAT
Azure DNS receives query from source IP 10.4.2.4 (the router's VNet IP)
Azure DNS responds to 10.4.2.4 (routable within the VNet)
Router performs NAT translation back
Response reaches homelab, success!

Why It Used to Work

The old subnet router must have been configured with --snat-subnet-routes=true. When I created the new one, the cloud-init script had it set to false (probably copied from some documentation that prioritized preserving source IPs for logging).
To make matters worse, the cloud-init script also had an invalid Tailscale tag (tag:westus3 that didn't exist), which caused the initial tailscale up command to fail silently. The VM was running Tailscale, but not advertising any routes at all!
The Fix

Simple, once identified:
sudo tailscale up \
  --advertise-routes=10.4.0.0/16,168.63.129.16/32 \
  --advertise-exit-node \
  --advertise-tags=tag:azure \
  --accept-routes \
  --accept-dns=false \
  --snat-subnet-routes=true \  # The critical flag!
  --hostname=azure-router-westus3
After restarting the Tailscale daemon on the homelab server, DNS resolution started working:
$ host coder-postgres.internal.azure.shirhatti.com
coder-postgres.internal.azure.shirhatti.com has address 10.4.1.4
Coder came right back up.
Lessons Learned


SNAT isn't just about preserving source IPs—it's essential when your DNS server can't route responses back to the original client network


Cloud-init failures can be silent—the VM booted fine, Tailscale was running, but the critical configuration never applied


Routing != Reachability—just because ip route shows a path and nc can connect doesn't mean application-layer protocols will work


Test DNS resolution, not just connectivity—I could ping and connect to the port, but DNS queries specifically failed


Document your network assumptions—I updated my topology documentation to explicitly note why SNAT is required for Azure DNS to work


The updated creation script now has --snat-subnet-routes=true, and future subnet routers won't have this issue.
No results found