Skip to content

Instantly share code, notes, and snippets.

@karubits
Last active October 7, 2025 14:19
Show Gist options
  • Select an option

  • Save karubits/787632319e529066a81d1192d879834c to your computer and use it in GitHub Desktop.

Select an option

Save karubits/787632319e529066a81d1192d879834c to your computer and use it in GitHub Desktop.
Proxmox 9 and Thunderbolt Networking - Auto recover network on reboot
# /etc/udev/rules.d/99-thunderbolt-net.rules
# Fire the bring-up service only when a thunderbolt netdev appears.
# Avoid ACTION=="change" to prevent re-invoking the service during link flaps.
ACTION=="add", SUBSYSTEM=="net", KERNEL=="thunderbolt*", \
TAG+="systemd", ENV{SYSTEMD_WANTS}="tbnet-bringup@%k.service"

Proxmox 9 and Thunderbolt Networking - Auto recover network on reboot

Hardware used: 3× Minisforum MS-01

With the release of Proxmox 9 and its newer kernel, Thunderbolt interfaces on the MS-01 come up automatically out of the box, which makes setup much easier.

However, I found an issue in a full mesh topology, for example:

  • PVE1 connects to PVE2 and PVE3
  • PVE2 connects to PVE1 and PVE3
  • PVE3 connects to PVE1 and PVE2

When I restart PVE2, the Thunderbolt interfaces on that node do not come up automatically. Even running ifup thunderbolt# locally doesn’t restore them. The only way to bring the links back was to run ifup thunderbolt# on one of the remote hosts.

To work around this, I created a small script and a systemd service. They monitor for when the Thunderbolt device is reconnected and automatically bring up the interface once it’s UP and stable.


Assumptions

  • All hypervisor nodes are running a fresh install of Proxmox 9.
  • The hypervisors are connected in a full mesh topology.
  • Thunderbolt connections are in place and working.
  • You’ve followed Full Mesh Network for Ceph Server – Using SDN Fabrics.

    I use this approach because it’s supported in the Proxmox UI, it’s simple to set up, and it provides redundancy with a full mesh.


Instructions

  1. Edit /etc/network/interfaces and configure the Thunderbolt interfaces with hotplug support and jumbo frames (recommended for Ceph performance):
    allow-hotplug thunderbolt0
    iface thunderbolt0 inet manual
        mtu 9000
    
    allow-hotplug thunderbolt1
    iface thunderbolt1 inet manual
        mtu 9000
2. Save the systemctl service file tbnet-bringup@.service to /etc/systemd/system/tbnet-bringup@.service
3. Save the tbnet-bringup.sh script to /usr/local/sbin/tbnet-bringup.sh
4. Save the 99-thunderbolt-net.rules to /etc/udev/rules.d/99-thunderbolt-net.rules
5. Make the script exacutable. `chmod +x /usr/local/sbin/tbnet-bringup.sh`
6. Reload udevadm and the systemctl daemon 
   ```bash
   udevadm control --reload-rules
   udevadm trigger -s net
   systemctl daemon-reload
  1. Repeat for each host in the cluster.
  2. I would suggest to open two ssh connections to a neighbouring host (lets say PVE1) and run:
  • watch -c -n 1 "ip -br -c a | grep thunder in one window to monitor when the interface is restored.
  • journalctl -fin a second window for log monitoring
  1. Then reboot a neighbour node e.g. PVE2
  2. The script will log journalctl while as soon as it see the device connection restored.
15:02:59 pve1 kernel: thunderbolt 0-0:1.1: retimer disconnected
15:03:00 pve1 kernel: thunderbolt 0-0:1.1: new retimer found, vendor=0x8087 device=0x15ee
15:03:05 pve1 kernel: thunderbolt 0-1: new host found, vendor=0x8086 device=0x1
15:03:05 pve1 kernel: thunderbolt 0-1: Intel Corp. (none)15:03:07 pve1 tbnet-bringup[127399]: [thunderbolt1] Attempt 1/10: state=down carrier=0; retry in 5s
15:03:12 pve1 tbnet-bringup[127429]: [thunderbolt1] Attempt 2/10: state=down carrier=0; retry in 5s
15:03:12 pve1 fabricd[1253]: [NBV6R-CM3PT] OpenFabric: Needed to resync LSPDB using CSNP!
15:03:13 pve1 fabricd[1253]: [NBV6R-CM3PT] OpenFabric: Needed to resync LSPDB using CSNP!
15:03:17 pve1 tbnet-bringup[127494]: [thunderbolt1] UP detected (attempt 3); verifying stable for 6s…
15:03:27 pve1 fabricd[1253]: [GNY7F-C4R79] ISIS-Adj (ceph): Rcvd P2P IIH from (thunderbolt1) with invalid pdu length 8997
15:03:27 pve1 tbnet-bringup[127586]: [thunderbolt1] Stable UP for 10s. Applying offload tweaks and ifup.
15:03:27 pve1 systemd[1]: tbnet-bringup@thunderbolt1.service: Deactivated successfully.
15:03:27 pve1 systemd[1]: Finished tbnet-bringup@thunderbolt1.service - Thunderbolt net bring-up for thunderbolt1 (with retries).

After you have tested and validated the networks are restored. Your ready to proceed with the CEPH installation and use the Fabric subnet as your CEPH network.

#!/bin/sh
# /usr/local/sbin/tbnet-bringup.sh
# ==============================================================================
# tbnet-bringup.sh (no-flap)
#
# Goal:
# Bring a Thunderbolt netdev up WITHOUT toggling it down. Set admin-UP once
# (if needed), then poll until the link is stably UP. Exit as soon as it's
# stable. If already UP at start, do nothing and exit 0.
#
# Strategy:
# * Triggered only on ACTION=add (device creation).
# * Never "ip link set down".
# * Consider success when carrier=1 and operstate=up
# and remains stable for STABLE_SECS (default 6s).
# * Apply ethtool offload tweaks and run ifup once on success.
# * Use a lockfile to prevent concurrent instances per iface.
#
# Tunables (env):
# ATTEMPTS=10 # total checks (default)
# DELAY=5 # seconds between checks
# SETTLE=1 # initial settle before first check
# STABLE_SECS=6 # seconds of continuous UP before declaring success
#
# Exit codes:
# 0 success (already up or became stable-up)
# 1 failed to reach stable-up in time
# ==============================================================================
set -eu
IFACE="${1:?need iface name}"
ATTEMPTS="${ATTEMPTS:-10}"
DELAY="${DELAY:-5}"
SETTLE="${SETTLE:-1}"
STABLE_SECS="${STABLE_SECS:-6}"
LOCKDIR="/run/tbnet-bringup"
LOCKFILE="$LOCKDIR/$IFACE.lock"
log() { /usr/bin/logger -t tbnet-bringup "[$IFACE] $*"; }
mkdir -p "$LOCKDIR"
# crude lock: if another instance is running, exit quietly
if ! ( set -o noclobber; echo $$ > "$LOCKFILE" ) 2>/dev/null; then
log "Another bring-up instance is running; exiting"
exit 0
fi
trap 'rm -f "$LOCKFILE"' EXIT
# Ensure iface exists
[ -e "/sys/class/net/$IFACE" ] || { log "Interface not present in sysfs"; exit 1; }
# Initial settle for enumeration
sleep "$SETTLE"
# Helper to read link state
state() { cat "/sys/class/net/$IFACE/operstate" 2>/dev/null || echo unknown; }
carrier(){ cat "/sys/class/net/$IFACE/carrier" 2>/dev/null || echo 0; }
# If not admin-UP, set it UP once (no down)
# (We don't care about return if already up)
/sbin/ip link set dev "$IFACE" up || true
# Fast path: already solid UP?
if [ "$(state)" = "up" ] && [ "$(carrier)" = "1" ]; then
log "Already UP at start; applying tweaks and exiting."
/usr/sbin/ethtool -K "$IFACE" gro off gso off lro off || true
/sbin/ifup "$IFACE" || true
exit 0
fi
# Poll until stable UP
i=1
stable_start=""
while [ "$i" -le "$ATTEMPTS" ]; do
S="$(state)"; C="$(carrier)"
if [ "$S" = "up" ] && [ "$C" = "1" ]; then
# start or continue stability window
now=$(date +%s)
if [ -z "$stable_start" ]; then
stable_start="$now"
log "UP detected (attempt $i); verifying stable for ${STABLE_SECS}s…"
fi
elapsed=$(( now - stable_start ))
if [ "$elapsed" -ge "$STABLE_SECS" ]; then
log "Stable UP for ${elapsed}s. Applying offload tweaks and ifup."
/usr/sbin/ethtool -K "$IFACE" gro off gso off lro off || true
/sbin/ifup "$IFACE" || true
exit 0
fi
else
# lost carrier during stability window; reset window
if [ -n "$stable_start" ]; then
log "Lost stability (state=$S carrier=$C); restarting stability timer."
stable_start=""
else
log "Attempt $i/$ATTEMPTS: state=$S carrier=$C; retry in ${DELAY}s"
fi
fi
sleep "$DELAY"
i=$((i+1))
done
log "FAILED to reach stable UP within $ATTEMPTS attempts"
exit 1
# /etc/systemd/system/tbnet-bringup@.service
[Unit]
Description=Thunderbolt net bring-up for %i (no-flap, with stability check)
BindsTo=sys-subsystem-net-devices-%i.device
After=sys-subsystem-net-devices-%i.device network-pre.target
[Service]
Type=oneshot
Environment="PATH=/usr/sbin:/usr/bin:/sbin:/bin"
# Optional per-host tuning:
# Environment=ATTEMPTS=12 DELAY=5 SETTLE=1 STABLE_SECS=8
ExecStart=/usr/local/sbin/tbnet-bringup.sh %i
TimeoutStartSec=90
[Install]
WantedBy=sys-subsystem-net-devices-%i.device
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment