This builds upon excellent foundational work by @scyto.
- Original TB4 research from @scyto: https://gist.github.com/scyto/76e94832927a89d977ea989da157e9dc
- My Original PVE 9 Writeup: https://gist.github.com/taslabs-net/9f6e06ab32833864678a4acbb6dc9131
Key contributions from @scyto's work:
- TB4 hardware detection and kernel module strategies
- Systemd networking and udev automation techniques
- MTU optimization and performance tuning approaches
Changelog - Fork Updates January 2025 - Fork by Yearly1825 Improvements and fixes based on PVE 9.0.10 testing:
- Fixed script error handling: Replaced problematic || syntax with proper if/then/else statements in interface bringup scripts (pve-en05.sh and pve-en06.sh) for more reliable error handling and retry logic
- Optimized Ceph networking: Changed configuration to use Thunderbolt network (10.100.0.0/24) for both public and cluster networks instead of split networks, resolving performance degradation issues discovered in production (ref: Proxmox forum thread #170091)
- Simplified network topology: Removed confusing and unnecessary /30 point-to-point subnet configuration - OpenFabric handles mesh routing automatically without manual subnet assignments
- Improved /etc/network/interfaces instructions: Changed from appending to file to properly inserting configuration above the source directive to prevent conflicts
- Enhanced script reliability: Added proper bash conditionals for more predictable behavior during interface initialization
This guide provides a step-by-step, tested setup for building a high-performance Thunderbolt 4 + Ceph cluster on Proxmox VE 9.
Lab Results:
- TB4 Mesh Performance: Sub-millisecond latency, 65520 MTU, full mesh connectivity
- Ceph Performance: 1,300+ MB/s write, 1,760+ MB/s read with optimizations
- Reliability: 0% packet loss, automatic failover, persistent configuration
- Integration: Full Proxmox GUI visibility and management
Hardware Environment:
- Nodes: 3x systems with dual TB4 ports (tested on MS01 mini-PCs)
- Memory: 64GB RAM per node (optimal for high-performance Ceph)
- CPU: 13th Gen Intel (or equivalent high-performance processors)
- Storage: NVMe drives for Ceph OSDs
- Network: TB4 mesh (10.100.0.0/24) + management (10.11.12.0/24)
Software Stack:
- Proxmox VE: 9.0 with native SDN OpenFabric support
- Ceph: Reef with BlueStore, LZ4 compression, 2:1 replication
- OpenFabric: IPv4-only mesh routing for simplicity and performance
- 3 nodes minimum: Each with dual TB4 ports (tested with MS01 mini-PCs)
- TB4 cables: Quality TB4 cables for mesh connectivity
- Ring topology: Physical connections n2→n3→n4→n2 (or similar mesh pattern)
- Management network: Standard Ethernet for initial setup and management
- Proxmox VE 9.0
- Root access to all nodes
- Basic Linux networking knowledge
- Patience: TB4 mesh setup requires careful attention to detail!
- Management network: 10.11.12.0/24 (adjust to your environment)
- TB4 cluster network: 10.100.0.0/24 (for Ceph cluster traffic)
- Router IDs: 10.100.0.12 (n2), 10.100.0.13 (n3), 10.100.0.14 (n4)
Critical: Perform these steps on ALL mesh nodes (n2, n3, n4).
Load TB4 kernel modules:
# Execute on each node:
echo 'thunderbolt' >> /etc/modules
echo 'thunderbolt-net' >> /etc/modules
modprobe thunderbolt && modprobe thunderbolt-netVerify modules loaded:
lsmod | grep thunderboltExpected output: Both thunderbolt and thunderbolt_net modules present.
Find TB4 controllers and interfaces:
lspci | grep -i thunderbolt
ip link show | grep -E '(en0[5-9]|thunderbolt)'Expected: TB4 PCI controllers detected, TB4 network interfaces visible.
Critical: Create interface renaming rules based on PCI paths for consistent naming.
# Create systemd link file for first TB4 interface:
cat > /etc/systemd/network/00-thunderbolt0.link << 'EOF'
[Match]
Path=pci-0000:00:0d.2
Driver=thunderbolt-net
[Link]
MACAddressPolicy=none
Name=en05
EOF
# Create systemd link file for second TB4 interface:
cat > /etc/systemd/network/00-thunderbolt1.link << 'EOF'
[Match]
Path=pci-0000:00:0d.3
Driver=thunderbolt-net
[Link]
MACAddressPolicy=none
Name=en06
EOFNote: Adjust PCI paths if different on your hardware (check with lspci | grep -i thunderbolt)
Add TB4 interfaces to network configuration with optimal settings:
vim /etc/network/interfaces Add the below to /etc/network/interfaces above this line source /etc/network/interfaces.d/*
auto en05
iface en05 inet manual
mtu 65520
auto en06
iface en06 inet manual
mtu 65520
Required for systemd link files to work:
systemctl enable systemd-networkd
systemctl start systemd-networkdAutomation for reliable interface bringup on cable insertion:
Create udev rules:
cat > /etc/udev/rules.d/10-tb-en.rules << 'EOF'
ACTION=="add|move", SUBSYSTEM=="net", KERNEL=="en05", RUN+="/usr/local/bin/pve-en05.sh"
ACTION=="add|move", SUBSYSTEM=="net", KERNEL=="en06", RUN+="/usr/local/bin/pve-en06.sh"
EOFCreate en05 bringup script:
cat > /usr/local/bin/pve-en05.sh << 'EOF'
#!/bin/bash
LOGFILE="/tmp/udev-debug.log"
echo "$(date): en05 bringup triggered" >> "$LOGFILE"
for i in {1..5}; do
if ip link set en05 up mtu 65520; then
echo "$(date): en05 up successful on attempt $i" >> "$LOGFILE"
break
else
echo "$(date): Attempt $i failed, retrying in 3 seconds..." >> "$LOGFILE"
sleep 3
fi
done
EOF
chmod +x /usr/local/bin/pve-en05.shCreate en06 bringup script:
cat > /usr/local/bin/pve-en06.sh << 'EOF'
#!/bin/bash
LOGFILE="/tmp/udev-debug.log"
echo "$(date): en06 bringup triggered" >> "$LOGFILE"
for i in {1..5}; do
if ip link set en06 up mtu 65520; then
echo "$(date): en06 up successful on attempt $i" >> "$LOGFILE"
break
else
echo "$(date): Attempt $i failed, retrying in 3 seconds..." >> "$LOGFILE"
sleep 3
fi
done
EOF
chmod +x /usr/local/bin/pve-en06.shApply all TB4 configuration changes:
# Update initramfs:
update-initramfs -u -k all
# Reboot to apply changes:
rebootAfter reboot, verify TB4 interfaces:
ip link show | grep -E '(en05|en06)'Expected result: TB4 interfaces should be named en05 and en06 with proper MTU settings.
Essential: TB4 mesh requires IPv4 forwarding for OpenFabric routing.
echo 'net.ipv4.ip_forward=1' >> /etc/sysctl.conf
sysctl -pVerify forwarding enabled:
sysctl net.ipv4.ip_forwardExpected: net.ipv4.ip_forward = 1
Ensure TB4 interfaces come up automatically on boot:
Create systemd service:
cat > /etc/systemd/system/thunderbolt-interfaces.service << 'EOF'
[Unit]
Description=Configure Thunderbolt Network Interfaces
After=network.target thunderbolt.service
Wants=network.target
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/local/bin/thunderbolt-startup.sh
[Install]
WantedBy=multi-user.target
EOFCreate startup script:
cat > /usr/local/bin/thunderbolt-startup.sh << 'EOF'
#!/bin/bash
# Thunderbolt interface startup script
LOGFILE="/var/log/thunderbolt-startup.log"
echo "$(date): Starting Thunderbolt interface configuration" >> "$LOGFILE"
# Wait up to 30 seconds for interfaces to appear
for i in {1..30}; do
if ip link show en05 &>/dev/null && ip link show en06 &>/dev/null; then
echo "$(date): Thunderbolt interfaces found" >> "$LOGFILE"
break
fi
echo "$(date): Waiting for Thunderbolt interfaces... ($i/30)" >> "$LOGFILE"
sleep 1
done
# Configure interfaces if they exist
if ip link show en05 &>/dev/null; then
/usr/local/bin/pve-en05.sh
echo "$(date): en05 configured" >> "$LOGFILE"
fi
if ip link show en06 &>/dev/null; then
/usr/local/bin/pve-en06.sh
echo "$(date): en06 configured" >> "$LOGFILE"
fi
echo "$(date): Thunderbolt configuration completed" >> "$LOGFILE"
EOF
chmod +x /usr/local/bin/thunderbolt-startup.shEnable the service:
systemctl daemon-reload
systemctl enable thunderbolt-interfaces.serviceLocation: Datacenter → SDN → Fabrics
- Click: "Add Fabric" → "OpenFabric"
- Configure in the dialog:
- Name:
tb4 - IPv4 Prefix:
10.100.0.0/24 - IPv6 Prefix: (leave empty for IPv4-only)
- Hello Interval:
3(default) - CSNP Interval:
10(default)
- Name:
- Click: "OK"
Expected result: You should see a fabric named tb4 with Protocol OpenFabric and IPv4 10.100.0.0/24
Still in: Datacenter → SDN → Fabrics → (select tb4 fabric)
- Click: "Add Node"
- Configure for n2:
- Node:
n2 - IPv4:
10.100.0.12 - IPv6: (leave empty)
- Interfaces: Select
en05anden06from the interface list
- Node:
- Click: "OK"
- Repeat for n3: IPv4:
10.100.0.13, interfaces:en05, en06 - Repeat for n4: IPv4:
10.100.0.14, interfaces:en05, en06
Important: Configure /30 point-to-point addresses on the en05 and en06 interfaces:
n2: en05:10.100.0.1/30, en06:10.100.0.5/30n3: en05:10.100.0.9/30, en06:10.100.0.13/30n4: en05:10.100.0.17/30, en06:10.100.0.21/30
Critical: This activates the mesh - nothing works until you apply!
In GUI: Datacenter → SDN → "Apply" (button in top toolbar)
Expected result: Status table shows all nodes with "OK" status
Critical: OpenFabric routing requires FRR (Free Range Routing) to be running.
systemctl start frr
systemctl enable frrVerify FRR is running:
systemctl status frr | grep ActiveExpected output: Active: active (running)
Check TB4 interfaces are up with correct settings:
ip addr show | grep -E '(en05|en06|10\.100\.0\.)'Critical test: Verify full mesh communication works.
# Test router ID connectivity from current node:
ping -c 3 10.100.0.12
ping -c 3 10.100.0.13
ping -c 3 10.100.0.14Expected: All pings succeed with sub-millisecond latency (~0.6ms)
If connectivity fails: TB4 interfaces may need manual bring-up:
ip link set en05 up mtu 65520
ip link set en06 up mtu 65520
ifreload -aInstall Ceph packages:
pveceph install --repository no-subscriptionEssential: Proper directory structure and ownership:
mkdir -p /var/lib/ceph && chown ceph:ceph /var/lib/ceph
mkdir -p /etc/ceph && chown ceph:ceph /etc/cephOn the first node (n2) only:
pveceph mon createVerify monitor creation:
ceph -sSet public and cluster networks for optimal TB4 performance:
Changed to use the thunderbolt network for everything
ceph config set global public_network 10.100.0.0/24
ceph config set global cluster_network 10.100.0.0/24
ceph config set mon public_network 10.100.0.0/24
ceph config set mon cluster_network 10.100.0.0/24Original Used a different public/cluster network which led to degraded performance (https://forum.proxmox.com/threads/low-ceph-performance-on-3-node-proxmox-9-cluster-with-sata-ssds.170091/)
ceph config set global public_network 10.11.12.0/24- ceph config set global cluster_network 10.100.0.0/24
ceph config set mon public_network 10.11.12.0/24- ceph config set mon cluster_network 10.100.0.0/24
On n3 and n4 nodes:
pveceph mon createVerify 3-monitor quorum:
ceph quorum_statusCreate OSDs on NVMe drives (adjust device names as needed):
# Create two OSDs per node:
pveceph osd create /dev/nvme0n1
pveceph osd create /dev/nvme1n1Verify all OSDs are up:
ceph osd tree# Set OSD memory target to 8GB per OSD:
ceph config set osd osd_memory_target 8589934592
# Set BlueStore cache sizes for NVMe performance:
ceph config set osd bluestore_cache_size_ssd 4294967296
# Set memory allocation optimizations:
ceph config set osd osd_memory_cache_min 1073741824
ceph config set osd osd_memory_cache_resize_interval 1# Set CPU threading optimizations:
ceph config set osd osd_op_num_threads_per_shard 2
ceph config set osd osd_op_num_shards 8
# Set BlueStore threading for NVMe:
ceph config set osd bluestore_sync_submit_transaction false
ceph config set osd bluestore_throttle_bytes 268435456
ceph config set osd bluestore_throttle_deferred_bytes 134217728
# Set CPU-specific optimizations:
ceph config set osd osd_client_message_cap 1000
ceph config set osd osd_client_message_size_cap 1073741824# Set network optimizations for TB4 mesh:
ceph config set global ms_tcp_nodelay true
ceph config set global ms_tcp_rcvbuf 134217728
ceph config set global ms_tcp_prefetch_max_size 65536
# Set cluster network optimizations:
ceph config set global ms_cluster_mode crc
ceph config set global ms_async_op_threads 8
ceph config set global ms_dispatch_throttle_bytes 1073741824
# Set heartbeat optimizations:
ceph config set osd osd_heartbeat_interval 6
ceph config set osd osd_heartbeat_grace 20# Set BlueStore optimizations for NVMe drives:
ceph config set osd bluestore_compression_algorithm lz4
ceph config set osd bluestore_compression_mode aggressive
ceph config set osd bluestore_compression_required_ratio 0.7
# Set NVMe-specific optimizations:
ceph config set osd bluestore_cache_trim_interval 200
# Set WAL and DB optimizations:
ceph config set osd bluestore_block_db_size 5368709120
ceph config set osd bluestore_block_wal_size 1073741824# Set scrubbing optimizations:
ceph config set osd osd_scrub_during_recovery false
ceph config set osd osd_scrub_begin_hour 2
ceph config set osd osd_scrub_end_hour 6
# Set deep scrub optimizations:
ceph config set osd osd_deep_scrub_interval 1209600
ceph config set osd osd_scrub_max_interval 1209600
ceph config set osd osd_scrub_min_interval 86400
# Set recovery optimizations:
ceph config set osd osd_recovery_max_active 8
ceph config set osd osd_max_backfills 4
ceph config set osd osd_recovery_op_priority 1# Create pool with optimal PG count for 6 OSDs:
ceph osd pool create cephtb4 256 256
# Set 2:1 replication ratio:
ceph osd pool set cephtb4 size 2
ceph osd pool set cephtb4 min_size 1
# Enable RBD application:
ceph osd pool application enable cephtb4 rbdceph -sExpected results:
- Health: HEALTH_OK
- OSDs: 6 osds: 6 up, 6 in
- PGs: All PGs active+clean
# Test write performance:
rados -p cephtb4 bench 10 write --no-cleanup -b 4M -t 16
# Test read performance:
rados -p cephtb4 bench 10 rand -t 16
# Clean up test data:
rados -p cephtb4 cleanupExpected Results:
- Write Performance: ~1,300 MB/s average, 2,000+ MB/s peak
- Read Performance: ~1,760 MB/s average, 2,400+ MB/s peak
# Network tuning:
echo 'net.core.rmem_max = 268435456' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 268435456' >> /etc/sysctl.conf
echo 'net.core.netdev_max_backlog = 30000' >> /etc/sysctl.conf
# Memory tuning:
echo 'vm.swappiness = 1' >> /etc/sysctl.conf
echo 'vm.min_free_kbytes = 4194304' >> /etc/sysctl.conf
# Apply settings:
sysctl -pProblem: TB4 interfaces not coming up after reboot
Quick Fix: Manually bring up interfaces:
ip link set en05 up mtu 65520
ip link set en06 up mtu 65520
ifreload -aPermanent Fix: Check systemd service:
systemctl status thunderbolt-interfaces.service
# Check if scripts are corrupted:
wc -l /usr/local/bin/pve-en*.sh
# Check for shebang errors:
head -1 /usr/local/bin/*.sh | grep -E 'thunderbolt|pve-en'
# Fix shebang if corrupted:
sed -i '1s/#\\!/#!/' /usr/local/bin/thunderbolt-startup.sh
sed -i '1s/#\\!/#!/' /usr/local/bin/pve-en05.sh
sed -i '1s/#\\!/#!/' /usr/local/bin/pve-en06.shProblem: Mesh connectivity fails between nodes
# Check interface status:
ip addr show | grep -E '(en05|en06|10\.100\.0\.)'
# Verify FRR routing service:
systemctl status frrProblem: OSDs going down after creation
# Restart OSD services after fixing mesh:
systemctl restart ceph-osd@*.serviceProblem: Ceph cluster shows OSDs down after reboot
# 1. Bring up TB4 interfaces:
/usr/local/bin/pve-en05.sh
/usr/local/bin/pve-en06.sh
# 2. Wait for interfaces to stabilize:
sleep 10
# 3. Restart Ceph OSDs:
systemctl restart ceph-osd@*.service
# 4. Monitor recovery:
watch ceph -sProblem: Inactive PGs or slow performance
# Check cluster status:
ceph -s
# Verify optimizations are applied:
ceph config dump | grep -E '(memory_target|cache_size|compression)'
# Check network binding:
ceph config get osd cluster_network
ceph config get osd public_network- Cleaned up commands for direct terminal execution (no SSH wrappers)
- Fixed formatting and organization
- Updated Ceph version references (Nautilus → Reef)
- Clarified step-by-step execution flow
if you don't set the ips for en05/06 in the sdn ui and they aren't set anywhere else, what do they get set to? can you share your working /etc/network/interfaces.d/sdn
because when i repeat that, my en05/06 get the same ip as the iface dummy_tb4 inet static which breaks everything, i think