Skip to content

Instantly share code, notes, and snippets.

@Mixermachine
Created January 3, 2026 15:33
Show Gist options
  • Select an option

  • Save Mixermachine/ae26823a8770ab109c48080d851f3272 to your computer and use it in GitHub Desktop.

Select an option

Save Mixermachine/ae26823a8770ab109c48080d851f3272 to your computer and use it in GitHub Desktop.
Dying ZFS raid pool rescue

Hi everyone,

I recently reactivated my NAS setup. It consists of a Intel N100 and four Western Digital RED CMR drives in a RAID-Z2 (RAID 6 -> two parrity drives). The software used is Truenas.

Two of the drives have already degrated pretty badly (read, write and checksum errors). The other were also throwing Checksum errors. Multiple files were already corrupted. Attempting to copy broken files from the pool would stop the complete copy action.

So I needed a robust script which attempts to copy files one by one.
This script does uses Python3 and RSync to copy the files individually from a folder to a destination via SSH. It worked pretty well for me. I could rescue all important non broken files (old photos) from the pool.

What does the script do?

The script is executed in Truenas Shell and picks files from a specified folder one-by-one. It checks if there is constant progress output from rsync. If there is silence for more then 15 seconds we assume that the pool can't access the file any longer. We retry three times. If the file could not be accessed in 15 seconds for three times we add the file into the bad file log. If the script is restarted, the bad file log is read from file and known bad files are ignored.

How to use the script?

I put the file directly in a cat-to-file command.
You can just copy everything in an editor, change the folder + destination and copy the full content into a shell window. When you have done that, execute it via

python3 rescue_folder.sh

Disclaimer: the script is partially created with AI. I did read + verify all lines and improved some pieces.

cat << 'EOF' > rescue_folder.py
import os
import subprocess
import time
import sys
import select
# ================= CONFIGURATION =================
# EDIT THIS
SOURCE_DIR = "/SOURCE/DIR/"
REMOTE_DEST = "DESTINATION_USER@DESTINATION_IP:/DESTINATION/DIR/"
BAD_FILES_LOG = "bad_files.txt"
MAX_RETRIES = 3
# WATCHDOG TIMEOUT:
# If rsync prints NOTHING for this many seconds, we assume it's stuck and kill it.
# Large files are safe because rsync prints progress updates constantly.
SILENCE_THRESHOLD = 15
# =================================================
def load_bad_files():
if not os.path.exists(BAD_FILES_LOG):
return set()
with open(BAD_FILES_LOG, 'r') as f:
return set(line.strip() for line in f if line.strip())
def log_bad_file(relative_path):
try:
with open(BAD_FILES_LOG, 'a') as f:
f.write(f"{relative_path}\n")
print(f" [!] marked as bad: {relative_path}")
except Exception as e:
print(f" [!] Failed to write to bad files log: {e}")
def rsync_file_watchdog(local_path, relative_path):
ssh_cmd = "ssh -o ConnectTimeout=5 -o ServerAliveInterval=5"
cmd = [
"rsync",
"-avz",
"--progress",
"--partial",
"--relative",
f"--rsh={ssh_cmd}",
relative_path,
REMOTE_DEST
]
# Start the process in the background
process = subprocess.Popen(
cmd,
cwd=SOURCE_DIR,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
bufsize=1,
universal_newlines=True
)
last_activity = time.time()
try:
# Loop while process is running
while True:
# Check if process finished
if process.poll() is not None:
break
# Check for output on stdout (non-blocking)
reads = [process.stdout.fileno()]
ret = select.select(reads, [], [], 1.0) # Wait up to 1 sec for output
if ret[0]:
# We have data! Read it to keep the buffer clear
line = process.stdout.readline()
if line:
last_activity = time.time() # RESET TIMER
# Check watchdog
if time.time() - last_activity > SILENCE_THRESHOLD:
print(f" [WATCHDOG] No output for {SILENCE_THRESHOLD}s. Killing...")
process.kill()
return False
# Process finished naturally. Check exit code.
return process.returncode == 0
except Exception as e:
print(f" [ERROR] {e}")
try:
process.kill()
except:
pass
return False
finally:
# Close streams to prevent resource leaks
process.stdout.close()
process.stderr.close()
def main():
print("--- Starting Rescue Sync v3 (Smart Watchdog) ---")
print(f"Source: {SOURCE_DIR}")
print(f"Target: {REMOTE_DEST}")
bad_files = load_bad_files()
print(f"Loaded {len(bad_files)} previously skipped files.\n")
for root, dirs, files in os.walk(SOURCE_DIR):
for filename in files:
full_path = os.path.join(root, filename)
relative_path = os.path.relpath(full_path, SOURCE_DIR)
if relative_path in bad_files:
print(f"Skipping known bad file: {relative_path}")
continue
success = False
for attempt in range(1, MAX_RETRIES + 1):
print(f"Syncing: {relative_path} (Attempt {attempt}/{MAX_RETRIES})...", end="\r")
if rsync_file_watchdog(full_path, relative_path):
success = True
print(f"OK: {relative_path}" + " "*20)
break
else:
print(f"FAIL: {relative_path} (Attempt {attempt}) - waiting 2s...")
time.sleep(2)
if not success:
print(f"GIVING UP: {relative_path}")
log_bad_file(relative_path)
bad_files.add(relative_path)
if __name__ == "__main__":
try:
main()
except KeyboardInterrupt:
print("\n\nScript stopped by user.")
sys.exit(0)
EOF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment