Skip to content

Instantly share code, notes, and snippets.

@thehale
Created March 9, 2026 22:09
Show Gist options
  • Select an option

  • Save thehale/65086ceeb1eef6a949f4ca7970eaed8f to your computer and use it in GitHub Desktop.

Select an option

Save thehale/65086ceeb1eef6a949f4ca7970eaed8f to your computer and use it in GitHub Desktop.
Shell script to extract text from PDFs, Spreadsheets, Word/Powerpoint, and more
#!/usr/bin/env bash
# Self-contained script to extract text from a variety of file types
# Supported file types -- https://textract.readthedocs.io/en/stable/#currently-supporting
# Usage: ./textract <input.pdf> -o <output.csv>
set -euo pipefail
# Install uv if not available
if ! command -v uv &> /dev/null; then
echo "Installing uv..."
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"
fi
# Run textract using uv with Python 3.11 (installs textract if needed)
exec uv tool run --python 3.11 --from textract textract "$@"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment