Created
March 9, 2026 22:09
-
-
Save thehale/65086ceeb1eef6a949f4ca7970eaed8f to your computer and use it in GitHub Desktop.
Shell script to extract text from PDFs, Spreadsheets, Word/Powerpoint, and more
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/usr/bin/env bash | |
| # Self-contained script to extract text from a variety of file types | |
| # Supported file types -- https://textract.readthedocs.io/en/stable/#currently-supporting | |
| # Usage: ./textract <input.pdf> -o <output.csv> | |
| set -euo pipefail | |
| # Install uv if not available | |
| if ! command -v uv &> /dev/null; then | |
| echo "Installing uv..." | |
| curl -LsSf https://astral.sh/uv/install.sh | sh | |
| export PATH="$HOME/.local/bin:$PATH" | |
| fi | |
| # Run textract using uv with Python 3.11 (installs textract if needed) | |
| exec uv tool run --python 3.11 --from textract textract "$@" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment