Last active
July 13, 2025 19:36
-
-
Save balfiere/c39feb84f9626fb895f2e921a3b9f552 to your computer and use it in GitHub Desktop.
A workaround to python_thai_ocr crashing on some of my pdfs. Inside a folder of images, OCR each image and append the output to the argument passed to the script. Example usage: ~/scripts/thai_images2ocr ocr.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/bin/bash | |
| # first argument is the output file | |
| output=$1 | |
| # remove the output file if it already exists | |
| rm -f "$output" | |
| # create the output file | |
| touch "$output" | |
| # page counter | |
| i=0 | |
| # process each image in the current directory | |
| shopt -s nocaseglob | |
| shopt -s nullglob | |
| for f in *.{jpg,jpeg,tiff,bmp,png} | |
| do | |
| # increase the page counter | |
| i=$((i+1)) | |
| # process the image and save output to temporary file | |
| # uses https://github.com/nanonymoussu/python_thai_ocr | |
| python $HOME/python_thai_ocr/main.py "$f" temp | |
| # add page header | |
| echo -e "✧˖°─ .✦──── ・ 。゚⟡ ☽ Page ${i} ☾ ⟡ ˚。 ・ ────✦.─ °˖✧\n" | cat >> "$output" # cute version (more at https://emojicombos.com/divider) | |
| # echo -e "===============Page ${i}===============\n" | cat >> "$output" # normal version | |
| # add page content | |
| cat temp >> "$output" | |
| # add empty lines at end of page | |
| echo -e "\n\n" | cat >> "$output" | |
| done | |
| # remove temporary file | |
| rm temp |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment