-
-
Save matthieuheitz/7287e214b1aeda7948f6c27fbfb5288b to your computer and use it in GitHub Desktop.
| #!/bin/bash | |
| # Method found here https://askubuntu.com/a/122604/423332 | |
| # Dependencies: | |
| # On ubuntu, you can install ocrodjvu and pdfbeads with: | |
| # sudo apt install ocrodjvu | |
| # gem install pdfbeads | |
| # The path and filename given can only contain ascii characters | |
| f=$1 | |
| # Get filename | |
| filename=$(basename -- "$f") | |
| extension="${filename##*.}" | |
| file_no_ext="${filename%.*}" | |
| # Count number of pages | |
| echo "f=$f" | |
| p=$(djvused -e n "$f") | |
| echo -e "The document contains $p pages.\n" | |
| # Number of digits | |
| pp=${#p} | |
| echo "###############################" | |
| echo "### Extracting page by page ###" | |
| echo "###############################" | |
| # For each page, extract the text, and the image | |
| for i in $( seq 1 $p) | |
| do | |
| ii=$(printf %0${pp}d $i) | |
| djvu2hocr -p $i "$f" | sed 's/ocrx/ocr/g' > pg$ii.html | |
| ddjvu -format=tiff -page=$i "$f" pg$ii.tiff | |
| done | |
| echo "" | |
| echo "##############################" | |
| echo "### Building the final pdf ###" | |
| echo "##############################" | |
| # Build the final pdf | |
| pdfbeads > "$file_no_ext".pdf | |
| echo "" | |
| echo "Done" | |
| # Remove temp files | |
| echo "" | |
| read -p "Do you want to delete temp files ? (pg*.html, pg*.tiff, pg*.bg.jpg) " -n 1 -r | |
| echo # (optional) move to a new line | |
| if [[ $REPLY =~ ^[Yy]$ ]] | |
| then | |
| rm pg*.html pg*.tiff pg*.bg.jpg | |
| fi | |
I packaged pdfbeads (with patches) to work on Debian without warnings (including the RMagick vs. rmagic thing) and with all the dependencies set to be pulled in. It should work on a sufficiently new Ubuntu version (I don't know how much, since I don't follow Ubuntu releases that closely). That being said, if I introduce that package on Debian, then getting it to work on Ubuntu should be relatively simple.
I can TRY TO provide a precompiled version of it on a PPA that I have (where I have other tools that I find useful).
In the mean time, the unfinished (but working) package is at: https://github.com/rbrito/pkg-pdfbeads
It works very well for me and I will try this script to see how well things go when we mix everything together.
Used it just now and everything worked perfectly. Only quirk was I had to roll back gem update --system 3.0.8 to get rmagick to install properly and stop complaining that constant Gem::ConfigMap is deprecated (issue and fix discussed here).
> gem list rmagick iconv pdfbeads
*** LOCAL GEMS ***
rmagick (4.2.5, 2.16.0)
iconv (1.0.8)
pdfbeads (1.1.3)
Thanks so much for this helpful script!!
I get :