Rebuilding and Recompressing a PDF with pdfimages, jbig2enc, jbig2topdf.py, and a Python Image-Replacement Script
This article explains a full workflow to:
- Extract images from an existing PDF (PDF A) using
pdfimagesfrom the Poppler suite, - Convert all extracted images to JBIG2 format using
jbig2enc, - Assemble those JBIG2 images into a temporary JBIG2-only PDF (PDF B) with
jbig2topdf.py, and - Replace all raster images in the original PDF A with the JBIG2 images from PDF B using a Python script based on
pypdf.
The end result is an output PDF that keeps the original structure of PDF A (bookmarks, annotations, vector graphics, etc.) while replacing its raster images with JBIG2-compressed images taken from PDF B, giving you both smaller size and better preservation of the original PDF’s logical structure.
- Part of the Poppler utilities.
- Extracts embedded images from a PDF file without re-rendering pages.
- Suitable for scanned PDFs where each page is primarily an image.
- Encodes bitmap images into the JBIG2 format.
- Excellent for compressing monochrome (black-and-white) scanned pages.
- Supports lossy and lossless modes.
- Helper script (typically shipped with
jbig2encor related projects). - Combines a JBIG2 symbol dictionary and per-page JBIG2 streams into a single PDF (PDF B).
- Produces a simple, image-only PDF with one JBIG2 image per page.
- Reads both PDF A and PDF B.
- Collects JBIG2 image XObjects from PDF B.
- Walks through pages in PDF A, replacing its raster image XObjects with JBIG2 XObjects from PDF B.
- Updates both page resources and content streams so that the new images are actually used when rendering.
You will need:
- A Unix-like environment (Linux, macOS, or WSL on Windows).
- Poppler utilities (
pdfimages). jbig2encandjbig2topdf.py.- Python 3.
pypdfPython library (installable viapip install pypdf).
Optionally verify tools:
pdfimages -v
jbig2 -h
python3 jbig2topdf.py --help
python3 -c "import pypdf; print(pypdf.__version__)"Extract all page images from the original PDF A:
pdfimages -tiff input.pdf imgExplanation:
input.pdfis the original PDF A.imgis the image name prefix; you will get files likeimg-000.tif,img-001.tif, etc.-tiffensures bitonal TIFF output, appropriate for JBIG2 encoding.
After this step, you should have one raster image per page (for scanned documents).
Next, encode the TIFF images using jbig2enc. For typical lossy symbol-compression (good compression, adequate for many scanned texts):
jbig2 -s -p -v -o output img-*.tifKey options:
-s: Symbol compression (segment repeated shapes, such as glyphs).-p: Multi-page mode; handles all input images as a single document.-v: Verbose output.-o output: Useoutputas the base name.
This usually generates:
output.jb2– JBIG2 page-streams for all pages.output.sym(or similar) – global symbol dictionary.
For lossless encoding:
jbig2 -lossless -p -v -o output img-*.tifLossless mode is safer for archival purposes but may not compress as much.
Now convert the JBIG2 streams into a simple PDF B:
python3 jbig2topdf.py -o pdf_b.pdf output.sym output.jb2Explanation:
pdf_b.pdfis the new JBIG2-based PDF B.output.symis the symbol dictionary.output.jb2holds the per-page JBIG2 streams.
PDF B is typically an image-only PDF with one JBIG2 image per page and minimal additional structure.
Up to now, you have:
- PDF A: original document (
input.pdf), with its full structure. - PDF B: JBIG2-only document (
pdf_b.pdf), one JBIG2 image per page.
The next step is to create a third PDF (call it PDF C) that uses the page layout and structure of PDF A but whose raster images are replaced by JBIG2 images from PDF B.
Below is the Python script you provided, which performs the replacement.
Conceptually, the script:
-
Reads PDF B and collects all XObjects of subtype
/Imagethat use/JBIG2Decodeas a filter. -
Iterates over all pages in PDF A:
- Copies the page object.
- Looks at its
/Resources→/XObjectdictionary. - Finds all image XObjects on that page (in reversed order, to match your original behavior).
- For each image, sequentially assigns the next JBIG2 XObject from PDF B.
- Inserts the JBIG2 XObject into the page’s
/XObjectdictionary under a new name (/ImJB2_page_index_image_index). - Rewrites the page’s content stream(s), replacing usages of the old image name with the new one.
-
Writes out a new PDF where, as far as the renderer is concerned, the pages are the same, but they now reference JBIG2 images instead of the original raster images.
This approach preserves:
- Page count and page order.
- Vector graphics, text objects, annotations, bookmarks, and other higher-level PDF constructs.
- Layout and coordinates of images (because only the underlying XObjects are switched).
Save the script as replace_images_jbig2.py, then run:
python3 replace_images_jbig2.py input.pdf pdf_b.pdf output_jbig2.pdfWhere:
input.pdfis your original PDF A (with structure you want to keep).pdf_b.pdfis the JBIG2-only PDF B created byjbig2topdf.py.output_jbig2.pdfis the final PDF C where images have been replaced by JBIG2.
If everything works correctly, output_jbig2.pdf should:
- Look visually similar (or identical) to
input.pdffrom a user perspective. - Be significantly smaller than
input.pdf, thanks to JBIG2 compression. - Preserve all non-image content and structure from
input.pdf.
Putting all steps together:
# 1. Extract page images from PDF A
pdfimages -tiff input.pdf img
# 2. Encode extracted images as JBIG2 (multi-page)
jbig2 -s -p -v -o output img-*.tif
# 3. Build PDF B from JBIG2 streams
python3 jbig2topdf.py -o pdf_b.pdf output.sym output.jb2
# 4. Replace images in PDF A with JBIG2 images from PDF B
python3 replace_images_jbig2.py input.pdf pdf_b.pdf output_jbig2.pdfResult:
input.pdf: original (unmodified).pdf_b.pdf: simple JBIG2-only PDF used as a source of compressed images.output_jbig2.pdf: final optimized PDF, combining PDF A’s structure with JBIG2-compressed images.
- Lossy JBIG2 can sometimes introduce glyph substitution errors (visually similar but incorrect characters).
- This can be critical for legal, financial, or archival documents.
- For such cases, prefer lossless mode or thoroughly validate results.
- The replacement script assumes that the number and order of images in PDF B correspond appropriately to the images in PDF A.
- If PDF A and B differ in page count or image count, the script will log that there are not enough or extra JBIG2 images.
- For complex PDFs with multiple images per page or mixed content, you may need to adapt the mapping logic.
- The final PDF (
output_jbig2.pdf) is not automatically PDF/A-compliant. - Additional tooling and validation are required if PDF/A is a requirement.
If you would like, I can also add notes on how to adapt the script for different mapping strategies (e.g., one JBIG2 image per page regardless of how many images are present, or preserving original image names instead of generating new ones).