aont/0 recompressing a pdf.md

## 0 recompressing a pdf.md

      
    Raw
  

              0 recompressing a pdf.md
            
          
    Rebuilding and Recompressing a PDF with pdfimages, jbig2enc, jbig2topdf.py, and a Python Image-Replacement Script

This article explains a full workflow to:

Extract images from an existing PDF (PDF A) using pdfimages from the Poppler suite,
Convert all extracted images to JBIG2 format using jbig2enc,
Assemble those JBIG2 images into a temporary JBIG2-only PDF (PDF B) with jbig2topdf.py, and
Replace all raster images in the original PDF A with the JBIG2 images from PDF B using a Python script based on pypdf.

The end result is an output PDF that keeps the original structure of PDF A (bookmarks, annotations, vector graphics, etc.) while replacing its raster images with JBIG2-compressed images taken from PDF B, giving you both smaller size and better preservation of the original PDF’s logical structure.

1. Overview of the Tools

pdfimages (Poppler)


Part of the Poppler utilities.
Extracts embedded images from a PDF file without re-rendering pages.
Suitable for scanned PDFs where each page is primarily an image.

jbig2enc


Encodes bitmap images into the JBIG2 format.
Excellent for compressing monochrome (black-and-white) scanned pages.
Supports lossy and lossless modes.

jbig2topdf.py


Helper script (typically shipped with jbig2enc or related projects).
Combines a JBIG2 symbol dictionary and per-page JBIG2 streams into a single PDF (PDF B).
Produces a simple, image-only PDF with one JBIG2 image per page.

Python + pypdf (image replacement script)


Reads both PDF A and PDF B.
Collects JBIG2 image XObjects from PDF B.
Walks through pages in PDF A, replacing its raster image XObjects with JBIG2 XObjects from PDF B.
Updates both page resources and content streams so that the new images are actually used when rendering.


2. Prerequisites

You will need:

A Unix-like environment (Linux, macOS, or WSL on Windows).
Poppler utilities (pdfimages).
jbig2enc and jbig2topdf.py.
Python 3.
pypdf Python library (installable via pip install pypdf).

Optionally verify tools:
pdfimages -v
jbig2 -h
python3 jbig2topdf.py --help
python3 -c "import pypdf; print(pypdf.__version__)"

3. Step 1 – Extract Images from PDF A with pdfimages

Extract all page images from the original PDF A:
pdfimages -tiff input.pdf img
Explanation:

input.pdf is the original PDF A.
img is the image name prefix; you will get files like img-000.tif, img-001.tif, etc.
-tiff ensures bitonal TIFF output, appropriate for JBIG2 encoding.

After this step, you should have one raster image per page (for scanned documents).

4. Step 2 – Encode Extracted Images to JBIG2 with jbig2enc

Next, encode the TIFF images using jbig2enc. For typical lossy symbol-compression (good compression, adequate for many scanned texts):
jbig2 -s -p -v -o output img-*.tif
Key options:

-s: Symbol compression (segment repeated shapes, such as glyphs).
-p: Multi-page mode; handles all input images as a single document.
-v: Verbose output.
-o output: Use output as the base name.

This usually generates:

output.jb2 – JBIG2 page-streams for all pages.
output.sym (or similar) – global symbol dictionary.

For lossless encoding:
jbig2 -lossless -p -v -o output img-*.tif
Lossless mode is safer for archival purposes but may not compress as much.

5. Step 3 – Build PDF B from JBIG2 with jbig2topdf.py

Now convert the JBIG2 streams into a simple PDF B:
python3 jbig2topdf.py -o pdf_b.pdf output.sym output.jb2
Explanation:

pdf_b.pdf is the new JBIG2-based PDF B.
output.sym is the symbol dictionary.
output.jb2 holds the per-page JBIG2 streams.

PDF B is typically an image-only PDF with one JBIG2 image per page and minimal additional structure.

6. Step 4 – Replace Images in PDF A with JBIG2 Images from PDF B

Up to now, you have:

PDF A: original document (input.pdf), with its full structure.
PDF B: JBIG2-only document (pdf_b.pdf), one JBIG2 image per page.

The next step is to create a third PDF (call it PDF C) that uses the page layout and structure of PDF A but whose raster images are replaced by JBIG2 images from PDF B.
6.1 Python script for image replacement

Below is the Python script you provided, which performs the replacement.
6.2 What the script does

Conceptually, the script:


Reads PDF B and collects all XObjects of subtype /Image that use /JBIG2Decode as a filter.


Iterates over all pages in PDF A:

Copies the page object.
Looks at its /Resources → /XObject dictionary.
Finds all image XObjects on that page (in reversed order, to match your original behavior).
For each image, sequentially assigns the next JBIG2 XObject from PDF B.
Inserts the JBIG2 XObject into the page’s /XObject dictionary under a new name (/ImJB2_page_index_image_index).
Rewrites the page’s content stream(s), replacing usages of the old image name with the new one.


Writes out a new PDF where, as far as the renderer is concerned, the pages are the same, but they now reference JBIG2 images instead of the original raster images.


This approach preserves:

Page count and page order.
Vector graphics, text objects, annotations, bookmarks, and other higher-level PDF constructs.
Layout and coordinates of images (because only the underlying XObjects are switched).

6.3 How to run the script

Save the script as replace_images_jbig2.py, then run:
python3 replace_images_jbig2.py input.pdf pdf_b.pdf output_jbig2.pdf
Where:

input.pdf is your original PDF A (with structure you want to keep).
pdf_b.pdf is the JBIG2-only PDF B created by jbig2topdf.py.
output_jbig2.pdf is the final PDF C where images have been replaced by JBIG2.

If everything works correctly, output_jbig2.pdf should:

Look visually similar (or identical) to input.pdf from a user perspective.
Be significantly smaller than input.pdf, thanks to JBIG2 compression.
Preserve all non-image content and structure from input.pdf.


7. End-to-End Workflow Summary

Putting all steps together:
# 1. Extract page images from PDF A
pdfimages -tiff input.pdf img

# 2. Encode extracted images as JBIG2 (multi-page)
jbig2 -s -p -v -o output img-*.tif

# 3. Build PDF B from JBIG2 streams
python3 jbig2topdf.py -o pdf_b.pdf output.sym output.jb2

# 4. Replace images in PDF A with JBIG2 images from PDF B
python3 replace_images_jbig2.py input.pdf pdf_b.pdf output_jbig2.pdf
Result:

input.pdf: original (unmodified).
pdf_b.pdf: simple JBIG2-only PDF used as a source of compressed images.
output_jbig2.pdf: final optimized PDF, combining PDF A’s structure with JBIG2-compressed images.


8. Important Considerations

Lossy JBIG2 risks


Lossy JBIG2 can sometimes introduce glyph substitution errors (visually similar but incorrect characters).
This can be critical for legal, financial, or archival documents.
For such cases, prefer lossless mode or thoroughly validate results.

Page and image alignment assumptions


The replacement script assumes that the number and order of images in PDF B correspond appropriately to the images in PDF A.
If PDF A and B differ in page count or image count, the script will log that there are not enough or extra JBIG2 images.
For complex PDFs with multiple images per page or mixed content, you may need to adapt the mapping logic.

PDF/A and long-term archiving


The final PDF (output_jbig2.pdf) is not automatically PDF/A-compliant.
Additional tooling and validation are required if PDF/A is a requirement.


If you would like, I can also add notes on how to adapt the script for different mapping strategies (e.g., one JBIG2 image per page regardless of how many images are present, or preserving original image names instead of generating new ones).

  
## image_replace.py
import argparse
from copy import copy

from pypdf import PdfReader, PdfWriter
from pypdf.generic import NameObject


def collect_jbig2_images(reader_b):
    """
    Collect JBIG2 image XObjects from PDF B and return them as a list.

    Returns:
        list: [xobj, xobj, ...]
    """
    images = []
    for page_index, page in enumerate(reader_b.pages):
        print(f"[DEBUG] PDF B Page {page_index + 1}: start collecting JBIG2 images")
        resources = page.get("/Resources")
        if resources is None:
            print(f"[DEBUG]   No /Resources on PDF B Page {page_index + 1}")
            continue

        xobjects = resources.get("/XObject")
        if xobjects is None:
            print(f"[DEBUG]   No /XObject in /Resources on PDF B Page {page_index + 1}")
            continue

        for name, xobj_ref in xobjects.items():
            xobj = xobj_ref.get_object()
            subtype = xobj.get("/Subtype")
            flt = xobj.get("/Filter")
            print(f"[DEBUG]   XObject {name}: Subtype={subtype}, Filter={flt}")

            if subtype == "/Image":
                if isinstance(flt, list):
                    is_jbig2 = "/JBIG2Decode" in flt
                else:
                    is_jbig2 = (flt == "/JBIG2Decode")

                if is_jbig2:
                    print(f"[DEBUG]     -> JBIG2 image detected and collected: {name}")
                    images.append(xobj)
                else:
                    print(f"[DEBUG]     -> Image but not JBIG2: {name}")

    print(f"[DEBUG] Finished collecting JBIG2 images: {len(images)} found")
    return images


def process_xobjects_recursive(owner, xobjects, jbig2_images, jbig2_index, page_idx, img_counter, indent=""):
    """
    Recursively traverse XObjects and replace /Image with JBIG2.
    * XObject names are not changed; only their contents are replaced.
    """
    if xobjects is None:
        print(indent + "[DEBUG]   No /XObject in this level")
        return jbig2_index

    print(indent + f"[DEBUG]   XObject keys at this level = {list(xobjects.keys())}")

    for name, xobj_ref in xobjects.items():
        xobj = xobj_ref.get_object()
        subtype = xobj.get("/Subtype")
        flt = xobj.get("/Filter")
        print(indent + f"[DEBUG]   XObject {name}: Subtype={subtype}, Filter={flt}")

        if subtype == "/Image":
            if jbig2_index >= len(jbig2_images):
                print(indent + "[DEBUG]   Not enough JBIG2 images. Remaining images will be kept as-is.")
                continue

            img_counter[0] += 1
            old_name = name
            jbig2_xobj = jbig2_images[jbig2_index]
            print(indent + f"[DEBUG]     Replacing image {old_name} on page {page_idx + 1}, img_counter={img_counter[0]}")
            print(indent + f"[DEBUG]       Using JBIG2 image index {jbig2_index} from PDF B")
            jbig2_index += 1

            # ★ Do not change the name: assign the new XObject to the same key
            xobjects[NameObject(old_name)] = jbig2_xobj
            print(indent + f"[DEBUG]       Overwrote XObject {old_name} with JBIG2 image")
            print(indent + f"[DEBUG]       Current XObject keys = {list(xobjects.keys())}")

            # Since the name is unchanged, there is no need to rewrite /Contents

        elif subtype == "/Form":
            print(indent + f"[DEBUG]     Descend into Form XObject {name}")
            form_resources = xobj.get("/Resources")
            if form_resources is None:
                print(indent + f"[DEBUG]       Form XObject {name} has no /Resources")
            else:
                inner_xobjs = form_resources.get("/XObject")
                if inner_xobjs is None:
                    print(indent + f"[DEBUG]       Form XObject {name} has no inner /XObject")
                else:
                    jbig2_index = process_xobjects_recursive(
                        owner=xobj,
                        xobjects=inner_xobjs,
                        jbig2_images=jbig2_images,
                        jbig2_index=jbig2_index,
                        page_idx=page_idx,
                        img_counter=img_counter,
                        indent=indent + "    ",
                    )
        else:
            print(indent + f"[DEBUG]     XObject {name} is not Image or Form, skipping replacement")

    return jbig2_index


def replace_images_with_sequential_jbig2(pdf_a_path, pdf_b_path, output_path):
    """
    Replace raster images in PDF A with JBIG2 images taken sequentially from PDF B.

    PDF B is assumed to be created by jbig2topdf.py, with one JBIG2 image per page.
    The JBIG2 images are taken in page order from PDF B and assigned to image XObjects
    in PDF A, recursively descending into Form XObjects.
    """
    print(f"[DEBUG] Opening PDF A: {pdf_a_path}")
    reader_a = PdfReader(pdf_a_path)

    print(f"[DEBUG] Opening PDF B: {pdf_b_path}")
    reader_b = PdfReader(pdf_b_path)

    writer = PdfWriter()

    # 1. Collect all JBIG2 images from PDF B
    jbig2_images = collect_jbig2_images(reader_b)
    total_jbig2 = len(jbig2_images)
    print(f"Number of JBIG2 images in PDF B: {total_jbig2}")

    jbig2_index = 0  # Index of the next JBIG2 image to use

    for page_idx, page_a in enumerate(reader_a.pages):
        print(f"[DEBUG] ===== Processing PDF A Page {page_idx + 1} =====")
        new_page = copy(page_a)

        resources = new_page.get("/Resources")
        if resources is None:
            print(f"[DEBUG] Page {page_idx + 1}: No /Resources, copying as-is")
            writer.add_page(new_page)
            continue

        xobjects = resources.get("/XObject")
        if xobjects is None:
            print(f"[DEBUG] Page {page_idx + 1}: No /XObject in /Resources, copying as-is")
            writer.add_page(new_page)
            continue

        img_counter = [0]

        print(f"[DEBUG] Page {page_idx + 1}: start recursive XObject processing")
        jbig2_index_before = jbig2_index

        jbig2_index = process_xobjects_recursive(
            owner=new_page,
            xobjects=xobjects,
            jbig2_images=jbig2_images,
            jbig2_index=jbig2_index,
            page_idx=page_idx,
            img_counter=img_counter,
            indent="  ",
        )

        if img_counter[0] == 0:
            print(f"[DEBUG] Page {page_idx + 1}: No image XObjects found even recursively (used JBIG2 index {jbig2_index_before} -> {jbig2_index})")
        else:
            print(f"[DEBUG] Page {page_idx + 1}: Replaced {img_counter[0]} image(s) (JBIG2 index {jbig2_index_before} -> {jbig2_index})")

        writer.add_page(new_page)

    # If there are unused JBIG2 images, just report it.
    if jbig2_index < total_jbig2:
        print(f"{total_jbig2 - jbig2_index} JBIG2 image(s) remain unused.")
    else:
        print("[DEBUG] All JBIG2 images were used or there was exact match")

    print(f"[DEBUG] Writing output PDF to: {output_path}")
    with open(output_path, "wb") as f:
        writer.write(f)
    print("[DEBUG] Done.")


def parse_args():
    parser = argparse.ArgumentParser(
        description="Replace images in a PDF (A) with JBIG2 images from another PDF (B)."
    )
    parser.add_argument(
        "pdf_a",
        help="Path to the original PDF A (images will be replaced in this structure).",
    )
    parser.add_argument(
        "pdf_b",
        help="Path to PDF B created via jbig2topdf.py (one JBIG2 image per page).",
    )
    parser.add_argument(
        "output",
        help="Path to the output PDF file with images replaced.",
    )
    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    replace_images_with_sequential_jbig2(args.pdf_a, args.pdf_b, args.output)
	import argparse
	from copy import copy

	from pypdf import PdfReader, PdfWriter
	from pypdf.generic import NameObject


	def collect_jbig2_images(reader_b):
	"""
	Collect JBIG2 image XObjects from PDF B and return them as a list.

	Returns:
	list: [xobj, xobj, ...]
	"""
	images = []
	for page_index, page in enumerate(reader_b.pages):
	print(f"[DEBUG] PDF B Page {page_index + 1}: start collecting JBIG2 images")
	resources = page.get("/Resources")
	if resources is None:
	print(f"[DEBUG] No /Resources on PDF B Page {page_index + 1}")
	continue

	xobjects = resources.get("/XObject")
	if xobjects is None:
	print(f"[DEBUG] No /XObject in /Resources on PDF B Page {page_index + 1}")
	continue

	for name, xobj_ref in xobjects.items():
	xobj = xobj_ref.get_object()
	subtype = xobj.get("/Subtype")
	flt = xobj.get("/Filter")
	print(f"[DEBUG] XObject {name}: Subtype={subtype}, Filter={flt}")

	if subtype == "/Image":
	if isinstance(flt, list):
	is_jbig2 = "/JBIG2Decode" in flt
	else:
	is_jbig2 = (flt == "/JBIG2Decode")

	if is_jbig2:
	print(f"[DEBUG] -> JBIG2 image detected and collected: {name}")
	images.append(xobj)
	else:
	print(f"[DEBUG] -> Image but not JBIG2: {name}")

	print(f"[DEBUG] Finished collecting JBIG2 images: {len(images)} found")
	return images


	def process_xobjects_recursive(owner, xobjects, jbig2_images, jbig2_index, page_idx, img_counter, indent=""):
	"""
	Recursively traverse XObjects and replace /Image with JBIG2.
	* XObject names are not changed; only their contents are replaced.
	"""
	if xobjects is None:
	print(indent + "[DEBUG] No /XObject in this level")
	return jbig2_index

	print(indent + f"[DEBUG] XObject keys at this level = {list(xobjects.keys())}")

	for name, xobj_ref in xobjects.items():
	xobj = xobj_ref.get_object()
	subtype = xobj.get("/Subtype")
	flt = xobj.get("/Filter")
	print(indent + f"[DEBUG] XObject {name}: Subtype={subtype}, Filter={flt}")

	if subtype == "/Image":
	if jbig2_index >= len(jbig2_images):
	print(indent + "[DEBUG] Not enough JBIG2 images. Remaining images will be kept as-is.")
	continue

	img_counter[0] += 1
	old_name = name
	jbig2_xobj = jbig2_images[jbig2_index]
	print(indent + f"[DEBUG] Replacing image {old_name} on page {page_idx + 1}, img_counter={img_counter[0]}")
	print(indent + f"[DEBUG] Using JBIG2 image index {jbig2_index} from PDF B")
	jbig2_index += 1

	# ★ Do not change the name: assign the new XObject to the same key
	xobjects[NameObject(old_name)] = jbig2_xobj
	print(indent + f"[DEBUG] Overwrote XObject {old_name} with JBIG2 image")
	print(indent + f"[DEBUG] Current XObject keys = {list(xobjects.keys())}")

	# Since the name is unchanged, there is no need to rewrite /Contents

	elif subtype == "/Form":
	print(indent + f"[DEBUG] Descend into Form XObject {name}")
	form_resources = xobj.get("/Resources")
	if form_resources is None:
	print(indent + f"[DEBUG] Form XObject {name} has no /Resources")
	else:
	inner_xobjs = form_resources.get("/XObject")
	if inner_xobjs is None:
	print(indent + f"[DEBUG] Form XObject {name} has no inner /XObject")
	else:
	jbig2_index = process_xobjects_recursive(
	owner=xobj,
	xobjects=inner_xobjs,
	jbig2_images=jbig2_images,
	jbig2_index=jbig2_index,
	page_idx=page_idx,
	img_counter=img_counter,
	indent=indent + " ",
	)
	else:
	print(indent + f"[DEBUG] XObject {name} is not Image or Form, skipping replacement")

	return jbig2_index


	def replace_images_with_sequential_jbig2(pdf_a_path, pdf_b_path, output_path):
	"""
	Replace raster images in PDF A with JBIG2 images taken sequentially from PDF B.

	PDF B is assumed to be created by jbig2topdf.py, with one JBIG2 image per page.
	The JBIG2 images are taken in page order from PDF B and assigned to image XObjects
	in PDF A, recursively descending into Form XObjects.
	"""
	print(f"[DEBUG] Opening PDF A: {pdf_a_path}")
	reader_a = PdfReader(pdf_a_path)

	print(f"[DEBUG] Opening PDF B: {pdf_b_path}")
	reader_b = PdfReader(pdf_b_path)

	writer = PdfWriter()

	# 1. Collect all JBIG2 images from PDF B
	jbig2_images = collect_jbig2_images(reader_b)
	total_jbig2 = len(jbig2_images)
	print(f"Number of JBIG2 images in PDF B: {total_jbig2}")

	jbig2_index = 0 # Index of the next JBIG2 image to use

	for page_idx, page_a in enumerate(reader_a.pages):
	print(f"[DEBUG] ===== Processing PDF A Page {page_idx + 1} =====")
	new_page = copy(page_a)

	resources = new_page.get("/Resources")
	if resources is None:
	print(f"[DEBUG] Page {page_idx + 1}: No /Resources, copying as-is")
	writer.add_page(new_page)
	continue

	xobjects = resources.get("/XObject")
	if xobjects is None:
	print(f"[DEBUG] Page {page_idx + 1}: No /XObject in /Resources, copying as-is")
	writer.add_page(new_page)
	continue

	img_counter = [0]

	print(f"[DEBUG] Page {page_idx + 1}: start recursive XObject processing")
	jbig2_index_before = jbig2_index

	jbig2_index = process_xobjects_recursive(
	owner=new_page,
	xobjects=xobjects,
	jbig2_images=jbig2_images,
	jbig2_index=jbig2_index,
	page_idx=page_idx,
	img_counter=img_counter,
	indent=" ",
	)

	if img_counter[0] == 0:
	print(f"[DEBUG] Page {page_idx + 1}: No image XObjects found even recursively (used JBIG2 index {jbig2_index_before} -> {jbig2_index})")
	else:
	print(f"[DEBUG] Page {page_idx + 1}: Replaced {img_counter[0]} image(s) (JBIG2 index {jbig2_index_before} -> {jbig2_index})")

	writer.add_page(new_page)

	# If there are unused JBIG2 images, just report it.
	if jbig2_index < total_jbig2:
	print(f"{total_jbig2 - jbig2_index} JBIG2 image(s) remain unused.")
	else:
	print("[DEBUG] All JBIG2 images were used or there was exact match")

	print(f"[DEBUG] Writing output PDF to: {output_path}")
	with open(output_path, "wb") as f:
	writer.write(f)
	print("[DEBUG] Done.")


	def parse_args():
	parser = argparse.ArgumentParser(
	description="Replace images in a PDF (A) with JBIG2 images from another PDF (B)."
	)
	parser.add_argument(
	"pdf_a",
	help="Path to the original PDF A (images will be replaced in this structure).",
	)
	parser.add_argument(
	"pdf_b",
	help="Path to PDF B created via jbig2topdf.py (one JBIG2 image per page).",
	)
	parser.add_argument(
	"output",
	help="Path to the output PDF file with images replaced.",
	)
	return parser.parse_args()


	if __name__ == "__main__":
	args = parse_args()
	replace_images_with_sequential_jbig2(args.pdf_a, args.pdf_b, args.output)
No results found