jerieljan/README.md

## README.md

      
    Raw
  

              README.md
            
          
    pdf-parse-with-llm

This is a simple Opencode / Claude Code skill that you can use to delegate the document / image processing to
another LLM, in this case, gemini-2.5-pro.
Why?

While Claude Code and other harnesses come with their own ways to read files, I sometimes stumble on issues
when there's a lot of them and especially when they're handwritten or scanned PDFs.
Maybe I'm just unlucky sometimes, but I usually end up with a session that gets stuck or takes longer than
usual and this approach works for me.
I also just prefer Gemini models for document processing. It reads far better for my use cases and they cost
better than just throwing other expensive models to do this.
How to Use


Install llm: https://llm.datasette.io/en/stable/
Configure it to use llm-gemini or whichever model you'd like to choose for document processing.
Place the SKILL.md in this page to your coding agent's skill directory
(e.g., in Opencode, ~/.config/opencode/skill/pdf-parse-with-llm/SKILL.md)
Verify if it works by asking your agent to use the skill (or something like /skills)


## SKILL.md

      
    Raw
  

              SKILL.md
            
          
  name
  description
  compatibility
  metadata
  
  
  pdf-parse-with-llm
  Parse PDFs (including scanned/image PDFs) using the `llm` CLI with an LLM attachment workflow; extract structured data and/or text when native PDF text is unreliable.
  opencode
  
  
  input
  tool
  default_model
  
  
  pdf
  llm
  gemini/gemini-2.5-pro
  
  
What I do


When a PDF needs parsing—especially scanned PDFs or PDFs where text extraction is unreliable—I use the llm CLI with the PDF attached.
I craft prompts to extract:

clean plain text
structured JSON (tables, fields, invoices, forms)
specific answers with citations to page numbers (when feasible)


I iterate: first get a high-level extraction, then refine prompts for missing fields, tables, or ambiguous sections.

When to use me

Use this skill when:

You encounter a .pdf that is likely scanned (image-based) or has broken/non-native text.
You need reliable extraction into text/JSON for downstream processing.
You need to interpret tables, forms, or multi-column layouts.

Do not use this skill when:

You already have high-quality extracted text and only need summarization (unless verification against the PDF is required).

Required tool

The llm utility is available for use.
Command template (required)

Use this exact structure:
llm -m {{llm-id}} "{{prompt}}" -a "{{path-to-pdf}}"
Attachment guidance


Attach up to five files at a time; use this only for images and small PDFs.
For large files or long documents, attach one file per request.

Default model

Unless instructed otherwise, use gemini/gemini-2.5-pro by default.
Prompting patterns

1) Fast full-text extraction

Use when you just need readable text.
Prompt example:

"Extract all readable text from this PDF. Preserve headings and paragraph breaks. If the PDF is scanned, perform OCR. Return plain text."

Command:
llm -m gemini/gemini-2.5-pro "Extract all readable text from this PDF. Preserve headings and paragraph breaks. If the PDF is scanned, perform OCR. Return plain text." -a "PATH/TO/FILE.pdf"
2) Structured JSON extraction (preferred for automation)

Use when you need schema’d output.
Prompt example:

"Extract the following fields as JSON: invoice_number, invoice_date, vendor_name, vendor_address, customer_name, customer_address, line_items[{description, quantity, unit_price, amount}], subtotal, tax, total. If a field is missing, use null. Also include page_number for each extracted top-level field if possible."

Command:
llm -m gemini/gemini-2.5-pro "Extract the following fields as JSON: invoice_number, invoice_date, vendor_name, vendor_address, customer_name, customer_address, line_items[{description, quantity, unit_price, amount}], subtotal, tax, total. If a field is missing, use null. Also include page_number for each extracted top-level field if possible." -a "PATH/TO/FILE.pdf"
3) Table extraction

Prompt example:

"Locate the table(s) in this PDF and return them as JSON arrays. Preserve column headers. If multiple tables exist, return an array of tables with {title, page_number, rows}."

4) Targeted Q&A with page grounding

Prompt example:

"Answer: What is the contract termination notice period? Quote the exact clause and provide the page number."

Workflow guidance


Identify the user’s objective (full text vs structured fields vs tables vs Q&A).
Run an initial extraction prompt.
If the output is incomplete/garbled:

narrow to specific pages/sections in the prompt
request alternative renderings (e.g., “focus on the header area of each page”)
extract tables separately from narrative text


Return results in the format the user asked for (plain text / JSON / CSV-like text), and note uncertainties explicitly.
No results found