Vincent Richard VRichardJP

## jpepub2anki.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                VRichardJP
                / jpepub2anki.md
            
            
              Last active
              November 18, 2025 11:52
            
              
                Generate Anki deck from japanese epub files using LLMs
              
          
    Generate Anki deck from japanese epub files using LLMs

The present document describes a method to generate anki vocabulary decks from japanese books using LLMs. Traditionally, one would create new anki cards as they stumble upon new words and expressions in books they read. This is a very tedious and annoying task. What could be done instead is to use an LLM to skim through the book first, and automatically generate anki cards that should cover 90% of one's needs.


Extract XHTML chapters from the epub file (e.g. using Calibre)


Extract raw text from the XHTML. Note that japanese epubs may contain furigana. These are normally found within <rt> or <html:rt> blocks and should be removed from the raw text. For example with sed:


cat chapter01.xhtml | sed 's/<rt>[^<]*<\/rt>//g' | sed 's/<html:rt>[^<]*<\/html:rt>//g' | sed 's/<[^>]*>//g' > chapter01.txt

  
## cr2book.py
#!/usr/bin/env python3

# Simple script which converts all Critical Role transcripts from <https://www.kryogenix.org/crsearch/html/index.html> to text format. Text transcripts may then easily be converted to ebook formats using pandoc.
#
# How to use:
#
# 1. Download and extract all Critical Role transcripts from <https://www.kryogenix.org/crsearch/cr_full.zip>.
# 2. Run this script from the `cr_full/` directory (or change value of `INDEX` below)
# 3. For each campaign, a new text file containing all episode transcripts will be saved under the `txt/` directory
# 4. You may then easily convert each text to e.g. an ebook using pandoc. For example: `pandoc txt/c1.txt -o c1.epub`
	#!/usr/bin/env python3

	# Simple script which converts all Critical Role transcripts from <https://www.kryogenix.org/crsearch/html/index.html> to text format. Text transcripts may then easily be converted to ebook formats using pandoc.
	#
	# How to use:
	#
	# 1. Download and extract all Critical Role transcripts from <https://www.kryogenix.org/crsearch/cr_full.zip>.
	# 2. Run this script from the `cr_full/` directory (or change value of `INDEX` below)
	# 3. For each campaign, a new text file containing all episode transcripts will be saved under the `txt/` directory
	# 4. You may then easily convert each text to e.g. an ebook using pandoc. For example: `pandoc txt/c1.txt -o c1.epub`