Skip to content

Instantly share code, notes, and snippets.

View VRichardJP's full-sized avatar

Vincent Richard VRichardJP

  • Macnica
  • Shin-Yokohama
View GitHub Profile
@VRichardJP
VRichardJP / jpepub2anki.md
Last active November 18, 2025 11:52
Generate Anki deck from japanese epub files using LLMs

Generate Anki deck from japanese epub files using LLMs

The present document describes a method to generate anki vocabulary decks from japanese books using LLMs. Traditionally, one would create new anki cards as they stumble upon new words and expressions in books they read. This is a very tedious and annoying task. What could be done instead is to use an LLM to skim through the book first, and automatically generate anki cards that should cover 90% of one's needs.

  1. Extract XHTML chapters from the epub file (e.g. using Calibre)

  2. Extract raw text from the XHTML. Note that japanese epubs may contain furigana. These are normally found within <rt> or <html:rt> blocks and should be removed from the raw text. For example with sed:

cat chapter01.xhtml | sed 's/<rt>[^<]*<\/rt>//g' | sed 's/<html:rt>[^<]*<\/html:rt>//g' | sed 's/<[^>]*>//g' > chapter01.txt

@VRichardJP
VRichardJP / cr2book.py
Last active October 27, 2025 00:39
Convert Critical Role transcripts from https://www.kryogenix.org/crsearch/html/index.html to epub using pandoc
#!/usr/bin/env python3
# Simple script which converts all Critical Role transcripts from <https://www.kryogenix.org/crsearch/html/index.html> to text format. Text transcripts may then easily be converted to ebook formats using pandoc.
#
# How to use:
#
# 1. Download and extract all Critical Role transcripts from <https://www.kryogenix.org/crsearch/cr_full.zip>.
# 2. Run this script from the `cr_full/` directory (or change value of `INDEX` below)
# 3. For each campaign, a new text file containing all episode transcripts will be saved under the `txt/` directory
# 4. You may then easily convert each text to e.g. an ebook using pandoc. For example: `pandoc txt/c1.txt -o c1.epub`