The present document describes a method to generate anki vocabulary decks from japanese books using LLMs. Traditionally, one would create new anki cards as they stumble upon new words and expressions in books they read. This is a very tedious and annoying task. What could be done instead is to use an LLM to skim through the book first, and automatically generate anki cards that should cover 90% of one's needs.
-
Extract XHTML chapters from the epub file (e.g. using Calibre)
-
Extract raw text from the XHTML. Note that japanese epubs may contain furigana. These are normally found within
<rt>or<html:rt>blocks and should be removed from the raw text. For example with sed:
cat chapter01.xhtml | sed 's/<rt>[^<]*<\/rt>//g' | sed 's/<html:rt>[^<]*<\/html:rt>//g' | sed 's/<[^>]*>//g' > chapter01.txt