Skip to content

Instantly share code, notes, and snippets.

@punit-naik
Created September 20, 2021 07:59
Show Gist options
  • Select an option

  • Save punit-naik/af83667c002d347ea0552b643590784b to your computer and use it in GitHub Desktop.

Select an option

Save punit-naik/af83667c002d347ea0552b643590784b to your computer and use it in GitHub Desktop.
Parsing a document's content using Apache Tika with Clojure
;; Execute the following `lein try` command to get a REPL with the following libraries loaded in it's classpath
;; lein try org.apache.tika/tika-parsers 1.14 org.apache.tika/tika-core 1.14
(import '[org.apache.tika Tika])
(import '[java.nio.file Files])
(import '[java.io File])
(import '[java.io ByteArrayInputStream])
(def tika (Tika.))
(def f-bytes (-> (File. "/path/file.pdf") .toPath Files/readAllBytes))
(defn parse-resume
[file-byte-array]
(.parseToString tika (ByteArrayInputStream. file-byte-array)))
(parse-resume f-bytes)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment