Skip to content

Instantly share code, notes, and snippets.

@cshirley
Last active May 20, 2019 18:35
Show Gist options
  • Select an option

  • Save cshirley/95061832430fde62df59f41467b6a762 to your computer and use it in GitHub Desktop.

Select an option

Save cshirley/95061832430fde62df59f41467b6a762 to your computer and use it in GitHub Desktop.
HTML Parse and export
require 'nokogiri'
require 'json'
require 'byebug'
require 'fileutils'
def export_html(filename)
doc = Nokogiri::HTML(File.read(filename))
@body = doc.css('div.light')
{
src: filename,
url: filename.gsub(/\/index.html/, '.json'),
title: @body.css('h1').text,
content: @body.inner_html.gsub(/[\r\n]/, ''),
images: @body.css('img').map { |i| i&.attributes['src']&.value },
links: @body.css('a').map { |e| e['href'] }
}
end
Dir['**/*.html'].each do |filename|
puts 'Processing ' + filename
doc = export_html(filename)
fname = File.expand_path(File.join('../export', doc[:url]))
dname = File.dirname(fname)
FileUtils.mkdir_p(dname) unless Dir.exist?(dname)
File.write(fname, doc.to_json)
end
wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=windows \
--domains website.org \
--no-parent \
www.website.org/tutorials/html/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment