Skip to content

Instantly share code, notes, and snippets.

@masutaka
Created February 16, 2017 13:19
Show Gist options
  • Select an option

  • Save masutaka/1021420732ecb9828672ced45bcd70fc to your computer and use it in GitHub Desktop.

Select an option

Save masutaka/1021420732ecb9828672ced45bcd70fc to your computer and use it in GitHub Desktop.
Parse html and print json for Elasticsearch Bulk API
#!/usr/bin/env ruby
require 'json'
require 'oga'
require 'time'
action = {
'index' => {
'_index' => 'chalow',
'_type' => 'article'
}
}
Dir.chdir(File.expand_path('..', __dir__)) do
Dir.glob('webroot/chalow/*-*-*-*.html') do |file|
File.open(file) do |f|
parser = Oga::HTML::Parser.new(f)
/(?<y>\d+)-(?<m>\d+)-(?<d>\d+)-(?<i>\d+)\.html/ =~ file
id = "#{y}-#{m}-#{d}-#{i}"
parser.parse.xpath('//div[@class="section"]').each do |article|
title = article.at_xpath('h3/text()[1]').text.gsub('[', '').strip
article.at_xpath('h3').remove
article.at_xpath('div[@class="caption"]').remove
article.xpath('blockquote[@class="twitter-tweet"]').tap(&:remove)
article.at_xpath('script[contains(., "socialplus")]').remove
article.at_xpath('id("google-adsense")').remove
body = article.text.gsub("\n", '').strip
document = {
'id' => id,
'title' => title,
'body' => body,
'@timestamp' => Time.parse("#{y}-#{m}-#{d} #{i.to_i - 1}:00:00 +0900").strftime('%FT%T'),
}
puts action.to_json, document.to_json, ''
end
end
end
end
@masutaka
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment