-
-
Save jescalan/1572289 to your computer and use it in GitHub Desktop.
| require 'rubygems' | |
| require 'nokogiri' | |
| require 'open-uri' | |
| require 'colored' | |
| # this is just a preview of what's to come - a proof of concept. | |
| # it will be converted to a api-type library, gemified, and put in it's own repo | |
| # for now, a cool way to experiment with amazon's data | |
| query = 'ruby' | |
| page = '2' | |
| doc = Nokogiri::HTML(open("http://www.amazon.com/s/field-keywords=#{query}?page=#{page}")) | |
| puts "Amazon search for '#{query}', page ##{page}\n".red.underline | |
| doc.css('div.product').each do |el| | |
| # grab the title | |
| title = el.css('a.title').first.content | |
| # grab the author (can be linked or not, hence the logic) | |
| author = el.css('.ptBrand a').empty? ? el.css('.ptBrand').first.content.gsub!(/by /, '') : el.css('.ptBrand a').first.content | |
| # grab the image | |
| image = el.css('.productImage').attribute 'src' | |
| # grab the product link | |
| link = el.css('a.title').attribute 'href' | |
| puts "#{title} by #{author}".green | |
| puts "image url:".yellow + " #{image}" | |
| puts "amazon link:".yellow + " #{link}" | |
| puts "" | |
| end |
Yeah this gist was created 8 years ago, not surprised
@jonbarlo, have you found the solution?
@codemicky yeah but involves to pay a third party service for proxy'ing, Amazon is super strict and doesn't likes headless browsers.
Another solution is running capybara w/ non-headless browser, if you create a dummy amz account and perform a log-in before checking the amz url i think you wont have issues but i might be wrong (i have done the same for another platform using this approach)
And last thing is you might try to use a gem called kimurai
https://github.com/vifreefly/kimuraframework
Wondering what would be the result (i have used this as well so its another approach)
I see. Thank you 👍
@codemicky try kimurai and see if it works otherwise try nokogiri but behind a proxy, something like this https://scrapinghub.com/crawlera
This will throw an exception
OpenURI::HTTPError (503 Service Unavailable)looks like Amazon is behind cloufare DNS to prevent attacks