class SuperCrawler::Scrap

Scrap a single HTML page Responsible for extracting all relevant information within a page (internal links and assets)

Attributes

url[R]

Public Class Methods

new(url) click to toggle source
# File lib/super_crawler/scrap.rb, line 15
def initialize url
  # Normalize the URL, by adding a scheme (http) if not present in the URL
  @url = URI.encode( !!(url =~ /^(http(s)?:\/\/)/) ? url : ('http://' + url) )
end

Public Instance Methods

get_all() click to toggle source

Get links and assets within a page Returns a hash of links, images, stylesheets and scripts URLs

# File lib/super_crawler/scrap.rb, line 109
def get_all
  {
    :'links' => get_links,
    :'images' => get_images,
    :'stylesheets' => get_stylesheets,
    :'scripts' => get_scripts
  }
end
get_assets() click to toggle source

Get all assets within a page Returns a hash of images, stylesheets and scripts URLs

# File lib/super_crawler/scrap.rb, line 97
def get_assets
  {
    :'images' => get_images,
    :'stylesheets' => get_stylesheets,
    :'scripts' => get_scripts
  }
end
get_images() click to toggle source

Get all the images within a page NOTA: These are images within <img src=“…” /> tag.

# File lib/super_crawler/scrap.rb, line 47
def get_images
  return [] unless page_exists?

  # Get all the images sources (URLs), using Nokogiri
  images_links = get_doc.css('img').map{ |image| image['src'] }.compact

  # Create the absolute path of the images
  images_links.map!{ |image| create_absolute_url( image ) }

  return images_links.uniq # Return links to images without duplicates
end
get_scripts() click to toggle source

Get all the JS scripts within a page NOTA: These are scripts within <script src=“…” /> tag.

# File lib/super_crawler/scrap.rb, line 81
def get_scripts
  return [] unless page_exists?

  # Get all the script sources (URLs), using Nokogiri
  scripts_links = get_doc.css('script').map{ |script| script['src'] }.compact

  # Create the absolute path of the scripts
  scripts_links.map!{ |script| create_absolute_url( script ) }

  return scripts_links.uniq # Return links to scripts without duplicates
end
get_stylesheets() click to toggle source

Get all the CSS links within a page NOTA: These are links within <link href=“…” /> tag.

# File lib/super_crawler/scrap.rb, line 63
def get_stylesheets
  return [] unless page_exists?

  # Get all the stylesheet links (URLs), using Nokogiri
  css_links = get_doc.css('link').select{ |css_link| css_link['rel'] == 'stylesheet' }
                                 .map{ |css_link| css_link['href'] }
                                 .compact

  # Create the absolute path of the CSS links
  css_links.map!{ |css_link| create_absolute_url( css_link ) }

  return css_links.uniq # Return links to CSS files without duplicates
end
page_exists?() click to toggle source

Check if the page exists

# File lib/super_crawler/scrap.rb, line 121
def page_exists?
  !!( get_doc rescue false )
end

Private Instance Methods

create_absolute_url(url) click to toggle source

Given a URL, return the absolute URL

# File lib/super_crawler/scrap.rb, line 142
def create_absolute_url url
  # Append the base URL (scheme+host) if the provided URL is relative
  URI.parse(URI.encode url).host.nil? ? "#{URI.parse(@url).scheme}://#{URI.parse(@url).host}#{url}" : url
end
get_doc() click to toggle source

Get the page `doc` (document) from Nokogiri. Cache it for performace issue.

# File lib/super_crawler/scrap.rb, line 131
def get_doc
  begin
    @doc ||= Nokogiri(open( @url ))
  rescue Exception => e
    raise "Problem with URL #{@url}: #{e}"
  end
end