module Scrapifier::Support
Support
methods to get, check and organize data.
Public Instance Methods
sf_check_img_ext(images, allowed = [])
click to toggle source
Filter images returning those with the allowed extentions.
Example:
>> sf_check_img_ext('http://source.com/image.gif', :jpg) => [] >> sf_check_img_ext( ['http://source.com/image.gif','http://source.com/image.jpg'], [:jpg, :png] ) => ['http://source.com/image.jpg']
Arguments:
images: (String or Array) - Images which will be checked. allowed: (String, Symbol or Array) - Allowed types of image extension.
# File lib/scrapifier/support.rb, line 57 def sf_check_img_ext(images, allowed = []) allowed ||= [] if images.is_a?(String) images = images.split elsif !images.is_a?(Array) images = [] end images.select { |i| i =~ sf_regex(:image, allowed) } end
sf_domain(uri)
click to toggle source
Return the URI domain.
Example:
>> sf_domain('http://adtangerine.com') => 'adtangerine.com'
Arguments:
uri: (String) - URI.
# File lib/scrapifier/support.rb, line 186 def sf_domain(uri) uri = uri.to_s.split('/') uri.empty? ? '' : uri[2] end
sf_eval_uri(uri, exts = [])
click to toggle source
Evaluate the URI’s HTML document and get its metadata.
Example:
>> eval_uri('http://adtangerine.com', [:png]) => { :title => "AdTangerine | Advertising Platform for Social Media", :description => "AdTangerine is an advertising platform that...", :images => [ "http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png ], :uri => "http://adtangerine.com" }
Arguments:
uri: (String) - URI. exts: (Array) - Allowed type of images.
# File lib/scrapifier/support.rb, line 27 def sf_eval_uri(uri, exts = []) doc = Nokogiri::HTML(open(uri).read) doc.encoding, meta = 'utf-8', { uri: uri } [:title, :description, :keywords, :lang, :encode, :reply_to, :author].each do |k| node = doc.xpath(sf_xpaths[k])[0] meta[k] = node.nil? ? '-' : node.text end meta[:images] = sf_fix_imgs(doc.xpath(sf_xpaths[:image]), uri, exts) meta rescue SocketError {} end
sf_fix_imgs(imgs, uri, exts = [])
click to toggle source
Check and return only the valid image URIs.
Example:
>> sf_fix_imgs( ['http://adtangerine.com/image.png', '/assets/image.jpg'], 'http://adtangerine.com', :jpg ) => ['http://adtangerine/assets/image.jpg']
Arguments:
imgs: (Array) - Image URIs got from the HTML doc. uri: (String) - Used as basis to the URIs that don't have any protocol/domain set. exts: (Symbol or Array) - Allowed image extesntions.
# File lib/scrapifier/support.rb, line 145 def sf_fix_imgs(imgs, uri, exts = []) sf_check_img_ext(imgs.map do |img| img = img.to_s unless img =~ sf_regex(:protocol) img = sf_fix_protocol(img, sf_domain(uri)) end img if img =~ sf_regex(:image) end.compact, exts) end
sf_fix_protocol(path, domain)
click to toggle source
Fix image URIs that don’t have a protocol/domain set.
Example:
>> sf_fix_protocol('/assets/image.jpg', 'http://adtangerine.com') => 'http://adtangerine/assets/image.jpg' >> sf_fix_protocol( '//s.ytimg.com/yts/img/youtub_img.png', 'https://youtube.com' ) => 'https://s.ytimg.com/yts/img/youtub_img.png'
Arguments:
path: (String) - URI path having no protocol/domain set. domain: (String) - Domain that will be prepended into the path.
# File lib/scrapifier/support.rb, line 170 def sf_fix_protocol(path, domain) if path =~ %r{^//[^/]+} 'http:' << path else "http://#{domain}#{'/' unless path =~ %r{^/[^/]+}}#{path}" end end
sf_img_regex(exts = [])
click to toggle source
Build image regexes according to the required extensions.
Example:
>> sf_img_regex => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|jpeg|png|gif)(\?.+)?$)/i >> sf_img_regex([:jpg, :png]) => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg|png)(\?.+)?$)/i
Arguments:
exts: (Array) - Image extensions which will be included in the regex.
# File lib/scrapifier/support.rb, line 107 def sf_img_regex(exts = []) exts = [exts].flatten unless exts.is_a?(Array) if exts.nil? || exts.empty? exts = %w(jpg jpeg png gif) elsif exts.include?(:jpg) && !exts.include?(:jpeg) exts.push :jpeg end %r{(^http{1}[s]?://([w]{3}\.)?.+\.(#{exts.join('|')})(\?.+)?$)}i end
sf_regex(type, *args)
click to toggle source
Select regexes for URIs, protocols and image extensions.
Example:
>> sf_regex(:uri) => /\b((((ht|f)tp[s]?:\/\/).../i, >> sf_regex(:image, :jpg) => /(^http{1}[s]?:\/\/([w]{3}\.)?.+\.(jpg)(\?.+)?$)/i
Arguments:
type: (Symbol or String) - Regex type: :uri, :protocol, :image args: (*) - Anything.
# File lib/scrapifier/support.rb, line 79 def sf_regex(type, *args) type = type.to_sym unless type.is_a? Symbol type == :image && sf_img_regex(args.flatten) || sf_uri_regex[type] end
sf_uri_regex()
click to toggle source
Build a hash with the URI regexes.
# File lib/scrapifier/support.rb, line 85 def sf_uri_regex { uri: %r{\b( (((ht|f)tp[s]?://)|([a-z0-9]+\.))+ (?<!@) ([a-z0-9\_\-]+) (\.[a-z]+)+ ([\?/\:][a-z0-9_=%&@\?\./\-\:\#\(\)]+)? /? )}ix, protocol: /((ht|f)tp[s]?)/i } end
sf_xpaths()
click to toggle source
Organize XPaths.
# File lib/scrapifier/support.rb, line 118 def sf_xpaths { title: XPath::TITLE, description: XPath::DESC, keywords: XPath::KEYWORDS, lang: XPath::LANG, encode: XPath::ENCODE, reply_to: XPath::REPLY_TO, author: XPath::AUTHOR, image: XPath::IMG } end