class WebPageParser::BaseRegexpParser

BaseRegexpParser is designed to be sub-classed to write new parsers that use regular. It provides some basic help but most of the work needs to be done by the sub-class.

Simple pages could be implemented by just defining new regular expression constants, but more advanced parsing can be achieved with the *_processor methods.

Constants

CONTENT_RE

The regular expression to extract the content

DATE_RE

The regular expression to extract the date

HTML_ENTITIES_DECODER

The object used to turn HTML entities into real charaters

KILL_CHARS_RE

The regular expression to find all characters that should be removed from any content.

TITLE_RE

The regular expression to extract the title

Public Class Methods

new(options = { }) click to toggle source
Calls superclass method WebPageParser::BaseParser::new
# File lib/web-page-parser/base_parser.rb, line 97
def initialize(options = { })
  super(options)
  @page = encode(@page)
end

Public Instance Methods

content() click to toggle source

The content method returns the important body text of the web page.

It does basic extraction and pre-processing of the page content and then calls the content_processor method for any other more custom processing work that needs doing. Lastly, it does some basic post processing and returns the content as a string.

When writing a new parser, the CONTENT_RE constant should be defined in the subclass. The KILL_CHARS_RE constant can be overridden if necessary.

# File lib/web-page-parser/base_parser.rb, line 165
def content
  return @content if @content
  matches = class_const(:CONTENT_RE).match(page)
  if matches
    @content = class_const(:KILL_CHARS_RE).gsub(matches[1].to_s, '')
    content_processor
    @content.collect! { |p| decode_entities(p.strip) }
    @content.delete_if { |p| p == '' or p.nil? }
  end
  @content = [] if @content.nil?
  @content
end
date() click to toggle source

The date method returns a the timestamp of the web page, as a DateTime object.

It does the basic extraction using the DATE_RE regular expression but the work of converting the text into a DateTime object needs to be done by the date_processor method.

# File lib/web-page-parser/base_parser.rb, line 146
def date
  return @date if @date
  if matches = class_const(:DATE_RE).match(page)
    @date = matches[1].to_s.strip
    date_processor
    @date
  end
end
decode_entities(s) click to toggle source

Convert html entities to unicode

# File lib/web-page-parser/base_parser.rb, line 179
def decode_entities(s)
  HTML_ENTITIES_DECODER.decode(s)
end
encode(s) click to toggle source

Handle any string encoding

# File lib/web-page-parser/base_parser.rb, line 103
def encode(s)
  return s if s.nil?
  return s if s.valid_encoding?
  if s.force_encoding("iso-8859-1").valid_encoding?
    return s.encode('utf-8', 'iso-8859-1')
  end
  s
end
page() click to toggle source

return the page contents, retrieving it from the server if necessary

# File lib/web-page-parser/base_parser.rb, line 113
def page
  @page ||= retrieve_page
end
retrieve_page(rurl = nil) click to toggle source

request the page from the server and return the raw contents

# File lib/web-page-parser/base_parser.rb, line 118
def retrieve_page(rurl = nil)
  durl = rurl || url
  return nil unless durl
  durl = filter_url(durl) if self.respond_to?(:filter_url)
  self.class.retrieve_session ||= WebPageParser::HTTP::Session.new
  encode(self.class.retrieve_session.get(durl))
end
title() click to toggle source

The title method returns the title of the web page.

It does the basic extraction using the TITLE_RE regular expression and handles text encoding. More advanced parsing can be done by overriding the title_processor method.

# File lib/web-page-parser/base_parser.rb, line 131
def title
  return @title if @title
  if matches = class_const(:TITLE_RE).match(page)
    @title = matches[1].to_s.strip
    title_processor
    @title = decode_entities(@title)
  end
end

Private Instance Methods

class_const(sym) click to toggle source

get the constant from this objects class

# File lib/web-page-parser/base_parser.rb, line 186
def class_const(sym)
  self.class.const_get(sym)
end
content_processor() click to toggle source

Custom content parsing. It should split the @content up into an array of paragraphs. Conversion to utf8 is done after this method.

# File lib/web-page-parser/base_parser.rb, line 192
def content_processor
  @content = @content.split(/<p>/)
end
date_processor() click to toggle source

Custom date parsing. It should parse @date into a DateTime object

# File lib/web-page-parser/base_parser.rb, line 197
def date_processor
end
title_processor() click to toggle source

Custom title parsing. It should clean up @title as necessary. Conversion to utf8 is done after this method.

# File lib/web-page-parser/base_parser.rb, line 202
def title_processor
end