class WebPageParser::BaseRegexpParser
BaseRegexpParser
is designed to be sub-classed to write new parsers that use regular. It provides some basic help but most of the work needs to be done by the sub-class.
Simple pages could be implemented by just defining new regular expression constants, but more advanced parsing can be achieved with the *_processor methods.
Constants
- CONTENT_RE
The regular expression to extract the content
- DATE_RE
The regular expression to extract the date
- HTML_ENTITIES_DECODER
The object used to turn HTML entities into real charaters
- KILL_CHARS_RE
The regular expression to find all characters that should be removed from any content.
- TITLE_RE
The regular expression to extract the title
Public Class Methods
WebPageParser::BaseParser::new
# File lib/web-page-parser/base_parser.rb, line 97 def initialize(options = { }) super(options) @page = encode(@page) end
Public Instance Methods
The content method returns the important body text of the web page.
It does basic extraction and pre-processing of the page content and then calls the content_processor
method for any other more custom processing work that needs doing. Lastly, it does some basic post processing and returns the content as a string.
When writing a new parser, the CONTENT_RE
constant should be defined in the subclass. The KILL_CHARS_RE
constant can be overridden if necessary.
# File lib/web-page-parser/base_parser.rb, line 165 def content return @content if @content matches = class_const(:CONTENT_RE).match(page) if matches @content = class_const(:KILL_CHARS_RE).gsub(matches[1].to_s, '') content_processor @content.collect! { |p| decode_entities(p.strip) } @content.delete_if { |p| p == '' or p.nil? } end @content = [] if @content.nil? @content end
The date method returns a the timestamp of the web page, as a DateTime object.
It does the basic extraction using the DATE_RE
regular expression but the work of converting the text into a DateTime object needs to be done by the date_processor
method.
# File lib/web-page-parser/base_parser.rb, line 146 def date return @date if @date if matches = class_const(:DATE_RE).match(page) @date = matches[1].to_s.strip date_processor @date end end
Convert html entities to unicode
# File lib/web-page-parser/base_parser.rb, line 179 def decode_entities(s) HTML_ENTITIES_DECODER.decode(s) end
Handle any string encoding
# File lib/web-page-parser/base_parser.rb, line 103 def encode(s) return s if s.nil? return s if s.valid_encoding? if s.force_encoding("iso-8859-1").valid_encoding? return s.encode('utf-8', 'iso-8859-1') end s end
return the page contents, retrieving it from the server if necessary
# File lib/web-page-parser/base_parser.rb, line 113 def page @page ||= retrieve_page end
request the page from the server and return the raw contents
# File lib/web-page-parser/base_parser.rb, line 118 def retrieve_page(rurl = nil) durl = rurl || url return nil unless durl durl = filter_url(durl) if self.respond_to?(:filter_url) self.class.retrieve_session ||= WebPageParser::HTTP::Session.new encode(self.class.retrieve_session.get(durl)) end
The title method returns the title of the web page.
It does the basic extraction using the TITLE_RE
regular expression and handles text encoding. More advanced parsing can be done by overriding the title_processor
method.
# File lib/web-page-parser/base_parser.rb, line 131 def title return @title if @title if matches = class_const(:TITLE_RE).match(page) @title = matches[1].to_s.strip title_processor @title = decode_entities(@title) end end
Private Instance Methods
get the constant from this objects class
# File lib/web-page-parser/base_parser.rb, line 186 def class_const(sym) self.class.const_get(sym) end
Custom content parsing. It should split the @content up into an array of paragraphs. Conversion to utf8 is done after this method.
# File lib/web-page-parser/base_parser.rb, line 192 def content_processor @content = @content.split(/<p>/) end
Custom date parsing. It should parse @date into a DateTime object
# File lib/web-page-parser/base_parser.rb, line 197 def date_processor end
Custom title parsing. It should clean up @title as necessary. Conversion to utf8 is done after this method.
# File lib/web-page-parser/base_parser.rb, line 202 def title_processor end