module MMETools::Webparse

methods for processing strings while parsing webpages

Public Instance Methods

acronymize(str) click to toggle source

Transforms a string str to an acronym

# File lib/mme_tools/webparse.rb, line 54
def acronymize(str)
  cleared_str = clear_string(str, :encoding => 'ASCII').gsub(/\W/," ")

  # opcio 1
  unwanted_words_pttrn = %w[de en].map {|w| "\\b#{w}\\b"}.join("|")
  res = cleared_str.gsub(/\b\w\b|#{unwanted_words_pttrn}/i," ")
  res = res.split(" ").map {|s| s[0..0].upcase}.join

  # opcio 2
  if res == ""
    res = cleared_str.split(" ").map {|s| s[0..0].upcase}.join
  end
  res
end
asciify(str) click to toggle source

Intenta convertir str a ASCII pur i dur

# File lib/mme_tools/webparse.rb, line 49
def asciify(str)
  Iconv.conv('ASCII//TRANSLIT//IGNORE', 'UTF8', str)
end
clear_string(str, opts={}) click to toggle source

treu els espais innecessaris i codis HTML d’enmig i extrems a un string neteja l’string eliminant tots els no printables dels extrems i els d’enmig els substitueiux per un unic espai. Les opcions opts poden ser:

+:encoding+ => "ASCII" | "UTF8" (default)
  "ASCII" converteix tots els caracters al mes semblant ASCII (amb Iconv)
  "UTF8" torna una cadena UTF8

(based on an idea of Obie Fernandez www.jroller.com/obie/tags/unicode)

# File lib/mme_tools/webparse.rb, line 37
def clear_string(str, opts={})
  options = {:encoding=>'UTF8'}.merge opts  # default option :encoding=>'UTF8'
  str=str.chars.map { |c| (c.bytes[0] <= 127) ? c : translation_hash[c] }.join if options[:encoding]=='ASCII'
  str.gsub(/[\s\302\240]+/mu," ").strip # el caracter UTF8 "\302\240" correspon al &nbsp; de HTML
end
clear_uri(uri) click to toggle source

torna una uri treient-hi les invocacions javascript si n’hi ha. Per exemple

"javascript:openDoc('/gisa/documentos/cartes/PT.DOC')" -> "/gisa/documentos/cartes/PT.DOC"
# File lib/mme_tools/webparse.rb, line 22
def clear_uri uri
  case uri
  when /Doc\('.*'\)/ then uri.match(/Doc\('(.*)'\)/).captures[0]
  else uri
  end
end
datify(str) click to toggle source

Extracts and returns the first provable DateTime from a string

# File lib/mme_tools/webparse.rb, line 78
def datify(str)
  pttrn = /(\d+)[\/-](\d+)[\/-](\d+)(\W+(\d+)\:(\d+))?/
  day, month, year, dummy, hour, min = str.match(pttrn).captures.map {|d| d ? d.to_i : 0 }
  case year
  when 0..69
    year += 2000
  when 70..99
    year += 1900
  end
  DateTime.civil year, month, day, hour, min
end
shorten(str) click to toggle source

Transforms str to a shortened version: strips all non-alphanumeric chars, non-ascii and spaces and joins every word first two letters capitalized

# File lib/mme_tools/webparse.rb, line 72
def shorten(str)
  cleared_str = clear_string(str, :encoding => 'ASCII').gsub(/\W/," ")
  cleared_str.split(" ").map {|s| s[0..1].capitalize}.join
end

Protected Instance Methods

setup_translation_hash() click to toggle source
# File lib/mme_tools/webparse.rb, line 96
def setup_translation_hash
  accented_chars   = "ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüý".chars.map{|c| c}
  unaccented_chars = "AAAAAACEEEEIIIIDNOOOOOxOUUUUYaaaaaaceeeeiiiinoooooouuuuy".split('')

  translation_hash = {}
  accented_chars.each_with_index { |char, idx| translation_hash[char] = unaccented_chars[idx] }
  translation_hash["Æ"] = 'AE'
  translation_hash["æ"] = 'ae'
  translation_hash
end
translation_hash() click to toggle source
# File lib/mme_tools/webparse.rb, line 92
def translation_hash
  @@translation_hash ||= setup_translation_hash
end