module MMETools::Webparse
methods for processing strings while parsing webpages
Public Instance Methods
acronymize(str)
click to toggle source
Transforms a string str
to an acronym
# File lib/mme_tools/webparse.rb, line 54 def acronymize(str) cleared_str = clear_string(str, :encoding => 'ASCII').gsub(/\W/," ") # opcio 1 unwanted_words_pttrn = %w[de en].map {|w| "\\b#{w}\\b"}.join("|") res = cleared_str.gsub(/\b\w\b|#{unwanted_words_pttrn}/i," ") res = res.split(" ").map {|s| s[0..0].upcase}.join # opcio 2 if res == "" res = cleared_str.split(" ").map {|s| s[0..0].upcase}.join end res end
asciify(str)
click to toggle source
Intenta convertir str
a ASCII pur i dur
# File lib/mme_tools/webparse.rb, line 49 def asciify(str) Iconv.conv('ASCII//TRANSLIT//IGNORE', 'UTF8', str) end
clear_string(str, opts={})
click to toggle source
treu els espais innecessaris i codis HTML d’enmig i extrems a un string neteja l’string eliminant tots els no printables dels extrems i els d’enmig els substitueiux per un unic espai. Les opcions opts
poden ser:
+:encoding+ => "ASCII" | "UTF8" (default) "ASCII" converteix tots els caracters al mes semblant ASCII (amb Iconv) "UTF8" torna una cadena UTF8
(based on an idea of Obie Fernandez www.jroller.com/obie/tags/unicode)
# File lib/mme_tools/webparse.rb, line 37 def clear_string(str, opts={}) options = {:encoding=>'UTF8'}.merge opts # default option :encoding=>'UTF8' str=str.chars.map { |c| (c.bytes[0] <= 127) ? c : translation_hash[c] }.join if options[:encoding]=='ASCII' str.gsub(/[\s\302\240]+/mu," ").strip # el caracter UTF8 "\302\240" correspon al de HTML end
clear_uri(uri)
click to toggle source
torna una uri treient-hi les invocacions javascript si n’hi ha. Per exemple
"javascript:openDoc('/gisa/documentos/cartes/PT.DOC')" -> "/gisa/documentos/cartes/PT.DOC"
# File lib/mme_tools/webparse.rb, line 22 def clear_uri uri case uri when /Doc\('.*'\)/ then uri.match(/Doc\('(.*)'\)/).captures[0] else uri end end
datify(str)
click to toggle source
Extracts and returns the first provable DateTime from a string
# File lib/mme_tools/webparse.rb, line 78 def datify(str) pttrn = /(\d+)[\/-](\d+)[\/-](\d+)(\W+(\d+)\:(\d+))?/ day, month, year, dummy, hour, min = str.match(pttrn).captures.map {|d| d ? d.to_i : 0 } case year when 0..69 year += 2000 when 70..99 year += 1900 end DateTime.civil year, month, day, hour, min end
shorten(str)
click to toggle source
Transforms str
to a shortened version: strips all non-alphanumeric chars, non-ascii and spaces and joins every word first two letters capitalized
# File lib/mme_tools/webparse.rb, line 72 def shorten(str) cleared_str = clear_string(str, :encoding => 'ASCII').gsub(/\W/," ") cleared_str.split(" ").map {|s| s[0..1].capitalize}.join end
Protected Instance Methods
setup_translation_hash()
click to toggle source
# File lib/mme_tools/webparse.rb, line 96 def setup_translation_hash accented_chars = "ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüý".chars.map{|c| c} unaccented_chars = "AAAAAACEEEEIIIIDNOOOOOxOUUUUYaaaaaaceeeeiiiinoooooouuuuy".split('') translation_hash = {} accented_chars.each_with_index { |char, idx| translation_hash[char] = unaccented_chars[idx] } translation_hash["Æ"] = 'AE' translation_hash["æ"] = 'ae' translation_hash end
translation_hash()
click to toggle source
# File lib/mme_tools/webparse.rb, line 92 def translation_hash @@translation_hash ||= setup_translation_hash end