class Ai::Nlp::Hasher
Class managing an n-gram hash
Public Class Methods
new(input)
click to toggle source
Initialisation @param string input The string to treat
# File lib/ai/nlp/n_gram/hasher.rb, line 18 def initialize(input) @input = input @hash = {} clean end
Public Instance Methods
calculate()
click to toggle source
Calculates n-gram frequencies for the dataset @return Frequencies of ngram or sorted array
# File lib/ai/nlp/n_gram/hasher.rb, line 27 def calculate @input.split(/[\d\s\[\]]/).each do |word| calculate_word_gram("_#{word}_") end drop_unwanted_keys @hash.sort { |one, other| other[1] <=> one[1] } end
Private Instance Methods
calculate_letter_gram(parameters)
click to toggle source
Stores the mono-gram, bi-gram and tri-gram in the hash @param hash parameters The list of necessary parameters :
- letter_position The position of the letter to be processed - word The word treated - length Current word size
# File lib/ai/nlp/n_gram/hasher.rb, line 63 def calculate_letter_gram(parameters) (1..3).each do |nth| letters = parameters[:word][parameters[:letter_position], nth] next unless letters init_key(letters) @hash[letters] += 1 if parameters[:length] > (nth - 1) end end
calculate_word_gram(word)
click to toggle source
Enriched hash representing the n-gram of a word @param string word The word to calculate
# File lib/ai/nlp/n_gram/hasher.rb, line 40 def calculate_word_gram(word) length = word.size (0..length).each do |letter_position| parameters = { letter_position: letter_position, word: word, length: length } calculate_letter_gram(parameters) length -= 1 end end
clean()
click to toggle source
Cleans the string passed as argument
# File lib/ai/nlp/n_gram/hasher.rb, line 81 def clean safe_clean specific_clean clean_latin @input = @input.strip.split(" ").join(" ") end
clean_latin()
click to toggle source
Cleans the string from Latin characters if more than half of the string is not Latin.
# File lib/ai/nlp/n_gram/hasher.rb, line 90 def clean_latin latin = @input.scan(/[a-z]/) nonlatin = @input.scan(/[\p{L}&&[^a-z]]/) nonlatin_ratio = nonlatin.size / (latin.size * 1.0) return if nonlatin_ratio < 0.5 @input.gsub!(/[a-zA-Z]/, "") if !latin.empty? && !nonlatin.empty? end
drop_unwanted_keys()
click to toggle source
Deletes a key if its value is less than or equal to zero
# File lib/ai/nlp/n_gram/hasher.rb, line 51 def drop_unwanted_keys @hash.each_key do |key| @hash.delete(key) if key.size <= 0 end end
init_key(letters)
click to toggle source
Initialize key if necessary @param string letters The group of letters
# File lib/ai/nlp/n_gram/hasher.rb, line 75 def init_key(letters) @hash[letters] ||= 0 end
safe_clean()
click to toggle source
Cleaning via existing tools
# File lib/ai/nlp/n_gram/hasher.rb, line 111 def safe_clean @input = Sanitize.clean(@input) @input = CGI.unescapeHTML(@input) @input = Unicode.downcase(@input) end
specific_clean()
click to toggle source
Removes polluting web addresses, mails and characters
# File lib/ai/nlp/n_gram/hasher.rb, line 100 def specific_clean uri_regex = %r/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ @input.gsub!(uri_regex, "") # Remove mails @input.gsub!(/[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}/, "") # Repleace polluting non-alphabetical characters, punctuation included by a space @input.gsub!(%r/[\*\^><!\"#\$%&\'\(\)\*\+:;,\._\/=\?@\{\}\[\]|\-\n\r0-9]/, " ") end