class Opener::LanguageIdentifier::Backend::LanguageDetection

Constants

DEFAULT_PRIORITY

The default priority for non OpeNER languages.

@return [Float]

DEFAULT_PROFILES_PATH

Path to the directory containing the default profiles.

@return [String]

DEFAULT_SHORT_PROFILES_PATH

Path to the directory containing the default short profiles.

@return [String]

PRIORITIES

Prioritize OpeNER languages over the rest. Languages not covered by this list are automatically given a default priority.

@return [Hash]

SHORT_THRESHOLD

The amount of characters after which the detector should switch to using the longer profiles set.

@return [Fixnum]

Public Class Methods

new() click to toggle source
# File lib/opener/language_identifier/backend/language_detection.rb, line 62
def initialize
  @factory = com.cybozu.labs.langdetect.DetectorFactory.new
end

Public Instance Methods

detect(input) click to toggle source

@return [String]

# File lib/opener/language_identifier/backend/language_detection.rb, line 81
def detect input
  detector = new_detector input
  detector.detect

# The core Java code raise an exception when it can't detect a language.
# Since this isn't actually something fatal we'll capture this and return
# "unknown" instead.
rescue com.cybozu.labs.langdetect.LangDetectException
  return 'unknown'
end
new_detector(input) click to toggle source
# File lib/opener/language_identifier/backend/language_detection.rb, line 66
def new_detector input
  @factory.load_profile determine_profiles input
  @factory.set_seed 1

  priorities = build_priorities input, @factory.langlist
  detector   = com.cybozu.labs.langdetect.Detector.new @factory

  detector.set_prior_map priorities
  detector.append input.downcase
  detector
end

Protected Instance Methods

build_priorities(input, languages) click to toggle source

Builds a Java Hash mapping the priorities for all OpeNER and non OpeNER languages.

If the input size is smaller than the short profiles threshold non OpeNER languages are disabled. This is to ensure that these languages are detected properly when analysing only 1-2 words.

@param [String] input @param [Array<String>] languages @return [java.util.HashMap]

# File lib/opener/language_identifier/backend/language_detection.rb, line 106
def build_priorities input, languages
  priorities = java.util.HashMap.new
  priority   = if short_input? input then 0.0 else DEFAULT_PRIORITY end

  PRIORITIES.each do |lang, val|
    priorities.put(lang, val)
  end

  languages.each do |language|
    unless priorities.contains_key(language)
      priorities.put(language, priority)
    end
  end

  priorities
end
determine_profiles(input) click to toggle source

@param [String] input @return [String]

# File lib/opener/language_identifier/backend/language_detection.rb, line 127
def determine_profiles input
  if short_input? input then DEFAULT_SHORT_PROFILES_PATH else DEFAULT_PROFILES_PATH end
end
short_input?(input) click to toggle source

@param [String] input @return [TrueClass|FalseClass]

# File lib/opener/language_identifier/backend/language_detection.rb, line 135
def short_input? input
  input.length <= SHORT_THRESHOLD
end