class Elasticrawl::JobStep

Represents an Elastic MapReduce job flow step. For a parse job this will process a single Common Crawl segment. For a combine job a single step will aggregate the results of multiple parse jobs.

Public Instance Methods

job_flow_step(job_config) click to toggle source

Returns a custom jar step that is configured with the jar location, class name and input and output paths.

For parse jobs optionally specifies the maximum # of Common Crawl data files to process before the job exits.

# File lib/elasticrawl/job_step.rb, line 14
def job_flow_step(job_config)
  jar = job_config['jar']
  max_files = self.job.max_files

  step_args = []
  step_args[0] = job_config['class']
  step_args[1] = self.input_paths
  step_args[2] = self.output_path
  # All arguments must be strings.
  step_args[3] = max_files.to_s if max_files.present?

  step = Elasticity::CustomJarStep.new(jar)
  step.name = set_step_name
  step.arguments = step_args

  step
end

Private Instance Methods

set_step_name() click to toggle source

Sets the Elastic MapReduce job flow step name based on the type of job it belongs to.

# File lib/elasticrawl/job_step.rb, line 35
def set_step_name
  case self.job.type
    when 'Elasticrawl::ParseJob'
      if self.crawl_segment.present?
        max_files = self.job.max_files || 'all'
        "#{self.crawl_segment.segment_desc} Parsing: #{max_files}"
      end
    when 'Elasticrawl::CombineJob'
      paths = self.input_paths.split(',')
      "Combining #{paths.count} jobs"
  end
end