class Elasticrawl::CrawlSegment
Represents a segment of a web crawl released by the Common Crawl
Foundation. Each segment contains archive, metadata and text files.
Public Class Methods
create_segment(crawl, segment_name, file_count)
click to toggle source
Creates a crawl segment based on its S3 path if it does not exist.
# File lib/elasticrawl/crawl_segment.rb, line 14 def self.create_segment(crawl, segment_name, file_count) s3_uri = build_s3_uri(crawl.crawl_name, segment_name) segment = CrawlSegment.where(:crawl_id => crawl.id, :segment_name => segment_name, :segment_s3_uri => s3_uri, :file_count => file_count).first_or_create end
Private Class Methods
build_s3_uri(crawl_name, segment_name)
click to toggle source
Generates the S3 location where this segment is stored.
# File lib/elasticrawl/crawl_segment.rb, line 25 def self.build_s3_uri(crawl_name, segment_name) s3_path = ['', Elasticrawl::COMMON_CRAWL_PATH, crawl_name, Elasticrawl::SEGMENTS_PATH, segment_name, ''] URI::Generic.build(:scheme => 's3', :host => Elasticrawl::COMMON_CRAWL_BUCKET, :path => s3_path.join('/')).to_s end
Public Instance Methods
segment_desc()
click to toggle source
Description shows name and number of files in the segment.
# File lib/elasticrawl/crawl_segment.rb, line 9 def segment_desc "Segment: #{segment_name} Files: #{file_count}" end