class UkParliament::HttpHouseMembers

Class to load house member data from the web.

Public Class Methods

new(house_id) click to toggle source

Initialise our parent class and set about scraping data from the web.

Calls superclass method UkParliament::HouseMembers::new
# File lib/uk_parliament/http_house_members.rb, line 9
def initialize(house_id)
  super

  @q_manager = QueueManager.new(house_id)

  retrieve_members_list
  assemble_members_data
end

Private Instance Methods

assemble_members_data() click to toggle source

Trigger scraping of more detailed house member information and save the results to file.

# File lib/uk_parliament/http_house_members.rb, line 63
def assemble_members_data
  @q_manager.enqueue(@members)

  process_members_list { |member|
    scrape_member_summary(member)
  }

  save_file

  if @q_manager.error_queue_size > 0
    log.info("#{@q_manager.error_queue.length} entries in the error queue to reprocess")
  end
end
process_members_list() { |member| ... } click to toggle source

Process the house members list, to retrieve more info about each member. Splits the work across multiple threads, to diminish the time taken.

# File lib/uk_parliament/http_house_members.rb, line 79
def process_members_list
  threads = []

  @config[:scrape_no_of_threads].times do
    threads << Thread.new do
      until @q_manager.main_queue.empty?
        id = @q_manager.main_queue.pop

        if id
          member = @members.find { |item|
            item['id'] == id.to_i
          }

          yield member

          sleep(@config[:scrape_request_delay])
        end
      end
    end
  end

  threads.each { |t| t.join }
end
retrieve_members_list() click to toggle source

Gets the list of house members. Depending on the circumstance, we either just load a list from existing file or we got the parliament.uk site, and scrape the list from there.

In the case of loading the file, the errors processed will be merged into the existing file data, and saved. This behaviour will continue until there are no more errors to process.

# File lib/uk_parliament/http_house_members.rb, line 27
def retrieve_members_list
  if @q_manager.scrape_errors?
    load_file
  else
    scrape_members_list
  end
end
scrape_member_summary(member) click to toggle source

Scrape more detailed house member's info from their specific page.

# File lib/uk_parliament/http_house_members.rb, line 48
def scrape_member_summary(member)
  log.info("Fetching (#{member['id']}) #{member['alphabetical_name']}")

  document = Nokogiri::HTML(open(member['url']))
  pipeline = MemberSummaryDocPipeline.new(@house_id, document)
  pipeline.enrich_member_data(member)

  member['timestamp'] = Time.now.strftime('%FT%T%:z')
rescue => e
  log.info("Error processing '#{@house_id}' member ID #{member['id'].to_s}, URL #{member['url']}, Exception #{e.message}")
  @q_manager.error_queue.push(member['id'].to_s)
end
scrape_members_list() click to toggle source

Scrape a particular house's membership list from it's list page.

# File lib/uk_parliament/http_house_members.rb, line 36
def scrape_members_list
  url = (@house_id == Lords::HOUSE_ID) ? Lords::MEMBER_LIST_URL : Commons::MEMBER_LIST_URL
  log.info("Fetching '#{@house_id}' member list from #{url}")

  document = Nokogiri::HTML(open(url))
  pipeline = MemberListDocPipeline.new(@house_id, document)
  pipeline.house_member_list(@members)
rescue => e
  log.info("Error retrieving '#{@house_id}' member list, URL #{member['url']}, Exception #{e.message}")
end