class CsvImportAnalyzer::CsvSanitizer

Public Instance Methods

process(filename, options) click to toggle source

Public interface for the entire library What id does?

Sets "options" varaible by merging default values and passed values
Finds the delimiter by analyzing a sample
Sanitizes or preprocesses the csv file by creating a temporary processed file
         Replacing null and empty values with NULL
         Replace single quotes with double quotes if needed
Handle CSVMalformedError by logging the error to error report

Passes the options to DatatypeAnalysis

# File lib/csv-import-analyzer/csv_sanitizer.rb, line 25
def process(filename, options)
  options = defaults.merge(options)
  if File.exist?(filename)
    delimiter = identify_delimiter(filename)
    options[:delimiter] = delimiter
    # create tempfiles to update any changes being made
    temp_file, processed_file = create_tempfiles(filename, options)
    options[:temp_file] = temp_file.path
    line_count = 1
    File.foreach(filename) do |line|
      if line.length > 1 #Check if the line is empty - no point in processing empty lines
        line = replace_line_single_quotes(line,delimiter)
        begin
          line = CSV.parse_line(line, {:col_sep => delimiter})
        rescue CSV::MalformedCSVError
          # MalformedCSVError is due to illegal quoting or unclosed quotes
          # Try to add a quote at the end and resume processing
          # Log the changes to report
          temp_file.write("MalformedCSVError at line #{line_count}")
          line = line.insert(-2, "\"")
          line = CSV.parse_line(line, {:col_sep => delimiter})
        end
        line = replace_null_values(line)
        processed_file.write(line.to_csv({:col_sep => delimiter, :converters => :numeric}))
      end
      line_count += 1
    end
    options[:rows] = line_count
    temp_file.close
    processed_file.close
    # Cleaned the file - Now analyze for datatypes
    CsvImportAnalyzer::CsvDatatypeAnalysis.new(options).datatype_analysis
  else
    FileNotFound.new
  end
end

Private Instance Methods

create_tempfiles(filename, options) click to toggle source

Uses ruby tempfile to create temp files for

1. Store processed file
2. Error reporting

Returns the file handler for a temp file. This tempfile holds any modifications being done to the file.

# File lib/csv-import-analyzer/csv_sanitizer.rb, line 122
def create_tempfiles(filename, options)
  options[:original_filename] = filename
  filename = File.basename(filename)
  processed_filename = File.join(Dir.tmpdir, "processed_"+filename)
  options[:filename] = processed_filename
  # filename += Time.now.strftime("%Y%m%d%H%M%S")
  # temp_file = Tempfile.new(filename)
  # temp_file = File.open(File.join(Dir.tmpdir, filename), "w+")
  temp_file = File.join(Dir.tmpdir, "error_report_"+filename)
  temp_file = File.open(temp_file, "w+")
  processed_file = File.open(processed_filename, "w+")
  return temp_file, processed_file
end
defaults() click to toggle source

Hash of default values that would be merged with user passed in values returns [Hash] defaults

# File lib/csv-import-analyzer/csv_sanitizer.rb, line 68
def defaults
  {
    :metadata_output => nil,      # To be set if metadata needs to be printed to a file
    :processed_input => nil,      # To be set if processed input is needed
    :unique => 5,                 # Threshold for number of defaults values that needs to identified
    :check_bounds => true,        # Option to check for min - max bounds for each column [true => find the bounds]
    :datatype_analysis => 200,    # Number of rows to be sampled for datatype analysis
    :chunk => 200,                # Chunk size (no of rows) that needs to processed in-memory [Important not to load entire file into memory]
    :database => [:mysql],        # Databases for which schema needs to be generated
    :quote_convert => true,       # Convert single quotes to double quotes
    :replace_nulls => true,       # Replace nulls, empty's, nils, Null's with NULL
    :out_format => :json          # Set what type of output do you need as analysis
  }
end
replace_line_single_quotes(line, delimiter) click to toggle source

Replaces single quotes with doubles in each line Escapes the double quotes if it’s between two single quotes before returns [String] result

# File lib/csv-import-analyzer/csv_sanitizer.rb, line 88
def replace_line_single_quotes(line, delimiter)
  delimiter = "\\|" if delimiter == "|"
  pattern = "#{delimiter}'.*?'#{delimiter}" # set the pattern to opening and closing single quote found between delimiters
  res = line.gsub(/#{pattern}/)
  result = res.each { |match|
    replace = "#{delimiter}\""
    replace = "\|\"" if delimiter == "\\|"
    match = match.gsub(/^#{delimiter}'/,replace)
    replace = "\"#{delimiter}"
    replace = "\"\|" if delimiter == "\\|"
    match = match.gsub(/'#{delimiter}$/,replace)
  }
  result = result.gsub(/''/,'\'') #replace any single quote that might have been used twice to escape single quote before
  return result
end
replace_null_values(line) click to toggle source

Replace all nil, “NAN”, empty values with NULL for maintaining consistency during database import

# File lib/csv-import-analyzer/csv_sanitizer.rb, line 105
def replace_null_values(line)
  line.each do |value|
    if null_like?(value)
      replace_index = line.index(value)
      line[replace_index] = "NULL"
    end
  end
  return line
end