class CsvImportAnalyzer::CsvDatatypeAnalysis

Attributes

csv_column_datatypes[RW]
nullable[RW]

Public Class Methods

new(options) click to toggle source
# File lib/csv-import-analyzer/csv_datatype_analysis.rb, line 15
def initialize(options)
  @options = options
  @csv_column_datatypes = {}
  @nullable = []
end

Public Instance Methods

datatype_analysis() click to toggle source

Process a chunk of csv file for all possible datatypes towards each column in the row This datatype analysis is used for analyzing,

Min - Max values of each column
Distinct values of each column
Enumeration eligibility
# File lib/csv-import-analyzer/csv_datatype_analysis.rb, line 35
def datatype_analysis
  SmarterCSV.process(filename, {:col_sep => delimiter, :chunk_size => chunk_size, 
    :remove_empty_values => false, :remove_zero_values => false}) do |chunk|
    chunk.each do |row|
      row.each do |key, value|
        unless null_like?(value)
          datatype = determine_dataype(value)
          add_to_datatype(key, datatype.to_sym)
        else             
          nullable.push(key) unless nullable.include?(key)
        end
      end
    end
    break
  end
  options[:csv_datatype_analysis] = csv_column_datatypes.clone # To retain the current state of csv_column_datatypes since it's altered further
  finalize_datatypes_for_csv
  options[:csv_column_datatypes] = csv_column_datatypes
  options[:nullable] = nullable
  take_further_actions
end
filename() click to toggle source
# File lib/csv-import-analyzer/csv_datatype_analysis.rb, line 25
def filename
  @options[:filename]
end
options() click to toggle source
# File lib/csv-import-analyzer/csv_datatype_analysis.rb, line 21
def options
  @options
end

Private Instance Methods

add_to_datatype(key, datatype) click to toggle source

Build the hash of hashes which hold the count of different possible datatypes for each row

# File lib/csv-import-analyzer/csv_datatype_analysis.rb, line 79
def add_to_datatype(key, datatype)
  if csv_column_datatypes[key].nil?
    csv_column_datatypes[key] = {datatype => 1}
  else
    if csv_column_datatypes[key][datatype].nil?
      csv_column_datatypes[key][datatype] = 1
    else
      csv_column_datatypes[key][datatype] += 1
    end
  end
end
chunk_size() click to toggle source
# File lib/csv-import-analyzer/csv_datatype_analysis.rb, line 63
def chunk_size
  return options[:chunk]
end
delimiter() click to toggle source
# File lib/csv-import-analyzer/csv_datatype_analysis.rb, line 59
def delimiter
  return options[:delimiter]
end
determine_dataype(value) click to toggle source

Call DatatypeValidator in helper module to process the possible datatype for the value Is this the right way to hide dependency on the external classes or objects May be a static would do ? Should I create an object and call method on the object each time rather than instantiate a new object each time ??

# File lib/csv-import-analyzer/csv_datatype_analysis.rb, line 72
def determine_dataype(value)
  return validate_field(value)
end
finalize_datatypes_for_csv() click to toggle source

Finalize the datatype for each column. A column datatype would be set to varchar or string if even one of it’s values tend to be string If the column doesn’t have any possible strings then assign the datatype to column with maximum count of identified possibilites

# File lib/csv-import-analyzer/csv_datatype_analysis.rb, line 96
def finalize_datatypes_for_csv
  csv_column_datatypes.map { |column_name, possible_datatypes|
    #If there is string type even atleast 1 there is no other option but to set the datatype to string => varchar
    if possible_datatypes.has_key?(:string)
      csv_column_datatypes[column_name] = :string
    else
      #set the max occurance datatype as the datatype of column
      csv_column_datatypes[column_name] = possible_datatypes.key(possible_datatypes.values.max)
    end
  }
end
take_further_actions() click to toggle source

Decide if simple datatype analysis is enough or proced further Proceed further would be to

Identify min and max bounds for each column
Identify if the number distinct values are less than set threshold
# File lib/csv-import-analyzer/csv_datatype_analysis.rb, line 114
def take_further_actions
  if options[:check_bounds]
    min_max_bounds = CsvImportAnalyzer::CsvCheckBounds.new(options)
    res = min_max_bounds.get_min_max_values
    options[:min_max_bounds] = res[:min_max]
    options[:uniques] = res[:uniques]
  end
  query = CsvImportAnalyzer::SqlQueryBuilder.new(options)
  query.generate_query
end