-.- --. .-. --..

Preprocessing of CSV data in Ruby cleanly

05 Aug 2016

Recently, I stumbled upon a nifty feature in the Ruby CSV library which allows for a very neat preprocessing experience. Ruby’s CSV module’s usual entry point is parse. This takes an argument called converters which accepts an array of anonymous functions that can be used to transform the data. For example, consider the following CSV data:

age,group
 48,A
74 ,E
73,A
57,C
85,A
78 ,B
110 ,C

There are a couple of problems with our data. First up, the age column has numbers, but when we parse the file using CSV.parse, the first column data will be strings. Second, some of the fields have spaces embedded in them so we get strings with those spaces once we parse. This is how the first row looks like, for example, for the following parsing code:

rows = CSV.parse(file, headers: true)
rows[1]
# => 74 ,E
rows[1][0].class
# => String

To clean this data up, we can use the converter feature:

string_converter = lambda { |field, _| field.strip.to_i }

rows = CSV.parse(csv_file, headers: true, converters: [string_converter])
rows[1]
# => 74,E
rows[1][0].class
# => Fixnum

Note that I’m using a Ruby version less that 2.3, so the class is Fixnum. Newer versions would show this as Integer instead.

The string_converter function will strip each data point and convert that to an integer.

By default, the CSV library ships with a bunch of converters which you can use via the CSV::Converters constant. For example, if we were to convert the first column value to a Float value, we can use the built-in :float converter like so:

rows = CSV.parse(file, headers: true, converters: [CSV::Converters[:float]])
rows[1]
# => 74.0,E
rows[1][0].class
# => Float

This API is so much fun to use!