The Problem
I recently had a client who needed some data conversion done. They had contact records from a number of difference sources that they needed to import into a new CRM system that they were deploying. They established the desired output format, provided the input files (i.e. files containing the contact information) and some rules for how they wanted some of the fields to be handled. One of the big issues was identifying and eliminating duplicates. There were roughly 45,000 records among all of the input files.
In the past, I would have used awk or maybe Perl for most of the data manipulation. Both have excellent text processing capabilities. However, for the last year or so I’ve been doing this sort of work using Ruby. It’s been quite refreshing, as Ruby’s syntax is clean and clear (Perl is neither, in my opinion) and its standard library is powerful (an area where awk falls short). It’s also an object oriented programming language, something neither of the others is, which makes managing larger programs easier.
FasterCSV
In the past projects done with Ruby, I’d usually been manipulating a single file with a relatively simple format. Most were tab delimited and contained only a few columns. It was easy enough to read in the file using Ruby’s standard File class and split the string on the delimiter. For the recent project, though, I was dealing with much larger files and much more complex data. In addition, I had to produce a tab delimited file containing around 75 columns and 31,000 records as the output.
After a bit of research, I found the FasterCSV gem. “Gem” is Ruby jargon for its extension system, similar to PEAR for PHP or CPAN for Perl. I installed it and started playing around. I was pleasantly surprised not only by how easy it is to use (I shouldn’t have been, since it’s a Ruby library), but also by how powerful and featureful it is. In fact, when Ruby 1.9 was released, FasterCSV was used to replace the original CSV library and renamed to simply “CSV”.
FasterCSV enabled me to work with the data easily. By default it expects the delimiter to be a comma, but you can easily override that behavior and tell it to use something else. In the case of this project, some of my input files were comma delimited and some were tab delimited, so being able to change the behavior for each file was handy. It also allows one to access the fields by position (i.e. an integer index) or, if the first row contains column headers, by the column’s name. Accessing fields by position is fine when dealing with a small number of columns, but quickly becomes unwieldy when dealing with tens or even hundreds of columns. Also, using the name of the column makes verification of the mapping of input fields to output fields much easier.
A host of other options are available, from specifying the file encoding (ANSI, UTF-8, etc.) to setting the row separator and quotation character. The documentation of the “new” method has a list of them. One of the more interesting features is the ability to create custom converters to handle conversion of data in a field or header. I didn’t take advantage of this feature for my project, but it looks quite useful.
OO Design
A number of the input files were variations on a theme; they had only small differences from each other in how the data had to be handled, while the formats were identical. Because Ruby is an object oriented language, I was able to create a base class to handle all of the common functionality, and then derive subclasses to handle only the few small differences. This pattern isn’t unique to Ruby, of course. Any OO language would have supported it. Unfortunately neither awk nor Perl, my usual suspects for this kind of work, falls into that category (yes, that includes Perl 5, protestations from Perl fan boys aside). Ruby’s particular brand of OO also supports method chaining, so I could do things like this:
output_row['Zip'] = row['Zipcode'].strip.rjust(5, '0')
In this example, “row['Zipcode']” returns an instance of String. We then call the strip method on that instance, which returns another String stripped of leading and trailing whitespace, against which we call the rjust method to left pad the string with zeros, up to five places. This ability to chain function calls can make the code short and clear.
Issues
The main issue that I found was related to method chaining. As is frequently the case with delimited files, fields could sometimes be unexpectedly empty. When a field is empty, FasterCSV returns “nil” (Ruby’s version of null) instead of an empty string. Now nil is an object in Ruby (because everything is an object) and so does have methods. However, it does not have methods like strip and rjust, as used in the example above. In the event that zipcode field for the current row was empty, the line above would result in an exception, halting the program. Given my background in languages like C and C++, the most obvious solution was to use the trinary operator like this:
output_row['Zip'] = row['Zipcode'].nil? ? nil : row['Zipcode'].strip.rjust(5, '0')
This strikes me as uglier than Ruby should be, however, so I asked about it in comp.lang.ruby. Interestingly, the most idiomatic way to handle it appears to be to patch the standard classes Object (the base class for all classes) and NilClass. The thread has the details, but the approach that I thought was the cleanest syntactically was to load Rails’ ActiveSupport gem and use the try method that Rails adds to Object and NilClass. If I were creating a nonRails application to be distributed or installed on a production server, I don’t know that I’d want to create a dependency on Rails like that, but in this case it was an application to be used on a one-week project, so it wouldn’t be a maintenance issue.
Conclusion
I really liked using Ruby for this project. People claim that Ruby, more than most languages, gets out of your way and lets you solve the problem at hand. I certainly found that to be the case in this project. With the exception of the one issue mentioned above, I spent almost no time worrying about language issues or working around shortcomings. As a programmer, it was a very satisfying experience.