Thursday, June 4, 2009

Processing large files with Ruby and Rails

UPDATE: As "stefano" pointed out in the comments, the standard "gets" method does indeed accept a parameter. My fault for not checking the documentation!

although we as web developers prefer working on neat features for our websites, sometimes we need to get down and dirty with data processing. I know, I don't like it any more than you do, but if you want to run a business sometimes you have to do stuff that isn't that fun. That's one of the reasons I liked Ruby; the file library for reading and writing files makes data processing a lot simpler than similar tasks I written in C++, Java, and other mainstream languages. Usually I just do something like this:



file = File.open("some_file")
#read all the contents of the file into "str" variable
str = file.read
file.close
...#do some processing...


Wow, that was easy! However, sometimes code like this just won't cut it. For example: if your file is too big to read into memory all wants and could cause performance issues for the server, you may want to process the file iteratively. Again, and Ruby provides a pleasant way for handling the scenario. The "gets" method accepts a block and yields back to you each line of the file one at a time, thus conserving your precious memory. See the example below:



file = File.open("some_file")
while(cur_line = file.gets)
...#do some processing...
end
file.close


Also pretty easy. Today, however, I ran into a new problem. What happens when your file is too big to read into memory at one time, but all the data is all in one single line? Don't believe that would ever happen? Check out EDI sometime and see what you think about it (On second thought, never checked that out. Never ever look at EDI. I wouldn't want to make you cry). Sometimes even XML or HTML files are written all on one line in human readability isn't of any particular concern.

Well, I have never dealt with that situation before. I had this file I needed to process, roughly 30 MB, all on one line. now, iterative processing would be okay, because each segment in the file was in the proper order for processing and it was delineated by a pipe character, but there just aren't any built-in methods all in the file object that do what the "gets" method does on a delimiter other than newline. So I wrote one:


class File
def uber_gets(delimiter)
segment = ""
self.each_byte do |byte|
char = byte.chr
if char == delimiter
yield segment
segment = ""
else
segment = "#{segment}#{char}"
end
end
end
end


with this modification, you can now do small iterative processing based on any delimiter. In my case, using EDI files, each record is separated by a "~". so, I used the above method as follows:

file = File.open("some_file")
file.uber_gets("~") do |segment|
...#do some processing...
end
file.close

There you go. The whole file is on one line, but the code is still respecting memory consumption. If it helps you, enjoy. I'll post a link to the gist:

http://gist.github.com/123924

A new way to Blog

Most of my day I spend in front of a computer. I write code, I answer e-mails, and then when I get home, I blog. All that typing can be hard on the hands. I try to do most of the right things, ergonomically speaking. But I still end up with tendinitis. Being only 22 years old, this is obviously something I really want to avoid if I'd like to continue a career in the software business.

Enter dictation. This isn't the first time that I play with the idea of speaking to my computer. Both Windows Vista, which I had installed in my old computer, and Mac OS X, which I have all my new computers, have built-in speech recognition software. However, this is all mostly for command and control. The software helps you do things; you can open new windows, push menu buttons, click links, and do all sorts of other command based tasks. But when it comes to actually writing, these solutions fall short. Today, though, I'm happy to say that I'm now past that point. every word that you are reading on this page was put there by dictation software. MacSpeech dictation is the program I'm using, and I have to admit I'm impressed. Everything I say seems to just end up on the screen without me having to use my hands or forearms.

there are some drawbacks to the software. For one, it's not cheap. $200 for the current release. Now admittedly, that's not the most expensive piece of software ever seen, but for regular consumer consumption of price point seems a bit steep. On top of that, there is no trial version you can download. In fact, even when you buy it you can't download it. You have to have it shipped to you as if we were back in the 90s. so if you decide you want to buy MacSpeech Dictation, be aware that it's all or nothing.

For someone like me though, the advantage of being able to do my blog posts without my hands far outweighs the drawbacks inherent in MacSpeech's distribution system. I hope that as my experience with the package progresses, I'll be able to say that all my e-mails and all my blog posts are done without putting any unnecessary strain on my forearms. That way, I can save my limited typing capacity for what I enjoy most: code!

If this is something you'd like to try, check out the link below.

Mac Speech

Tuesday, May 26, 2009

Metric-Fu

The latest addition to my utility-belt of Ruby tools, Metric-Fu is giving me plenty of ideas on how to refactor my codebase.

Essentially, this library gives you the ability to take 8 of the most common code analyzing tools and run them all on your codebase at once, producing one consolidated report. I love it!

The full list can be found at the Metric_Fu page on rubyforge.org, but 2 of my favorites are listed here:

Roodi gives you some design help as it checks all kinds of common programming problems. Method have too many parameters? Cyclomatic complexity too high? Forget the else clause on a case statement? Roodi will give you the heads up you need.

Flay checks out your code for duplicate constructs and segments. It found several pieces of duplicated logic that I had never noticed before, I was really impressed. It also does soft matches on "similar" code, things that could probably be combined and simplified if you're clever. Very nice.

The great thing about metric_fu is that you just run one command, and all the packages get run for you, giving you one page afterwards containing links to the results of whatever package you want. Check it out, you might be surprised at how much your fingers start itching to go back and fix all the problems you didn't even know you had.

Monday, May 25, 2009

Curse you, rake db:migrate!

Have you ever been in this situation?

"Hmm, this feature will require a big change to my database! I mean, we're going to have to touch every single record in that table."

If you have, you've probably followed up with this thought:

"Good thing I'm using Rails! They make it so easy!"

And then you went and wrote this migration:


def self.up
Model.all.each do |m|
#..some important function
#performed on every object..
end
end


And then you were really proud of how quickly that went, and you run it on your development machine, and it works really well. But THEN you push it to your staging or production server that has way more data than your dev machine, and you get this staring back at you from your command line:


** [out :: 123.123.123.100:8063] ==
** YourCrazyMigration: migrating
=========================================


And you stare at that for about 30 seconds before shouting:

"Mother F%^&er! I did it again! I can't believe I did it again! I built a stupid migration that uses the stupid 'all' method which is now dominating the memory on that box and I can either kill it and pick up the pieces or let it run for the next 3 hours as it pages the hell out of the hard disk!"

Well, since I did EXACTLY THAT just now, I decided that from now on we'll be using a new migration task at our development shop called "safety_migrate", which you're welcome to take advantage of if you'd like. It runs through every file in your migration directory, checking for the dreaded "all" method, and WILL NOT run the migrations unless every file is clean!

check the gist:

http://gist.github.com/117625

Happy Migrating!

Sunday, May 17, 2009

Maximum Impact for Minimal Effort

Today was a day for working outside, and I got a lot of mulch spread over the landscaping at my house.

I love mulch because it's really easy to use, really cheap to buy, and really makes a difference in the way your yard looks. You can't get much more bang for your buck as far as landscaping goes.

Believe it or not, I was thinking a lot about work as I was hauling chips of cedar over to my flower beds, and I started considering what sorts of things are "mulchy" in the web development world.

Color

Unfortunately, people aren't often impressed by performance or functionality nearly as much as they are by aesthetics. A white page full of blue links that all do really cool stuff is just a huge turn off. The difference that can be made with a simple header, a 3-column layout, and a pleasant color scheme is phenomenal.

Central Navigation

Sometimes you have websites that have spaghetti links all over the place. Trying to get back to the homepage usually means either clicking the back arrow 14 times, or re-typing in the domain address. A very common and successful approach is to have a set of navigational links as part of the header, and it's successful for a reason: people know where to go. No matter where they are, the critical places on the website can be accessed easily.

Intelligent defaults

If you have a dropdown that has a list of states, and 80% of your users are local to your state, go ahead and default the selection to your state. If you have 25 reports users can run on your website, and 3 of them are used more often than any others, put those three at the top. It doesn't take a lot of work to reposition things, but it makes a big difference.

There you go. Three easy things to do that will make a large impact on your users.