Thursday, February 4, 2010

Tricky Little find_in_batches (watch your :select clause)

If you are like me, you have a few background processes that deal with tons of data (reporting, etc). To run these processes, you may use ActiveRecord's "find_in_batches" method which allows you to only pull so many records into memory at a time (a good idea when processing large numbers of records). You may also explicitly use a "select" clause in your AR queries from time to time to only pull in the fields you need for a given purpose. Be wary, as I tripped up over something silly today and you should know about it.

You see, I had something like this:


class SomeModel < ActiveRecord::Base
  named_scope :only_essentials,
              :select=>"some_models.info,some_models.name"
end

class BackgroundProcess
  def run
    SomeModel.only_essentials.
              find_in_batches(:batch_size=>200) do |batch| 
      #....some processing of each record in the batch
    end
  end
end


The problem? It was running way too fast. Over in mere seconds. You might think, "Hey, that's not such a bad problem to have", until you realize that the reason it was running so fast is it was only processing the first 200 records (that is, the first batch).

Why? It's simple if you know how "find_in_batches" works. It uses "order" and "limit" to order your models by primary key and limit the result set. What field did I not include in my named scope? "some_models.id" is correct. Add that field so your reference point is maintained, and everything runs as expected.

Cheers,

~Ethan

Monday, February 1, 2010

Smart quotes and dumb errors

Wow, this one just sucked. :)

I'll make it quick. We have a model in our rails app that captures large strings and then displays an abbreviated version to the user later. For example:

class Narrative < ActiveRecord::Base
  validates_presence_of :text

  def text_in_brief
    return nil if text.nil?
    (text.size > 20) ? (text[0..19] + "...") : text
  end
end


Hopefully that made sense. We want an ellipses to indicate that this value continues on for some time, so if it's longer than 20 characters, just cut it off and add the three periods.

Enter JSON. We sometimes want to send this value to JSON. And every now and then the to_json method blares this exception:

JSON::GeneratorError: source sequence is illegal/malformed

But only very rarely. What the hell is going on here?

Analyzing the data shows that all the strings that cause this explosion have one thing in common - a smart quote as the 19th or 20th character. I mean the one that is actually represented as "\342\200\235". See where this is going?

By splitting off the string at the character level, it's possible to cut off the smart quote somewhere in the middle, because a smart-quote is actually 3 characters long.  This causes an invalid string, which bombs the to_json call.  F@(#!

Quick and Dirty solution?


class Narrative < ActiveRecord::Base
  validates_presence_of :text

  before_save :escape_smart_quotes
  
  def escape_smart_quotes
    self.name.gsub! "\342\200\235", '"' 
  end  

  def text_in_brief
    return nil if text.nil?
    (text.size > 20) ? (text[0..19] + "...") : text
  end
end

Maybe when I've cooled off I'll do this better and extract it to be usable in other models. For now, just be glad I figured it out at all. :)

Wednesday, January 27, 2010

gem Read-through: slim_scrooge

Ok, new project. I believe it's dangerous to rely on code that you do not understand. As a rails-developer, I have tons of plugins and gems that I do not understand. See the problem?

To rectify this, I'm making it my goal to read through one of my main project's many dependancies each week. Two side benefits:

1) I will probably be better at writing my own open source libraries if I've seen a larger sample of how they're usually constructed.

2) code reading is good for you, but it's tough to find time to just sit down and crack open a library. This will give me a good reason.

So without further ado, today I'm doing a read-through of http://github.com/sdsykes/slim_scrooge, a great ActiveRecord optimizing library that has made a difference in the performance of my current main project. Don't expect anything linear here, I'm just going to record my notes and if you want to use them too you're welcome to them.

Slim Scrooge

The point of the slim scrooge library is to moniter your active record queries, and optimize them so that they only pull back the columns that you end up using in that section of code. Let's find out how it works:

NOTES

1) First thing I noticed. There is a test directory, but no tests. Problem? maybe....

2) Scratch my first note. It appears that SlimScrooge::ActiveRecordTest actually runs the ActiveRecord tests that are included with Rails. I guess this makes sense, as a regression test. Anything that filters activerecord should still pass the activerecord test suite. Still, this definitely means that the code itself is not under test. The gem could do nothing, and the tests would still go green. I'm not here to judge, though. I've written my own share of untested code.

3) first included file in the main library is a C extension called 'callsite_hash'. Looking in the /ext directory of the plugin. My "C" is a little rusty since I've been out of it for 3 years, but I think I get that it's defining the global ruby function "callsite_hash", and mapping it to the c function "rb_f_callsite" in this callsite_hash.c file. I don't know what it does yet, as it's the rb_f_callsite function is a little dense for my limited C skills, but maybe it will make more sense in context. So, moving on.

4) Next inclusion is SlimScrooge::SimpleSet (a subclass of Hash, /lib/slim_scrooge/simple_set.rb). This class stores a set of keys based on a submitted array, all mapped to the value "true". Because of the syntax, each time an element is added, it will only create a new entry if it's not already in the set. So basically it's a set of unique elements with some helper methods to keep operations restricted to only the keys (like a collect method that only runs over the keys array). Knowing what the gem does, at this point I'm guessing this is the structure that column names are stored in so you know which ones were used and which ones weren't after a query. We'll see.

5) Moving on to /lib/slim_scrooge/callsites.rb, which defines the class SlimScrooge::Callsites (no parent class). This class only has static methods, so I guess it's never instantiated. It has a class-level variable called @@callsites, which is a hash. Write access to the hash is synchronized through the uses of a Mutex which is instatiated at the time of class definition as a class-level constant (SlimScrooge::Callsites::CallsitesMutex). Given that I don't know what's being stored here, I don't feel like I can accurately analyze it. Therefore, I'm jumping over to the top-level algorithm in /lib/slim_scrooge/slim_scrooge.rg

6) lib/slim_scrooge/slim_scrooge.rb definately is the meat of the gem. SlimScrooge uses good old alias_method_chain to bring about "find_by_sql_with_slim_scrooge" (defined in the gem) and "find_by_sql_without_slim_scrooge" (the original "find_by_sql" method in ActiveRecord). This is how the gem inserts itself into every activerecord call. In the "find_by_sql_with_slim_scrooge", we see what's being done step by step:

A) if the sql passed in is an array (that is, a custom query directly from a programmer writing Model.find_by_sql("blah")), don't bother. Let it run like normal.
B) if this "callsite" has been seen before, try to optimize it.
C) if it hasn't been seen before, try to monitor it
D) otherwise, let it go (find_by_sql_without_slim_scrooge)

7) So what is a "callsite"? How do you know if you've been here before? Well, apparently that's what the C extension is for "callsite_hash.c". The query is passed into this black-magic-extension which by some occult method creates a unique key for it (called a callsite_key). This is then stored in that class-level hash in the "Callsites" class.

8)There is logic written in here to pass it through unoptimized if the query is not "scroogable", and there are several conditions that meet that. For one, if there's any joining, it won't bother. Also, if it's not a "select" query (that is, it doesn't start with SELECT, include the expected table name, and have a "FROM" in it). [These were limitations I was unaware of before].

9) The monitoring of a query is done by attaching a MonitoredHash to each row in the first query. This hash maintains a reference to the callsite, and can be configured to not monitor certain columns. Anytime a column is accessed that was previously unseen, the callsite is notified.

10) next time the query is run, the callsite has a record of which columns were used and uses "scrooged_sql()" to only produce a select query for those columns.

Well, this was fun. I feel like I've learned a bit about how my site works under the hood, and a little more qualified to comment on the use of this gem in the future. Here are a few things I learned that are not directly about the slim_scrooge gem:

1) The Mutex class can be used to synchronize access to an object.

2) ActiveRecord appears to direct all queries through the "find_by_sql" method. That's the place to hit it if you want to get in some sort of filtering.

3) C extensions for ruby use an "Init_*" method to integrate themselves into the runtime.

Until next time,

~Ethan

Friday, January 22, 2010

Un-Joining your Scopes

I had a suprising problem today when one of my tests started failing after I had done a little refactoring. You see, I'd had this ActiveRecord class that was doing some reporting (massive data extraction) and it was originally using a pretty ugly SQL statement:


Model.find_in_batches("Massive SQL statement") do |models|
  models.each do |model|
    models_to_compare = Model.scope_with_other_joins
  end
end

Naturally, I wanted to make that SQL statement go away, and use a bundle of named scopes instead. I had good tests wrapping this area already, so I set autotest running and started hacking away:


Model.scope_with_some_joins.find_in_batches do |models|
  models.each do |mdl|
    other_comparisons = Model.scope_with_other_joins
  end
end

Note that both queries (line 1 and line 3) have similar joins involved. Now, my tests started failing on line 3 -- I get a runtime error showing me that for some reason when running the second query it's maintaining the join scope from the outer query, giving me an "ambigious column" error because there is one table that is joined in from both queries. Now, this "some reason" is really just that this is the way it's designed....the whole point of a "scope" is to be able to nest other things inside of it. In my case, though, I needed the line three query to be totally seperate and distinct. It took some googling and stack-overflowing (love that community), but here's what I discovered:

Model.scope_with_some_joins.find_in_batches do |models|
  models.each do |mdl|
    Model.send(:with_exclusive_scope) do
      other_comparisons = Model.scope_with_other_joins
    end
  end
end

this protected "with_exclusive_scope" method resets the scope entirely for that model within that block. Thus, you're able to have a clean query regardless of the surrounding context. Now, I'm not saying that this hack of sending a protected method is a good idea anytime, but in my case I didn't have an easy way to get around it (other than leaving the SQL statement in place). It's still cleaner to me than having that giant SQL string I had in the code before, and maybe once I do a little more reading on the subject I'll get an even better idea. Other suggestions welcome!

Thursday, January 21, 2010

Autotest saves the day

This is not a tutorial on how to setup autotest on your machine. People have already done that plenty of times, a good one is here.

I'm just writing to say what a big difference it's made for me. My first job was at a "test-everything" development shop. I really agreed with the notion of having solid tests surrounding all possible code, and running them before every commit/deployment. The problem for me arose when I moved to start my own business. Without all that peer pressure (I'm working as the only development talent currently), it's easy to fall off the wagon. Especially when using tools that don't exactly integrate testing into your workflow. Typically I'd write tests around all new code, and run them before committing, but if I was making a quick change just to format something better or to fix a bug, I was often hurried enough to not only write no new tests, but to not run any of my current tests before committing just to get the damn thing out the door.

Autotest silently runs in the background, running your tests anytime you change a file. Not only that, it can be configured to use Growl to notify you every time a test breaks. Now I don't even have to think about it. Every time I press command-S, my tests get run and I know that at the least I haven't broken anything that's currently covered.

Of course the limitation is that you must write tests in the first place. Having your tests run all the time without any real coverage doesn't save you much. For me, though, just knowing that my tests are being run consistently gives me more motivation to write more of them, more often. If you're a rails developer, just try it. It's really not too much of a time commitment to set up, and if you're like me you'll be suddenly one giant step closer to being the unit-testing-guru that you always wished you were.