We've spent most of the last 2 weeks trying to fix a difficult-to-find bug in the edu2.0 site. Every now and again, after a few hours of running, occasional page requests freeze. It's frustrating, because the rest of the time the site works fine.
So far, we've tried a bunch of things, each of which has improved the site performance but none have fixed the bug. Improvements include:
- lowering the debug level of our production servers to reduce the amount of logfile output
- changing from the Apache2 prefork model to the threading model
- adding 2GB memory to the mongrel servers
- increasing the possible number of open connections to the database
- upgrading to the latest version of all gems
- upgrading to Rails 1.6
- doing more fragment caching using memcached
We also plan on moving to an evented mongrel, adding more mongrel servers, and upgrading to Rails 2.1 soon, but none of those things will fix the bug.
Today, we noticed that a couple of the mongrels had frozen with timeout error messages, and our production log indicated a couple of timeouts blocked on a session write to memcached. So my best guess right now is that the memcached clients are somehow freezing.
According to the blogs here and here, others have noticed the same thing. So we're going to experiment with adding the timeout code to see if this fixes the issue.
For the record, this is an area that Rails could do with some improvements; it seems like some of the standard libraries are still a little immature. As long as we can fix this bug soon, I will remain happy with our choice of Ruby on Rails.
Comments