A few weeks ago, we noticed that the edu2.0 site started to get quite sluggish around noon. At times, it got so slow that we were worried that perhaps the code had an infinite loop in it, and we'd reboot the servers to speed things up again.
About a week later, it got so bad that we decided to focus 100% on site performance. The first thing we did was to install a new Rails site management system. We had used FiveRuns for a while, but hit several bugs that were not fixed after a couple of months of waiting. So we changed to NewRelic and it has been fantastic - rock solid and very easy to use.
We also upgraded our system from Rails 2.1 to Rails 2.3. This was relatively painless; it took about a day of work to iron out the little incompatabilities, including some custom work we've done in the security subsystem.
Last of all, we replaced Mongrel with Phusion Passenger for running our Rails processes. We really like the way that Passenger shuts down Rails processes if they're not used for a while or misbehave and then spins up new ones when required.
During this time, my Windows development system started to freak out and go super-slow as well - 30 seconds for a compile-and-run! Since we couldn't afford reduced iteration times during this crunch, I decided to scrap my Windows machine entirely and move to a 3GHz iMac with 24" monitor. It's beautiful. It took about 2 hours to get all the basics set up, and another 8 hours to get the entire development system installed (we now use Aptana studio as an Eclipse plugin). It was a pain having to go through so many learning curves at the same time, but my productivity is much better now.
The NewRelic reports were very useful, and their "transaction trace" feature showed us details about the requests that were taking longer than usual to process. Based on this feedback, we made the following changes:
- optimized the SQL statements generated by many of the slow requests
- improved the way that chat rooms worked, resulting in 95% fewer database accesses
- paginated message boxes, forums, blogs and chat transcripts, which reduced rendering time
- staggered the offline "housekeeping" processes which were originally all happening at once
- reduced the frequency that we re-index the Sphinx database for text search
- various other miscellaneous tweaks
The result of all these changes was noticeable. During peak hours, the average site response time has been reduced from 4 seconds to just 0.2 seconds. We're also going to add some more servers that will reduce this response time further still and provide capacity for growth.
Last but not least, we noticed that memory usage sometimes slowly grows by a few percent every 10 minutes during peak hours. We originally thought this was a memory leak, and performed a bunch of diagnostics to see if this was the case. It turns out that it's probably not a memory leak but instead due to the increased file uploads during the day. When you upload a file in Rails 2.3, it reads it in chunks and then stores the file into a TempFile. It takes a while for the chunks to get garbage collected, so during the upload, memory usage increases. Some users upload files that are 10MB - 100MB, so during this time the memory usage increases quite a bit. Once the file has uploaded, the usage goes down. We might put in an attachment size limit in order to prevent extreme cases.
Recent Comments