Dealing with Memory Leaks in DelayedJobPosted on April 23rd, 2011 1 comment
Update: David Genord pointed out a *huge* bug in my code, where jobs will never be retried; his fix has replaced the existing gist.
Our workers (DelayedJob) leak memory. Not heinously fast, but enough that monit bounces them fairly often. I tried out perftools.rb on one of our longer running jobs, reindexing contacts. I cut the job down to indexing only 25 contacts, and profiled objects instantiated. To index 25 contacts, which took about 25 seconds, the app instantiated 3.1 million objects.
Just in case the title is misinterpreted – DJ is not leaking memory. My code is.
With a little bit of memoization and a few short-circuits, I got it down to 800 thousand. That was nice, and it sped things up by an order of magnitude, but the jobs were still leaking memory
So, I decided to steal a cool concept from the next async processor I really want to work with: Resque. I changed the worker to fork and wait for every job it performs. This means that there’s an overhead added to every worker of about 200MB, but that’s nothing compared to how bad things got if all the workers started sucking up huge chunks of memory at the same time.
One caveat: I tried to look around for someone who’d already done this and just incorporated those changes, but it turns out googling “delayed job fork” doesn’t reveal much. If there’s an existing project that’s already trying to do this for delayed job, please let me know
The code changes around this are pretty simple, just changing the run method and adding a hook for handling fork reconnection stuff (it looks like DJ has some hooks built for this, but I didn’t see them in my local copy)
Before this patch, if I ran 1 worker locally, after about half an hour my machine ran out of ram (it has 8gb). If I ran 4, my computer was unusable. After the patch, I can run 4 workers locally with no perceived performance hit. Money shot:
You have a major problem with your code. Failed jobs will NEVER be retried.
Try this https://gist.github.com/942589
Also there are a few autoscaling non single process forks of delayed_job. Here is the one I did and have been using in production.
This does not fork for every job but instead forks off worker processes which stay alive for as long as they have work to do. After they have been idle for a short period, 30 seconds if I remember correctly, the worker is killed.