5Whys: Production Solr=pwntPosted on November 12th, 2010 1 comment
I’ve been introduced to a process known as the “5 Whys.” When something goes terribly wrong, like your Solr index getting corrupted, you ask the question, “why did it go wrong?” And then after the answer, “why was that?” (Rinse, repeat, until you’ve asked the question 5 times).
The idea being, if you ask that question five times and figure out solutions to those 5 answers, you’ll have a significantly better process than you did before.
About 2 months ago we released 1.0 of our first module, and in the last month our users have really started to use our application. As such, the last two weeks have been an (incredibly painful) learning experience.
While there are a bunch of boring ones, I think this one’s pretty interesting, and the lessons that come from it can be applied universally.
I’ll start this story with a timeline. Last Tuesday morning, around 7:45AM, a customer emails our support, saying search is broken. One of our guys looks into it, and sure enough, search is returning erratic results. My phone starts blowing up with texts and emails. I sit down, look at it, think back a week before when we turned on Solr’s autoCommit functionality, blame that, and do a non-destructive reindex of everything. After 15 minutes (our dataset isn’t big yet), the problem’s resolved.
Talking through the issue, we thought it was a reasonable hypothesis that too many DJ workers had posted big commits to Solr at once (at the time we were reindexing nightly), and that resulted in Solr running out of memory, corrupting its index. It’s worth noting that Solr’s documentation says that the index won’t be corrupted when this happens, but it was all we had to go on at the time.
Wednesday afternoon, about 3:15, it happened again. We knew how to handle it because we’d talked about what processes we should follow should we see this again, but were a little confused. We’d reduced the workers, we’d turned off full reindexing and were only doing nightly optimization calls, we reduced the commit batch size to 50 (from 300) to reduce the likelihood of some huge dataset being posted, and we’d lowered Solr’s commit threshold to keep it from blowing up.
We decided to reduce another worker and hope the black magic would help.
3:15 rolls around Thursday. It happens again.
After quickly setting the fix in motion, this piques my interest. Clearly it’s something scheduled doing it now. Looking at the crontabs for our different servers, I don’t see anything suspicious, but then I look through the Solr request logs, and see a POST to /solr/update from our office right before our phones started blowing up. Look the day before, see the same thing.
The easiest fix was to change the password for Solr up on production and not put that change into Git. It didn’t happen today, so it seems like that was the right fix. Then looking tonight, I was watching my logs while running a rake task, and saw “Error, couldn’t connect to Solr server at <production solr ip>”. Boom.
Someone had put ENV['RAILS_ENV'] ||= ‘production’ in one of the rake files, which it turns out are globally scoped. This, coupled with us putting production passwords in source control, let us corrupt the Solr index remotely. Epic fail.
There were lots of process changes that came from this, but here are a few code notes that might save you a stroke or two:
Never, ever, ever, ever store production passwords in source control
I’d previously argued against this, but now I’ve seen the error of my ways. If you have production passwords locally, and you don’t/can’t put some sort of IP security around your server, it’s just a matter of time before someone accidentally gets an environment variable screwed up and ruins *everything*.
Everything in lib/tasks is globally scoped
If you write something cool at the top of one rake file, keep in mind that it applies to all rake tasks in your app. If you define a method (even in a namespace) in a rake file, it’ll be available to all other rake tasks. If you have two identically named methods that do different things in different namespaces, only one of them is going to exist, and it’s quite possible that it’s not the one you want.
Don’t set RAILS_ENV=production by default
While this one is technically handled by the previous note, it’s worth listing explicitly. There’s no reason to assume RAILS_ENV=production. If someone wants it to be done in production mode, let him or her set the environment variable themselves. Setting it by default only means you’ll break it in every other environment where the environment variable isn’t set. Plus, if someone fails on note#1, it means there’s a good chance you ruin someone’s day.
Good advise on passwords, thanks!
And black magic is always suspicious