Facebook’s Ben Maurer makes some great points in his Fail at Scale. I didn’t watch the accompanying video presentation, but it’s an extremely interesting read about how they try to anticipate and manage failures. The observation that it’s so often linked back to configuration changes is an interesting one. I also enjoyed the bit about canary releases and the adaptive LIFO queues.
Being the Allspaw fan that I am, I always cringe a bit when I see someone so cavalierly throw out the phrases “human error” and “root cause” — no matter what their data say. But their “DERP” methodology softens the blow a bit. If you’re not doing
post mortems using something like that, then there’s a good chance that yours are toxic.