canary releases and adaptive LIFO queues

Fail at Scale:

Great points by Facebook’s Ben Maurer, and an extremely interesting read about how they try to anticipate and manage failures. The observation that it’s so often linked back to configuration changes is an interesting one. I also enjoyed the bit about canary releases and the adaptive LIFO queues.

Being the Allspaw fan that I am, I always cringe a bit when I see someone so cavalierly throw out the phrases “human error” and “root cause” – no matter what their data say. But their “DERP” methodology softens the blow a bit. If you’re not doing post mortems incident reviews using something like that, then there’s a good chance that yours are toxic.