Comment by kragen
One of the interesting things that came out of Google's "SRE" system is that they deliberately add outages if they don't have enough. They learned years ago that if you build a service that promises 99% uptime and deliver 99.99% uptime, other people in the company will come to depend on that 99.99% uptime unintentionally. So they chaos-monkey it to ensure that the inevitable failures aren't catastrophic.