As I read the morning news today, I noticed many comments about a global IT service disruption, apparently caused by a system software configuration deployment. Incidentally, I saw it on an Android tablet browsing a Linux-hosted social media site. The effects are wide-ranging, to the point where our local drive-in movie theater could not sell tickets online.
When I worked on the front lines of a global IT shop, we constantly looked to eliminate single failure points, and were regularly surprised by them. Like when our single back-up generator failed (and the knee-jerk fix was to have 2 generators) to deliver power.
Any suggestions on best practices (beyond working for a manager that has your back no matter what)?