The Rules - At Least As I See Them (Well, the First Two)
Since I’ve been dealing with computers, I’ve developed some rules of thumb. The first rule seems obvious, although I’m constantly surprised by the people that break it. It is:Rule 1: Never run a command on a computer that affects the communications path through which you are connected to that machine.This is slightly more complicated than it sounds - especially when configuring routing protocols in routers. You change things such that you lose your routes from where you are to that machine, and it’s time for Plan B. However, for the most part, it’s straightforward, which is why it became the first rule.The second is less obvious, and more controversial, although it’s potentially more important. It is:Rule 2: System add-ons included solely for the purpose of reducing downtime by means of failover or redundancy will cause more downtime due to bugs or misconfigurations than would have been caused by hardware failures if those add-ons had not been included, unless the level of diligence and effort is greatly increased.Let’s take that a piece at a time. There are a lot of pieces of tech in the world that people use to protect against hardware failures, like SANs and clusters. That function, protecting against hardware failures, is inherently complicated, and that means that those pieces of tech have to be inherently complicated. And Murphy’s Law (and experience) tell us that the more things there are that could go wrong, the more likely something will. In fact, I contend that, unless you go to extraordinary efforts to test every last possible thing that can go wrong, problems with those many complicated bits will cause more problems than would have been caused if you’d just left them off.I’ve seen a lot of network outages in my time that were caused by routers that got confused and sent out (or listened to) the wrong routes. Network engineers have many names for this phenomenon - my favorite is “flapping”, and it’s a very common happening. I have seen many fewer network outages caused by router hardware that just dies - and most of those have been routers that spent time in places with very dirty electrical power. Now of course, I have seen networks that lose routers without any hiccups at all, but those are generally the networks that require “pull tests” (where you unplug routers and make sure things fail over as you expect) after every non-trivial configuration change and periodically on a regular basis.Likewise, I’ve dealt a lot lately with a network that has regular issues due to “automatic spanning-tree reconfigurations” and a database cluster that blue-screens when the underlying SAN hiccups.Think about it for a second - there are many different ways that a system can go wrong - many different pieces that can fail in different ways. What are the odds that the code that is supposed to deal with that specific failure is going to behave exactly as you want it to the very first time that section of code is executed in your environment?I’m not saying “never use any High-Availability add-on”, I’m saying “if you use an High-Availability add-on, either spend far more effort configuring and testing it than you would spend on the non-HA version, or expect it to cause you more problems than you would have had if you’d gone with the non-HA version.“It’s okay if you don’t believe me. A lot of vendors have spent a lot of money trying to get you to believe that it isn’t true. But think about it, and start paying more attention to what’s causing your enterprise more problems. After that, I think it will become clear.