I know that this email and blog entry is way too long. But it is definitely worth reading and I think sort of fun. So print it out and read part of it now and if you can put it down, part of it tomorrow. I think you will enjoy it. Don’t forget to send it to a couple of friends. Share the fun.
When I was VP of IT at Barnes and Noble.com, we were on the bleeding edge of ecommerce. We were one of 5 sites that were the biggest, most influential, most complex, fastest moving ecommerce sites on the Internet. Like the others, we were breaking new ground constantly as we invented how to do everything. Along with all the excitement came mistakes. Some of the mistakes turned into crises that impacted the business. Managing crises became one of my unwritten jobs. I became a student of crises to learn how to deal with them better. Here are 9 rules that I learned along the way:
Nine rules for managing crises
1. Train your people to pull the rip-cord early
2. Avoid the cube magnet
3. Always assume there is more than one unrelated problem occurring
4. Ignore the obvious
5. Schedule the fire-fighting effort for effectiveness
6. Help people get off the call
7. Fire the fire enthusiasts
8. Learn from every crisis
9. Engineer your way to sleeping at night
1. Train your people to pull the rip-cord early. I found it fascinating that the onset of a crisis was naturally ignored by well – everyone. I think people think maybe it will go away like a cold. Just leave it alone. Or maybe they think if I close my eyes the boogeyman can’t be there, or maybe someone else will fix it so I can focus on what is important to me right now. Or maybe they are afraid that someone will shoot the messenger. The problem of ignoring an event is that the longer you put off diagnosing the problem, the more noise enters the system from increased usage, planned changes, and unrelated events. All of this clutter can make it much more difficult to gather the pertinent information to understand what happened to initiate the crisis. Understanding is the first step to solving a problem. Employees should be rewarded and encouraged for identifying issues and raising the alarm as soon as possible.
2. Avoid the cube magnet. I am sure you have seen this occur. Shortly after the alarm is sounded, one of the best engineers announces, “I know what the problem is!”. You see everyone stop their own investigation and start surrounding the engineer’s cube, admiring how fast she can type. Uh-oh, you find out after a while that perhaps the engineer was wrong or perhaps she discovers some interesting information, but not the root cause of the event. The result is lost time. Rather than letting this happen, you should say, “Great, Sue! Let us know what you found on the conference call, in the meantime, everyone should carry on with their own fact finding.”
3. Always assume there is more than one unrelated problem occurring. IT systems naturally grow to become complex systems of systems. At any moment in time, there are unimportant events occurring that should, when time permits, (ya,right), should be cleaned up in the code or configuration of servers or SAN or network. Because these are happening all the time, when you start looking for problems, you are guaranteed to find them. Unfortunately, not necessarily the clues you need to find that relate to the root cause of the current crisis. How should you deal with these? Track them all. You should absolutely look for how spurious metrics may be related to the problem at hand. But, it is good to train the firefighters to keep an open mind. Any two events, no matter how much you want them to be correlated, may not be.
4. Ignore the obvious. “The cause cannot be the fuzzywuz master database. There is no real-time relationship between that database and the order path”. “There is no way that the test code could have ended up loaded on that production server. There is no way that putting in that server name could have ended up pointing to that server”. Regardless of how we think systems work, we may not completely understand the errors in systems, the mistakes that people make, or the configuration impacts of planned and unplanned changes. I have always found that the root cause of problems made sense. Not the sense that experts expect, but still very logical. Net-net, it is not a personal thing, but crises need to be treated very logically to get resolved. People, even the smartest ones may be wrong sometimes. What they think is obvious, may not turn out so obvious.
5. Schedule the fire-fighting effort for effectiveness. Ok. So we have declared a crisis. Someone opens the “bridge”. Everyone who could help gets on the call. This will be effective for at least 30 minutes as data is shared. Then, when you find that the root cause has not been discovered, it is time for more fact-finding. In many cases, the best way to manage the diagnosis is a set of waves of sharing information then data collection, then sharing information. To do this, it is best to plan the waves. Sometimes it is 45 minute waves, sometimes longer. This planned fire-fighting approach gives the experts the time that they need to accomplish tasks. It is effective.
6. Help people get off the call. The crisis has been going on. The call has been open for 5 hours. Everyone who could have offered any help at the beginning voluntarily or not, joined the call. Now, it is likely that half of them no longer have any value on the call. They want to get back to work, visit their family or go back to sleep. To get off the bridge, a) the problem must be solved, b) an individual must be able to prove without a shadow of a doubt that they have nothing to offer, or c) the individual dies. I have seen engineers feign their own death to get off conference calls. Rather than forcing them the embarrassment of explaining how they were resuscitated through the wonders of medicine the next day, it is better to actively help people get off the call. If it is reasonable that their area of expertise will not help, tell them to go to bed. We know how to find them.
7. Fire the fire enthusiasts. Every organization has them. They are the first ones that volunteer to get on the bridge. They are the only ones that know how System Z works. They are so fast at fixing the problem when it occurs. But, interestingly, they don’t ensure that the root cause gets fixed. They live for crises. It may seem like they are indispensible. Temporarily they are. In the long term, they tend to inhibit your ability to stabilize systems, reduce the mean time between failures, and improve the service levels to your customers. If it is possible, find a way to give them the religion to improve process, and reduce defects. But if they are adrenaline addicts, your only choice may be to replace them.
8. Learn from every crisis. I love to learn how to avoid pain. There are many things to learn from crises. These can include how the systems really work. What are the thresholds of systems past which they become unstable. What should be monitored in the future to foresee looming problems and avoid them. What caused the defect that could be avoided in the development process. How to improve change and release management. What tools were most useful to you and which ones weren’t helpful. How to improve crisis management in your organization.
9. Engineer your way to sleep at night. As far as I can tell, 90% of all significant crises start at 1am. They follow the sun to ensure that it is always 1am wherever the engineers live. This will ensure that there is no way that twenty people in your organization sleep at night. It takes effort to fight this law of nature. It can be done. We all know how to do this. We merely implement good quality assurance throughout the application development life cycle, we implement ITIL ITSM to manage change, configuration items, release management and service desk processes. And we implement end-to-end monitoring to ensure that we learn to eliminate critical alarms and start managing warnings.
At T3 Dynamics, we are rolling out Monitoring as a Service to complement our professional services of strategy, architecture, implementation and managed services for enterprise monitoring. We are looking for additional beta testers for a no-charge implementation of monitoring so you too can sleep better at night. Please let me know if you have an interest.
No comments:
Post a Comment