When Best Efforts Aren’t Good Enough
“Have you tried rebooting it?”
There was a time, not so long ago, when that was the first question a technician would ask when attempting to resolve an issue with a PC or a server that evolved from PCs. This was not limited to servers; IT appliances, network equipment, and other computing devices could all be expected to behave oddly if not regularly rebooted. As enterprise IT departments matured, reboot schedules were developed for equipment as a part of routine preventative maintenance. Initially, IT departments developed policies, procedures, and redundant architectures to minimize the impact of regular reboots on clients. Hardware and O/S manufacturers did their part by addressing most of the issues that caused the need for these reboots, and the practice has gradually faded from memory. While the practice of routine reboots is mostly gone, the architectures, metrics, and SLAs remain.
Five Nines (or 99.999%) availability SLAs became the gold standard for infrastructure and is assumed in most environments today. As business applications have become more complex, integrated, and distributed, the availability of individual systems supporting them has become increasingly critical. Fault tolerance in application development is not trivial, and in application integration efforts it is orders of magnitude more difficult, particularly when the source code is not available to the team performing the integration. These complex systems are fragile and will behave in unpredictable ways if not shut down and restarted in an orderly fashion. If a single server supporting a piece of a large distributed application fails, it can cause system or data corruption that will take significant time to resolve, impacting client access to applications. The fragile nature of applications makes Five Nines architectures very important. Today, applications hosted in data centers rely on infrastructure and operating systems that are rock solid, never failing, and reliable to a Five Nines standard or better.
As we look at cloud, it’s easy to believe that there is an equivalency between a host in your data center and an instance in the cloud. While the specifications look similar, critical differences exist that often get overlooked. For example, instances in the cloud (as well as all other cloud services) have a significantly lower SLA standard than we are used to, some are even provided on a Best Effort basis. It’s easy to understand why this important difference is missed – the hardware and operating systems we currently place in data centers are designed to meet Five Nines standards, so it is assumed, and nobody asks about it anymore. Cloud-hosted services are designed to resemble systems we deploy to our data centers, and although the various cloud providers out there are clear and honest about their SLAs, they don’t exactly trumpet the difference between traditionally accepted SLAs and those they offer from their rooftops.
A Best Efforts SLA essentially boils down to your vendor promising to do whatever they are willing to do to make your systems available to you. There is no guarantee of uptime, availability or durability of systems, and if a system goes down, you have little or no legal recourse. Of course, it is in the interest of the vendor and their reputation to restore systems as quickly as possible, but they (not you) determine how the outage will be addressed, and how resources will be applied to resolve issues. For example, if the vendor decides that their most senior technicians should not be redirected from other priorities to address the outage, you’ll have more junior technicians handling the issue, who may potentially take longer to resolve it – a situation which is in your vendor’s self-determined best interest, not yours.
There are several instances where a cloud provider will provide an SLA better than the default of Best Efforts. An example of this is AWS S3, where Amazon is proud of their Eleven Nines of data durability. Don’t be confused by this, it is a promise that your data stored there won’t be lost, but not a promise that you’ll be able to access it whenever you want. You can find available SLAs for several AWS services, but none of them exceed Four Nines. This represents effectively 10x the potential outage time over Five Nines and applies only to the services provided by the cloud provider, not the infrastructure you use to connect to them or your applications which run on top of them. The nature of a cloud service outage is also different than one that happens in a data center. In your data center, catastrophic all-encompassing outages are rare, and your technicians will typically still have access to systems and data while your users do not. They can work on both restoring services and “Plan B” approaches concurrently. When systems fail in the cloud, oftentimes there is no access for technicians, and the work restoring services cannot begin until the cloud provider has restored access. This typically leads to more application downtime. Additionally, when systems go down in your data center, your teams can typically provide an ETA for restoration and status updates along the way. Cloud providers are notorious for not offering status updates while systems are down, and in some cases, the systems they use to report failures and provide status updates rely on the failed systems themselves – meaning you’ll get no information regarding the outage until it is resolved. Admittedly, these types of events are rare, but the possibility should still give you pause. So, you’ve decided to move your systems to the cloud, and now you’re wondering how you are going to deal with the inevitable outages. There are really only a few options available to you; first, you can do nothing and hope for the best. For some business applications, this may be the optimal (although most risky) path. Second, you can design your cloud infrastructure like your data centers have been designed for years. My last two posts explored how expensive this path is, and depending on how you design, it may not offer you the availability that you desire anyway. Third, you can implement cloud infrastructure automation and develop auto-scaling/healing designs that identify outages as they happen and often respond before your team is even aware of a problem. This option is more cost-effective than the second option, but it requires significant upfront capital and its effectiveness requires people well-versed in deploying this type of solution – people who are in high demand and hard to find right now. Finally, the ideal way to handle this challenge is to rewrite applications software to be cloud-native – modular, fault-tolerant applications that are infrastructure aware, able to self-deploy and self-re-deploy through CI/CD patterns and embedded infrastructure as code. For most enterprise applications this would be a herculean effort and a bridge too far. Over the past several decades, as we’ve made progress in IT towards total availability of services, you’ve come to rely on, take comfort in, and expect your applications/business features to be available all the time. Without proper thought, planning, and an understanding of the revolutionary nature of cloud-hosted infrastructure, that availability is likely to take a step backward. Don’t be like so many others and pay a premium for lower uptime. Be aware that there are hazards out there and bring in experienced people to help you identify the risks and mitigate them. You’re looking for people who view your moves toward the cloud as a business effort, not merely a technical one. Understand the challenges that lie ahead, make informed decisions regarding the future of your cloud estate, and above all, Cloud ConfidentlyTM!
Don’t Take Availability for Granted
Over the past several decades, as we’ve made progress in IT towards total availability of services, you’ve come to rely on, take comfort in, and expect your applications/business features to be available all the time. Without proper thought, planning, and an understanding of the revolutionary nature of cloud hosted infrastructure, that availability is likely to take a step backward. Bring in experienced people to help you identify the risks and mitigate them.