A Tale of Two Models: Provisioning vs. Capacity

A couple of weeks ago, I wrote about current IT trends being ‘revolutionary’ as opposed to ‘evolutionary’ in nature.

Today, I want to expand on that concept and share one of the planning models that make cloud systems in particular and automated infrastructure in general more cost effective and efficient. When talking to clients I refer to this as “The Provisioning vs. Capacity Model”.

First let’s look at the Provisioning Model, which, with some adaptation, has underpinned infrastructure decisions for the last five decades of IT planning. The basic formula is fairly complex, but looks something like this:

((((CurrentPeakApplicationRequirements * GrowthFactor) * HardwareLifespan) + FudgeFactor) * HighAvailability) * DisasterRecovery

Let’s look at a practical example of what this means. As an IT leader, being asked to host a new application, I would work with the app vendor and/or developers to understand the compute, storage and networking configurations they recommend per process/user. Let’s say that we determine that a current processor core can support 10 concurrent users and a single user creates roughly 800K of data per day.

I would then work with the business to identify the number of users we expect to begin with, their estimate for peak concurrent users and what expected annual growth will be. Ultimately, we project that we will start with 20 users who may all be using the system at the same time. Within the first year, they anticipate scaling to 250 users, but only 25% of them will be expected to be using the system concurrently. By year five (our projected hardware lifespan) they are projecting to have 800 users, 300 of whom may be using the system at any given time. I can now calculate the hardware requirements of this application:

 

YearUsersStorage (GB)Concurrent UsersCores
125049.59636
2450138.8513514
3600257.8722823
4700396.7325926
5800555.4230030

Being an experienced IT leader, I ‘know’ that these numbers are wrong, so I’m going to pad them. Since the storage is inconsequential in size (I’ll likely use some of my heavily over provisioned SAN), from here on out, I’ll focus on compute.  The numbers tell me that I’ll need 2 servers each with 4 quad core processors for a total of 32 processors. Out of caution I would probably increase that to 3 servers. Configuring memory would follow a similar pattern.

Because the application is mission critical it’ll be deployed in a Highly Available (HA) configuration, so I’ll need a total of six servers in case the there is a failure with the first three. This application will also require infrastructure in our DR site, so we’ll replicate those six servers there for a total order of twelve servers. In summary, on day one, this business would have a dozen servers in place to support 20 users.

The Provisioning Model can lead to overkill

Under the provisioning model a Highly Available solution with sufficient Disaster Recovery infrastructure could result in a large server deployment to support a very small number of users.

I know what you’re thinking, “This is insanity, if my IT people are doing this, they are robbing me blind!” No, they aren’t robbing you blind, they are following a “Provisioning” model of IT planning. The reason they plan this way is simple; it usually takes months from the time that an infrastructure need is identified to the time that it is deployed in production. It looks something like this in most enterprises:

  • 1-2 weeks – Identify a need and validate requirements
  • 1 week – Solicit quotes from 3 approved vendors (if the solution comes from a non-approved vendor, add 3 months to a year for vendor approval)
  • 2-3 weeks – Generate a Capital Request with documented Business Justification
  • 2 weeks – Submit Capital Request to Finance for approval
  • 2-3 weeks – Request a PO from purchasing & submit to vendor
  • 2-3 weeks – Wait for vendor to deliver hardware & Corporate receiving to move equipment to configuration lab
  • 3-4 weeks – Manually configure solution (Install O/S & Applications, request network ports, firewall configurations, etc)
  • 2 weeks – Install and Burn-In

The total turnaround time here is 15-20 weeks. Based on the cost, time, pain and labor it takes to provision new infrastructure, we want to do it right and be prepared for the future, and there is no quick fix if we aren’t. Using a provisioning model, the ultimate cost in deploying a solution is not in the hardware being deployed, but rather in the process of deploying it.

The upshot of all this is; Most of your IT infrastructure is sitting idle or nearly idle most if not all of the time. As we assess infrastructure, it is not uncommon for us to see utilization numbers below 10%.

Over the past 15 years as configuration management, CI/CD, virtualization and containerization technologies have been adopted by IT, the math above has changed, but because those technologies are evolutionary in nature, the planning process hasn’t. In the Provisioning model, we are always planning for and paying for capacity that we will need in the future, not what we need today.

Enter Cloud Computing, Infrastructure Automation, Infrastructure as Code (IaC) and AI. Combined, these technologies have ushered in a revolutionary way to plan for IT needs. IaaS and PaaS platforms provide nearly limitless compute and storage capability with few geographic limitations. Infrastructure Automation & IaC allow us to securely and flawlessly deploy massive server farms in minutes. AI and Machine Learning can be leveraged to autonomously monitor utilization patterns, identify trends and predictively trigger scaling activities to ensure sufficient compute power is delivered “Just in Time” to meet demand, then scaled back as demand wanes. In cases where IaaS and PaaS providers experience localized outages, the same combination of IaC and AI can deploy your infrastructure in an unaffected region, likely before most of your user base or IT is even aware that an outage has occurred. Software updates and patches can be deployed without requiring system outages. The possibilities and opportunities are truly mind boggling.

Taking advantage of these capabilities requires a complete change in the way our IT teams think about planning and supporting the applications our users consume. As I mentioned above, the incremental hardware costs of over-provisioning in the data center is inconsequential when compared with the often un accounted for cost of deploying that hardware. In forward looking IT, where IaaS and PaaS are billed monthly and provided on a cost per deployed capacity model, and infrastructure can be nearly instantly deployed, we need to abandon the Provisioning Model and adopt the Capacity Model.

Before I proceed, you need to understand that these three pillars; IaaS/PaaS, Infrastructure Automation, and AI must all be in place to effectively take advantage of the cost savings and efficiency of the Capacity Model while still delivering secure, reliable services to your users. Merely moving (often referred to as “Lift and Shift”) your servers to the cloud and optimizing them for utilization may provide some initial cost savings, but at significant risk to security, availability and reliability of services.

3 Pillars of the Capacity Model

IaaS/PaaS, Infrastructure Automation, and AI must all be in place to effectively take advantage of the cost savings and efficiency of the Capacity Model.

Following the Capacity Planning model, we try to align deployed infrastructure to utilization requirements as closely as we can, hour by hour. You may have noticed in my Provisioning example above I was primarily concerned and planning for the required capacity at the end of the lifespan of the infrastructure supporting the application. I was also building to a standard that no system would ever exceed 35%-40% utilization. In the new capacity planning model, I want every one of my services running at as close to 90% utilization as possible. Ideally with only enough headroom to support increase in utilization for as long as it takes to spin up a new resource (typically only a few minutes). As demand wanes, I want to be able to intelligently terminate services as they become idle. I use the word “Intelligently” here for a reason; it’s important to understand that many of these resources are billed by the hour, so if I automatically spin up and terminate a resource in 15 minutes, I am billed for a full hour – If I do it 3 times in a single hour, I’m billed for 3 hours.

Let’s look at a sample cost differential between Provisioning and Capacity modelling in the cloud. For this exercise, I’m just using the standard rack rates for AWS infrastructure. I am not applying any of the discounting mechanisms that are available and using simple calculations to illustrate the point.

 

Provisioning Model – 5 Year Costs:

YearInstanceCost/HourQtyHour/MonthAnnual Cost
1c5.xlarge0.1748720$70,502.40
2c5.xlarge0.1748720$70,502.40
3c5.xlarge0.1748720$70,502.40
4c5.xlarge0.1748720$70,502.40
5c5.xlarge0.1748720$70,502.40
Total Cost: $352,512.00

 

Capacity Model – 5 Year Costs:

YearInstanceCost/HourQtyHour/MonthAnnual Cost
1c5.xlarge0.172410$1,672.80
2c5.xlarge0.174293$2,390.88
3c5.xlarge0.176255$3,121.20
4c5.xlarge0.177245$3,498.60
5c5.xlarge0.178237$3,867.84
Total Cost: $14,551.32

In the model above, for simplicity of understanding, I only adjusted the compute requirements on a yearly basis, in reality with the ability to dynamically adjust both instance size and quantity hourly based on demand, actual spend would likely be closer to $8k over 5 years. It’s also important to remember that Revolution is neither free or easy; developing and refining the technologies to support this potential savings for this new application will cost $50k-$100k over the five years depending on the application requirements. At the end of the day, or at the end of five years, following the capacity model may result in spending well less than half the cost of the provisioning model, but you would have enjoyed much higher security, reliability and availability of applications with a significantly lower support cost.

To wrap up this very long post, yes, it is true that massive cost savings can be realized through 21stCentury IT Transformation, but it will require a Revolution in the way you think about supporting your business applications. Without people experienced in these very new technologies, you’re not likely to be happy with the outcome. Finally, if you encounter anyone who leads the charge to cloud with words like “Lift and Shift”, please don’t be hesitant to laugh in their face. If you don’t you may end up spending $350,000+ for what could otherwise cost you $8,000.

Andrew Pope is a Co-Founder of Effectual, Inc.

It’s also important to remember that Revolution is neither free or easy; developing and refining the technologies to support this potential savings for this new application will cost $50k-$100k over the five years.