This blog post touches on the broader themes of our whitepaper, Reducing the Runaway Costs of a Hybrid Big Data Architecture. To get the full story, download the whitepaper here.
Large-scale cloud migration begins for a number of reasons. Contributing factors include an on-prem data center reaching its capacity limit, aging hardware, and expiring licenses. The most common reason, though, is the frustration of internal, often customer-facing, teams.
Picture this: internal departments, pursuing their own KPIs, request more computing resources, putting pressure on the IT operations team to provide them. IT operations, meanwhile, receive a conflicting message from the CFO; they are being told to minimize spend and be as streamlined as possible. And so the IT Ops teams end up tearing their hair out attempting to free up resources and help meet requests from within the organization.
Legacy big data approaches often don’t leave IT operations teams with many options. In an on-prem data center, there is an inherent and internal limit to compute capacity. An on-prem data center will never double its capacity overnight, and any utilization gains are hard-won.
The cloud is seen as the obvious solution to this problem. With AWS, Azure, or Google Cloud, you face none of the baked-in limitations of an on-prem data center. The technical and internal bottlenecks of the legacy architecture vanish. Compute capacity is theoretically unlimited.
This is also where the trouble starts.
In the CapEx framework model (where legacy, on-prem data centers operate), the balance sheet is very clear and projections were simpler. Traditionally, the CFO would oversee strict cost control mechanisms. Though this translated to constrictions on compute capacity, the trade-off was watertight budgeting.
However, in the cloud-based OpEx paradigm, the control flows of how money is being spent suddenly become much looser and harder to define, as there is no hardcoded capacity ceiling.
As cloud migration is usually a large-scale, exciting, multi-stakeholder internal project, enterprises throw resources at it. To unleash the full computing capacities of internal teams, the migration leaders are afforded a large budget. The engineers are delighted, loving the idea of new capacity, and for every internal team, an all-you-can-eat approach to resources sounds like the promised land.
An OpEx spending model + the infinite resources of the cloud = overspending disaster
In this brave new world, an engineer can spin up a hundred-node cluster in AWS on a Friday, forget about it and go home, and then discover a month later that over the weekend it racked up thousands in cloud costs.
Although the above example almost never happens when organizations migrate, a cloud-based architecture can be more streamlined and cheaper if the processes and controls are strict and clear. There are ways to potentially get a 30-50% increase in enterprise cluster throughput. Conversely, this is equal to a 30-50% reduction in infrastructure research requirements.