In this week’s blog, we’ll describe various approaches to successful big data chargeback as observed by Pepperdatafrom our unique vantage point providing performance management solutions for hundreds of big data platforms around the world.
Big data chargeback models have been fluid since the earliest deployments of Hadoop. A long-standing question has been, what is the most effective way to quantify and recoup the cost of providing a big data platform to multiple tenants? The answer has changed over time as more enterprises continue trying various, evolving approaches. As you’ll see, the answer is neither simple or static.
The Old Way
Hadoop itself is synonymous with big data and it’s no surprise that early chargeback models were driven by the amount of data an individual or group stored in the Hadoop File System (HDFS). Initially charging per the amount of data stored was fairly effective. Early on, groups or individuals were pulling in datasets for manipulation and managing the space they were provided as a standard. Data lifecycle management as a big data practice began to break this model.
When big data platforms hit their early growth curves, businesses took a critical look at what was driving the growth. They found data duplication and replicated processing pipelines both having a strong negative impact on ROI for the platform as a whole. To tackle these problems, production processes were stood up to ingest and prepare data for multiple tenants, enabling over-arching data management without the waste. Extract, transform, and load (ETL) processes became the norm, at which point production or service accounts began to “own” the majority of the data in these environments. The old storage-based chargeback models needed to be revisited.
The New Way
When individuals no longer own the bulk of the data, the model naturally shifts to charging for the use of the data as the primary chargeback method. Charging tenants for the use of data required quantifying the workloads that were run on any data set. Resource management in Hadoop relies heavily on the segmentation of access to CPU and RAM. Given the modern state of data management, charging tenants for the amount of CPU and RAM they used to extract value from data became the norm.
This new approach has a number of advantages over the old model. First, you eliminate any gray areas that arise when data ownership is not clear cut. Should the fraud prevention group own the transaction data they use to train their models when that same data is used by seven other groups within the business? No need to answer this question if you charge by resources used. Second, this system is a lot harder to game. Believe it or not, tenants will adjust usage patterns to incur less cost. In the extreme, there are cases where the data owner becomes aware of the audit schedule and deletes or compresses data only to reingest or expand it again after the measurements are taken. Better to incentivize the more efficient use of CPU and RAM, as this behavioral adjustment actually improves the overall health of the environment.
The Next Way?
If there’s one constant in technology, it’s change. The current best practice may not be the best way after the next wave of complexity or innovation. There are already questions about the new ways that enterprises are exploring to improve chargeback along two key vectors:
- What is a tenant?
- Can we change how a tenant is defined on the