Our previous post, Empower Shared Services Chargeback Models to Generate Better Business Outcomes, submits that deep, timely, accurate, and accessible platform usage data empowers shared services chargeback models to generate better business outcomes. This post goes a level deeper with regard to cost controls and data placement. It describes the challenges of multi-temperature data management as it applies to financial services and how to take the simple next step to reviewing your data temperature today.
What is Data Temperature?
Data temperature is an expressive way to describe how often data is used. Hot data is accessed most frequently and cold data is seldom used, while warm and cool data is the range in between. Fast data solutions are expensive, so priorities need to be set across enterprise-wide big data platforms based on how hot or cold the data is. The purpose of multi-temperature data management is to ensure the most cost-effective allocation of primary storage (fast) and secondary storage (not so fast).
Hot data in financial services requires low latency network and storage solutions (e.g.: in-memory computing and time series databases). This is a critical component of algorithmic (algo) trading in capital markets, when trading decisions are calculated in real time on streaming market data. Hot data is also needed in personal banking scenarios, like payments decisioning to mitigate fraud without inconveniencing the customer.
On the other hand, end-of-month treasury reporting is a cooler data access scenario, while the requirements to archive seven years of customer data gets pretty cold.
A holistic view that considers a variety of architectures, like microservices, in-memory computing, and HDFS storage best practices, is needed but those categories can’t be compared, apples to apples. Effective multi-temperature data management is only possible if you have detailed knowledge about how data is used across your enterprise. That in turn requires deep, timely, and accurate platform utilization data that is accessible by decision makers.
The Challenges of Multi-Temperature Data Management
It is pretty straight forward to allocate hot data to primary storage and cold data to secondary storage. The challenge of evaluating data storage economics comes when functions overlap and when warm and cool temperatures change abruptly or over time.
In the algo trading example above, not all data an algo uses is necessarily hot. Decisioning on real-time market conditions could refer to an economic data model that is rarely used elsewhere. This is a functional overlap because the latter is cold but it is still important. Identifying a high value application is not enough because, in this case, both hot and cool data are used.
With regard to changing data temperatures, consider an economic crisis. High priority data used to generate revenue can temporarily become secondary to mitigating risk. What was cool yesterday is hot today, but it will revert back when the economy stabilizes again.
To facilitate quick and good decision making, you need a complete and current picture of how data is being utilized across your big data platform. Below, I refer to Pepperdata Platform Spotlight to explain how you would obtain and use the necessary data because it provides a correlated view of your infrastructure and resource utilization.
Increase Operational Alpha
If you’re paying for hot data storage on cold data for any period of time, money that could have been better allocated elsewhere is lost. The data temperature dynamics described above can only be ascertained if you have deep, timely, accurate, and accessible platform utilization data as described below:
- Depth and timeliness: Pepperdata instruments every node to continuously collect and correlate hundreds of real-time operational metrics: host-level CPU, RAM, disk I/O, and network metrics as well as job, task, queue, workflow, and user.
- Accuracy: Because the data is captured at the source, you will know actual utilization, not a calculated abstraction of it.
- Accessibility: Platform Spotlight makes all of the data available and usable via a streamlined UI, direct downloads, REST API, and data science reports.
Figure 1: The correlated view of infrastructure and resource utilization within Platform Spotlight. Pepperdata cluster performance monitoring includes real-time and historical information, including system demand, abusive users, and wasteful applications.
Leveraging the full set of capabilities described above will facilitate cost-effective data placement decisions of primary and secondary storage that reflect current and real data temperature. It also goes a long way to increase operational alpha by improving operating margins while freeing up resources to focus on higher value activities.
Simple Next Steps
While this post described some of the more sophisticated challenges of multi-temperature data management as it applies to financial services, the first step on this journey should be a simple one.
Right now, you probably don’t know the exact age of every file you have stored in HDFS, for example. You don’t know which files should be archived in cold storage and which files need to be moved to a faster data solution.
Pepperdata Platform Spotlight provides a comprehensive data temperature report (see figures 2-4), with which you’ll gain access to:
- The age and size of your HDFS data files
- The exact file names for each temperature
- Which files don’t match their current policy based on access times
Take the simple next step to reviewing your data temperature today.
Furthermore, to understand how the Pepperdata integrated toolset enables finance and big data success through DevOps and Platform Ops collaboration, read this solution brief, “Optimize Data Analytics in Finance.”