Pepperdata Increases CPU Utilization by 157% When Compared to AWS Auto Scaling

TPC-DS on Amazon EMR Benchmark Results

Introduction

The Pepperdata 2021 Benchmark Report demonstrates the efficacy of Pepperdata Capacity Optimizer compared to the AWS Custom Auto Scaling Policy. The benchmarking work in this report uses TPC-DS, an industry-standard big data benchmarking workload, and measures the following:

• Instance hours
• CPU utilization
• Memory utilization
• Price/performance, as defined by the cost to run the workload divided by the time it took to run

The purpose of this report is to demonstrate the performance improvements Capacity Optimizer can deliver using TPC-DS, a documented standard workload. Pepperdata ran the benchmarks “out of the box” and did not modify or recompile them using any special libraries.

Key Findings

This report highlights three groups of findings that demonstrate that Capacity Optimizer can automatically:

• Optimize resource utilization, as measured by an increase in CPU and memory utilization and a decrease in instance hours.
• Decrease instance duration.
• Reduce costs by up to 50%.

On average, Capacity Optimizer decreased both overall duration by 8% and instance hours by 38%, while increasing both CPU utilization by 157% and memory utilization by 38% when compared to AWS Custom Auto Scaling for the TPC-DS workload.

Big Data Benchmarking in the Cloud

Benchmarking is the process of running a set of standard tests against some object to produce an assessment of that object’s relative performance. Imagine driving three different sports cars on the same course and measuring each car’s maximum speed, torque, and fuel consumption to compare the overall performance of the three cars.

This report covers our initial work with TPC-DS, the Decision Support framework from the Transaction Processing Performance Council. TPC-DS is a sophisticated, industry-standard big data analytics benchmark developed over decades and is a de facto standard for SQL including Hadoop. Our work is not an official audited benchmark as defined by TPC.

The TPC-DS workload consists of three distinct disciplines: Database Load, Query Run, and Data Maintenance. The query run is executed twice, once before and once after the data maintenance step. Each query run executes the same 99 query templates with different variables in permuted order, thereby simulating a workload of multiple concurrent users accessing the system .

The TPC-DS workload consists of three distinct disciplines: Database Load, Query Run, and Data Maintenance. The query run is executed twice, once before and once after the data maintenance step. Each query run executes the same 99 query templates with different variables in permuted order, thereby simulating a workload of multiple concurrent users accessing the system.

Resource Utilization Results

On average, Capacity Optimizer decreased both overall duration by 8% and instance hours by 38%, while increasing both CPU utilization by 157% and memory utilization by 38% when compared to AWS Custom Auto Scaling for the TPC-DS workload:

instance hours
cpu utilization
memory utilization

Cost Ratio

Pepperdata compared the cost of the two options for each of the 103 queries. In the following chart, the baseline cost line represents the average cost of the queries with TPC-DS data using the AWS Custom Auto Scaling. Any queries where Capacity Optimizer provided a savings is represented by bars which run below the baseline. More than 90% of the queries which occurred while running Capacity Optimizer resulted in savings versus running the AWS Custom Auto Scaling policy alone.

cost ratio

In this set of runs, the five queries where Capacity Optimizer did not perform as well as the AWS Custom Auto Scaling Policy tend to be the simpler and less complicated queries. These are TPC-DS Query 6, TPC-DS Query 8, TPC-DS Query 10, TPC-DS Query 20, and TPC-DS Query 98. Pepperdata excelled in the most complicated queries that reflect the most demanding real-world environments.

Overall Duration

Although overall duration was not a primary metric of interest in evaluating the effectiveness of Capacity Optimizer, we did observe an approximate 8% decrease in the overall runtime of the entire suite of 104 queries, with the overall duration decreasing from 3.62 hours to 3.33 hours.

duration

At the individual query level, the duration difference ranged from an increase of 38% with Capacity Optimizer to a decrease of 74% with Capacity Optimizer, as shown in the following graph. As in the previous graph, any queries where Pepperdata using Capacity Optimizer decreased the duration of the query is represented by bars which run below the baseline.

duration ratio

Conclusion

Using the industry-standard benchmark dataset TPC-DS, Pepperdata was able to demonstrate that Capacity Optimizer provides an uplift over the Amazon AWS Custom Auto Scaling Policy in both CPU and memory utilization while decreasing instance hours and overall time to run the entire suite of 103 queries.

As more companies migrate big data workloads to the cloud, these findings have important implications for cost and resource management. A recent survey conducted by Pepperdata identified budget control as a primary challenge of organizations moving to the cloud. Without a lot of effort, Capacity Optimizer can help big data workloads in the cloud run more efficiently, resulting in substantial cost and resource savings. This helps make the cloud an even more attractive and viable option for big data workloads.

Sign up to try out the Pepperdata solution free for 30 days, and see how much simpler managing your big data applications in the cloud can be.

Appendix: Methodology and Configurations

This appendix presents the methodology and configurations we used.

Methodology

The methodology used in our benchmarking study focused on three variables:

1. The full suite of 103 TPC-DS queries
2. An AWS Custom Auto Scaling configuration that was selected at random from among those used by our prospects and customers who have implemented AWS Custom Auto Scaling on their AWS EMR clusters
3. A consistent application of Pepperdata Capacity Optimizer

We created two 3 TB TPC-DS datasets in Amazon EMR, stored in AWS S3. Data sets contain approximately 6.35 billion records stored in 24 tables. We ran each query twice, with and without Capacity Optimizer, for a total of four runs. We then averaged the results. A summary of the four runs is shown in the following table:

AWS Custom Auto Scaling Policy Pepperdata Capacity Optimizer
Run 1 Run 3
Run 2 Run 4
Results: Average of Run 1 and Run 2 Results: Average of Run 3 and Run 4

To prevent the cluster from caching previous results, we created a new cluster between each run.

AWS Settings

EMR 6.1 with Spark 3 was used in all four runs. The AWS Custom Auto Scaling rules used were selected at random and we believe are representative of what a typical AWS customer might use when setting up a new cluster. There was no customization of the AWS settings for the workloads. (See Appendix for further details.)

Pepperdata Settings

Pepperata used the default Capacity Optimizer configurations in the standard XML file that we provide to our prospects and customers implementing Capacity Optimizer in their environments.

AWS Cluster Setup

We used the following AWS configuration and instances:

Release label: emr-6.1.0
Hadoop distribution: Amazon 3.2.1
Applications: Hive 3.1.2, Spark 3.0.0

​​Master: 1 m5.4xlarge
Core: 8 m5.4xlarge
Task:1 m5.4xlarge

Example AWS CLI Export

cli export

AWS Custom Auto Scaling Policy

We used the following AWS Custom Auto Scaling Policy settings inside AWS:

scaling policy

Appendix: Raw Data

This appendix presents the raw data we gathered.

appendix raw data

(*) Note: We were not able to determine instance hours directly from the metrics we collected, so we calculated it based on resource usage and instance size. We considered two ways to calculate instance hours:
1. Instance GB-Hours / Instance Mem (GB)
2. Instance Core-Hours / Instance Cores
Both resulted in similar calculations of instance hours. To create this table, we chose the max of these two values in case there was any difference.

 

Individual Query Cost

individual query cost
individual query cost 2
individual query cost 3
individual query cost 4
individual query cost 5
individual query cost 6
individual query cost 7
individual query cost 8
individual query cost 9

Individual Query Duration

individual query duration
individual query duration 2
individual query duration 3
individual query duration 4
individual query duration 5
individual query duration 6
individual query duration 7
individual query duration 8
individual query duration 9