This blog is the first in a series that introduces big data developers to Pepperdata Application Summary. Application Summary is the first in a series of Pepperdata guided application performance management (APM) user experiences. In these experiences, we solve a particular user problem (or use case) by providing all the relevant information, insights, and calls to action in one place so that the user can perform these tasks easily and quickly.
What is Application Summary?
Before I start the tour, let me first introduce Application Summary, a self-service performance solution created for application developers of Spark, MapReduce, and other big data applications. When we talk about application performance, it could be in terms of running applications faster, using fewer resources, or, in the case of error resolution, mitigating these errors and quickly getting to the root cause. For developers who want to make their applications perform better, we target the following use cases:
- Find my application(s) easily
- Provide meaningful recommendations for improving application performance
- Identify system bottlenecks that affect application performance
- Help me to easily determine the root cause of application failures
Let’s start with finding applications by using the App Search function.
Based on user feedback, we simplified the search options so you can more easily search for all the applications running on your cluster or just specific ones that you are interested in. Either way, you can optionally specify a time range for your search, as well as an application’s full or partial name. If you want to narrow down your search to just one user or one queue, you can specify that as well. And, you can save your searches to use later so you don’t have to re-enter the same search criteria. Let’s see this in action. I’m going to specify “ScalaPageRank” as my specific app name, “prod” as the user name, and “root.prod” for the queue.
After I clicked the Find Matching Apps button, App Search returned eight results. The search criteria is displayed, and information is shown in a tabular form that can be sorted by columns. In this instance, we sorted by start time. You can compare the stats of any two runs of an app to see why one run took significantly longer than another or to see how the performance characteristics changed as the result of a small code or operational parameter change.
Another APM feature of Application Summary is the ability to alert on duration or on peak memory usage. You can click the alarm icon in either column heading—I clicked on the duration alarm icon—and it opens a pane where I can set an alarm for future runs of this app, such as those that exceed a threshold of 25 minutes. This means that for any future runs of the app, if it takes more than 25 minutes to run, I receive an alert. If an application has an important SLA associated with it, you can use this feature to set associated alarms.
Returning to the tabular search results, let’s click the App ID for one of the ScalaTeraSort apps so we can take a look at its Application Summary. For this demonstration, I’m using ScalaTeraSort, a Spark application. There are three sections of Application Summary that I’ll discuss. After I do that, I’ll discuss Pepperdata recommendations as they relate to APM.
To start with, we have the header, which gives the app name, app type, the user who ran it. We answer questions like, “In which queue did the app run?” and “How long did it run?” The header also answers the question, “how much resources are you consuming?” In this case, the app took 87 percent of the cluster memory, and it held it for 18 minutes. It also took 53 percent of the CPU, and consumed that for 18 minutes.
The second section of Application Summary, Issues, gathers all the issues related to the app into one place and provides actionable recommendations for improving performance. By “issues”, we mean alarms, bottlenecks, and status and error information specific to the type of app, such as Spark or MapReduce. By working through the tabs from left-to-right, following the recommendations and addressing the root causes of the identified bottlenecks and app failures, you can address all aspects of APM.
So let’s start with the recommendations tab. Our aim is to provide very specific advice that is understandable for users. For example, in the screenshot above, we show you that your application experienced a Spark executor shuffle read bytes skew and that you need to increase the number of data partitions. We also tell you that you can achieve this by using the RDD repartition transformation or by decreasing the cluster’s dfs.block.size value.
The third section of the Application Summary shows stats—the underlying data for the second section’s recommendations and issues. In the Resource Usage tab, the first thing that we show is how much memory is being wasted by the application. Right now the severity levels correspond to hard-coded thresholds, but our goal is to configure the threshold for your particular environment so that you know whether this application falls within the norm of all of the other applications that you’re running on your cluster. Next, we provide charts that show resource usage over the lifetime of the app’s run. For example, in the Memory Used by Type, we break down the memory by total, heap, non-heap, and new I/O, which is of particular interest to Spark developers. So if your app asks YARN for a portion of cluster or queue memory, we’ll tell you how much was allocated in terms of memory and CPU, and how much was actually used.
Another chart in the Resource Usage tab is App Container Asks, which provides insight into the lifecycle of your application: “What’s the backlog of these asks, and what is running?” If you have a significant backlog, you know that that app is going to be constrained and therefore take longer to run. In this example, there was very little backlog, and the app got the majority of its containers as soon as it asked for them. The other tab in the Stats section is App History. It provides key metrics such as runtime duration and peak memory usage for the app’s current run and five previous runs.
Returning to the Issues section of Application Summary, I’d like to briefly walk through the Bottlenecks tab to talk about the types of bottlenecks that can occur for an application:
- The app could be running on nodes that are CPU bound. We say that a node is CPU bound if it is pegged at 95 percent CPU usage or higher. If your app ran on CPU bound nodes for 80 percent of its runtime, we say that the app experienced a CPU bound bottleneck.
- The app could be spending a lot of time doing garbage collection (GC), which is an intrinsic determinant of application performance. If your app spent more than 25 percent of its runtime doing GC, we say that the app experienced a GC bottleneck.
- The app could be idle a lot of the time, just waiting for the scheduler to launch it. If the app was idle for more than 30% of its runtime, and the runtime was longer than ten minutes, we say that the app experienced a scheduling delay bottleneck.
In this example, the app ran for just over ten minutes. But almost all of that time (99.12 %) was spent just waiting. So this bottleneck is highlighted in red in the Bottlenecks tab. Now that we’ve seen how to display bottlenecks that affect your app performance, let’s look at app status and, specifically, information about failures. Next, I’m going to show you a Spark application.
It’s important to note that as far as YARN is concerned, this app finished with the same successful status, COMPLETED, as the previous examples. However, when we look at the job history, we see that the app consisted of one job that failed. Which essentially means that the application failed. The Spark tab in the Issues section summarizes the failures.
In this case, there were 65 failures, which we break down by jobs, stages,executors, and tasks. It’s much easier to determine the root cause of the job failures by using this contextual breakdown instead of navigating through the Spark Web UI and analyzing the many log files. In addition, we’ve translated the complicated stack traces to simple English, which is much easier to act on. In this case, the job failed because a stage failed four consecutive times, and the stage failed because an executor it was relying on failed. This example showed how much easier it is to diagnose a Spark failure by using Application Summary than to work directly with the stack traces from Spark. And when you’ve used the Application Summary to trace a Spark application’s failure down to the root cause, you can use Pepperdata Code Analyzer for Apache Spark to further diagnose such failures and resolve them.
Thank you for taking this tour of Pepperdata Application Summary. To recap, we demonstrated:
- An easy-to-use, effective application search function that lets you save your searches
- How to set up an alert for future runs of an app, which is useful for scenarios where there’s an associated SLA
- Easy, actionable recommendations for improving app performance.
- How to learn about system bottlenecks that affect application performance
- Using the consolidated errors information in Application Summary, derived from stack traces, metrics data, and log files, to pinpoint exactly which part of an application failed
Pepperdata works closely with customers and understand their unique requirements to provide the best user experience and improve our products. Please look for upcoming blogs in this series. I look forward to working with you and making our products better and more useful and valuable to you.
Things that you can do next: