Webinar: Proven Approaches to Hive Query Tuning

Webinar: Proven Approaches to Hive Query Tuning

For those of us who are APM industry “insiders”, there’s not a lot of mystery when it comes to understanding the functionality and value of application performance management. We tend to take for granted that everyone else shares the same knowledge, but that’s not a wise assumption to make. So we thought it would be worthwhile to provide a brief APM tutorial for those who may not be fully familiar with the subject.

Application performance management (APM) is the monitoring and management of performance and availability of software applications. The primary goal of APM solutions is to maintain an expected level of service and availability by understanding why transactions in your application/workload are slow or failing. For example, a development or operations team can instantly tell from their APM solution if an application is causing some performance spikes. They can then leverage their APM solution to determine the root cause, identify which queries were affected and make appropriate adjustments to resolve the issue.

Here are some examples of common application problems that APM solutions can help to quickly identify and resolve.

  • Track application usage to understand spikes in traffic
  • Find bottlenecks or latency problems with application dependencies including SQL, queues, caching, etc.
  • Identify slow SQL queries
  • Find the highest volume and slowest web pages or transactions
  • Find the root cause(s) for application problems
  • Identify ways to optimize application performance

APM solutions perform a plethora of functions in the quest to optimize application performance, including:

  • Anomaly detection, which enables a rapid response to application behaviors that begin to fall outside the normal operating range, and identify events that do not conform to expected behavioral patterns
  • Application debugging, the process of finding the causes of undesirable effects on application behaviors. This requires identifying defects and errors in program code
  • Distributed profiling of the many potential sources of application performance degradation. This involves identifying and localizing the sources of performance degradation across an ecosystem consisting of applications, services and machines.
  • IT service monitoring shows whether response time and uptime commitments are being met, including adhering to service level agreements and performance thresholds
  • Root cause analysis attempts to determine the probable cause of a problem, then constructing a causality chain correlating case and effect.
  • End-user experience monitoring captures user-based performance data to gauge how well the application is performing and identify potential performance problems.
  • Application topology discovery and visualization is the visual expression of the application in a flow-map to establish all the different components of the application and how they interact with each other.
  • User-defined transaction profiling examines specific interactions to recreate conditions that lead to performance problems for testing purposes.
  • IT operations analytics is the discovery of usage patterns, identification of performance problems, and anticipation of potential problems before they happen.

APM solutions also gather an incredible set of metrics as they continuously monitor the cluster and application environment. Here are just a few of the key metrics that are captured and analyzed.

Average Response Time is the amount of time an application takes to return a query or request to a user. To measure average response time, an application is tested under different circumstances (i.e. number of concurrent users, number of transactions requested) Typically, this metric is measured from the start of the request to the time the last byte is sent. Other factors, like geographic location of the user and the complexity of the information being requested, can affect the average response time for users. These should all be considered in the overall evaluation of application performance.

Error Rates – The last thing you want your users to see are errors. Monitoring error rates is a critical application performance metric.
There are potentially three different ways to track application errors:

  • HTTP Error % – Number of web requests that ended in an error
  • Logged Exceptions – Number of unhandled and logged errors from your application
  • Thrown Exceptions – Number of all exceptions that have been thrown

It is common to see thousands of exceptions being thrown and ignored within an application. Hidden application exceptions can also cause a lot of performance problems.

Request Rate – Understanding how much traffic your application receives will impact the success of your application. Potentially all other application performance metrics are affected by increases or decreases in traffic. Request rates can be useful to correlate to other application performance metrics to understand the dynamics of how your application scales. Monitoring the request rate can also be good to watch for spikes or sudden inactivity. If you have a busy API that suddenly gets no traffic at all, that could signal something bad. A similar but slightly different metric to track is the number of concurrent users.

Application and Server CPU and Memory – If the CPU usage on your machines is extremely high, or you have limited memory resources, you can expect to eventually suffer from application performance problems. Monitoring the CPU and memory usage of your servers, cluster and applications is a basic and critical metric.

Application Availability – Monitoring and measuring if your application is online and available is a key metric you should be tracking. Most companies use this as a way to measure uptime for service level agreements (SLA).

All metrics should be evaluated over time, and one that is critical should be fed into a rules engine that raises alerts when a set threshold is exceeded. Ultimately all metrics can and should be used to understand what is normal/typical for your application so that abnormal/atypical behavior can be detected, analyzed and resolved.

Other Things You Can Do

Interested in learning more about APM and how it can be applied to optimize your application and infrastructure environment?