Six Key Takeaways from Spark Summit and DataWorks Summit

Six Key Takeaways from Spark Summit and DataWorks Summit

It’s been an extremely busy–but, very productive – couple of weeks. Pepperdata sponsored two critically important events attended by many of our clients and target customers.

Thousands of developers, data engineers, data scientist and business professionals attended each event. Spark Summit is dubbed the world’s largest event for the Apache Spark community, and DataWorks Summit is one of the industry’s premier big data community events. Combined, these events promote the opportunity to participate in hundreds of formal presentations as well as countless informal conversations.

Events like these let us hear directly from clients and prospects on a range of issues. What types of projects are they working on? What works well? Where do they struggle? What are their primary concerns today, tomorrow, and in the future?

I wanted to share six key takeaways from both events based on the conference sessions that I attended, as well as direct conversations that I had with attendees:

  • Formerly known as the Hadoop Summit, DataWorks Summit has expanded its sphere of influence beyond Hadoop. This includes a greater focus on Spark and the use cases enabled by Spark like machine learning, predictive analytics, and artificial intelligence. The focus on Spark and use cases matches our own focus and what we observe within our own customer base.
  • There is great momentum behind Spark as observed at both events. Spark was a hot topic in keynotes, in breakout sessions, at the booth, and in the line for coffee. Customers understand the benefits of Spark, but face challenges in writing applications that deliver great performance. Many of them are new to Spark.
  • Developers are using Spark to quickly develop projects with sample data on small development clusters, but face significant challenges when deploying those projects on large production clusters with production scale data.
  • DevOps for Big Data strongly resonated with participants at both events who expressed the corresponding need for tools that accelerate the DevOps cycle from code to monitor.
  • As customers maximize their Big Data investments, there is a corresponding growth in multitenancy usage, and an acute need expressed by operators for tools that can help them deliver performance across clusters.
  • Developers and operators are coming to grips with the importance of understanding “cluster weather” when looking at the performance of any single application. Cluster weather is a term that we use at Pepperdata to describe the performance impact imposed by all the applications vying for resources on a cluster and the health of all the cluster resources for an application of interest at any given point in time. We observe cluster weather as a combined view of all Spark and Hadoop applications that run on the cluster, and the health of all nodes on the cluster. What many customers think of as a “Big Data” problem is a DevOps issue that requires a detailed understanding of cluster performance and the impact of individual applications in a multitenancy environment. Look for a blog post focused on cluster weather in the coming weeks.

Based on hundreds of conversations over the past couple of weeks, I am confident that Pepperdata is well positioned to serve these industry trends. Pepperdata uses fine granularity time-series data across the full stack, combined with active automatic controls that maximize cluster utilization to display the performance impacts of developing and running Big Data clusters. Other tools don’t provide performance views into applications, nodes, and clusters to identify where problems originate. Several customers revealed that prior to selecting Pepperdata, they spent months using other existing tools to troubleshoot performance problems in the cluster without diagnosing the root cause. In most cases, Pepperdata can pinpoint the root cause of performance issues within days.

To help customers solve the problems of scaling Spark applications from development to production clusters, we recently announced Code Analyzer for Apache Spark, which identifies lines of code and related stages in applications causing performance issues related to resource consumption over CPU, memory, garbage collection, network, and disk I/O. This is just the latest addition to our comprehensive portfolio that shortens time to production, and increases cluster ROI on-premise or in the cloud, and improves communication and resolution of performance issues between Dev and Ops.

To learn more about the value Pepperdata provides tune into our Webinar series and contact us directly if you have any questions.

0