Thousands of developers, data engineers, data scientists, and business professionals attended Strata Data Conference last week in New York City. Events like these allow us to speak directly with customers, partners, and prospects on a range of issues: What Big Data projects are they working on? What are their biggest challenges and concerns? What trends are they seeing in the industry?
I wanted to share some of my observations and takeaways from last week based on the conference sessions I attended, as well as direct conversations the Pepperdata team had with attendees:
- AI was a dominant trend. We saw a lot more discussion of and vendors in the AI space, compared to a couple of years ago when storage, networking, and related infrastructure topics were quite prominent.
- Hadoop isn’t a buzzword anymore. As Alex Woodie wrote and others echoed, “Hadoop was hard to find at Strata this week.” People were talking about Spark, the data lake concept, and applications, not about Hadoop. Whether it’s “dead” or just assumed to be part of the landscape, it’s no longer a hot topic.
- Debugging Spark is harder than debugging MapReduce. Nearly everyone we talked to about Spark development mentioned that Spark has enabled many more people to use Big Data, and that writing applications is faster and easier with Spark than MapReduce, because it abstracts away the execution details. But that abstraction also means hiding the details, which makes it much harder to debug and achieve good performance for Spark applications.
- More people are writing Spark code with Python and SQL than with Scala. Another factor contributing to the larger population of Spark users vs. MapReduce users is the widespread use of Python and (especially) SQL for writing Spark applications and queries. Spark SQL has a large and growing user base of analysts who would previously have been using traditional data warehouses.
- Many companies are moving to the cloud, or at least a hybrid model. Whether it’s using a service like Amazon EMR, Google Cloud Platform, Databricks, or Cloudera Altus, or spinning up their own clusters using cloud instances, we talked to many companies that are moving to a cloud-only Big Data deployment or using the cloud for elasticity and experimentation.
- Many deployments are moving to a decoupled storage model. As networks become increasingly fast and moving the computation to the data becomes less important, many companies are moving to an architecture where compute and storage are decoupled. This approach allows each to be scaled independently, reducing cost and adding flexibility in adding, removing, or moving compute nodes.
- Orchestration frameworks like Kubernetes and Mesos are on the horizon. We heard a lot of interest in Kubernetes in particular, especially since companies are starting to move to it as the underpinning of their general IT framework, beyond Big Data. As part of the open source community effort to enable Spark on Kubernetes, Pepperdata has been spearheading the HDFS on Kubernetes work. Our engineer Kimoon Kim gave a great talk on that topic, diving deep into several of the technical challenges we’ve faced.
This was my fifth year making the trip out to Strata NY, and it’s been exciting to see the market mature from the early days of “how do I make this stuff work?” to rich discussions of interesting applications and how people are using Big Data to solve real problems in the real world.