Apache Spark has quickly become an open source framework with widespread appeal for large-scale data processing. Last year, more than 1,000 organizations were using Spark in production. Many run Spark on clusters of thousands of nodes and up to petabytes of data, according to the project’s FAQ. Another indicator of its growth: according to Indeed, Spark-related job postings are up more than 10x over the past two years.
But beyond the impressive numbers, what users say about production Spark speaks volumes about its impact.
We recently invited experts to take part in a “Production Spark” webinar series to share their thoughts on trends, challenges, use cases and the future of production Spark. The first conversation in the series featured experts from SAP/Altiscale, Clearsense, and Silicon Valley Data Science, along with my co-founder Sean Suchter.
The topic “sparked” an interactive and lively discussion with our panel, and we’ve distilled some of the highlights (edited for brevity):
Spark’s popularity is driven by features like easy integration, built-in machine learning and support for streaming data.
Dr. Babak Behzad, Senior Software Engineer, SAP/Altiscale:
“I would say the most important key feature that Spark has is its integration with different ecosystems under the same API and programming model. So you have data flow transformations and the ETL ecosystem, you have SQL, Spark SQL. There is streaming (Spark Streaming) and then you have graph processing for graph computations.”
Charles Boicey, Chief Innovation Officer, Clearsense:
“Within healthcare, we have a lot of streaming data, so the ability to actually handle and utilize streaming data and its compatibility with YARN is huge. Probably the two most important Spark use cases for us are to be able to build our offline models and predictive models in R, and then directly apply them within a Spark environment. Also Scala built-in for machine learning is a benefit, and then the ability for Spark to handle multi-tendencies allowing for multiple clients to use the same instance.”
Sean Suchter, Co-founder and CTO, Pepperdata:
“Spark handles many different use cases in one engine. You can do batch analytics, you can do ETL, you can do the streaming analytics and streaming processing. And you can also run interactive analytics, and within one familiar wrapper. So the flexibility of Spark is one important feature. And the other thing that has helped it take off is the programming model.”
Andrew Ray, Principal Data Engineer, Silicon Valley Data Science:
“I think Spark has become so popular because of its ease of use. My favorite feature is DataFrames, allowing you to seamlessly transition between both SQL and programmatic manipulation of your data going back and forth in one API. I think this is amazing from an ease of use standpoint.”
Spark is driving interesting and breakthrough applications in healthcare and retail.
Charles Boicey, Chief Innovation Officer, Clearsense:
“Healthcare affects us all. We have patients in high acute areas, ICUs, operating rooms, and so forth that are connected to monitoring systems with real-time output (streaming heart rate, blood pressures, respirations, temperatures, and ventilator parameters). What we don’t have in healthcare is for a team to be on each individual patient. By streaming data and using the capabilities of Spark, we’re able to build models within the Spark environment that help alert clinicians of a patient’s condition changes well before humans can detect it. When we look outside of healthcare, as more and more medical IoT devices become available, this is a perfect environment as well to process that information. Spark resides at the center and allows us to do much more than we were able to do, such as when we used to use Storm. It lends itself to be the center of the Big Data ecosystem.”
Richard Williamson, Principal Engineer, Silicon Valley Data Science:
“A few years ago we were helping a major retailer in the midwest re-architect its inventory platform. The platform was based on Cassandra, so they were actually running into some consistency issues. We were also using lightweight transactions at that point, so the only place we could actually come back and do some reconciliation in the discrepancies we were seeing was by using Spark. We built a batch pipeline to process, load the data from Cassandra in, do the reconciliation, and then write the corrections back out.”
Spark Streaming is bringing data to the table in real time, which benefits applications like retail POS.
Richard Williamson, Principal Engineer, Silicon Valley Data Science:
“With retail, the main dataset that most people start with in this context is probably the point-of-sale data coming into the system from stores or the online sales channels. In the past, legacy data warehouses have operated either to a day or an hour bucket level for those. The thing that Spark Streaming now brings to the table, and in real-time, is to bring that data into the environment and be able to start processing it. It’s typically coupled with something like Kafka for the ingestion side and then a persistence layer to store it like Kudu or Cassandra.”
Experts offer tips for production Spark.
Andrew Ray, Principal Data Engineer, Silicon Valley Data Science:
“If you’ve been primarily just doing MapReduce on your YARN cluster, you might have a maximum container size that’s too low for what would be optimal usage in Spark. Spark, unlike MapReduce, likes to have large containers, where you would have dozens of cores and tens of gigabytes of RAM all in one container for an executer. You want to make sure that you set your maximum allocation from YARN to enable that. As a second thing, you might want to consider upgrading the RAM on the nodes in your cluster, if it’s a little older hardware and you don’t have, say, 128 gigabytes on those nodes. You might want to consider getting some more, because that can really help with performance in Spark.”
Sean Suchter, Co-founder and CTO, Pepperdata:
“A flipside to our ‘make sure you give your containers enough RAM’ advice is once you have done that, you might want to do one round at least of making sure you don’t give them too much. You might range up enough to make sure you have got enough, but then you might want to inspect how much they’re really using, and dial it back down to what they actually need, because it is very easy to have a massive excess amount configured and you don’t actually need that much. It certainly can be the case, especially because JVMs will run up to however much RAM you give them, and the garbage collector will just keep it at that line, so you might want to pay attention to how much Spark says it actually needs and dial yourself back down, so you don’t just waste cluster resources.”
Dr. Babak Behzad, Senior Software Engineer, SAP/Altiscale:
“As you migrate from MapReduce to Spark, you can keep running your MapReduce jobs while they are still being run and you haven’t ported all of them into Spark, and then start migrating application by application on the same cluster in Spark on YARN. We have seen very good performance for Spark on YARN, so we should be able to do this migration step by step rather than stopping all the applications and trying to migrate all of them all at once.”
Andrew Ray, Principal Data Engineer, Silicon Valley Data Science:
“We definitely see our customers using multiple versions of Spark concurrently, just for simple compatibility reasons. The first part is, Spark 1.x and 2.x are binary incompatible, and they also have different default versions of Scala, if you care about that. Definitely, everyone should at least have a version of 1.x and 2.x, but you might have more versions just for being able to pin a production application to a certain version of Spark that you know that is good with, because you have tested it with that and you might not have gone to the expense of testing it with a new version of Spark or migrating it as well. In my previous employer, we had actually pretty much every version of Spark released installed on the cluster concurrently. This is no big deal with YARN; you have whatever the current latest version is set as the default that people see, but then if they specify the right path, they can get whatever version they want. It is seamless with YARN, because it just submits the application with that version of Spark and all the jars get shifted along with it.”
Sean Suchter, Co-founder and CTO, Pepperdata:
“Spark has this nice interactive mode where you can start up the Spark clients, you may do this through some tools, a Zeppelin or Jupyter, or you may just do this using interactive Spark CLI. You start your client and it allows you to load all your data up that you’re trying to operate on, in RDDs or DataFrames now, and then you can very quickly do lots of little queries against it, so it’s like an interactive analytics session, and the reason it’s so fast is because you’ve loaded your data up into RAM, and you’ve got it on these tens or hundreds or thousands of processes on the cluster that have got all this data cached in RAM, and it’s really great for when you’re trying to do your queries. The problem is if you do this and then walk away – I kid you not, ‘go to lunch’ seems to be the common mode – your processes don’t know when you might come back and start typing into the keyboard and so they are holding this RAM for you and keeping it from everyone else on the cluster, and so we’ve multiple times had people use the Pepperdata product to literally just find those people so that the operators can call them and say, ‘Hey, are you at lunch? Okay, I’m going to kill your job.’”
The mix of benefits and challenges when using Spark in production is a timely topic given the growing number of organizations who have applications already in production or that are moving in that direction. It was fun to hear about the interesting use cases out there and get some great advice from the experts.