What is Scalability in Cloud Computing

What is Scalability in Cloud Computing

Apache Spark has quickly become an open source framework with widespread appeal for large-scale data processing. Last year, more than 1,000 organizations were using Spark in production. Many run Spark on clusters of thousands of nodes and up to petabytes of data, according to the project’s FAQ. Another indicator of its growth: according to Indeed, Spark-related job postings are up more than 10x over the past two years.

But beyond the impressive numbers, what users say about production Spark speaks volumes about its impact.

We recently invited experts to take part in a “Production Spark” webinar series to share their thoughts on trends, challenges, use cases and the future of production Spark. The first conversation in the series featured experts from SAP/Altiscale, Clearsense, and Silicon Valley Data Science, along with my co-founder Sean Suchter.

The topic “sparked” an interactive and lively discussion with our panel, and we’ve distilled some of the highlights (edited for brevity):

Spark’s popularity is driven by features like easy integration, built-in machine learning and support for streaming data.

Dr. Babak Behzad, Senior Software Engineer, SAP/Altiscale:
“I would say the most important key feature that Spark has is its integration with different ecosystems under the same API and programming model. So you have data flow transformations and the ETL ecosystem, you have SQL, Spark SQL. There is streaming (Spark Streaming) and then you have graph processing for graph computations.”

Charles Boicey, Chief Innovation Officer, Clearsense:
“Within healthcare, we have a lot of streaming data, so the ability to actually handle and utilize streaming data and its compatibility with YARN is huge. Probably the two most important Spark use cases for us are to be able to build our offline models and predictive models in R, and then directly apply them within a Spark environment. Also Scala built-in for machine learning is a benefit, and then the ability for Spark to handle multi-tendencies allowing for multiple clients to use the same instance.”

Sean Suchter, Co-founder and CTO, Pepperdata:
“Spark handles many different use cases in one engine. You can do batch analytics, you can do ETL, you can do the streaming analytics and streaming processing. And you can also run interactive analytics, and within one familiar wrapper. So the flexibility of Spark is one important feature. And the other thing that has helped it take off is the programming model.”

Andrew Ray, Principal Data Engineer, Silicon Valley Data Science:
“I think Spark has become so popular because of its ease of use. My favorite feature is DataFrames, allowing you to seamlessly transition between both SQL and programmatic manipulation of your data going back and forth in one API. I think this is amazing from an ease of use standpoint.”

Spark is driving interesting and breakthrough applications in healthcare and retail.

Charles Boicey, Chief Innovation Officer, Clearsense:
“Healthcare affects us all. We have patients in high acute areas, ICUs, operating rooms, and so forth that are connected to monitoring systems with real-time output (streaming heart rate, blood pressures, respirations, temperatures, and ventilator parameters). What we don’t have in healthcare is for a team to be on each individual patient. By streaming data and using the capabilities of Spark, we’re able to build models within the Spark e