Spark SQL, Hold the Hadoop

THE APACHE SPARK processing engine is often paired with Hadoop, helping users to accelerate analysis of datastored in the Hadoop Distributed File System. But Spark can also be used as a standalone big data platform.That’s the case at online marketing and advertising services provider Sellpoints Inc.—and it likely wouldn’t be possible without the technology’s Spark SQL module.

Sellpoints initially used a combination of Hadoop and Spark running in the cloud to process data on the Web activities of consumers for analysis by its business intelligence (BI) and data science teams. But in early 2015, the Emeryville, Calif., company converted to a Spark system from Databricks, also cloud-based, to streamline its architecture and reduce technical support issues. Benny Blum,vice president of product and data at 

Sellpoints, said the analysts there use a mix of Spark SQL and the Scala programming language to set up extract, transform and load (ETL) processes for turning the raw data into usable information.

The BI team in particular leans heavily on Spark SQL since it doesn’t require the same level of technical skills as Scala does—some BI analysts do all of their ETL programming with the SQL technology, according to Blum.

“Spark SQL is really an enabler for someone who’s less technical to work with Spark,” he explained. “If we didn’t have it, a platform like Databricks wouldn’t be as viable for our organization, because we’d have a lot more reliance on the data science and engineering teams to do all of the work.”

Sellpoints collects hundreds of millions of data points from Web logs on a daily basis, amounting to a couple of terabytes per month. The raw data is streamed into an Amazon Simple Storage Service data store. It is then run through the extract, transform and load routines in Spark to convert it into more understandable metricsbased formats and to translate it for output to Tableau’s business intelligence software. The software is used to build reports and data visualizations for the company’s corporate clients.

Spark SQL isn’t a perfect match for standard SQL at this point. “There are certain commands that I expect to be there that aren’t there or may be there but under a different name,” Blum said. Despite such kinks, the technology is familiar enough to get the job done, he noted, adding, “If you know SQL, you can work with it.”