Spark SQL, Hold the Hadoop

THE APACHE SPARK processing engine is often paired with Hadoop, helping users to accelerate analysis of datastored in the Hadoop Distributed File System. But Spark can also be used as a standalone big data platform.That’s the case at online marketing and advertising services provider Sellpoints Inc.—and it likely wouldn’t be possible without the technology’s Spark SQL module.

Sellpoints initially used a combination of Hadoop and Spark running in the cloud to process data on the Web activities of consumers for analysis by its business intelligence (BI) and data science teams. But in early 2015, the Emeryville, Calif., company converted to a Spark system from Databricks, also cloud-based, to streamline its architecture and reduce technical support issues. Benny Blum,vice president of product and data at 

Sellpoints, said the analysts there use a mix of Spark SQL and the Scala programming language to set up extract, transform and load (ETL) processes for turning the raw data into usable information.

The BI team in particular leans heavily on Spark SQL since it doesn’t require the same level of technical skills as Scala does—some BI analysts do all of their ETL programming with the SQL technology, according to Blum.

“Spark SQL is really an enabler for someone who’s less technical to work with Spark,” he explained. “If we didn’t have it, a platform like Databricks wouldn’t be as viable for our organization, because we’d have a lot more reliance on the data science and engineering teams to do all of the work.”

Sellpoints collects hundreds of millions of data points from Web logs on a daily basis, amounting to a couple of terabytes per month. The raw data is streamed into an Amazon Simple Storage Service data store. It is then run through the extract, transform and load routines in Spark to convert it into more understandable metricsbased formats and to translate it for output to Tableau’s business intelligence software. The software is used to build reports and data visualizations for the company’s corporate clients.

Spark SQL isn’t a perfect match for standard SQL at this point. “There are certain commands that I expect to be there that aren’t there or may be there but under a different name,” Blum said. Despite such kinks, the technology is familiar enough to get the job done, he noted, adding, “If you know SQL, you can work with it.” 


What we mean by “data lake” and DW “augmentation”

When we say “data lake,” we’re referring to a centralized repository, typically in Hadoop, for large volumes of raw data of any type from multiple sources. It’s an environment where data can be transformed, cleaned and manipulated by data scientists and business users. A “managed” data lake is one that uses a data lake management platform to manage ingestion, apply metadata and enable data governance so that you know what’s in the lake and can use the data with confidence.

“Augmentation” means enhancing what you already have, not starting from scratch. With a data warehouse augmentation, you keep your data warehouse and existing BI tools, but add a complementary, integrated data lake. For example, you could use a complementary data lake to prepare datasets and then feed them back into a traditional data warehouse for business intelligence analysis, or to other visualization tools for data science, data discovery, analytics, predictive modeling and reporting.

Why DW augmentation? 

We find that companies typically consider a DW augmentation project for two scenarios: 

Blue sky – You want to be able to do new things, beyond the capabilities of the data warehouse. This could include supporting specific business use cases for more advanced big data analytics or data science to find new insights or generate revenue; for example, with new products and services or through improved, more personalized customer experience. 

Cut costs – You want to continue doing what you’re already doing with your data warehouse, but do it cheaper using commodity hardware.

Reference Architecture

What could a data warehouse augmentation look like in your environment? Let’s review some architecture diagrams.

The differences between a traditional data warehouse architecture and data lakes are significant. An DW is fed data from a broad variety of enterprise applications. Naturally, each application’s data has its own schema, requiring the data to be transformed to conform to the DW’s own predetermined schema. Designed to collect only data that is controlled for quality and conforming to an enterprise data model, the DW is capable of answering only a limited number of questions. Further, storing and processing all data in the data warehouse is cost prohibitive.

Typically an organization will augment a data warehouse with a data lake in order to enjoy a reduction in storage costs. The data lake is fed information in its native form and little or no processing is performed for adapting the structure to an enterprise schema. The data can be stored on commodity hardware, rather than expensive proprietary hardware. Required data can be pulled from the lake to leverage in the data warehouse. While this model provides significant cost savings it does not take advantage of the strategic business improvements that a data lake can provide. 

Reference Diagram #3: Data Warehouse Augmentation and Offload Diagram

The biggest advantage of data lakes is flexibility. By allowing the data to remain in its native format, a far greater stream of data is available for analysis. When an organization enlists the data lake to offload expensive data processing, in addition to storage, the entire business can benefit from more timely access to more data.