What we mean by “data lake” and DW “augmentation”

When we say “data lake,” we’re referring to a centralized repository, typically in Hadoop, for large volumes of raw data of any type from multiple sources. It’s an environment where data can be transformed, cleaned and manipulated by data scientists and business users. A “managed” data lake is one that uses a data lake management platform to manage ingestion, apply metadata and enable data governance so that you know what’s in the lake and can use the data with confidence.

“Augmentation” means enhancing what you already have, not starting from scratch. With a data warehouse augmentation, you keep your data warehouse and existing BI tools, but add a complementary, integrated data lake. For example, you could use a complementary data lake to prepare datasets and then feed them back into a traditional data warehouse for business intelligence analysis, or to other visualization tools for data science, data discovery, analytics, predictive modeling and reporting.

Why DW augmentation? 

We find that companies typically consider a DW augmentation project for two scenarios: 

Blue sky – You want to be able to do new things, beyond the capabilities of the data warehouse. This could include supporting specific business use cases for more advanced big data analytics or data science to find new insights or generate revenue; for example, with new products and services or through improved, more personalized customer experience. 

Cut costs – You want to continue doing what you’re already doing with your data warehouse, but do it cheaper using commodity hardware.

Reference Architecture

What could a data warehouse augmentation look like in your environment? Let’s review some architecture diagrams.

The differences between a traditional data warehouse architecture and data lakes are significant. An DW is fed data from a broad variety of enterprise applications. Naturally, each application’s data has its own schema, requiring the data to be transformed to conform to the DW’s own predetermined schema. Designed to collect only data that is controlled for quality and conforming to an enterprise data model, the DW is capable of answering only a limited number of questions. Further, storing and processing all data in the data warehouse is cost prohibitive.

Typically an organization will augment a data warehouse with a data lake in order to enjoy a reduction in storage costs. The data lake is fed information in its native form and little or no processing is performed for adapting the structure to an enterprise schema. The data can be stored on commodity hardware, rather than expensive proprietary hardware. Required data can be pulled from the lake to leverage in the data warehouse. While this model provides significant cost savings it does not take advantage of the strategic business improvements that a data lake can provide. 

Reference Diagram #3: Data Warehouse Augmentation and Offload Diagram

The biggest advantage of data lakes is flexibility. By allowing the data to remain in its native format, a far greater stream of data is available for analysis. When an organization enlists the data lake to offload expensive data processing, in addition to storage, the entire business can benefit from more timely access to more data.