How to build and execute a data lakehouse strategy

By Jonny Dixon, Senior Product Manager at Dremio.

  • Wednesday, 13th September 2023 Posted 1 year ago in by Phil Alsop

The data lakehouse has captured the hopes of modern enterprises looking to combine the best of data lakes with the best of data warehouses. Like a data lake, it consolidates multi-structured data in flexible object stores. And like a data warehouse, it transforms and queries data at high speed.

While still in the early adoption cycle, businesses that implement data lakehouses can streamline their architectures, reduce cost and assist in governance of self-service analytics. From data mesh support to providing a unified access layer for analytics and data modernisation for the hybrid cloud, there are myriad use cases and even more to come.

But many don’t know where to start – and the risk of spending time and money only for it to go wrong is putting many off taking advantage of data lakehouses’ benefits. However, building and executing a data lakehouse strategy can be broken down into four simple steps.

Understand what a data lakehouse is, and isn’t

The data lakehouse seeks to combine the structure and performance of a data warehouse with the flexibility of a data lake. It’s a type of data architecture that uses data warehouse commands, often in structured query language (SQL), to query data lake object stores, on premises or in the cloud, at high speed.

The data lakehouse supports both data science workloads and business intelligence (BI) by running queries against both relational data and multi-structured data stored as files. The adoption of data lakehouses is driven by enterprises’ need to simply how they meet exploding business demand for analytics.

As by running fast queries directly on the object store within the data lake, enterprises don’t need to copy or move data to meet BI performance requirements. This can reduce the need for data extracts and a data warehouse, which in turn minimises the pain of managing multiple copies and streamline costs. Further, by supporting both BI and data science, the data lakehouse can enable enterprises to consolidate workloads as they no longer need two distinct platforms. And the open architecture and open formats are still maintained to interoperate with other tools.

Prioritise your business use cases

Once you have a solid understanding of how a data lakehouse can benefit your enterprise, it's essential to define the most pressing priority business use cases. Only then can you identify the ‘quick wins’ and prioritise the architectural characteristics required to support them – such as unified, simple, accessible, high performance, economic, governed or open.

Common use cases could include periodic reporting, interactive reports and dashboards, ad-hoc queries, 360-degree customer views or artificial intelligence and machine learning.

For example, a consumer-packaged goods firm could implement a lakehouse platform on Microsoft Azure Data Lake Storage to eliminate duplicative data silos, thereby improving data quality to support supply chain analytics. Rather than replicating data from the lake into satellite data warehouses or BI extracts, the business could use the lakehouse to accelerate queries on the lake itself. Consolidating their environment in this fashion enables data teams to reduce duplicative copies, improve efficiency and assist compliance.

Tackle your first project

Having identified the priority use cases, plan and execute the first project to support the highest priority. With the right stakeholders assembled – including an executive sponsor, data analyst or scientist, data engineer, architect and governance manager – a roadmap can be created and implemented in order to incrementally change the environment.

For example, to support 360-degree customer views, the team might migrate semi-structured customer data from HDFS to a cloud object store for the target lakehouse. Or to provide distinct views of consolidated datasets to the marketing, trading, supply chain management, and ecommerce business units, the team might spin up departmental data lakehouses without replicating or migrating datasets within the enterprise object store. This would enable the departments to build their own business insights and support their own projects, while using the same query interface and semantic layer as other teams.

Expand the data lakehouse

Once the business value has been demonstrated, the ‘quick win’ should open budget, executive support and architectural platforms to expand the data lakehouse. It might lead to planning and executing a second project that migrates other functional data – perhaps finance or supply-chain records – from HDFS to the lakehouse. Or the decision might be made to extend the unified access layer to support legacy databases on premises. This could support a data mesh in which business domain owners publish data products to self-service users throughout the business.

The goal is to create a sequence of incremental, achievable projects that each demonstrate the lakehouse’s ROI. From there, look to rinse and repeat until every part of the business is benefiting from the data lakehouse.