My previous post explained that, in my mind, the data lakehouse differs hardly at all from the traditional data warehouse architectural design pattern (ADP). It consists largely of the application of new cloud-based technology to the same requirements and constraints we tackled over the decades of data warehousing. My conclusion there: vive le data warehouse!
In this post, I move to a more substantial challenge to my declared position on the future of the data warehouse. Let’s unpick the data mesh.
Data mesh was first described in a May 2019 article, “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh,” by Thoughtworks consultant, Zhamak Dehghani. Her choice of starting point—a Monolithic Data Lake—for this journey offers a clue to much of the thinking that follows. I return to this point in a moment, but first, what is a data mesh?
Data Mesh Defined
More than three years after its introduction, many definitions of a data mesh exist, but one of the clearest was that provided by Thoughtworks in 2020: A data mesh is “a domain-driven analytical data architecture where data is treated as a product and owned by teams that most intimately know and consume the data … apply[ing] the principles of modern software engineering and the learnings from building robust, internet-scale solutions to unlock the true potential of enterprise data.”
Key to this definition are the phrases domain-driven and data as a product. Both draw attention to the governance aspects of creating data. In a data mesh, this is done in a highly distributed manner by the teams and functions most familiar with the data, i.e., the data owners, in a loose use of this phrase. This is, in essence, a focus on responsibility for data quality, and is of course, a welcome and vital aspect of delivering data/information for decision-making support. It is also at the heart of the data warehouse ADP, where it is often governed in a relatively centralized manner, particularly using enterprise data models. Data mesh designers believe that such governance should be entirely distributed, on the valid basis that centralized solutions are enormous bottlenecks to innovation and change.
Rather less importance is attached in a data mesh to the governance aspects of using and reusing data. A largely implicit assumption is that data once created by those that know it well can be directly and easily used as-is by multiple consumers. This is at the heart of the concept of data as a product. Business intelligence professionals who built early data warehouses can attest to the weakness of this assumption. A glance at the plethora spreadsheets derived from any provided data set will suffice. Good data governance must extend to data use across its many contexts. Although only poorly supported in traditional data warehousing, data virtualization offers important function in this area, which can be applied more easily in a data warehouse with its more centrally organized approach to data modelling.
The second part of this definition—applying the principles of modern software engineering—calls for the use of tools and techniques, particularly microservices architecture, that have proven successful in the building of distributed, cloud-centric—and largely operational—applications.
Mesh vs. Warehouse
Let’s return to the Monolithic Data Lake as the launchpad for the initial discussion of data mesh. Dehghani uses this term to describe an evolution of the traditional data warehouse via the original data lake, calling it a third-generation data platform. This is, in my view, an unfortunate starting point, because the Monolithic Data Lake as described implements very few true data warehousing principles.
By starting her journey here, Dehghani over-emphasizes some specific governance limitations of data lakes that often lead to “data swamps.” As a result, centralization is identified as the key problem in delivering analytics. A data warehouse, seen (incorrectly) as the ultimate embodiment of such centralization, is therefore immediately rejected. In a 2021 tweet, Dehghani declares “there are no warehouses in a mesh.” Some vendors and implementors disagree, seeing data warehouses (or more correctly data marts) as a form of data product, adding to the definitional confusion. The above referenced tweet continues: “There are autonomous data product quantums that provide multimodal access to domain data for analytical workloads – connected together in a graph – each both transforming and serving/controlling immutable bitemporal data.”
(As an aside, the mention of graph here and in other data mesh definitions seems to lead to some confusion in the market between data mesh and data fabric, where graph technology underpins its approach to metadata definition and use. Caveat emptor!)
Beyond the conceptual complexity and novelty of the above phrasing—which is widespread in discussions of data mesh—the data mesh ADP poses one significant challenge for anybody who has ever built a data warehouse: where and how is data from multiple sources reconciled in a consistent, cross-enterprise manner? Domain-driven design and data-as-a-product are valid and powerful approaches in cases where the scope of the analysis and, therefore, the information needed are confined to a well-bounded organizational entity. These methods offer no obvious way of tackling “single version of the truth” requirements such as consolidated reporting at board level or for regulatory bodies.
Long Live the Data Mesh?
As a very different and novel approach to delivering data for analytical and decision-making support, the data mesh—unlike the data lakehouse—offers a comprehensive logical architecture and thus deserves the moniker of ADP. On a very positive note, the data mesh focus on federated governance and the embedding of computational and data governance policies directly in data products and in the infrastructure. This is an approach that is timely and worth considering even if the broader aspects of the ADP may be disputed.
The approach has generated interest and excitement among a wide audience of developers struggling with delivering analytical function in the emerging cloud environment. Extensive prototyping and open-source development is underway in this community. However, many of these folks come from an operational application development background, with relatively little experience of large-scale data management challenges, often leading to a focus on point solutions.
The software required for data mesh implementation is in its very early stages. As a result, and given the novelty and complexity of some of the underlying concepts, Gartner’s 2022 Hype Cycle for Data Management has declared that data mesh will be “obsolete before plateau,” that is, it will never reach widespread mainstream adoption. I’m not so bold to make such a damning prediction. Although it offers an interesting and powerful approach to delivering well-bounded analytical solutions, as well as worthwhile thinking on distributed, embedded data governance, I certainly counsel caution in considering the data mesh as an overall approach to replace a data warehouse at this stage of its evolution.
So, my bottom line, at least for now: vive le data warehouse!
- The Data Warehouse is Dead, Long Live the Data Warehouse, Part II - November 24, 2022
- The Data Warehouse is Dead, Long Live the Data Warehouse, Part I - October 18, 2022
- Weaving Architectural Patterns III – Data Mesh - December 16, 2021