If you are in the data management world, you probably help your company to redefine its data analytics architecture, especially in the context of cloud adoption. At some point in the process, someone might have stared at the myriad boxes and acronyms in your architecture and said something like: “Now that we are redefining all this, couldn’t we simply consolidate all of the data in a single system like a data lake, or a data lakehouse”?
This is a natural question. Managing different systems, which use different technologies, is complex and costly from many points of view, so a single platform, one that takes care of all of our data analytics needs, makes intuitive sense. And some vendors in the market are telling companies that now this is possible because cloud storage is cheap and their products can cover almost every analytics style.
Let’s Take a Step Back
But this is an idea with a complicated history. Not even in the much simpler world of the 90s were we able to store all data relevant for analytics in a single data warehouse. So there is some reason for skepticism, now that data volumes are hundreds of times bigger and data is more distributed than ever. And we heard a very similar story about data lakes, (often from the same companies) about 10 years ago. As always tends to happen, reality then showed that data lake technology was great for some tasks but not so great for others.
Will this time be any different? I am skeptical, because heterogeneity and widely distributed data were never driven by storage costs. They come about because organizations are composed of autonomous units that make independent decisions. They also come about because different departments use the best tool for each task, because one size never fits all.
That is why most large companies today maintain several different data warehouses and data lakes on-premises and in the cloud. That is why they also have operational data stores, NoSQL repositories, cloud object stores, and SaaS applications. They probably would like to decommission a few of these systems but they cannot without facing costly migration processes; but in many other cases, these systems need to stay simply because they are the best way to solve a particular need.
As Gartner said recently, “a single data persistence tier and type of processing is inadequate when trying to meet the full scope of modern data and analytics demands.” And if you look closely at the reference architectures of cloud providers, you will see that even if you are able to move all your data to the cloud (and that is a very big “if”), every cloud provider offers different processing engines for different tasks and data types.
Of course, it is still true that the power provided by this diversity of systems comes at a price. If data is distributed across different systems, integrating that data can be slow and costly. Also, users no longer have a single access point to the available data. Security and governance also become more difficult because you need to ensure that consistent policies are applied across all systems.
A Better Way
The solution that major market analysts like Gartner and Forrester are proposing is distributed architectures like logical data warehousing and data fabric (Also see Gartner’s “Demystifying the Data Fabric”). A key component in these architectures is the data virtualization layer, which bridges the gap between the data processing engines and the data consumers.
This layer provides the following crucial features:
- Agile data integration of data distributed across different systems, much faster and cheaper than traditional alternatives based on data replication (Gartner estimates savings at 45% when using data virtualization)
- A unified semantic layer, enabling organizations to create virtual models that present data to each type of consumer in the shape, format, and structure that each one needs, and enabling stakeholders to organize virtual models in layers, to encourage the reusability of semantic definitions and data transformations
- The ability for consumers to access data using any technology such as SQL, REST, OData, and GraphQL – Data APIs can be created in minutes, with zero code
- A single point from which to apply semantic, security, and governance policies across all data processing engines, and a single source of truth for data consumers
- The ability to implement “data marketplaces,” where business users can go to find and get access to relevant data, no matter where it is located
- A data-abstraction layer, enabling organizations to move data from one location or system to another, without affecting data consumers
- Intelligent caching and acceleration, which enables the selective replication of small data subsets, accelerating queries on slow data sources, as well as the ability to accelerate distributed queries (See this post for the technical details)
Summing Up
My advice would be simple: learn from history, assume you will never have a magic bullet that solves all of your analytics needs, and plan your data management architecture accordingly.
- Performance in Logical Architectures and Data Virtualization with the Denodo Platform and Presto MPP - September 28, 2023
- Beware of “Straw Man” Stories: Clearing up Misconceptions about Data Virtualization - November 11, 2021
- Why Data Mesh Needs Data Virtualization - August 19, 2021
Hi,
I am from traditional DW paradigm. I am unable to reconcile 2 points about data virtualization. Would you be able to point me in the right direction?
1. If I am pushing some part of processing/aggregation towards source [possibly OLTP?] layers – how am I sure that I am not impacting their operational performance? Or are we assuming that they are set aside / dedicated to answer our calls in first place? When we say less costly compared to replication, are we saying this as we transfer aggregated [at times partial] query results in place of whole data?
2. How can the performance be better than a well organized DW, when networks, source system performance issues and any operational issues are at play? If we say the cost-based optimizer takes the best possible route, what’s the process that builds & updates the statistics on a continuous basis. Wouldn’t a well scheduled batch processing system help avoid these challenges easily?
Thanks for the help!
Regards,
Sreekanth.V
Thank you, Sreekanth for your questions.
About your first question, the first thing that may be worth clarifying is that in logical analytics architectures like the Logical Data Warehouse or the Data Fabric, usually the data sources are not OLTP systems. These architectures are typically used by big companies which have several repositories used for analytics, such as data warehouses, multidimensional engines, data marts and data lakes. The Data Virtualization layer provides a common semantic layer and implements a unified security and governance layer on top of all those systems. Therefore, most of the queries that are pushed down to data sources by the Data Virtualization engine are processed by systems especially designed for analytic workloads. That said, it often happens that you need some data that is only available in operational data sources (it may be OLTP databases, packaged applications, SaaS applications,…). For these scenarios, the DV layer offers a number of options to avoid introducing too much workload in the data sources, such as caching, query acceleration (similar to complex materialized views), limiting the number of concurrent queries that can be pushed down to a data source, or limiting the type of queries that can be pushed down (for instance, you can specify that queries pushed down on a data source should include a query condition on a certain attribute to guarantee that a certain index is used).
About your second question, a Data Warehouse is great for structured questions on structured data but other engines excel at other types of needs. For instance, Data Science workloads are quite different and may run significantly better in other types of systems. An architecture like the Logical Data Warehouse or the Data Fabric allows you to use the best engine for each type of workload and also combining them when needed. In addition, it allows more agile data delivery (e.g. creating virtual data marts is much faster than creating physical data marts) and abstracts data users of changes in the underlying infrastructure (e.g. if a DW is moved to the cloud)
About your more specific question about data statistics for optimization, Denodo is able to automatically obtain cost statistics from the system tables of the underlying databases and refreshes them periodically. If you are interested on how the optimizer works in general to solve distributed queries, you can check for instance this and this.
Hope this helps!