Some still think it makes sense to compare ETL tools with data virtualization servers. Let me be clear, it really doesn’t. Unquestionably, both tool categories belong to the class of data integration tools, like data replication tools, data wrangling tools, and enterprise services buses; but this doesn’t imply they should be compared in detail.
Two things always have at least a few characteristics in common by which they can be compared. We can compare elephants with trucks based on weight for example, and computers with toasters based on their power consumption. But what will such comparisons tell us? Sometimes it does make sense. For example, when loading airplanes, it must be important to know the weights of trucks and elephants, but you never compare the two to determine which one you need for a specific use case.
So, ETL tools and data virtualization can be compared based on common characteristics such as price, data filtering capabilities, resource consumption and so on, but it’s a waste of time, because the tool categories have different use cases. ETL tools are specialized in extracting data from some source systems, transforming and aggregating that data, and finally storing the result in some target database. In fact, that’s what the letters stand for: Extract, Transform, and Load. They lift and shift the data. Because of the copying, the data used for reporting is never the most up-to-date data. There is always a time delay between the moment the data is inserted in the source system and when it can be accessed by a report. ETL is very well suited to operate in an architecture in which this need exists to replicate data before it’s used.
The essence of data virtualization is abstraction, or call it decoupling. It’s about decoupling data producers and data stores. By doing this, changes made on the data production and data storage side won’t impact the reports, and vice versa. Abstraction leads to a more agile environment for delivering data to data consumers, let me give a few examples:
Centralization of specifications: Specifications required to extract, integrate, filter, aggregate, and calculate data are commonly spread across reporting tools, ETL programs, and database servers. This makes lineage and impact analysis complex. If we use data virtualization servers properly, most of these transformation specifications are defined in its repository. This centralized storage of specifications simplifies lineage and impact analysis, improves productivity and maintenance, and leads to more transparency.
Decrease of data latency: With ETL, if the data required for reporting and analysis must be transformed first, the transformed data must first be physically stored before it can be used resulting in an increase of data latency. With data virtualization the data can be transformed when the data is being used. There is no need to store a transformed copy of the data. It can be physically stored though, but that’s an option, it’s not mandatory as with ETL.
Database migration: In my previous blog, I indicated that with data virtualization database servers can be replaced easily without impacting the reports. Lately, so many new powerful data storage and data processing technologies have been introduced. To be able to leverage the power of these new products, a migration is required. Data virtualization simplifies such a migration, because it operates as an abstraction layer between the reports and the databases.
Evidently, data virtualization has characteristics comparable to ETL and there is overlap in functionality, which is they both belong to the data integration category. But if you compare two types of products, you probably haven’t specified your precise requirements clearly. Again, if you’re comparing elephants and cars in detail, you probably haven’t specified your use case in detail either.
Don’t see data virtualization as a replacement for ETL, because that’s not what it is; they can co-exist. If there is a need to replicate data, ETL is the preferred approach. When an abstraction layer is needed to increase the agility of a system, data virtualization is the right choice.
- Metadata, the Neglected Stepchild of IT - December 8, 2022
- Zero-Copy Architectures Are the Future - October 27, 2022
- The Data Lakehouse: Blending Data Warehouses and Data Lakes - April 21, 2022
Dear Rick,
interesting article. What does this mean for the concept of a ‘logical data warehouse’ which you introduced a couple of years ago?
Kind Regards,
Ralf
Hi Ralf,
In a logical data warehouse we may be using data virtualization technology and ETL technology. DV would be used for data abstraction and because of that it would be the heart of the logical data warehouse. In that same architecture ETL may be used, for example, to copy data from data sources to a datahub.
Best regards,
Rick van der Lans
Rick, I think I understand DV but how is data sucked out of and placed back into legacy apps like SAP, ORACLE, Siemens Team Centre etc? (We have hundreds of these various legacy apps).
Is there a useful guide to the market of DV products?
The Decoupling effect is because data is taken from legacy systems on demand, correct? Or does the DV layer provide a form of cache?
How does such an architecture cope with massive peak input/output with the legacy platforms many of which were never designed/architected to work in such a manner.
Do DV products come with ADAPTERs or do these have to be custom developed?
thanks
Hi Jeremy,
Good questions. I have tried to answer them all:
how is data sucked out of and placed back into legacy apps like SAP, ORACLE, Siemens Team Centre etc? (We have hundreds of these various legacy apps).
>>> Through APIs or by retrieving data directtly from their databases. Note that DV products do not have adapters/connectors for every app developed on this planet, but commonly the well-known ones are supported. SDKs exist to develop adapters yourself.
Is there a useful guide to the market of DV products?
>>> Not that I am aware of, athough my seminars on this topic contain a lot of material on the more popular products.
The Decoupling effect is because data is taken from legacy systems on demand, correct?
>>> Yes, that’s one way of decoupling.
Or does the DV layer provide a form of cache?
>>> Yes, it does.
How does such an architecture cope with massive peak input/output with the legacy platforms many of which were never designed/architected to work in such a manner.
>>> DV is always restricted by what the underlying apps can do for you.
Do DV products come with ADAPTERs or do these have to be custom developed?
>>> Yes.
Thanks RIck – much appreciated.
One final question if I may
DV layers in full stack mobile/web development environments – is it a common practice? Is there a trend? It strikes me as a very very good option – to use a DV layer.
In this way users can access Enterprise System (Legacy) via native user sign-ons to do the low level transactional, OLTP, record type stuff (what the app was originally intended to do) and the DV layer can ‘cache’ data to / from them allowing 2nd and 3rd order derivative apps (WEB/mobile) etc. to be built fully decoupled and independent from the heavyweight architectures/skill sets of the legacy apps themselves. Therefore making our dev team far, far quicker and agile.
Is this a correct way to ‘see’ things. I am trying to get a bunch of people on board who have even less understanding of Enerprise Architecture than I do !
thanks
Hi Jeremy,
Yes, that would be a good use case for data virtualization. the DV layer would make live very easy for the developers.
Thanks and greatly appreciated