In a previous post, I discussed the unbridled copying of data by IT and the new design principle called data minimization, and I said that it was time for data-on-demand architectures. In fact, the post was a plea for developing zero-copy architectures.
Although zero copying of data may not be realistic in the near future, we should still strive for it. It is like zero-emission cars. If the entire life-cycle of such cars is considered, they do emit CO2. If the energy used while driving the cars and during their construction does not come from wind, solar or nuclear power, they still emit CO2. But the car industry strives for zero emission, and that is recommendable. It is time we do something similar with data architectures: we should strive for zero-copying of data in our architectures.
In this post I focus on intra-organization copying of data and how data virtualization can help to make this dream come true. I will provide four examples of what can be done.
Example 1: Replace file transfer by a data-on-demand solution
In a popular form of copying a file is created every night or weekend containing data extracted from source systems. This file is transferred to a receiving party. There, the data is received, transformed, and loaded into their systems to make the data available to users. Many copies of the data are created in this architecture. The extracted file is the first copy, the received file is the second copy, and the loaded data is the third.
The disadvantages of this approach are many. The data has a high data latency, because in many organizations it can take a day or more for new transferred data to become available. Also, the file transfer process can go wrong, making the new data being unavailable. Significant human resources are needed to develop the programs for copying, transferring, loading, and transforming the data, and also for managing the scheduling of the copying processes. What we should not underestimate is that with each copy created, it becomes more difficult to comply with data privacy and security regulations.
Such an architecture can be replaced by installing a data virtualization server on the side of the sender and allowing the receiving party to send queries to retrieve the data (the same data they would normally receive through a file transfer mechanism). In this case, the data virtualization server receives the queries, extracts the data from the sources, transforms it, and returns the result.
This data-on-demand solution leads to less copying, solving many of the problems mentioned here. The workload is now a challenge. The workload of a file transfer solution is highly predictable. Not so with a data-on-demand solution, because it is unpredictable as to how often and when the receiving party will request data. Therefore, the sender must run the system on a platform that offers sufficient scalability and availability.
Example 2: Replace file transfer to many receivers by a data-on-demand solution
The second example is comparable to the previous one. The difference is that the file is not transferred to one but to many receivers. Such an architecture can result in an enormous number of copies. Each receiver stores multiple copies of the data. By replacing such a solution with a data virtualization-driven, data-on-demand solution, it is transformed into a zero-copy architecture. The advantages mentioned in the previous example also apply here, but they are much more extensive due to the number of receivers.
Example 3: Replace full file transfer by a single-record, data-on-demand solution
In some situations, files are transferred that contain all the data, while the receiver uses only a few records. The whole file is transferred because the receiver does not know in advance which records will be needed. In a zero-copy alternative, a data virtualization server handles the queries and returns only those records that are needed at the time.
Example 4: Replace file transfer by realtime, ad-hoc queries
In the previous examples, the receiver always needs the same data. It is as if the query to determine the required data is predefined. In some file transfer solutions, however, the receiver needs the data to execute ad-hoc queries. In other words, the receiver does not know in advance what data it needs and therefore asks for all of it. Such an architecture can be simplified and turned into a zero-copy architecture by allowing the receiver to run the queries directly on the sources of the sender through a data virtualization server.
Data Virtualization Comes to the Rescue
These are just a few examples of architectures in which file-transfer solutions resulting in many copies can be replaced by zero-copy architectures.
The role of data virtualization in all of them is manyfold. It retrieves the requested data from the sources, optimizes that access, integrates data from multiple sources if needed, and transforms, aggregates, filters, calculates, and secures the data. It transforms the data into consumable data. It becomes the gateway to the data for external parties.
The movie and music industries have already transformed into respectively video-on-demand and music-on-demand through streaming services. It is time that the same happens for data. The right technology is available. The future is zero-copy architectures.
- Metadata, the Neglected Stepchild of IT - December 8, 2022
- Zero-Copy Architectures Are the Future - October 27, 2022
- The Data Lakehouse: Blending Data Warehouses and Data Lakes - April 21, 2022