In this post, I’m going to cover:
- How different data virtualization products interpret “data virtualization” differently
- The 5 unique characteristics of true data virtualization products
- Where the alternatives to data virtualization solutions underperform
Data virtualization has outgrown its infancy. Many providers offer “something” in this field, but there is a substantial difference between their “something” and real data virtualization. What is “real” data virtualization? What must a data virtualization product be able to do? In this post, I will cover the 5 essential characteristics of a data virtualization solution and compare a few data virtualization alternatives.
Essential Data Virtualization Characteristics
For a while, various software suppliers have been trying to secure a spot in the data virtualization market. Take Microsoft, for example, a company that recently introduced PolyBase as part of SQL Server 2019. IBM also has a data virtualization stack, as do other long-term providers such as Data Virtuality, Tibco, and Denodo. In addition, there are platforms such as MuleSoft and SAP HANA, and other products that offer data virtualization in one form or another. Each of these providers interpret “data virtualization” in a different way.
The question, however, is whether or not these solutions do what they are supposed to do. In other words, when can you safely call something a genuine data virtualization product? You will recognize it by these 5 characteristics:
1. It decouples logic and source complexity
A data virtualization product connects multiple data sources, but all of these sources have their complexities. You want to eradicate these complexities for your data end-users, to avoid any barriers to data access. You don’t want your end-users to have to apply the logic. It is better to apply the logic to your data at one point within the data platform, and then all your users can use the data from there. In other words, with data virtualization, you must be able to decouple the source from the end-user. That is how you isolate the data complexity and logic.
2. It runs in a technology-independent way, integrating with all data storage technologies
Another important requirement, of course, is for a data virtualization product to be able to integrate and communicate with a multitude of technologies. Data virtualization must enable this, in light of the complexity that characterizes today’s data landscape. Relational sources, data services, cloud sources, flat files: You must be able to unlock and access all your data sources, regardless of the technology they are based on. The same applies to your data storage.
3. It provides a strong backbone for a hybrid data architecture
Today’s data architecture is often somewhat hybrid in nature. A true data virtualization product integrates various databases within one architecture, whether they’re on-premises or in the cloud. That way, the different technologies can complement each other.
4. It supports different forms of data use
What is the benefit of data, if it cannot be used? With data virtualization, companies can always focus on the purpose of their data. Take governed BI, for example. The origin of any displayed data must always be clear. In other words, the data lineage must be clear, from the origin to the source, including all transformations in between. What about information that is needed at a specific instant? Can companies get that data ready quickly, so that it’s ready for use? Or, perhaps companies want to support data scientists, by letting them roam through all kinds of data, so that they can discover new trends and arrive at new insights. Data virtualization must support all of these different forms of data use.
5. It is an effective, cost-based optimizer
Performance is of the utmost importance when it comes to data virtualization. But how does one ensure optimal performance when running a query? With cost-based optimization. A cost-based optimizer analyses and determines the most efficient query strategy, or execution plan, which always depends on how the data is being used. For data virtualization, there are several paths to speedy selection and great performance: the optimizer activates the best route to the required data, per use case.
Alternatives to Data Virtualization
Now that we know the must-haves for data virtualization products, it is good to take a quick look at the alternative solutions to data virtualization. A few examples are data federation extensions for databases, SQL-on-Hadoop engines, in-memory databases, enterprise service busses (ESBs), service oriented architecture (SOA) and other message-based products, and cloud databases. Let us briefly review these options, keeping in mind the 5 core characteristics of data virtualization.
- No lineage with data federation
Data federation resembles data virtualization in many ways, but there are essential differences. For example, data federation can surely decouple users from source complexity, but data federation extensions do not demonstrate vast flexibility in unlocking and accessing sources. If we look at data use, focusing on governed BI, data federation does not provide a way to get a view of the origin of the data. Current data virtualization products have catalogs that track the data lineage up to the metadata level. Data federation may be good at query optimization, but it is still restricted in the number of different connectible sources it supports.
- Data replication needed for SQL-on-Hadoop
For SQL-on-Hadoop, decoupling the logic from the source is not that simple. This is because the data must be in the Hadoop file system, which is not always the case. In this case, data replication is unavoidable. What’s more, this solution falls short when it comes to unlocking and accessing various sources and combining and integrating structures. Administrators will soon get bogged down in a mishmash of data copying. In addition, as with data federation, SQL-on-Hadoop does not, by definition, enable search on a metadata catalog. SQL-on-Hadoop does provide a cost-based optimizer, but this will not help unless all the data is in Hadoop.
- In-memory databases: not technology-independent
In-memory databases decouple the logic from the source complexity for the user, just like data virtualization products. They retrieve full sets of data from source systems, store the data, and combine it with other data needed for the query. However, this solution often works with a specific database server, so it is not technology-independent. This means that in-memory databases provide little-to-no support for data management. Since the data is first retrieved from external sources and loaded into memory, one cannot leverage any of the options provided by the data server that stored the source data. For this reason, the execution path is not always very optimal. It makes sense, then, that in-memory databases do not score very highly in terms of cost-based optimization.
- ESB, SOA, and messaging: not suitable as the backbone
Companies can leverage ESB, SOA, and messaging solutions mainly to transport data from one application to the next. Still, they are indeed capable of decoupling logic and source complexity, and they operate with technology-independence. The only downside is that these solutions are mainly suited to point-to-point solutions rather than for different types of data use. This means they are also ill-suited as the backbone for your hybrid data architecture. Also, there is no form of cost-based optimization whatsoever; the messages simply go from A to B.
- Cloud databases: transfer all the data first
Cloud databases offer the familiarity and integration benefits of the classic on-premises SQL database. The main difference, however, is that one cannot host the databases locally in an organization, but use them as a platform as a service (PaaS) implementation. Database management, such as patches, performance, and security, is part of the service. This way, you spend less time on it and can get away with less in-house expertise. Cloud databases truly excel in terms of their flexibility in supporting the addition of storage capacity and computing power. Such upgrades are often dynamic and can be performed automatically to meet users’ (temporary) needs. The downside, though, is that in order to take advantage of the benefits, companies must first transfer all of their data to the cloud database. This requires the creation of extraction and load processes for each applicable source. Want to establish logic in transformation views? That works well and rather quickly, but cloud databases usually do not provide a decent development studio that would enable you to get an overview of the developed logic in multiple, sometimes stacked, views. This makes management complicated and disorganized.
True Data Virtualization
Multiple products on the market support one or more of the 5 characteristics of data virtualization. Each solution has its advantages and disadvantages, but the more of these characteristics it has, the more it can be considered a true data virtualization solution. With these unique characteristics, data virtualization is a promising approach to managing data, compared with the alternatives.
How does your organization benefit from true data virtualization?