The star of SnowflakeDB keeps rising. This cloud-based, analytical SQL database server offers a set of powerful features, not in the least of which is high performance, end-to-end encryption, cloud platform independency, a self-driving engine, automatic scalability (up or down), and flexibility.
Some may wonder what would be the advantage in using data virtualization in combination with SnowflakeDB. Data virtualization can extract data from other data sources, it offers transformation and data integration power, and so on, but most of these features are supported by SnowflakeDB as well. So what would be the reason for combining them? Though the two do offer some overlapping features, data virtualization also offers some features that are not supported by SnowflakeDB, or are only supported in a limited way.
Distinguishing Features
First of all, it’s not very likely that all of an organization’s data is eventually stored in SnowflakeDB. Most enterprise data isn’t currently stored in SnowflakeDB and won’t be stored there for a long time; big data can be too big to move centrally and may have to remain at remote sites; some data may be stored in cloud applications; and other data sources are highly secure, in which case the data is not allowed to be copied to another database. Still, to analyze all the data, one interface for all the data would simplify the development of reports and dashboards. Data virtualization can help by making a heterogeneous set of data sources look like one logical database. One of those data sources can be SnowflakeDB.
SnowflakeDB can access data from other data sources through external tables. However, these data sources are restricted to files and specific Hadoop-related file formats. Data virtualization, on the other hand, can access data from many more sources, including SQL databases, many file formats, applications, spreadsheets, and service busses. This technology was primarily developed to address the inherent heterogeneity of current data processing systems.
Also, data virtualization enables stakeholders to centrally manage security across diverse data sources. Instead of having to define separate security specifications for different data sources using different specification languages, all security specifications can be handled in one uniform way using data virtualization.
Another difference relates to views. With data virtualization all the transformation, integration, aggregation, and filtering specifications are defined using views. SnowflakeDB is a SQL database server and supports views as well. By the way, this is understandably the main reason why some think that data virtualization is superfluous when SnowflakeDB is in the picture; both support views.
But there are key differences. First, data virtualization processes views that join multiple sources more efficiently. Its query optimizer is built to run distributed joins. Second, data virtualization allows for the definition of metadata; views and their columns can be described, defined, and tagged. This metadata is searchable by business users and professional IT developers.
The third difference regarding views relates to lineage and impact analysis. Data virtualization can show which views are dependent on which sources and/or other views. Lineage can be leveraged to trace the source of the data used, and also what type of operations have been applied to the data over time. Essentially, lineage offers business users more transparency. Impact analysis can be used to determine what will happen to other views when the definition of a view is modified. So, the impact of such a change can be understood beforehand.
The data stored in SnowflakeDB can only be accessed through SQL. Data virtualization allows data consumers to deploy other languages or APIs. Depending on the requirements of the data consumer, this can be SQL, but it can also be REST, OData, GraphQL, SOAP, and other formats.
The last feature I want to mention here is database server independency. If an organization needs to deploy another database technology for specific applications, data virtualization can hide this. Applications and reports do not have to be changed in order to use the database technology correctly. In other words, data virtualization hides the SQL dialect of the data source in use.
Data Virtualization and SnowflakeDB: The Best of Both Worlds
The advantage of this combination of technologies is the streamlined data access of data virtualization and the speed, scalability, and flexibility of the SnowflakeDB engine. SnowflakeDB can also be used by data virtualization as sources for all the cached views.
The combination of data virtualization and SnowflakeDB is a powerful one. Data virtualization may feel superfluous when SnowflakeDB is used, but that’s not the case if you consider the entire data architecture responsible for data processing, storage, and analytics. The two clearly augment each other, enabling a flexible, scalable data architecture.
- Metadata, the Neglected Stepchild of IT - December 8, 2022
- Zero-Copy Architectures Are the Future - October 27, 2022
- The Data Lakehouse: Blending Data Warehouses and Data Lakes - April 21, 2022
I find alot of data virtualization can be done in python. Query the sources , put into dataframe, and combine the data. Of course at a large scale python may not be feasible.