Complex questions that need simple … but not simplistic … answers.
Update: make sure you also read the second part of this content. Data Virtualization Performance and Source System Impact (Part 2)
Within sixty seconds into a data virtualization conversation, the question invariably comes up: “What about dv performance?” And if it doesn’t, we bring it up. Why? Because performance is one of the key attributes that differentiate a true data virtualization platform from the rest that may offer some features but lack robust performance management under different mixed workloads. With so many products and vendors claiming to provide “virtualization” … including SQL on Hadoop, data preparation, cloud integration, BI with built-in federation, etc. … it is important to understand data virtualization performance in a comprehensive way, and not limited to a specific use case scenario.
A robust data virtualization platform has to address several aspects of performance and differentiate amongst them:
- DV Performance: typically refers to the latency, throughput, concurrency etc. of sending queries through the DV layer to access heterogeneous data in real time. This is performance as experienced by users of the DV platform. Several articles in this blog such as on “Dynamic Query Optimization” are addressing DV performance, like “Myths in Data Virtualization Performance”, “Physical vs Logical Data Warehouse Performance: The Numbers” and “Data Virtualization And Performance, Can They Be Achieved Simultaneously?”.
- Hybrid Strategies: using primarily data virtualization but supplemented through intelligent caching and ETL/batch processing seamlessly combined with federation. If you want more information about this topic, read “Breathe new life into your ETL Processes” and “Intelligent Caching in Data Virtualization”.
- Scalability: how performance under any of the above can be increased by adding more resources vertically (cores) or horizontally (servers, clusters) or using in-memory fabrics. If you want to learn more about this topic, read “Performance of Data Virtualization in Logical Data Warehouse Scenarios”.
- Source System Impact: ensuring that direct users of operational applications (e.g. Oracle EBS, SAP ECC, Cloud/SaaS applications, etc.) are least impacted when those sources are exposed broadly through a data virtualization layer.
- Service Level and Resource Management: facility for prioritizing some workloads and users over others; adapting resources dynamically based on external variables to maintain service levels; and if necessary, allowing for graceful denial of service to low priority or disruptive queries mindful of both DV performance and source system impact.
Since topics 1-3 are addressed in several other blog articles, here we will focus on the topics 4-5.
Source System Impact and Resource Management
Customers are particularly interested in how DV could affect the real-time performance for users of operational systems as DV could impose additional workload over them. A variant of this question is how DV impacts costs of using some sources such as mainframes which have cost / MIPS.
The first step to addressing source system impact questions is to acknowledge the difference between analytical and operational DV queries. Operational queries are those that are done in the context of a transaction such as responding to a customer in a call center where DV provides a “Customer 360” view to the agent across multiple underlying systems. Operational queries do not typically add extra load to underlying operational systems since the same queries to those systems are now being routed via DV and enriched with data from other sources. So there is no incremental source impact, compared with querying the operational systems directly.
In fact, operational queries via DV can actually IMPROVE operational performance. How so? Query volumes to underlying sources can decrease because frequently requested intermediary data sets can be cached in DV. Also in the example of a “Customer 360” call center application for DV, the ability to obtain richer and more complete data about a customer or deliver that information via self-service portal, compared to doing “on-screen” or “swivel-chair” integration using multiple calls to the underlying applications, can reduce query volumes and over-taxing the systems while improving customer response time and “first call resolution”. A major Telco customer reported 25-30% lower operational system workload after introducing “single customer view” through data virtualization and use of caching for frequently requested, but relatively unchanging data.
Analytical queries in the past were sent to intermediate systems such as the enterprise data warehouse (EDW) or data marts where data was copied from operational systems using ETL during low-usage batch windows. With DV, analytical queries are typically sent to the DV layer acting as a logical data warehouse (LDW) underlying which there is both the erstwhile intermediate data stores (EDW, Data Marts) but also operational systems. So part of the query gets data in real-time from operational systems and that does add source system impact. So it is a valid question to ask how data virtualization manages this impact for queries in real-time.
The short answer is, “Very well, and in multiple ways that can be combined to create the optimal solution”. Below are some of the advanced features found in Denodo 6.0 with respect to management of resources, source impact and service levels:
- Data View Parameters – Limit the type of queries sent to the data source.
- Source-aware Query Optimizer – Design queries to leverage source capabilities best.
- Caching Strategies – Cache frequent and/or costly views and queries.
- Denodo Scheduler – Selective ETL / batch operations.
- Resource Throttling – Managing resources and service levels by throttling at multiple levels – data source, server requests, and client application queries.
- Resource Manager in Denodo 6.0 – Dynamic and custom policy decisions.
- Monitoring – Managing through real-time and historical reports.
- Combining Denodo capabilities with other tools for adaptive management.
We will stop here and go into the details of each of these capabilities in Part 2 of this article. There we can see how these capabilities can be adopted as the need arises to deliver information to data consumers without impacting source systems. We will also discuss how these capabilities can be exposed via APIs from other complementary data services management tools in the enterprise.
- Data Governance in a Data Mesh or Data Fabric Architecture - December 21, 2023
- Moving to the Cloud, or a Hybrid-Cloud Scenario: How can the Denodo Platform Help? - November 23, 2023
- Logical Data Management and Data Mesh - July 20, 2023