Every day, organizations collect massive volumes of data that continue to grow. This growth is due to diverse data generators across consumer and enterprise landscapes, including hundreds of cloud applications, smartphones, websites, and social media networks, with each source generating meaningful data. By 2025, it’s estimated by the World Economic Forum report that 463 exabytes of data will be created each day globally – that’s the equivalent of 212,765,957 DVDs per day! But this doesn’t mean organizations get the most from their data.
Organizations face two critical data-related challenges. The first is that much of the collected data remains unused. Data has a limited shelf-life, and if it is not analyzed within a suitable timeframe, data is a wasted resource. However, that is a separate topic that I will discuss in a future article about maximizing the time-to-value of data and using data to gain actionable insights.
The Danger of Collecting Data
For this post, I want to focus on a second challenge, data collection. Most of the collected data is not new or original; much of it is copied. Most organizations store customer data in multiple repositories such as a transactional system, staging areas, data warehouses, data marts, and maybe also a data lake. Data can sometimes be stored several times within one datastore to support different use cases. We also find that redundant data copies are stored in development and test environments. IT and business users also copy data. They may have copied data from transactional application databases and operational data stores to private files and spreadsheets.
Beyond inter-organizational data exchange, intra-organizational data exchange also takes place. In almost all intra-organizational circumstances, receiving organizations store the data in their systems, resulting in more copies of the data. Large volumes of data are routinely sent back and forth on a daily basis, especially between government organizations. Data growth is also enormous within organizations, but this largely consists of redundant data.
However, data copies, or redundant data, quickly becomes a massive liability if not handled with care.
Expressed in mathematical terms:
No. of Data Copies ∝ Risk of Data Breach
That is, we increase the risk of a breach with every copy. Data copies are stored in various systems with different levels of security — authentication, authorization, and governance models.
I’m not advocating for the complete elimination of data copies. Obviously, there are numerous valid reasons for it — for example, strategic enterprise compliance reporting and audits. The most common motive to copy data into another system is to offload the excess load on the production system. However, with the advancement in dynamically programmable cloud infrastructure services and orchestration engines like Kubernetes and containers and network speed and performance improvements, server offloading is no longer necessary. Data sharing should simply be treated as incremental user workload. Systems can be dynamically scaled up and down based on user workload.
The second root cause of data copies and duplication is not a technical limitation; instead, it involves people and processes. Organizations need to use data to guide and justify decisions continuously. The process transformation starts at the top with sustained, engaged leadership to establish and reward the optimal collaboration behavior and treat data as an organizational asset, not a departmental property. Process improvement must precede — or at least go along with — technology upgrades to support the new way of working and collaborating. We must resist the old way of copying the data and storing it redundantly as our primary data delivery and sharing option. Redundant data copies create several problems like synchronization with the source data system, maintenance of data pipelines, handling schema and format changes, data latency between systems, and data quality issues. All these factors severely impact organizational agility.
Reduce the Data Footprint
Data copy reduction should be a guiding principle in modernizing data strategy and selecting data management platforms. We are all aware that data velocity, volume, and variety are growing in every organization. Data is decentralized and spread over the network. No single data repository can serve all enterprise needs. Enterprises will continue to build specific, purpose-driven data warehouses and data lakes to support investigative analytics. As a result, we are witnessing a rise in distributed data architecture and management strategy to address the reality of distributed data, such as data fabric and data mesh.
So, how to embrace data decentralization and adopt a distributed data management strategy?
The Denodo Platform is the leading solution for unified data fabric architecture and data mesh process management powered by data virtualization. The Denodo Platform leverages data virtualization to seamlessly implement logical data fabric architecture, employing modern features such as semantic data layers, metadata management, automated artificial intelligence/machine learning (AI/ML)-driven data catalogs, and hybrid, multi-cloud deployment options to break down the boundaries separating applications, data, clouds, and people. With the Denodo Platform’s copyless data access, integration, and delivery, organizations quickly achieve data copy reduction objectives while improving their agile data utilization strategy.
- Don’t Drown in Redundant Data Copies. Adopt a Copyless Data Strategy - March 17, 2022