The market for data lakes has recently seen an impressive wave of new-generation engines that provide highly efficient processing of very large data volumes stored in distributed file systems, like S3, ADLS and others. With low cost of storage in those cloud file systems, these technologies enable a very cost-efficient system, especially when compared with traditional EDW. But the data lake engine is just one piece of a data lake strategy. And how cost-efficient is your data lake strategy?
A good way to measure efficiency in the data world is by looking at time-to-market, which normally translates to time-to-insights, which depends on time-to-data. That is, to make business decisions based on data, you need to put the right data in front of the decision maker as soon as possible. You don’t want to miss your product launch before the Xmas season!
When working with a data lake, time-to-data depends on a number of factors:
- Is the data you need in the data lake? If it is not, you need to create and operate the ingestion pipelines. This may be more or less complex, depending on the type of data you need to ingest. In addition, sources change, so we need to maintain and adapt the pipelines according to changes in the sources.
- Is the data you need fresh enough? For some use cases, daily or weekly loads are enough, but others require real-time or near-real-time replication, which probably requires additional software, and it’s harder to operate.
- Is the quality of the data good enough? With raw data, often stored in the lake, you may need to create some cleansing jobs in the form of extract, load, and transform (ELT) flows, for example.
- Is the data understandable by end users? Even when lake data has good quality, it may be in a shape or structure that is hard to understand by the end user. Custom views, or additional ELT processes may be needed to make it usable. In addition, documentation and data catalogs play a crucial role to enable data literacy across different types of users
- Is the data lake directly accessible by consuming applications? If it is not, you may need to develop a custom access method, such as a web service endpoint with specific security and workload restrictions to feed a web or phone application.
All of these problems can be addressed, but it takes both time and money. We could ideally replicate all our data from internal and external sources in near-real time into the data lake. But is that cost-effective? What do we do with data seldomly used?
It’s a catch-22; if we don’t do it, we may be limiting the quality of our insights, but if we do it, the ROI of the data lake will be questionable. Why did you spend so much time and money to ingest this data that nobody has ever used? Was putting all kind of data in the lake actually a good idea to start with?
A Logical Approach
In this context, the idea of a logical data lake starts to take shape: Let’s keep the core data in the data lake, with the corresponding ingestion and curation pipelines, and leave the rest of the data wherever it may be. The virtual layer, based on the Denodo Platform, enables the end user to access all of that data as if it were stored on the same database. It is “logically” together, even though it is physically separated.
Queries can encompass data lake data (data that is executed by the data lake engine), external data, or a mix. Since external data comes directly from the sources, data is always fresh. The Denodo Dynamic Query Optimizer automatically decides how to execute each query, whether it is with an in-memory merge or by bringing the external data to the lake on demand. The structure of the data models can also be changed, in the shape of virtual views easily defined using graphical wizards in the Denodo Platform’s web-based Design Studio.
Access to that logical data lake can be accomplished using SQL, like a physical data lake, but also using REST, OData, and GraphQL web services, which are automatically supported for all data models by the Denodo Platform.
Security is also a no-brainer. All data, internal and external, can be secured down to the column and row level, and masked, regardless of its original location. Advanced features like tag-based security bring additional power to the lake. Security becomes semantic, and the meaning of data (brought in by the tags associated to it) can determine access control, therefore reducing development times and human errors.
If direct access to the source is not the best approach (e.g. if it’s too slow, or it adds too much impact on the underlying system), the Denodo Platform can bring data to the lake with the click of a button. The Denodo Platform will automatically generate the corresponding parquet files and register the structure in the data lake metadata catalog. It can even accelerate the execution of Athena queries, thanks to advanced features like AI-based Denodo’s Smart Query Acceleration.
A Cost-Efficient Data Lake Strategy
In short, a logical data lake powered by the Denodo Platform can be a game changer that brings agility, security, and cost-efficiency to your data lake strategy.
- Improving the Accuracy of LLM-Based Text-to-SQL Generation with a Semantic Layer in the Denodo Platform - May 23, 2024
- Denodo Joins Forces with Presto - June 22, 2023
- Build a cost-efficient data lake strategy with The Denodo Platform - November 25, 2021