Data Scientists hold coveted positions in the analytics arena. We picture them spending most of their time gazing into the data and making predictions, and some of the time preparing the data for analysis. Unfortunately, the reverse seems to be true.
The results of a recent survey of data scientists reported by CrowdFlower in Data Scientist Report 2017 revealed the following allocation of time spent on typical data science activities:
Activities of Data Scientists | Allocation of time |
Collecting, labeling, cleaning and organizing data | 51% |
Building and modeling data | 19% |
Mining data for patterns | 10% |
Refining algorithms | 9% |
Other activities | 8% |
The results suggest that data scientists spend approximately half their time preparing data and only about 38% of their time modeling data, mining data, or refining algorithms. This preponderance of time preparing data is necessary because otherwise the data simply would not be ready for analysis.
Virtual sandbox liberates data scientists
Enterprises face the challenge of finding a way to achieve a more optimal time allocation for their data scientists such that most of their time is spent building and deploying models, uncovering patterns in the data, and refining algorithms in contrast to spending over 50% of their time in data preparation activities.
For data scientists, preparing the data for predictive analysis is a multi-part, iterative process involving discovery, profiling, blending, transforming, cleansing, and augmenting the data, then storing it in some repository or file for subsequent sampling for use in the model. Once these data preparation tasks are initially completed, data scientists can focus on selecting or creating a model, and then training and testing the model. Depending on the results, they typically repeat one or more of the foregoing steps before operationalizing the model. Additional iterations may be necessary after operationalization – when the model is being sourced with live data – because results may be inconsistent with those obtained using test data.
A virtual sandbox environment – fueled by data virtualization – that facilitates the highly iterative preparation and processing of data described above and furnishes data in real time as opposed to batch-oriented ETL processing, can free up data scientists to focus more on the critical activities that require their specific skillset such as modeling data and refining algorithms.
Data Virtualization for Maximum Flexibility
Because data virtualization enables real-time data views across multiple sources, even highly dynamic, transactional ones, without moving data to a new repository, the data scientists could iterate freely, to derive the highest performing, best-fit models for each unique predictive task. And because data virtualization serves as a unified access layer across the different sources, it can track every move of every data scientist, significantly supporting their efforts.
Since no data extraction is required in data virtualization, model deployment is greatly simplified, and much of the process is automated.
Data Virtualization solves many problems
Virtual Sandbox is just one of the many solutions enabled by data virtualization. Watch the DataFest Conference sessions to learn about the many ways that data virtualization can simplify and accelerate data delivery. DataFest is intended for data practitioners and strategic roles such as Enterprise Architect, CDO, CIO, and Business Intelligence Director.
- In and out of Blockchain - February 12, 2018
- A Virtual Sandbox for Data Scientists - September 11, 2017
- Growing Demand for Metadata-driven Integration Solutions – The Biggest Change in the 2017 Magic Quadrant for Data Integration Tools - August 24, 2017