Typical estimates for the development of a new small molecule drug are that it involves preparation and testing of more than a thousand candidate molecules and results in a New Drug Application of over 100,000 pages. The data supporting such projects exists in diverse sets that vary greatly in size, complexity and structure. Creating this data represents an enormous investment of time, labor and money. Yet efficiently finding, sharing, and reusing it across an organization remains a significant challenge and knowledge management initiatives have a high failure rate. Describing data with an accurate and complete set of metadata is the first step in realizing its full value.
The table below contains a small sample from a master data repository of manufacturing lot information for a drug candidate. It illustrates a common occurrence when integrating data that is generated by multiple systems and/or users. Without a common terminology different ad hock “formats” and definitions emerge with a resulting breakdown of data harmony.
Description |
Amount | Size |
Film-coated tablet | 20 | mg |
FC tablet | 20 mg | |
20 mg tablet, film coated | ||
Red film coated tablet | 20 mg |
Metadata in the form of controlled vocabularies, taxonomies, thesauri and ontologies are collectively known as “vocabularies.” They differ in the complexity of the information represented and in how they’re expressed. However, all identify and categorize digital content and provide contextual information about that content. One of the key benefits of this type of metadata is harmonization of terminology across systems.
A useful model to understand the different types of vocabulary and their respective applications is the semantic spectrum. The semantic spectrum describes the logical rigor of a vocabulary’s underlying knowledge representation system. This, in turn, informs the capabilities and limitations of the vocabulary when it is used by consuming systems.
Increasing logical rigor enables powerful applications, particularly in terms of machine-based manipulation of data, but at the cost of a vocabulary that requires significantly greater skill and effort to build and maintain.
- Enumerated lists: Glossaries and controlled vocabularies
- Taxonomies: Concepts are organized into a hierarchical structure of broader and narrower meanings
- Thesauri: Includes non-hierarchical relationships such as equivalence (for example synonyms, acronyms and abbreviations) and association
- Ontologies: Rigorous models that strictly follow rules of description logic and are encoded in a formal ontology language
Finally, it’s worth pointing out that vocabularies in isolation provide little value. Their value is realized when they are used to describe an organization’s data assets. In conjunction with other technologies this enables data integration, data lifecycle management, search and other capabilities that create business value. An effective metadata development and governance strategy takes into account both the generation and consumption of data and the specific needs and constraints of the relevant systems and users. This guides modeling at the appropriate level of complexity to meet current needs and provides a structure that can support change as business and technology evolves.
This blog was penned by John Tulinsky, PhD, Senior Consultant at LabAnswer. LabAnswer is a proud supporter of Denodo DataFest.
- Data Integrity in the Cloud - January 18, 2017
- Controlled Vocabularies, Taxonomies, Thesauri and Ontologies for Knowledge Management: A Primer - January 11, 2017
- Data Governance: What’s in it for me? - September 15, 2015