R&D Data Cloud: Moving Your Digital Lab Beyond SDMS

May 11, 2021

Introduction

In 2007, Nokia ruled the mobile phone market as its category leader. Their flip phones, among the first to support text messaging and limited video, were a worldwide phenomenon. And yet, Nokia hardly realized an emerging threat: Apple was about to release the iPhone, a service-modeled platform that depended on multiple application developers sharing and co-existing in a more open digital landscape. Within six years, Apple outsold Nokia 5-to-1 and went on to be the fastest-adopted device in history.

We foresee a similar paradigm shift in life sciences R&D when comparing legacy, silo-prone data storage and management to an open, cloud-native, data-centric model. In this blog, we will compare life sciences’ traditional data management choice (SDMS) to the emerging R&D Data Cloud.

Let’s first understand what is an SDMS and what is an R&D Data Cloud.

What is an SDMS, anyway?

An SDMS imitates a filing cabinet using software. It captures, catalogs, and archives all versions of a data file from a scientific instrument -- HPLCs, mass specs, flow cytometers, sequencers -- and scientific applications like LIMS, ELNs, or analysis software. Specialties of an SDMS include rapid access, compliance with regulatory requirements, and specific workflows around data management. It’s certainly an improvement over the ‘90s-era process of maintaining your quality team on paper reports!

Essentially, an SDMS is a file server and data warehouse. Unfortunately, for many organizations an SDMS ends up as a "black hole" for experimental data that's carelessly tossed in and quickly forgotten.

What is an R&D Data Cloud?

The R&D Data Cloud embraces several fundamentally different design and architectural principles compared to SDMS.

First, it is data-centric; instead of just focusing on storing the data, it focuses on the full life cycle of the data and full stack data experience, including data acquisition from disparate sources, data harmonization into vendor agnostics format, data processing, and pipelining to facilitate data flow, data preparation and labeling for data science.

Data does not get stuck here, the goal is to present and deliver the data to the best-of-breed applications that data can be analyzed.

Second, it is cloud-native. Instead of treating the Cloud as just another data center, it leverages cloud services natively for traceability, scalability, security, and portability into private clouds.
‍

The Problem

Let's compare the two by first understanding the challenges life sciences R&D is facing.

Data volumes in biology, chemistry, materials science, agriculture, and other common R&D organizations continue to explode. The zettabyte era, a measure of total data which began in 2016, will have already expanded 175 times by 2025. Systemic complexity, data heterogeneity and interface discrepancy increase as biological targets become tougher to drug and regulations increase. Rising complexity and data volume present difficulties for legacy systems - outmoded data structures, physical storage limitations, and equipment obsoletion contribute to a trend we’ve observed: R&D organizations are moving away from Scientific Data Management Systems (SDMS) and towards cloud-based or cloud-native solutions.

Now consider the current capabilities of Scientific Data Management Systems (SDMS) and how perceived limitations might impair critical innovations instead of facilitating an open, interactive data future.

‍

#1: Integration issues

SDMS solutions traditionally rely on files exported from instrument control/processing software. However, labs are full of data sources that are not in the form of files; for example, blood gas analyzers, chromatography data systems, etc. These data sources require sophisticated and non-file based connections such as IoT-based or software-based programmatic integration.

Below is a short analysis of file-based integration used in SDMS solutions versus programmatic integrations used by the Tetra R&D Data Cloud.

‍

#2: Scaling issues

Once the data is brought into the SDMS, then the SDMS faces a scaling issue. None of the existing SDMS solutions are cloud-native, they are largely on-prem or use cloud storage simply as another data center. This poses two major challenges:

‍Challenge 1: Difficulty upgrading and maintaining on-prem SDMS‍

An on-prem SDMS typically requires multiple modules: Oracle databases, application servers, and file storage. Every upgrade requires changes in every module; a non-trivial effort.

A cloud-native architecture enables fully automated upgrade / deployment via infrastructure as code and rolling updates, and can continue to leverage the continuous innovations introduced as these infrastructure platforms evolve.

‍Challenge 2: Big R&D data

Did we mention zettabytes? As R&D becomes more complex, chemistry and assay outputs on the 100MB scale are giving way to proteomes, genomes (GB), images (TB), and exome maps (PB). Since cloud-native solutions rely on the elasticity and scalability of cloud storage, R&D organizations no longer need to worry about expansion or tedious upgrades.

Cloud-native infrastructure allows tiered storage, providing more flexibility to classify the storage layer to optimize for cost and availability. If you’re not going to access microscopy images for a while, park them in Glacier instead of purchasing extra server racks or hard disks.

‍

#3: Modern R&D requires flexible data flow

Beyond the on-prem vs cloud-native architecture, the R&D Data Cloud differs from SDMS in another fundamental way: SDMS supports very limited data flow or processing since it’s largely an archival dumping ground. Data flow grinds to a halt.

Without these capabilities, SDMS cannot accommodate the dynamic and sophisticated data flow required by modern R&D organizations. Consider a common R&D workflow:

Scientists produce data and reports from biochemical assays
These results pass through quality checks or enrichment
If the data check out, they are further pushed to ELN or LIMS
If an issue arises, stakeholders should be notified to take action and rectify the data mistake with full traceability
Data is also parsed to enable data visualization and data science
‍

Flexibility is key; R&D organizations should also be able to easily submit the data to multiple destinations such as both the ELN and LIMS (and any other data targets) as their process requires. To accomplish this, simple archiving inside an SDMS is not enough. Configurable data processing enables metadata extraction from the data set, proper categorization with user-defined terms or tags, harmonization of data into vendor-agnostic formats, data integrity checks, and further initiates the submission of the data to downstream applications.

The R&D Data Cloud has elevated the strategic importance of data engineering and integration through cloud-native and Docker-based data pipelines. A user can configure data processing after exploring the data, reprocess the files using another data pipeline, and create their own customized data extraction. They can also merge data from multiple sources, dynamically perform quality control or validation of the data sets, and integrate the data into other informatics applications such as ELN/LIMS.

Users of the R&D Data Cloud will be able to develop such data processing pipelines completely based on their own business logic by creating python scripts in a self-service fashion.

In short, the R&D Data Cloud fundamentally assumes that data should flow and that this flow merits equal consideration to data capture and data storage. The data flow reinvigorates the siloed and stale experimental data and enables fully automated workflows.
‍

#4: An SDMS isn't optimized to support data analytics or queries

SDMS solutions’ static nature makes them largely closed systems; simply putting data together in one place does not make the data discoverable, queryable, or ready for data science and analytics.

To truly leverage the power of the data, it needs to be properly labeled and categorized, and its content needs to be properly extracted. To support API-based queries and data analytics, data needs to be properly prepared, indexed, and partitioned. After all, data scientists can’t do much with a binary file!

The R&D Data Cloud introduces the concept of Intermediate Data Schema files, aka IDS files, which you may find helpful to understand via the following analogy.

In a nutshell, think about your lab as a big warehouse, where there are multiple parcels of different content, shape, weight, and color (like the data produced by your lab instruments, their software, your CRO/CDMOs, and other data sources). These parcels are heavily packaged in their own unique way, and when you have too many parcels, it becomes difficult to find the ones you need. You can't compare the content within each of these parcels, or make any meaningful observation beyond, “these are all very different things.”

Now imagine that each parcel comes with attached IDS metadata, which describes the parcel’s content in a consistent manner. With IDS, it’s much easier to find what you need, since you do not need to unpack each parcel - the IDS makes searching and finding what you need much more efficient.

You can also leverage the IDS’s content consistency to compare different parcels; for example:

Which parcels contain more items
Which parcel is the heaviest
Show me all the parcels that contain a bottle of wine from before April 2000
Only select the parcels that contain books with blue covers
‍

Data curation (labeling, categorization, content extraction, harmonization, indexing, partitioning) requires the data be put through a configurable and orchestrated processing layer, which SDMS does not have and simply is not designed for (see previous section, Issue 3). This processing layer is fundamental since data is never ready to be queried or explored without proper curation and transformation.

In the R&D Data Cloud, all data is automatically harmonized into a vendor-agnostic file format. By leveraging open and data science-friendly file formats like JSON and Parquet, the data combines with some of the most popular search and distributed query frameworks to provide highly reusable, flexible platforms to view and structure your data as you wish.

Another aspect of an SDMS to keep in mind is, even though an SDMS may provide preview or analysis on one individual run or data set, it is not optimized for aggregated insights and trending or clustering. However, the R&D Data Cloud provides accessible and harmonized data through a common software interface for visualization and analysis tools that act upon massive amounts of data from different sources, produced at different time points.

Summary

Combining the cloud-native data lake, Docker container-based data harmonization, and productized data acquisition, the Tetra R&D Data Cloud distinguishes itself from a traditional SDMS, in that the R&D Data Cloud supports:

Powerful, productized integrations with data sources
Scalability based on business needs by leveraging cloud-native architecture
Flexible data processing through self-serviceable and modular data pipelines
Harmonization into Intermediate Data Schema (IDS) provides a wider range of compatibility across data analytics tools
‍

As cloud-native services offer more features, accessibility, and post-processing options, it’s time to abandon the SDMS “data flip-phone” for the better, more innovative data cloud option. By adopting the R&D Data Cloud ahead of the curve, you position your company for maximal flexibility and derive deeper analytics from more connected instruments and devices, while benefiting from contemporary best-of-breed technologies to handle big data in your lab digital transformation journey.

‍

LinkedIn | Twitter | YouTube