R&D Data Cloud: Moving Your Digital Lab Beyond SDMS

Spin Wang
Mike Tarselli, Ph.D., MBA
|
May 11, 2021

Introduction

In 2007, Nokia ruled the mobile phone market as its category leader. Their flip phones, among the first to support text messaging and limited video, were a worldwide phenomenon. And yet, Nokia hardly realized an emerging threat: Apple was about to release the iPhone, a service-modeled platform that depended on multiple application developers sharing and co-existing in a more open digital landscape. Within six years, Apple outsold Nokia 5-to-1 and went on to be the fastest-adopted device in history. 

We foresee a similar paradigm shift in life sciences R&D when comparing legacy, silo-prone data storage and management to an open, cloud-native, data-centric model. In this blog, we will compare life sciences’ traditional data management choice (SDMS) to the emerging R&D Data Cloud. 

Let’s first understand what is an SDMS and what is an R&D Data Cloud.

What is an SDMS, anyway?

An SDMS imitates a filing cabinet using software. It captures, catalogs, and archives all versions of a data file from a scientific instrument -- HPLCs, mass specs, flow cytometers, sequencers -- and scientific applications like LIMS, ELNs, or analysis software. Specialties of an SDMS include rapid access, compliance with regulatory requirements, and specific workflows around data management. It’s certainly an improvement over the ‘90s-era process of maintaining your quality team on paper reports!

Essentially, an SDMS is a file server and data warehouse. Unfortunately, for many organizations an SDMS ends up as a "black hole" for experimental data that's carelessly tossed in and quickly forgotten.

What is an R&D Data Cloud?

The R&D Data Cloud embraces several fundamentally different design and architectural principles compared to SDMS. 

First, it is data-centric; instead of just focusing on storing the data, it focuses on the full life cycle of the data and full stack data experience, including data acquisition from disparate sources, data harmonization into vendor agnostics format, data processing, and pipelining to facilitate data flow, data preparation and labeling for data science.  

Data does not get stuck here, the goal is to present and deliver the data to the best-of-breed applications that data can be analyzed.

Second, it is cloud-native. Instead of treating the Cloud as just another data center, it leverages cloud services natively for traceability, scalability, security, and portability into private clouds. 

The Problem

Let's compare the two by first understanding the challenges life sciences R&D is facing. 

Data volumes in biology, chemistry, materials science, agriculture, and other common R&D organizations continue to explode. The zettabyte era, a measure of total data which began in 2016, will have already expanded 175 times by 2025. Systemic complexity, data heterogeneity and interface discrepancy increase as biological targets become tougher to drug and regulations increase. Rising complexity and data volume present difficulties for legacy systems - outmoded data structures, physical storage limitations, and equipment obsoletion contribute to a trend we’ve observed: R&D organizations are moving away from Scientific Data Management Systems (SDMS) and towards cloud-based or cloud-native solutions.

Now consider the current capabilities of Scientific Data Management Systems (SDMS) and how perceived limitations might impair critical innovations instead of facilitating an open, interactive data future.

#1: Integration issues

SDMS solutions traditionally rely on files exported from instrument control/processing software. However, labs are full of data sources that are not in the form of files; for example, blood gas analyzers, chromatography data systems, etc. These data sources require sophisticated and non-file based connections such as IoT-based or software-based programmatic integration.

Below is a short analysis of file-based integration used in SDMS solutions versus programmatic integrations used by the Tetra R&D Data Cloud.

#2: Scaling issues

Once the data is brought into the SDMS, then the SDMS faces a scaling issue. None of the existing SDMS solutions are cloud-native, they are largely on-prem or use cloud storage simply as another data center. This poses two major challenges:

Challenge 1: Difficulty upgrading and maintaining on-prem SDMS

An on-prem SDMS typically requires multiple modules: Oracle databases, application servers, and file storage. Every upgrade requires changes in every module; a non-trivial effort. 

A cloud-native architecture enables fully automated upgrade / deployment via infrastructure as code and rolling updates, and can continue to leverage the continuous innovations introduced as these infrastructure platforms evolve.

Challenge 2: Big R&D data

Did we mention zettabytes? As R&D becomes more complex, chemistry and assay outputs on the 100MB scale are giving way to proteomes, genomes (GB), images (TB), and exome maps (PB). Since cloud-native solutions rely on the elasticity and scalability of cloud storage, R&D organizations no longer need to worry about expansion or tedious upgrades.

Cloud-native infrastructure allows tiered storage, providing more flexibility to classify the storage layer to optimize for cost and availability. If you’re not going to access microscopy images for a while, park them in Glacier instead of purchasing extra server racks or hard disks.

#3: Modern R&D requires flexible data flow

Beyond the on-prem vs cloud-native architecture, the R&D Data Cloud differs from SDMS in another fundamental way: SDMS supports very limited data flow or processing since it’s largely an archival dumping ground. Data flow grinds to a halt.  

Without these capabilities, SDMS cannot accommodate the dynamic and sophisticated data flow required by modern R&D organizations. Consider a common R&D workflow:

  • Scientists produce data and reports from biochemical assays
  • These results pass through quality checks or enrichment 
  • If the data check out, they are further pushed to ELN or LIMS
  • If an issue arises, stakeholders should be notified to take action and rectify the data mistake with full traceability
  • Data is also parsed to enable data visualization and data science

Flexibility is key; R&D organizations should also be able to easily submit the data to multiple destinations such as both the ELN and LIMS (and any other data targets) as their process requires. To accomplish this, simple archiving inside an SDMS is not enough. Configurable data processing enables metadata extraction from the data set, proper categorization with user-defined terms or tags, harmonization of data into vendor-agnostic formats, data integrity checks, and further initiates the submission of the data to downstream applications. 

The R&D Data Cloud has elevated the strategic importance of data engineering and integration through cloud-native and Docker-based data pipelines. A user can configure data processing after exploring the data, reprocess the files using another data pipeline, and create their own customized data extraction. They can also merge data from multiple sources, dynamically perform quality control or validation of the data sets, and integrate the data into other informatics applications such as ELN/LIMS. 

Users of the R&D Data Cloud will be able to develop such data processing pipelines completely based on their own business logic by creating python scripts in a self-service fashion. 

In short, the R&D Data Cloud fundamentally assumes that data should flow and that this flow merits equal consideration to data capture and data storage. The data flow reinvigorates the siloed and stale experimental data and enables fully automated workflows. 

#4: An SDMS isn't optimized to support data analytics or queries

SDMS solutions’ static nature makes them largely closed systems; simply putting data together in one place does not make the data discoverable, queryable, or ready for data science and analytics. 

To truly leverage the power of the data, it needs to be properly labeled and categorized, and its content needs to be properly extracted. To support API-based queries and data analytics, data needs to be properly prepared, indexed, and partitioned. After all, data scientists can’t do much with a binary file! 

The R&D Data Cloud introduces the concept of Intermediate Data Schema files, aka IDS files, which you may find helpful to understand via the following analogy.

In a nutshell, think about your lab as a big warehouse, where there are multiple parcels of different content, shape, weight, and color (like the data produced by your lab instruments, their software, your CRO/CDMOs, and other data sources). These parcels are heavily packaged in their own unique way, and when you have too many parcels, it becomes difficult to find the ones you need. You can't compare the content within each of these parcels, or make any meaningful observation beyond, “these are all very different things.” 

Now imagine that each parcel comes with attached IDS metadata, which describes the parcel’s content in a consistent manner. With IDS, it’s much easier to find what you need, since you do not need to unpack each parcel - the IDS makes searching and finding what you need much more efficient.

You can also leverage the IDS’s content consistency to compare different parcels; for example:

  • Which parcels contain more items 
  • Which parcel is the heaviest 
  • Show me all the parcels that contain a bottle of wine from before April 2000
  • Only select the parcels that contain books with blue covers

Data curation (labeling, categorization, content extraction, harmonization, indexing, partitioning) requires the data be put through a configurable and orchestrated processing layer, which SDMS does not have and simply is not designed for (see previous section, Issue 3). This processing layer is fundamental since data is never ready to be queried or explored without proper curation and transformation.   

In the R&D Data Cloud, all data is automatically harmonized into a vendor-agnostic file format. By leveraging open and data science-friendly file formats like JSON and Parquet, the data combines with some of the most popular search and distributed query frameworks to provide highly reusable, flexible platforms to view and structure your data as you wish. 

Another aspect of an SDMS to keep in mind is, even though an SDMS may provide preview or analysis on one individual run or data set, it is not optimized for aggregated insights and trending or clustering. However, the R&D Data Cloud provides accessible and harmonized data through a common software interface for visualization and analysis tools that act upon massive amounts of data from different sources, produced at different time points. 

Summary

Combining the cloud-native data lake, Docker container-based data harmonization, and productized data acquisition, the Tetra R&D Data Cloud distinguishes itself from a traditional SDMS, in that the R&D Data Cloud supports:

  • Powerful, productized integrations with data sources
  • Scalability based on business needs by leveraging cloud-native architecture 
  • Flexible data processing through self-serviceable and modular data pipelines 
  • Harmonization into Intermediate Data Schema (IDS) provides a wider range of compatibility across data analytics tools

As cloud-native services offer more features, accessibility, and post-processing options, it’s time to abandon the SDMS “data flip-phone” for the better, more innovative data cloud option. By adopting the R&D Data Cloud ahead of the curve, you position your company for maximal flexibility and derive deeper analytics from more connected instruments and devices, while benefiting from contemporary best-of-breed technologies to handle big data in your lab digital transformation journey.

Follow us for ongoing updates about all things R&D data and other related topics:

LinkedIn | Twitter | YouTube

Share this article

Previous post

There is no previous post
Back to all posts
September 15, 2022

Creating a Treasure Trove of Scientific Insights

Read Blog
September 8, 2022

Pragmatic Compliance Solutions: Adding Value Effectively to GxP

Read Blog
August 10, 2022

Reinvented Resource Management Powers Innovation Cycles

Read Blog
August 11, 2022

Introducing Tetra Data Platform v3.3

Read Blog
August 4, 2022

Automating qPCR Workflows for Better Scientific Outcomes

Read Blog
July 28, 2022

3 Ghosts of Data Past (and how to eliminate them)

Read Blog
July 26, 2022

Science at Your Fingertips - Across the Enterprise

Read Blog
July 22, 2022

Building The Digital CDMO with TetraScience

Read Blog
June 27, 2022

Barrier Busting: Bringing ELN and LIMS Scientific Data Together

Read Blog
May 31, 2022

Committed to Curing Diabetes

Read Blog
May 23, 2022

New Frontiers: World’s First Community-Driven AI Store for Biology

Read Blog
May 18, 2022

Tetra Blasts Off at Boston’s Bio-IT World

Read Blog
May 9, 2022

Give Your in vivo Data the Attention it Deserves

Read Blog
May 2, 2022

Customizing Digital Lab Experiences With Ease

Read Blog
April 14, 2022

Sharing a Vision and Deep Customer Commitment

Read Blog
April 11, 2022

Escaping the Scientific Data Quagmire

Read Blog
April 1, 2022

Innovating with a HoloLens and Drones

Read Blog
April 6, 2022

Digital Twins: Seeing Double with a Predictive Eye

Read Blog
March 28, 2022

Automated Anomaly Detection and Correction

Read Blog
March 30, 2022

Making Labs More Efficient

Read Blog
March 4, 2022

Introducing Tetra Data Platform v3.2

Read Blog
March 2, 2022

Are you prepared to utilize ML/AI and Data Visualization?

Read Blog
February 22, 2022

SLAS 2022: The Industry’s “Hyped” for Accessible and Actionable Scientific Data

Read Blog
February 21, 2022

BIOVIA partners with TetraScience

Read Blog
February 16, 2022

Tetra Partner Network: An Interview with Klemen Zupancic, CEO, SciNote

Read Blog
February 4, 2022

Closing the Data Gap in Cancer Research

Read Blog
January 27, 2022

Waters & The Tetra Partner Network: Making Data Science Possible

Read Blog
December 16, 2021

Announcing Acquisition of Tetra Lab Monitoring Business by Elemental Machines

Read Blog
November 29, 2021

Move From Fractal to Flywheel with The Tetra Partner Network

Read Blog
March 26, 2021

How an IDS Complements Raw Experimental R&D Data in the Digital Lab

Read Blog
July 30, 2021

What is an R&D Data Cloud? (And Why Should You Care?)

Read Blog
March 26, 2021

What is a True Data Integration, Anyway?

Read Blog
June 1, 2020

Data Science Use Cases for the Digital Lab: Novel Analyses with Waters Empower CDS Data

Read Blog
April 20, 2022

Unlock the Power of Your ELN and LIMS

Read Blog
July 23, 2020

The Science Behind Trash Data

Read Blog
August 20, 2021

The 4 Keys to Unlock the Lab of the Future

Read Blog
September 29, 2021

TetraScience Achieves SOC 2 Type 2 Validation, Advances R&D Data Cloud GxP Compliance Capabilities

Read Blog
April 20, 2020

Round-up of Semantic Web thought leadership articles

Read Blog
May 11, 2021

R&D Data Cloud: Moving Your Digital Lab Beyond SDMS

Read Blog
September 10, 2021

Principles of Deep Learning Theory

Read Blog
July 8, 2020

Powering Bioprocessing 4.0 for Therapeutic Development

Read Blog
March 30, 2022

Why Biopharma Needs an End-to-End, Purpose-Built Platform for Scientific Data — Part 2

Read Blog
August 19, 2021

Part 2: How TetraScience Approaches the Challenge of Scaling True Scientific Data Integrations

Read Blog
March 23, 2022

Why Biopharma Needs an End-to-End, Purpose-Built Platform for Scientific Data — Part 1

Read Blog
January 18, 2021

New Matter: Inside the Minds of SLAS Scientists Podcast

Read Blog
June 29, 2020

Enabling Compliance in GxP Labs

Read Blog
May 14, 2020

LRIG-New England: Lunchtime Virtual Rapid Fire Event - May 26, 2020

Read Blog
June 10, 2020

Remote Lab Scheduling is No Longer Optional, it is a Requirement

Read Blog
August 2, 2020

Incident Reporting for GxP Compliance

Read Blog
October 15, 2020

Protein Purification with Cytiva UNICORN: Enhanced Analytics through Harmonization and Integration

Read Blog
July 29, 2020

Cloud-based Data Management with Lab Automation: HighRes Biosolutions Cellario + TetraScience

Read Blog
August 20, 2020

Understanding Why Freezer Temperatures May Not Be Uniform

Read Blog
July 14, 2021

Find Experimental Data Faster with Google-Like Search in Tetra Data Platform 3.1 Release

Read Blog
July 22, 2021

Experimental Data in Life Sciences R&D — It’s How Many Copies of Jaws?!

Read Blog
April 26, 2020

The Digital Lab Needs an Intermediate Data Schema (IDS): a First Principle Analysis

Read Blog
April 6, 2020

TetraScience ADF Converter -- Delivering on the Promise of Allotrope and a Startup’s Journey

Read Blog
August 6, 2020

"Data Plumbing" for the Digital Lab

Read Blog
June 8, 2020

Data Automation for High-Throughput Screening with Dotmatics, Tecan, and PerkinElmer Envision

Read Blog
May 15, 2020

Applying Data Automation and Standards to Cell Counter Files

Read Blog
June 11, 2020

AWS Healthcare & Life Sciences Web Day | Virtual Industry Event

Read Blog
February 12, 2021

AWS Executive Conversations: Evolving R&D

Read Blog
April 15, 2021

Announcing Our Series B: The What, When, Why, Who, and Where

Read Blog
April 15, 2021

Announcing our Series B: The DNA Markers of Category Kings and Queens

Read Blog
April 15, 2021

Announcing our Series B: Tetra 1.0 and 2.0 | The Noise and the Signal

Read Blog
March 29, 2020

Allotrope Leaf Node Model — a Balance between Practical Solution and Semantics Compatibility

Read Blog
March 13, 2020

Choose the right alert set points for your freezers, refrigerators, and incubators

Read Blog
August 27, 2020

99 Problems, but an SDMS Ain't One

Read Blog