How an IDS Complements Raw Experimental R&D Data in the Digital Lab

Spin Wang
|
March 26, 2021

Our previous post (The Digital Lab Needs an Intermediate Data Schema (IDS): a First Principle Analysis) has established the need for an intermediate data file format for Digital Labs and explains why we chose JSON + Parquet. Now let’s dive deeper into its use cases and discuss what an IDS is intended for, and what it’s not intended for. 

In a nutshell, think about your lab as a big warehouse, where there are multiple parcels of different content, shape, weight, and color (like the data produced by your lab instruments, their software, your CRO/CDMOs, and other data sources). Because these parcels are heavily packaged in their own unique way, when you have too many parcels, it’s really difficult to find the ones you need. You can't compare the content within each of these parcels, or make any meaningful observation beyond  “these are all very different things.” 

Now imagine, each parcel has a sidekick called IDS, attached to each cardboard box, describing the parcel’s content in a consistent manner. With IDS, it would be much easier to find what you need, since you do not need to unpack each parcel and you can just search through the sidekick. 

You can also leverage the sidekick's content consistency to compare different parcels; for example:

  • Which parcels contain more items? 
  • Which is heaviest? 
  • Show me all the parcels that contain a bottle of wine from before April 2000
  • Only select the parcels that contain books with blue covers.

Enabling a data sidekick

Hopefully the analogy above is helpful. Now let's move from physical packages in a warehouse to data "packages" in a life sciences company. In more technical terms, IDS is a vendor-agnostic, data science-friendly file format that captures all the scientifically meaningful and data science-relevant information from the vendor-specific or vendor-proprietary file formats produced by a fragmented lab data source ecosystem.

IDS serves multiple important functions: data search, data flow/liquidity and data analytics, which we'll discuss a bit below.

IDS enables search 

With an IDS, data scientists can leverage API-based or full text search, enabling them to quickly locate an instrument RAW file. As with the package analogy, all data fields are acquired and indexed.

IDS enables data liquidity and data flow

Data can now flow to third-party systems without requiring each of these systems to parse the vendor-specific file formats. This fundamentally breaks down internal data silos and enables more seamless information sharing among instruments, informatic applications, analytics tools and contract organizations (CRO/CDMOs).

IDS enables cloud-based, distributed data processing and analytics at massive scale

The majority of existing lab informatics software solutions were designed as Windows-based on-premises applications. For example, in order to analyze Mass Spec data, a scientist would drag and drop 20-100 RAW files into their favorite analysis software program, spend hours processing the data, and generate a report. This workflow unnecessarily consumes time, and limits the potential for data reuse and sharing.

Now, imagine you need to analyze data using big data technologies like Hadoop, Spark, or others. You wouldn’t be able to effectively use those big data tools while manually managing RAW files. However, IDS (JSON + Parquet) makes big data analytics possible, and the Tetra Data Platform will store, partition and index the IDS files in anticipation of such use cases.

Will IDS replace proprietary formats? 

Not at all. IDS is not a replacement of the vendor-specific formats. Organizations need always to save these RAW files, such that scientists can:

  • Understand the nuances of the content
  • Potentially import the RAW file back to the vendor software
  • Regenerate the IDS, for further information extraction or error correction

IDS makes it easier to find the RAW file and provides large-scale data analysis much efficiently since the RAW file does not need to be opened. Just like the Parcel and Sidekick analogy, the sidekick is not meant to replace the parcel; only to improve its use. In fact, the Tetra Data Platform will first ingest RAW data and then trigger RAW to IDS conversion or harmonization using data pipelines.

Can IDS be used for archiving your data? 

This is essentially a variation of the question above, but usually asked in the context of experimental data management. 

Let’s first define “archive.” In the context of experimental R&D data, an instrument typically produces “RAW” data - the points, ramps, intensities, depths, counts, and response factors produced by the detector or sensor - which is processed by instrument control/analysis software to produce “Analysis Result." Since sometimes the “Analysis Result” is saved in the same “RAW” file; for the purpose of this article we'll call this RAW data. RAW data is most often present in a vendor-specific format, vendor proprietary format, or non-file-based data system. 

For example: 

Archive of RAW data means creating a backup from the instrument software or analysis software as a file, stashing it somewhere, and then, when needed, restoring back to the original system, as if the data has never left the source system. (AWS S3 Glacier deep archive is a popular, secure, durable, and extremely low-cost Amazon S3 cloud storage service for data archiving and long-term backup).

Thus, the answer is: No, IDS itself isn't intended for long-term archival.

In most cases, it’s not possible to convert a vendor-proprietary and often binary format into JSON and Parquet, and then be able to re-import the data into the source system without loss of information.

However, the goal of IDS is not to capture everything possible in the RAW file, and in this sense it’s lossy. Therefore, IDS files alone are not sufficient to fulfill experimental data archival requirements. However, as described in the previous section, the content in IDS files will serve as an abstraction layer that will greatly augment search, liquidity, and analytics.

IDS helps you archive and manage your data

Do not get us wrong, we believe data archival is extremely important. GXP Compliance, patient safety, plumbing "dark data" for new insights: in these use cases and more, data archival critically impacts scientific success. The Tetra Data Platform (TDP) archives original instrument RAW files to the cloud with full traceability and data integrity. You can easily restore processed versions and track lineage back to the original instrument or instrument software. 

On top of this, ‍having IDS as a harmonization layer makes the archived data much easier to locate and the information stored in these RAW files much more easily accessible. Vendor and third-party binaries can be searched by file name or extension, but TDP's IDS enables users to search on granular fields like experiment ID, user name, method parameter, or instrument serial number; everything is indexed for faster and easier discovery.

For example, take a look at our Empower Cloud Archival App that leverages IDS to help you search your Waters Empower projects archived in Tetra Data Platform:

‍Hopefully we've helped you understand what happens to your irreplaceable RAW files (we save them!) and how IDS can lead better analytics, facile search, and ready indexing. In other words, IDS provides everything you'll need to track, find, and sort the heterogeneous and invaluable “packages” in your R&D data warehouse.

‍If you’d like to dive deeper into the Tetra Data Platform, we recommend this whitepaper covering many of its key capabilities.

As always, follow TetraScience for ongoing updates on experimental R&D data and related topics:

LinkedIn | Twitter | YouTube

Share this article

Previous post

There is no previous post
Back to all posts
May 18, 2022

Tetra Blasts Off at Boston’s Bio-IT World

Read Blog
May 9, 2022

Give Your in vivo Data the Attention it Deserves

Read Blog
May 2, 2022

Customizing Digital Lab Experiences With Ease

Read Blog
April 14, 2022

Sharing a Vision and Deep Customer Commitment

Read Blog
April 11, 2022

Escaping the Scientific Data Quagmire

Read Blog
April 1, 2022

Innovating with a HoloLens and Drones

Read Blog
April 6, 2022

Digital Twins: Seeing Double with a Predictive Eye

Read Blog
March 28, 2022

Automated Anomaly Detection and Correction

Read Blog
March 30, 2022

Making Labs More Efficient

Read Blog
March 4, 2022

Introducing Tetra Data Platform v3.2

Read Blog
March 2, 2022

Are you prepared to utilize ML/AI and Data Visualization?

Read Blog
February 22, 2022

SLAS 2022: The Industry’s “Hyped” for Accessible and Actionable Scientific Data

Read Blog
February 21, 2022

BIOVIA partners with TetraScience

Read Blog
February 16, 2022

Tetra Partner Network: An Interview with Klemen Zupancic, CEO, SciNote

Read Blog
February 4, 2022

Closing the Data Gap in Cancer Research

Read Blog
January 27, 2022

Waters & The Tetra Partner Network: Making Data Science Possible

Read Blog
December 16, 2021

Announcing Acquisition of Tetra Lab Monitoring Business by Elemental Machines

Read Blog
November 29, 2021

Move From Fractal to Flywheel with The Tetra Partner Network

Read Blog
March 26, 2021

How an IDS Complements Raw Experimental R&D Data in the Digital Lab

Read Blog
July 30, 2021

What is an R&D Data Cloud? (And Why Should You Care?)

Read Blog
March 26, 2021

What is a True Data Integration, Anyway?

Read Blog
June 1, 2020

Data Science Use Cases for the Digital Lab: Novel Analyses with Waters Empower CDS Data

Read Blog
April 20, 2022

Unlock the Power of Your ELN and LIMS

Read Blog
July 23, 2020

The Science Behind Trash Data

Read Blog
August 20, 2021

The 4 Keys to Unlock the Lab of the Future

Read Blog
September 29, 2021

TetraScience Achieves SOC 2 Type 2 Validation, Advances R&D Data Cloud GxP Compliance Capabilities

Read Blog
April 20, 2020

Round-up of Semantic Web thought leadership articles

Read Blog
September 19, 2020

Simplified Remote Monitoring of CO₂ Incubators via Direct Integration

Read Blog
May 11, 2021

R&D Data Cloud: Moving Your Digital Lab Beyond SDMS

Read Blog
September 10, 2021

Principles of Deep Learning Theory

Read Blog
July 8, 2020

Powering Bioprocessing 4.0 for Therapeutic Development

Read Blog
March 30, 2022

Why Biopharma Needs an End-to-End, Purpose-Built Platform for Scientific Data — Part 2

Read Blog
August 19, 2021

Part 2: How TetraScience Approaches the Challenge of Scaling True Scientific Data Integrations

Read Blog
March 23, 2022

Why Biopharma Needs an End-to-End, Purpose-Built Platform for Scientific Data — Part 1

Read Blog
January 18, 2021

New Matter: Inside the Minds of SLAS Scientists Podcast

Read Blog
June 29, 2020

Enabling Compliance in GxP Labs

Read Blog
May 14, 2020

LRIG-New England: Lunchtime Virtual Rapid Fire Event - May 26, 2020

Read Blog
June 10, 2020

Remote Lab Scheduling is No Longer Optional, it is a Requirement

Read Blog
August 2, 2020

Incident Reporting for GxP Compliance

Read Blog
October 15, 2020

Protein Purification with Cytiva UNICORN: Enhanced Analytics through Harmonization and Integration

Read Blog
July 29, 2020

Cloud-based Data Management with Lab Automation: HighRes Biosolutions Cellario + TetraScience

Read Blog
August 20, 2020

Understanding Why Freezer Temperatures May Not Be Uniform

Read Blog
July 14, 2021

Find Experimental Data Faster with Google-Like Search in Tetra Data Platform 3.1 Release

Read Blog
July 22, 2021

Experimental Data in Life Sciences R&D — It’s How Many Copies of Jaws?!

Read Blog
July 15, 2020

Direct Data Acquisition and Sensor Integration for ThermoFisher Heracell™ VIOS CO₂ Incubators

Read Blog
September 15, 2020

Remote Monitoring of VWR CO₂ Incubators via Direct Integration

Read Blog
April 26, 2020

The Digital Lab Needs an Intermediate Data Schema (IDS): a First Principle Analysis

Read Blog
April 6, 2020

TetraScience ADF Converter -- Delivering on the Promise of Allotrope and a Startup’s Journey

Read Blog
August 6, 2020

"Data Plumbing" for the Digital Lab

Read Blog
June 8, 2020

Data Automation for High-Throughput Screening with Dotmatics, Tecan, and PerkinElmer Envision

Read Blog
May 15, 2020

Applying Data Automation and Standards to Cell Counter Files

Read Blog
June 11, 2020

AWS Healthcare & Life Sciences Web Day | Virtual Industry Event

Read Blog
February 12, 2021

AWS Executive Conversations: Evolving R&D

Read Blog
April 15, 2021

Announcing Our Series B: The What, When, Why, Who, and Where

Read Blog
April 15, 2021

Announcing our Series B: The DNA Markers of Category Kings and Queens

Read Blog
April 15, 2021

Announcing our Series B: Tetra 1.0 and 2.0 | The Noise and the Signal

Read Blog
March 29, 2020

Allotrope Leaf Node Model — a Balance between Practical Solution and Semantics Compatibility

Read Blog
April 3, 2019

Allotrope 101

Read Blog
March 13, 2020

Choose the right alert set points for your freezers, refrigerators, and incubators

Read Blog
August 27, 2020

99 Problems, but an SDMS Ain't One

Read Blog