The Digital Lab Needs an Intermediate Data Schema (IDS): a First Principle Analysis

Spin Wang
|
April 26, 2020

We believe one of the critical components of the Digital Lab is the flow of data. Without the flow of information, data is siloed and fragmented, and it is nearly impossible to take action based on the information. To enable scalable data flow, the Digital Lab needs an Intermediate Data Schema (IDS).

The IDS is used to decouple the data sources (producers) and data targets (consumers). This decoupling significantly reduces the total cost of ownership for exchanging data in a heterogeneous R&D lab environment. It also enables data science, visualization, and analytics.

IDS is also a stepping stone towards the Allotrope Data Format (ADF). We do not recommend directly converting RAW lab data into ADF, since ADF is still an evolving standard. We need something "intermediate" as the bridge to use what we have now, while future-proofing to easily adapt to the rapid changes.

Authors:

Rae Wu - Life Sciences Data Analyst
Yi Jin - Solution Architect
Evan Anderson - Solution Architect
Kai Wang - Delivery Lead and Sr. Software Engineer
Spin Wang - Co-Founder and CEO

Stay tuned for the second blog of this blogpost series. We will cover our experience and suggestions in how to avoid losing data while transforming in and out of IDS, how to manage, govern IDS and leverage ontologies, and how to create IDS for your own data sets and business logic.

The Digital Lab Landscape

Data Sources
The life sciences R&D lab is full of heterogeneous data sets and diverse interfaces for data acquisition. Instruments and CRO/CDMOs produce a large amount of experimental data.


Fundamentally, these "source" systems are designed to run a task, execute a job, and/or deliver services. Providing the data in an accessible format and interface is not the data sources’ main consideration.

Many instrument manufacturers make their data format proprietary as a barrier against competition, incentivizing users to continue purchasing their software and hardware.

Data Targets
There is a similar heterogeneous landscape on data target side.


The heterogeneity is a key nature of the scientific R&D process, which requires data and information to be handled based on its particular domain and context. Since data is consumed with very specific context, many data targets (e.g., ELN, LIMS) require data to be provided in its specific format or using its provider's APIs.

In such a many-to-many data landscape, it is essential to use an Intermediate Data Schema (IDS) to harmonize heterogeneous data and decouple data sources from data targets.

As the following diagram shows, if each data source and data target is left to communicate with each other, N X M number of point-to-point connections or data flow / translations are needed. However, if data from the data sources is harmonized into an intermediate format, the total cost of ownership can be reduced by an order of magnitude to N + M. The IDS approach makes data flow manageable.

2020_05_Blog_TheDigitalLabNeedsAnIDS_NxMImage

From a first principle, such an IDS format should have the following characteristics:

  • Comprehensive information capture
  • Easy to write and transform data to IDS files
  • Easy to read and transform from IDS files
  • Minimum effort to consume the IDS files out of the box
  • Popular, well supported and exists among an ecosystem of mature tooling
  • Decoupling data sources and data targets to avoid vendor lock in

Intermediate Data Schema (IDS)

Such a heterogenous data landscape presents an exciting challenge, but not something new. The broader tech community has faced a similar challenge since the Internet and web started to be dominate the world, more than 20 years ago.

"JSON has taken over the world. Today, when any two applications communicate with each other across the internet, odds are they do so using JSON." - The Rise and Rise of JSON

JSON is adopted and supported by Google, Facebook, Twitter, Pinterest, Reddit, Foursquare, LinkedIn, Flickr, etc. Without any exaggeration, it is safe to say that the Internet communicates using JSON.

Here is an example of a JSON file describing a puppy (not just any puppy, her name is Ponyo and you can follow her on Instagram).

Screen-Shot-2020-04-11-at-3.25.09-PM

Without any prior training, just about anyone can easily understand and read the basic information about this puppy, in this format.

Here at TetraScience, we made the decision to use the most popular data interchange format of the Internet as our Intermediate Data Schema (IDS) for the Digital Lab, building our Intermediate Data Schema (IDS) on top of JSON Schema.

Here are some additional examples:

Image: Example IDS JSON describing an injection

Screen-Shot-2020-04-11-at-3.47.22-PM


Image: Example IDS JSON describing a gradient table in liquid chromatography

Screen-Shot-2020-04-11-at-3.49.57-PM


Image: Example IDS JSON describing a cell counter measurement

Screen-Shot-2020-04-12-at-11.58.28-AM

Inside the IDS, you can capture as much information as possible.
From the top level, this includes:

  • user
  • system
  • sample
  • run
  • system_status
  • experiment
  • method
  • result(s)
  • data cubes
  • logs
  • vendor RAW file

Data Cubes

JSON seems great for hierachical key value pairs, but what about a numerical matrix? These are pretty common in the labs. In this case, we use a data cube. If you are familiar with Allotrope or HDF5, you can see that this is inspired by the concept of data cubes in ADF and data sets in HDF5.

Screen-Shot-2020-04-12-at-12.05.51-PM

{
 "datacubes":[{
   "name": "Plate Reads",
   "description": "More information (Optional)",
   "mode": "absorbance",
   "another property": "you decide",
   "measures": [{
     "name": "OD_600",
     "unit": "ArbitraryUnit",
     "value": [
       [1.19, 1.05, 1.05],
       [1.11, 0.90, 0.95],
       [1.11, 0.93, 0.95]
      ]
    }],
    "dimensions": [{
      "name": "row",
      "unit": "count",
      "scale": [1, 2, 3]
    }, {
      "name": "column",
      "unit": "count",
      "scale": [1, 2, 3]
    }]
  }]
}

You can use data cubes to model many types of experimental data sets, such as:

INSTRUMENTMEASUREDIMENSIONChromatogram1: detector intensity1: wavelength
2: retention timeRaman Spectroscopy1: intensity1: wavenumber shiftPlate Reader1: absorbance
2: concentration1: row position
2: column positionMass Spectrometer1: intensity1: mass charge ratio
2: time

We do not recommend creating JSON files larger than 100MB. Instead, consider using File Pointers.

File Pointers

JSON is not suitable for storing large data sets. We do not have to force everything into JSON either. Leveraging the cloud and other purpose-built, efficient, and analytics-ready file formats, like Apache Parquet and HDF5, you can easily reference large binary files from an IDS.

{
 "result": {
   "cell": {
     "viability": {
       "value": 0.2,
       "unit": "Percent"
     }
   }
 },
 "image": {
   "type": "s3file",
   "fileKey": "uuid/RAW/folder/batch1_day2.parquet",
   "bucket": "ts-dev-datalake",
   "version": "3/L4kqtJl...",
   "fileID": "9ff46a4e-c6a0-4cfe-b6ff-1e7b20bbf561"
 }
}

Screen-Shot-2020-04-12-at-12.52.48-PM

Life sciences organizations commmonly pick from the following 3 formats, depending upon the use case.

Use Cases by format:

  • Parquet: Big data, data science, cloud computing. Highly recommended for data science and big data applications. Supported by Databricks, Spark, AWS Athena, and Facebook Presto.
  • HDF5: Grouping disparate files in one file. For example, if you would like to package images plus excel file(s) together into one file. This is a portable container format for storing numerical arrays. Ecosystem of scientific computing.
  • Instrument RAW Format: Using vendor specific software or analysis tools. Also needed when you need to import a file back into the vendor's software. Typically, images and instrument binary files are kept in the original format to leverage domain specific tooling and ecosystem.

The best informatics software uses JSON

Hopefully now you have a basic understanding of JSON, its popularity in the broader technology ecosystem, and its ability to capture scientific data.

Let's now take a look at its compatibility with some of the best informatics software in the Digital Lab space.

Alt
Alt
Alt
Alt
Alt


JSON is used to send data into some of the best informatics software, like Dotmatics, Riffyn, Benchling, Scilligence, IDBS, and many others.

The best analytics and data sciences tools use JSON

Let's turn our attention to some trends in data science:

All of these tools have seamless native support for JSON.

Another benefit of JSON is that it's interchangeable with tabular formats (those with rows and columns). At first glance, IDS JSON does not look very tabular, but it is actually quite easy to transform JSON into a tabular structure.

Image: Use Pandas to "flatten" or "normalize" the JSON. There are also countless other tools to do this.

Screen-Shot-2020-04-12-at-12.55.51-PM

As a result, some of the best visualization and analytics tools that rely on tabular data and SQL interface - Dotmatics Vortex, Tibco Spotfire, Tableau - can leverage IDS JSON with very minimum overhead, one of the important criteria of an Intermediate Data Schema

Image: Example visualization using Dotmatics Vortex

Dotmatics-Vortex--Visiualization-Example

What about Allotrope Data Format (ADF)?

This is good question. There have been a lot of conversations about converting all experimental data into ADF files and using Semantic Web RDF graphs. Naturally, we asked the question, "Is ADF a good option as the intermediate data schema/format?"

Let's think about it.

Allotrope Data Format is really HDF5 + Semantic Web RDF. Let's discuss these two separately.

HDF5

HDF5 is a popular binary data format for storing multi-dimensional numerical arrays along with heterogeneous data in one single portable container. In fact, it is one of the recommended file formats in the section above when it comes to saving large numerical data sets and binary files in the IDS JSON. See File Pointers. As a result, HDF5 is already part of the IDS JSON.

RDF

The image below is an RDF representation of a LCUV injection. This only captures one part of the information (10-20% of the experimental information from the run) and, as you can see, it is already getting quite complicated.

Image: RDF representation of 10-20% of the data from an LCUV injection

Screen-Shot-2020-04-12-at-2.39.05-PM

If we apply the criteria of an Intermediate Data Schema (IDS) to RDF, here are our observations:

WHY RDF IS NOT SUITABLE

  • Comprehensive information capture: Every relationship needs to be carefully designed and governed. There are not enough terms to capture the lab data.
  • Easy to write and transform data to IDS files: Non-trivial effort to create RDF files and validate they are correct.
  • Easy to read and transform from IDS files: Difficult to query from the RDF. Complicated SPARQL query is not scientist/data scientist friendly.
  • Minimum effort to consume the IDS files out of the box: Very few ELN/LIMS or visualization tools support semantic web RDF.
  • Popular, well supported and exists among an ecosystem of mature tooling: Limited adoption in the life sciences industry and broader IT ecosystem. Thus there is a small user base and steep learning curve.
  • Decoupling data sources and data targets to avoid vendor lock in: Currently RDF data in ADF files can only be read by the Allotrope Foundation's JAVA/.NET SDK.

As a result, we do not believe RDF is suitable as the Intermediate Data Schema for R&D labs. Major barriers include the learning curve and the lack of tooling.

BUT....

  • Though not a good workhorse for day-to-day transformation and processing and many of your data targets, RDF graphs can be useful to unlock specialized, semantic analysis tools. If you are considering RDF, we recommend you view it as another data target, not as the middle layer of the Intermediate Data Schema (IDS).
2020_04_Blog_TheDigitalLabNeedsAnIDS_ADFTargetImage
Screen-Shot-2020-04-12-at-16.42.48
  • We also recommend that you consider simplifying the semantic web RDF approach and use the Allotrope Leaf Node Pattern, which focuses on the data that scientists care about while preserving its semantic meaning, making the resulting file data science ready. Read more about converting from IDS to Leaf Node RDF in these blog posts: TetraScience ADF Converter and Leaf Node Model.

Summary: use JSON as IDS to harmonize your lab data sets

These are all the reasons we chose JSON as our Intermediate Data Schema (IDS) for the Digital Lab, and we think you should too.

Pick something that is popular, battled tested, easy to read/write, and supported by a vibrant tech community as well as almost all the websites on the Internet.

Pick something that is inclusive and transferrable to other formats when use cases justify further transformation. That's why it's called Intermediate Data Schema :)

Screen-Shot-2020-04-12-at-1.53.50-PM


Note: the list above is inclusive but not exhaustive.

References

Here is a video illustrating how easy it is to learn and use JSON.

Here is a roundup of curated articles about Semantic Web and RDF for further reading.

Share this article

Previous post

There is no previous post
Back to all posts
May 9, 2022

Give Your in vivo Data the Attention it Deserves

Read Blog
May 2, 2022

Customizing Digital Lab Experiences With Ease

Read Blog
April 14, 2022

Sharing a Vision and Deep Customer Commitment

Read Blog
April 11, 2022

Escaping the Scientific Data Quagmire

Read Blog
April 1, 2022

Innovating with a HoloLens and Drones

Read Blog
April 6, 2022

Digital Twins: Seeing Double with a Predictive Eye

Read Blog
March 28, 2022

Automated Anomaly Detection and Correction

Read Blog
March 30, 2022

Making Labs More Efficient

Read Blog
March 4, 2022

Introducing Tetra Data Platform v3.2

Read Blog
March 2, 2022

Are you prepared to utilize ML/AI and Data Visualization?

Read Blog
February 22, 2022

SLAS 2022: The Industry’s “Hyped” for Accessible and Actionable Scientific Data

Read Blog
February 21, 2022

BIOVIA partners with TetraScience

Read Blog
February 16, 2022

Tetra Partner Network: An Interview with Klemen Zupancic, CEO, SciNote

Read Blog
February 4, 2022

Closing the Data Gap in Cancer Research

Read Blog
January 27, 2022

Waters & The Tetra Partner Network: Making Data Science Possible

Read Blog
December 16, 2021

Announcing Acquisition of Tetra Lab Monitoring Business by Elemental Machines

Read Blog
November 29, 2021

Move From Fractal to Flywheel with The Tetra Partner Network

Read Blog
March 26, 2021

How an IDS Complements Raw Experimental R&D Data in the Digital Lab

Read Blog
July 30, 2021

What is an R&D Data Cloud? (And Why Should You Care?)

Read Blog
March 26, 2021

What is a True Data Integration, Anyway?

Read Blog
June 1, 2020

Data Science Use Cases for the Digital Lab: Novel Analyses with Waters Empower CDS Data

Read Blog
April 20, 2022

Unlock the Power of Your ELN and LIMS

Read Blog
July 23, 2020

The Science Behind Trash Data

Read Blog
August 20, 2021

The 4 Keys to Unlock the Lab of the Future

Read Blog
September 29, 2021

TetraScience Achieves SOC 2 Type 2 Validation, Advances R&D Data Cloud GxP Compliance Capabilities

Read Blog
April 20, 2020

Round-up of Semantic Web thought leadership articles

Read Blog
September 19, 2020

Simplified Remote Monitoring of CO₂ Incubators via Direct Integration

Read Blog
May 11, 2021

R&D Data Cloud: Moving Your Digital Lab Beyond SDMS

Read Blog
September 10, 2021

Principles of Deep Learning Theory

Read Blog
July 8, 2020

Powering Bioprocessing 4.0 for Therapeutic Development

Read Blog
March 30, 2022

Why Biopharma Needs an End-to-End, Purpose-Built Platform for Scientific Data — Part 2

Read Blog
August 19, 2021

Part 2: How TetraScience Approaches the Challenge of Scaling True Scientific Data Integrations

Read Blog
March 23, 2022

Why Biopharma Needs an End-to-End, Purpose-Built Platform for Scientific Data — Part 1

Read Blog
January 18, 2021

New Matter: Inside the Minds of SLAS Scientists Podcast

Read Blog
June 29, 2020

Enabling Compliance in GxP Labs

Read Blog
May 14, 2020

LRIG-New England: Lunchtime Virtual Rapid Fire Event - May 26, 2020

Read Blog
June 10, 2020

Remote Lab Scheduling is No Longer Optional, it is a Requirement

Read Blog
August 2, 2020

Incident Reporting for GxP Compliance

Read Blog
October 15, 2020

Protein Purification with Cytiva UNICORN: Enhanced Analytics through Harmonization and Integration

Read Blog
July 29, 2020

Cloud-based Data Management with Lab Automation: HighRes Biosolutions Cellario + TetraScience

Read Blog
August 20, 2020

Understanding Why Freezer Temperatures May Not Be Uniform

Read Blog
July 14, 2021

Find Experimental Data Faster with Google-Like Search in Tetra Data Platform 3.1 Release

Read Blog
July 22, 2021

Experimental Data in Life Sciences R&D — It’s How Many Copies of Jaws?!

Read Blog
July 15, 2020

Direct Data Acquisition and Sensor Integration for ThermoFisher Heracell™ VIOS CO₂ Incubators

Read Blog
September 15, 2020

Remote Monitoring of VWR CO₂ Incubators via Direct Integration

Read Blog
April 26, 2020

The Digital Lab Needs an Intermediate Data Schema (IDS): a First Principle Analysis

Read Blog
April 6, 2020

TetraScience ADF Converter -- Delivering on the Promise of Allotrope and a Startup’s Journey

Read Blog
August 6, 2020

"Data Plumbing" for the Digital Lab

Read Blog
June 8, 2020

Data Automation for High-Throughput Screening with Dotmatics, Tecan, and PerkinElmer Envision

Read Blog
May 15, 2020

Applying Data Automation and Standards to Cell Counter Files

Read Blog
June 11, 2020

AWS Healthcare & Life Sciences Web Day | Virtual Industry Event

Read Blog
February 12, 2021

AWS Executive Conversations: Evolving R&D

Read Blog
April 15, 2021

Announcing Our Series B: The What, When, Why, Who, and Where

Read Blog
April 15, 2021

Announcing our Series B: The DNA Markers of Category Kings and Queens

Read Blog
April 15, 2021

Announcing our Series B: Tetra 1.0 and 2.0 | The Noise and the Signal

Read Blog
March 29, 2020

Allotrope Leaf Node Model — a Balance between Practical Solution and Semantics Compatibility

Read Blog
April 3, 2019

Allotrope 101

Read Blog
March 13, 2020

Choose the right alert set points for your freezers, refrigerators, and incubators

Read Blog
August 27, 2020

99 Problems, but an SDMS Ain't One

Read Blog