The Digital Lab Needs an Intermediate Data Schema (IDS): a First Principle Analysis
We believe one of the critical components of the Digital Lab is the flow of data. Without the flow of information, data is siloed and fragmented, and it is nearly impossible to take action based on the information. To enable scalable data flow, the Digital Lab needs an Intermediate Data Schema (IDS).
The IDS is used to decouple the data sources (producers) and data targets (consumers). This decoupling significantly reduces the total cost of ownership for exchanging data in a heterogeneous R&D lab environment. It also enables data science, visualization, and analytics.
IDS is also a stepping stone towards the Allotrope Data Format (ADF). We do not recommend directly converting RAW lab data into ADF, since ADF is still an evolving standard. We need something "intermediate" as the bridge to use what we have now, while future-proofing to easily adapt to the rapid changes.
Stay tuned for the second blog of this blog post series. We will cover our experience and suggestions in how to avoid losing data while transforming in and out of IDS, how to manage, govern IDS and leverage ontologies, and how to create IDS for your own data sets and business logic.
The Digital Lab Landscape
The life sciences R&D lab is full of heterogeneous data sets and diverse interfaces for data acquisition. Instruments and CRO/CDMOs produce a large amount of experimental data.
Fundamentally, these "source" systems are designed to run a task, execute a job, and/or deliver services. Providing the data in an accessible format and interface is not the data sources’ main consideration.
Many instrument manufacturers make their data format proprietary as a barrier against competition, incentivizing users to continue purchasing their software and hardware.
There is a similar heterogeneous landscape on data target side.
The heterogeneity is a key nature of the scientific R&D process, which requires data and information to be handled based on its particular domain and context. Since data is consumed with very specific context, many data targets (e.g., ELN, LIMS) require data to be provided in its specific format or using its provider's APIs.
In such a many-to-many data landscape, it is essential to use an Intermediate Data Schema (IDS) to harmonize heterogeneous data and decouple data sources from data targets.
As the following diagram shows, if each data source and data target is left to communicate with each other, N X M number of point-to-point connections or data flow / translations are needed. However, if data from the data sources is harmonized into an intermediate format, the total cost of ownership can be reduced by an order of magnitude to N + M. The IDS approach makes data flow manageable.
From a first principle, such an IDS format should have the following characteristics:
- Comprehensive information capture
- Easy to write and transform data to IDS files
- Easy to read and transform from IDS files
- Minimum effort to consume the IDS files out of the box
- Popular, well supported and exists among an ecosystem of mature tooling
- Decoupling data sources and data targets to avoid vendor lock in
Intermediate Data Schema (IDS)
Such a heterogenous data landscape presents an exciting challenge, but not something new. The broader tech community has faced a similar challenge since the Internet and web started to be dominate the world, more than 20 years ago.
"JSON has taken over the world. Today, when any two applications communicate with each other across the internet, odds are they do so using JSON." - The Rise and Rise of JSON
JSON is adopted and supported by Google, Facebook, Twitter, Pinterest, Reddit, Foursquare, LinkedIn, Flickr, etc. Without any exaggeration, it is safe to say that the Internet communicates using JSON.
Here is an example of a JSON file describing a puppy (not just any puppy, her name is Ponyo and you can follow her on Instagram).
Without any prior training, just about anyone can easily understand and read the basic information about this puppy, in this format.
Here at TetraScience, we made the decision to use the most popular data interchange format of the Internet as our Intermediate Data Schema (IDS) for the Digital Lab, building our Intermediate Data Schema (IDS) on top of JSON Schema.
Here are some additional examples:
Image: Example IDS JSON describing an injection
Image: Example IDS JSON describing a gradient table in liquid chromatography
Image: Example IDS JSON describing a cell counter measurement
Inside the IDS, you can capture as much information as possible.
From the top level, this includes:
- data cubes
- vendor RAW file
JSON seems great for hierachical key value pairs, but what about a numerical matrix? These are pretty common in the labs. In this case, we use a data cube. If you are familiar with Allotrope or HDF5, you can see that this is inspired by the concept of data cubes in ADF and data sets in HDF5.
"name": "Plate Reads",
"description": "More information (Optional)",
"another property": "you decide",
[1.19, 1.05, 1.05],
[1.11, 0.90, 0.95],
[1.11, 0.93, 0.95]
"scale": [1, 2, 3]
"scale": [1, 2, 3]
You can use data cubes to model many types of experimental data sets, such as:
INSTRUMENTMEASUREDIMENSIONChromatogram1: detector intensity1: wavelength
2: retention timeRaman Spectroscopy1: intensity1: wavenumber shiftPlate Reader1: absorbance
2: concentration1: row position
2: column positionMass Spectrometer1: intensity1: mass charge ratio
We do not recommend creating JSON files larger than 100MB. Instead, consider using File Pointers.
JSON is not suitable for storing large data sets. We do not have to force everything into JSON either. Leveraging the cloud and other purpose-built, efficient, and analytics-ready file formats, like Apache Parquet and HDF5, you can easily reference large binary files from an IDS.
Life sciences organizations commmonly pick from the following 3 formats, depending upon the use case.
Use Cases by format:
- Parquet: Big data, data science, cloud computing. Highly recommended for data science and big data applications. Supported by Databricks, Spark, AWS Athena, and Facebook Presto.
- HDF5: Grouping disparate files in one file. For example, if you would like to package images plus excel file(s) together into one file. This is a portable container format for storing numerical arrays. Ecosystem of scientific computing.
- Instrument RAW Format: Using vendor specific software or analysis tools. Also needed when you need to import a file back into the vendor's software. Typically, images and instrument binary files are kept in the original format to leverage domain specific tooling and ecosystem.
The best informatics software uses JSON
Hopefully now you have a basic understanding of JSON, its popularity in the broader technology ecosystem, and its ability to capture scientific data.
Let's now take a look at its compatibility with some of the best informatics software in the Digital Lab space.
JSON is used to send data into some of the best informatics software, like Dotmatics, Riffyn, Benchling, Scilligence, IDBS, and many others.
The best analytics and data sciences tools use JSON
Let's turn our attention to some trends in data science:
- Python and R have quickly become de facto programming languages for data scientists
- Tools like Jupyter Notebook have become data scientists' computional notebook of choice
- Big data frameworks like Spark, which quickly rose to predominance and is still going strong since its invention in 2009
All of these tools have seamless native support for JSON.
Another benefit of JSON is that it's interchangeable with tabular formats (those with rows and columns). At first glance, IDS JSON does not look very tabular, but it is actually quite easy to transform JSON into a tabular structure.
Image: Use Pandas to "flatten" or "normalize" the JSON. There are also countless other tools to do this.
As a result, some of the best visualization and analytics tools that rely on tabular data and SQL interface - Dotmatics Vortex, Tibco Spotfire, Tableau - can leverage IDS JSON with very minimum overhead, one of the important criteria of an Intermediate Data Schema
Image: Example visualization using Dotmatics Vortex
What about Allotrope Data Format (ADF)?
This is good question. There have been a lot of conversations about converting all experimental data into ADF files and using Semantic Web RDF graphs. Naturally, we asked the question, "Is ADF a good option as the intermediate data schema/format?"
Let's think about it.
HDF5 is a popular binary data format for storing multi-dimensional numerical arrays along with heterogeneous data in one single portable container. In fact, it is one of the recommended file formats in the section above when it comes to saving large numerical data sets and binary files in the IDS JSON. See File Pointers. As a result, HDF5 is already part of the IDS JSON.
The image below is an RDF representation of a LCUV injection. This only captures one part of the information (10-20% of the experimental information from the run) and, as you can see, it is already getting quite complicated.
Image: RDF representation of 10-20% of the data from an LCUV injection
If we apply the criteria of an Intermediate Data Schema (IDS) to RDF, here are our observations:
WHY RDF IS NOT SUITABLE
- Comprehensive information capture: Every relationship needs to be carefully designed and governed. There are not enough terms to capture the lab data.
- Easy to write and transform data to IDS files: Non-trivial effort to create RDF files and validate they are correct.
- Easy to read and transform from IDS files: Difficult to query from the RDF. Complicated SPARQL query is not scientist/data scientist friendly.
- Minimum effort to consume the IDS files out of the box: Very few ELN/LIMS or visualization tools support semantic web RDF.
- Popular, well supported and exists among an ecosystem of mature tooling: Limited adoption in the life sciences industry and broader IT ecosystem. Thus there is a small user base and steep learning curve.
- Decoupling data sources and data targets to avoid vendor lock in: Currently RDF data in ADF files can only be read by the Allotrope Foundation's JAVA/.NET SDK.
As a result, we do not believe RDF is suitable as the Intermediate Data Schema for R&D labs. Major barriers include the learning curve and the lack of tooling.
- Though not a good workhorse for day-to-day transformation and processing and many of your data targets, RDF graphs can be useful to unlock specialized, semantic analysis tools. If you are considering RDF, we recommend you view it as another data target, not as the middle layer of the Intermediate Data Schema (IDS).
- We also recommend that you consider simplifying the semantic web RDF approach and use the Allotrope Leaf Node Pattern, which focuses on the data that scientists care about while preserving its semantic meaning, making the resulting file data science ready.
Summary: use JSON as IDS to harmonize your lab data sets
These are all the reasons we chose JSON as our Intermediate Data Schema (IDS) for the Digital Lab, and we think you should too.
Pick something that is popular, battled tested, easy to read/write, and supported by a vibrant tech community as well as almost all the websites on the Internet.
Pick something that is inclusive and transferrable to other formats when use cases justify further transformation. That's why it's called Intermediate Data Schema :)
Note: the list above is inclusive but not exhaustive.
Here is a video illustrating how easy it is to learn and use JSON.
Here is a roundup of curated articles about Semantic Web and RDF for further reading.