Our previous post (The Digital Lab Needs an Intermediate Data Schema (IDS): a First Principle Analysis) has established the need for an intermediate data file format for Digital Labs and explains why we chose JSON + Parquet. Now let’s dive deeper into its use cases and discuss what IDS is NOT useful for.
In a nutshell, think about your lab as a big warehouse, where there are multiple parcels of different content, shape, weight, and color (like the data produced by your lab instruments, their software, your CRO/CDMOs, and other data sources). Because these parcels are heavily packaged in their own unique way, when you have too many parcels, it’s really difficult to find the ones you need. You can't compare the content within each of these parcels, or make any meaningful observation beyond “these are all very different things.”
Now imagine, each parcel has a sidekick called IDS, attached to each random cardboard box, describing the parcel’s content in a consistent manner. With IDS, it would be much easier to find what you need, since you do not need to unpack each parcel and you can just search through the sidekick.
You can also leverage the sidekick's content consistency to compare different parcels; for example:
Hopefully the analogy above is helpful. Now let's move from physical packages in a warehouse to data "packages" in a life sciences company. In more technical terms, IDS is a vendor-agnostic, data science-friendly file format, that captures the scientifically meaningful and data science-relevant information from the vendor-specific or proprietary file formats produced by a fragmented lab data source ecosystem.
IDS serves multiple important functions: data search, data flow/liquidity and data analytics, which we'll discuss a bit below.
Since IDS harmonizes the content of the vendor file into JSON, and JSON is well suited for API-based search or full text search, users can quickly locate an instrumental RAW file.
For example, image you have a binary ThermoFisher Xcalibur Mass Spec RAW file, by introducing the IDS representation of the Mass Spec RAW file, the scientists or any 3rd-party data science or software applications can now search relevant information about the mass spec run, like:
In a similar example, image you have a significant number of microscopy images, by introducing the IDS representation of the images, the scientists or any 3rd-party data science or software applications can now search relevant information about the image capture, such as:
Data can now flow to third-party systems without requiring each of these systems to parse the vendor file formats. This use case fundamentally breaks down internal data silos and enables much more seamless information sharing and data flow.
The majority of existing lab informatics software solutions were designed as Windows-based on-premise applications. For example, in order to analyze her Mass Spec data, a scientist would drag and drop 20-100 RAW files into her favorite analysis software program, process the data, and generate a report. This is the limit to the functionality available in these apps.
Now imagine you need to deal with a much larger scale using Hadoop, Spark, or others, there is a significant blocker due to the RAW files. However, IDS (JSON + Parquet) make this possible, and the Tetra Data Platform will store, partition and index the IDS files in anticipation of such use cases.
Not at all. IDS does not replace the vendor-specific formats. Organizations need always to save these RAW files, such that scientists can:
IDS makes it easier to find the RAW file and provides large-scale data analysis much efficiently since the RAW file does not need to be opened. Just like the Parcel and Sidekick analogy, the sidekick is not meant to replace the parcel; only to improve its use. In fact, the Tetra Data Platform will first ingest RAW data and then trigger RAW to IDS conversion or harmonization using data pipelines.
This is essentially a variation of the question above, but usually asked in the context of experimental data management.
Let’s first define “archive.” In the context of experimental R&D data, an instrument typically produces “RAW” data - the points, ramps, intensities, depths, counts, and response factors produced by the detector or sensor - which is processed by instrument control/analysis software to produce “Analysis Result." Since sometimes the “Analysis Result” is saved in the same “RAW” file, for the purpose of this document we'll call these RAW data. RAW data most often present in a vendor-specific format, vendor proprietary format, or non-file-based data system.
Archive of “RAW” data means back-up from the instrument software or analysis software as a file, stashed somewhere and then, when needed, can be restored back to the original system where data is backed-up from. (AWS S3 Glacier deep archive is a popular secure, durable, and extremely low-cost Amazon S3 cloud storage classes for data archiving and long-term backup).
Thus, the answer is: No, IDS isn't itself intended for long-term archival.
In most cases, it’s not possible to convert a vendor proprietary and often binary format into JSON, Parquet and then be able to import the JSON and Parquet back to the instrument software without loss of information. In fact, IDS’ goal is not to capture EVERYTHING possible in the RAW file and it’s lossy in this sense. Thus IDS files themselves are not sufficient to fulfill archive requirements of your experimental data. However, as described in the previous section, IDS files will greatly augment search, liquidity, and analytics.
A lot of the experimental data are in fact stored in instrument control software, such as Waters Empower or ThermoFisher Chromeleon, thus making it impossible to use ANY vendor-agnostic files to archive the data, since any file exported from the instrument software will be lossy, and can not be imported back into the vendor software. The only way is to export the data base file and archive the data in vendor's native format.
The Tetra Data Platform (TDP) archives original instrument RAW files to the cloud with full traceability and data integrity. You can easily restore processed versions and track lineage back to the original instrument or instrument software.
Having IDS as a harmonization layer makes the archived data much easier to locate and the information stored in these RAW files much more easily accessible. Vendor and third-party binaries can be searched by file name or extension, but TDP's data schema allows search on granular fields like experiment ID, user, method, or instrument version - everything is indexed in Elastic for simpler discovery.
Hopefully we've helped you to explain what happens to your irreplaceable RAW files - we save them! - and have shown you how IDS can provide a complement to better analytics, facile search, and ready indexing. All the things you'll need to track and sort all the heterogeneous and invaluable packages in your R&D data warehouse.