Blog

The Tetra Way of Making Data Vendor-agnostic and AI-ready

July 21, 2025

The challenges

The boom in AI technologies is shaping the way we think, work, and solve problems. Even little things in life, like how to tend to a cut on the finger, can be consulted with an AI chatbot. We use AI without thinking too much about how the data is crawled, sanitized, tokenized, learned, analyzed, and finally, represented to us. But not all data is AI-ready out of the box. We still live with heterogeneous, incompatible data formats, structured or unstructured, proprietary, and vendor-locked. Without acquiring, parsing, and harmonizing, the data does not comply with the FAIR principles (findability, accessibility, interoperability, reusability) and is unfit for AI.

Various data sources and formats

Proprietary or vendor-locked file formats often require specific vendor software to access the data. We've heard countless stories of researchers needing to physically go into the lab, even during peak COVID restrictions, simply because the necessary software was only installed on lab computers. But even when file formats aren’t proprietary, they can still present significant challenges.

Take a familiar example: the TXT file. While it's intuitive and easy for humans to read and edit, it's far less friendly to machines. TXT files lack a defined structure or schema, forcing parsers to make assumptions about where the headers are, what the delimiters might be, how to interpret empty lines, and so on. Anyone can open and modify a TXT file with complete freedom, which makes it incredibly flexible, but also fragile. A single unexpected line can break the entire parsing logic. When it comes to data structure, freedom and rigor often sit at opposite ends of the spectrum.

The vendor-agnostic Intermediate Data Schema

With over a decade in the industry, TetraScience has encountered hundreds of data formats. We realized that, to unlock the power of Machine Learning and AI, providing machine-readable and vendor-agnostic data is key. Therefore, we introduced the concept “Intermediate Data Schema” or IDS in the very beginning (we have a blog post explaining the IDS concept in detail). The IDS designed by TetraScience in collaboration with instrument manufacturers, scientists, and customers is an open, vendor-agnostic JSON schema that is applied to raw instrument data or report files. This schema is used to map vendor-specific information (such as the name of a field) to vendor-agnostic information. The IDS standardizes naming, data type (whether the field is a string, an integer, or a date), data range (e.g., a number must be positive), and data hierarchy. IDS fields can also be linked to external ontologies to enable semantic interoperability. To ensure IDS consistency, TetraScience has been creating IDS component libraries, some of which are general to all instruments, and others are specific to a particular instrument technique or category. Therefore, creating a new IDS is like playing LEGO®; you pick the pieces from the existing building blocks and assemble them.

Some of the reusable IDS components

A generic “cell-counter” IDS using IDS components

How is IDS used?

First, raw data files are uploaded to the data lake of the Tetra Data Platform (TDP), which is the core of the Tetra Scientific Data and AI Cloud. Files are then parsed by a specific data pipeline, which generates a JSON file using the IDS designed for that specific instrument or instrument category as a guide. During processing, the data is validated, and search indexes are generated. The newly generated JSON file is thus an IDS JSON.

Data formats from various data sources being converted to IDS JSON

Furthermore, the data is now fluid—it’s indexed into the Tetra Data Lake. It’s Tetra Data now. You can query the data in TDP using Elasticsearch, SQL, API calls, or port the data into any tool of your choice.

Tetra Data can be ported into various applications

The compelling advantage

The IDS harmonizes different datasets typically created by biopharma organizations, such as instrument data, CRO assay data, and software data. Because the IDS JSON files are predictable, consistent, self-describing, ontology-enabled, and vendor-agnostic, they allow companies to consume the data in their preferred applications and build searches and aggregations. They can seamlessly integrate the data into visualization and analysis software, as well as ML/AI applications. The flywheel is turning and the momentum is building up because AI compels customers to recognize and demand what should have always been available: liberated, vendor-agnostic data, free from silos. Together with the Tetra Scientific Data and AI Cloud, you can unlock the insights from your data and uncover new possibilities.