March 26, 2021

What is a Tetra Data Integration, Anyway?

Data Integration is one of the biggest challenges for the Life Sciences industry in its journey to leverage AI/ML. Data gets stuck in proprietary,  vendor-specific formats and interfaces. Data silos are connected via rigid, one-off point-to-point and unsustainable connections. As the number of disparate systems increases exponentially - thanks to equipment obsoletion, new instrument models, evolving data standards, acquisitions, and a host of other factors - internal maintenance of this spider web of connections quickly exceeds any Life Sciences org’s internal IT capabilities.


Here, we’ll introduce our definition of a proper integration with an R&D data system, in particular with a piece of lab instrument or instrument software. We’ll also explain why our audacious approach - building and maintaining an expanding library of agents, connectors, and apps - might unify ALL pharma and biotech data silos, something that no other company can match.


Authors:

Mike Tarselli, Ph.D., MBA - Chief Scientific Officer

Spin Wang - Co-founder, President & CTO


How can we knit together disparate instruments and software systems into a logical platform?


Fragmented Ecosystem 

Let’s start from an assumption we believe is well-understood: Life Science data systems are notoriously fragmented. We’ve touched on this before (post and video), and if you need to hear it from another perspective, we wholeheartedly recommend Gartner’s 2019 analysis.


Systems diverge on:

  • File formats - 20+ in common usage for mass spectrometry, alone
  • Data schemas
  • Physical and data connectors (OPC/UA, API, Software Toolkit, Serial Port, etc.)
  • FDA Submission standards
  • Terminology differences between ELNs, LIMS, LES (workflow, reaction, scheme, process)
  • And numerous other variables


AI/ML and digital transformation depend on data liquidity

Do you want clean, curated data sets, with consistent headers, attuned to FAIR standards, auditable, traceable, and portable? The only way to enable AI/ML will be to have high quality data. To achieve this data liquidity, your divergent instrument outputs must “flow” across the system, connecting the disjointed landscape via a vendor-agnostic open network.


What is a Tetra Integration?

So what is TetraScience’s definition of a Data Integration, that enables automation, unites divergent data systems, and enables AI/ML? 


A true Tetra Integration needs to be able to surpass a high bar. It must:

  • Collect data from the source in an automated, configurable manner
  • Retain scientific context: What instrument collected this? How are the fields mapped?
  • Permit harmonization of the data into a vendor-agnostic and easily consumable format 
  • Enable data liquidity and flexible data science downstream


Integration must necessarily be “full-stack” -- data collection, transformation, validation, metadata decoration, harmonization, processing, labeling, AI/ML preparation -- such that users, systems, and agents may access and take actions on previously siloed data.


A Tetra Integration can be achieved via a combination of Tetra agents, connectors, and pipelines depending on the specific data sources. For example: 

  • To integrate with Waters Empower, we leverage a data agent to draw information from a Chromatography Data System (CDS) using a vendor toolkit
  • To integrate with NanoTemper Prometheus, we leverage the Tetra file-log agent and pipeline
  • To integrate with Solace, AGU SDC, we use RESTful API services to build connectors
  • To integrate with osmometers, blood gas analyzers, or shaking incubators, we use an IoT agent to stream continuous data from a physical, mounted instrument to the Cloud through secure MQTT 


Tetra’s “Special Sauce”: Productized and data-centric integrations

Important Note: When TetraScience uses the term Tetra Integration, we reject simple “drag-and-drop” of instrument RAW files into a Data Lake. To meet our criteria and quality standards, we must contextually transform source data into a harmonized format, like JSON. These integrations are differentiators for the Tetra Data Platform; to us, if you’re moving files without true parsing and interpretation, no value is added.

Differentiated from LIMS/SDMS: IoT agent and software agent/connector

Most Life Sciences data integrations are performed by LIMS and SDMS software, which need to bring data from different sources in ELN/LIMS for tracking and reporting, and to SDMS for storage. LIMS and SDMS rely on two major methods: 

  • Serial to ethernet adaptor for instruments such as osmometer and analyzer
  • File-based export and import 


While these may be viable options, they are far from optimal for an organization trying to upgrade to an Industry 4.0 motif. Consider the following scenarios:

 
 SDMS, LIMS, or ELN connecting through Serial to Ethernet Adapter
 Tetra Data Platform + IoT Agent
Network resilience 
Network interruptions between server and instrument result in lost data.
Resilient to network interruption; IoT agent will buffer data locally in case of network downtime.
Scalability
Centralized server maintains and manages connections with increasing number of endpoints = Not scalable.
Each IoT agent performs distributed computation, tracks its own state, performs complicated handshake with instrument, and preprocesses
Supported Communication Protocol
Serial to Ethernet adapter implies that only serial port is supported
Can handle more communication protocols: OPC, GPIO, CAN, others
Architecture
Data sources coupled strongly to destination
Data source and data target are no longer strongly coupled.

ELN, LIMS and SDMS have been traditionally relying on file export from the instrument for the majority of their integrations with the instruments and instrument control / processing software.


 
File export from ELN/LIMS/SDMS
Tetra Data Platform + software Agent and Connectors
Completeness
File export lacks info needed to enable sophisticated AI/ML, such as audit trail, method, comprehensive raw data
Collects everything we can extract from the source system, including raw data, logs, method, processing results and others. 
Automation and human error
Scientists must initiate and  configure exports on the spot. These manual operations are subject to variability, typos, and human errors
Cloud-native agents and data connectors communicate with data sources programmatically, capturing comprehensive data sets and detecting changes on update
Change detection
Requires manual export. If scientists forget to export the data, changes remain stuck in the source system and will not be reflected in analytics or reporting.
Automatically detect changes using audit trail, timestamp and other mechanisms to always ensure data current
Bi-directionality
Files are “unidirectional” - one may export data from the instrument software; however, this file cannot instruct automation software in how to run the experiment.
TDP permits third- party systems to send commands and instructions to the instrument or instrument control software via the agent

Data harmonization for data science and data liquidity

Extraction from source systems is insufficient to claim true integration. For example, imagine a data scientist has access to thousands of .pdfs from a Malvern Particle Sizer, or thousands of mass spec binary files from Waters MassLynx, or TA instruments differential scanning calorimetry (DSC) binary files; these formats can’t unlock the value of the data and impact R&D.

Other than the file name and path, these binary files are essentially meaningless to other data analytics applications. These data need to be further harmonized into our Intermediate Data Schema (IDS), based on JSON and Parquet, to truly allow any applications to consume the data and R&D teams to apply their data science tools. 


Fostering a community based on data liquidity 

TetraScience has taken on the challenging, audacious stance of building out, maintaining, and even upgrading sophisticated integrations; we believe this to be a first in the Life Sciences data industry, which has long suffered from the vendor data silo problem.  

  • An instrumental OEM’s primary business driver involves selling instruments and consumables
  • An informatics application provider’s  primary goal is to get more data flowing into its own software

The R&D Data Cloud and companion Tetra Integrations are entirely designed to serve the data itself, liberating it without introducing any proprietary layer of interpretation.  If your software can read JSON or Parquet, and talk to SQL, you can immediately benefit from the Tetra Integration philosophy.

Our cloud-native and Docker-based platform allows us to leverage the entire industry’s momentum to rapidly develop, test and enhance these integrations with real customer feedback. Rapid iteration and distribution of consistent, reproducible integrations across our customer base introduces more use cases, more test cases, and more battle-tested improvements for the entire scientific community. 


Check out some of our Tetra Integrations, and request an integration for your team right on that page. We're always interested in hearing from you!

Share this article

Contact a Solution Architect

Mike Tarselli, Ph.D., MBA
Chief Scientific Officer, TetraScience. Mike remains curious about chemistry, drug development, data flows, and scientific collaboration.

Read more posts by this author