Data Integration is one of the biggest challenges for the life sciences industry in its journey to leverage AI/ML. Data gets stuck in proprietary, vendor-proprietary and -specific formats and interfaces. Data silos are connected via rigid, one-off point-to-point and unsustainable connections. As the number of disparate systems increases exponentially - thanks to equipment obsoletion, new instrument models, evolving data standards, acquisitions, and a host of other factors - internal maintenance of this spider web of connections quickly exceeds any life sciences org’s internal IT capabilities.
Here, we’ll introduce our definition of a true integration with an R&D data system, in particular with a piece of lab instrument or instrument software. We’ll also explain why our audacious approach - building and maintaining an expanding library of agents, connectors, and apps - can unify ALL pharma and biotech data silos, something that is the foundation for accelerating drug discovery and R&D.
Systems diverge on:
Do you want clean, curated data sets, with consistent headers, attuned to FAIR standards, auditable, traceable, and portable? The only way to enable AI/ML will be to have high quality data. To achieve this data liquidity, your divergent instrument outputs must “flow” across the system, connecting the disjointed landscape via a vendor-agnostic open network.
So what is TetraScience’s definition of a true R&D Data Integration, that enables automation, unites diverse data systems, and enables AI/ML?
A true R&D Data Integration with a particular data system needs to be able to surpass a high bar. It must be:
A true R&D Data Integration must necessarily be “full-stack and purpose-built” — configurable data collection, harmonization to vendor-neutral formats, preparation for analytics consumption, automated push to data targets, and tailored to the system and related scientific use cases — so that scientists and data scientists can access and take actions on previously siloed data in order to accelerate discovery.
A true R&D Data Integration can be achieved via a combination of Tetra agents, connectors, and pipelines depending on the specific data systems. For example:
Important Note: When we use the term true R&D Data Integration, we reject simple “drag-and-drop” of instrument RAW files into a data lake. To meet our criteria and quality standards, we must contextually transform source data into a harmonized format, like JSON. These integrations are differentiators for the Tetra Data Platform; to us, if you’re moving files without true parsing and interpretation, no value is added.
Most life sciences data integrations are performed by LIMS and SDMS software, which need to bring data from different sources in ELN/LIMS for tracking and reporting, and to SDMS for storage. LIMS and SDMS rely on two major methods:
While these may be viable options, they are far from optimal for an organization trying to upgrade to an Industry 4.0 motif. Consider the following scenarios:
ELN, LIMS and SDMS have been traditionally relying on file export from the instrument for the majority of their integrations with the instruments and instrument control / processing software.
Extraction from source systems is insufficient to claim true integration. For example, imagine a data scientist has access to thousands of .pdfs from a Malvern Particle Sizer, or thousands of mass spec binary files from Waters MassLynx, or TA instruments differential scanning calorimetry (DSC) binary files; these formats can’t unlock the value of the data and impact R&D.
Other than the file name and path, these binary files are essentially meaningless to other data analytics applications. These data need to be further harmonized into our Intermediate Data Schema (IDS), based on JSON and Parquet, to truly allow any applications to consume the data and R&D teams to apply their data science tools.
TetraScience has taken on the challenging, audacious stance of building out, maintaining, and even upgrading sophisticated integrations; we believe this to be a first in the life sciences data industry, which has long suffered from the vendor data silo problem:
The R&D Data Cloud and companion Tetra Integrations are entirely designed to serve the data itself, liberating it without introducing any proprietary layer of interpretation. If your software can read JSON or Parquet, and talk to SQL, you can immediately benefit from the Tetra Integration philosophy.
Our cloud-native and Docker-based platform allows us to leverage the entire industry’s momentum to rapidly develop, test, and enhance these integrations with real customer feedback. Rapid iteration and distribution of consistent, reproducible integrations across our customer base introduces more use cases, more test cases, and more battle-tested improvements for the entire scientific community.
Check out some of our Tetra Integrations, and request an integration for your team right on that page. We're always interested in hearing from you!