How TetraScience Approaches Scaling Scientific Data Integrations

(This blog is Part 2 of a two-part series. For Part 1, see What is a True Data Integration, Anyway?)

In this blog post, we introduce:

What TetraScience and pharma define as a true Scientific Data Integration: i.e., what biopharma needs today in its journey of digitization and adoption of AI/ML
Why integrations, though conceptually straightforward, are extremely challenging to implement...especially at scale
How TetraScience approaches the massive challenge of integration with a holistic organizational commitment from different angles: positioning, process, product, team and capital. We discuss why we raised an $80M series B round to fuel this effort, explain why more than 60% of our integrations squad have life sciences Ph.D.s, and describe how Tetra Data Platform is purpose-built to accelerate integration development

What is a True Scientific Data Integration?

In a previous article, we started defining what we mean by a true Scientific Data Integration. To fundamentally shift the scientific data landscape from a point-to-point, project-based paradigm to one that is platform-oriented, productized, and cloud-native, biopharma needs integrations with disparate systems such as lab instruments (e.g., HPLC, mass spec, robotics systems, etc.), informatics applications (e.g., registration systems, Electronic Lab Notebooks (ELNs), Laboratory Information Management Systems (LIMS), and contract organizations (CRO/CDMOs). A true Scientific Data Integration must therefore be:

Configurable and productized: with flexible parameters to adjust how data is ingested from sources and provided to targets. Such configuration should be well-documented and tailored closely to particular data source and target systems of interest
Simple to manage: Integration with sources and targets should be achievable with a couple of clicks, and should centrally managed and secured, eliminating the need to log into multiple on-prem IT environments
Bidirectional: able to pull data from a system and push data into it, treating it as both a data source and data target. This capability is critical for enabling iterative automation around the Design, Make, Test, Analyze (DMTA) cycle of scientific discovery
Automated: able to detect when new data is available, and requiring little (or no) manual intervention to retrieve it from data sources or push it to data targets
Compliant: designed so that any change to the integration and all operations performed on data sets are fully logged and easily traced
Complete: engineered to extract, collect, parse, and preserve all scientifically meaningful information from data sources, including raw data, processed results, context of the experiment, etc.
Capable of enabling true data liquidity: able to harmonize data into a vendor-agnostic and data science-friendly format such that data can be consumed or pushed to any data target‍
(Bidirectionally) Chainable: The output of one integration must be able to trigger a subsequent integration. For example, an integration that pulls data from an ELN must be able to trigger an integration that pushes data to instrument control software, eliminating the need for manual transcription. The reverse may also the required: the latter integration (extracting data from the instrument control software) may need to trigger the former integration to push data back, submitting it to the ELN or LIMS.

Why is Integration so Hard?

Challenges in building a true Scientific Data Integration.

Integrations meeting the above requirements can be extremely difficult to build. Commonly-encountered challenges include:

Binary and Proprietary File Formats. Software and instrument vendors often lock data into proprietary formats, obligating use of their proprietary software (often a Windows desktop application) for data retrieval, analysis, and storage
Unsupported Change Detection. Primitive (and often undocumented) interfaces can make it difficult to detect when new data is available (e.g., when an injection has been reprocessed and new results calculated), undermining efforts to fully automate the Design, Make, Test, Analyze (DMTA) cycle at the center of hypothesis-driven inquiry
IoT-based Integration. Many common instruments (e.g., balances, pH meters, osmometers, blood gas analyzers, etc.) are not networked, provide no standard API for data retrieval, and may lack even the ability to export new data files to a shared drive. Smart, network-connected devices are required to retrieve data from such sources, eliminating the need for manual transcription
ELN/LIMS Integration. Pushing experimental results back to the ELN and LIMS sounds conceptually simple, but can be extremely complex in practice. Data structures used by ELNs and LIMS can be highly flexible and variable, requiring deep understanding of specific use cases in order to make data consumable by the target application
Data Harmonization. Data collection from disparate data silos with proprietary vendor data formats adds complexity beyond simply moving the data around. Harmonization requires data to be vendor-agnostic and available in data science compatible formats; this allows data to be consumed or pushed to any targets easily and lets it flow freely among all systems. Building these data models requires a deep understanding of instruments, data structures, data interfaces, and their limitations. It requires thorough review of how data is being used in any given scenario, to determine the right entity relationships to optimize for data consumption

There is a lot of hand-to-hand combat with each system that needs to be properly integrated. Building a new integration requires deep understanding of actual scientific use cases, how to detect changes, how to package data from the source system, how to harmonize the data to vendor-agnostic and data science compatible formats, and how to push the data in a generic and scalable manner to data consumers.

Challenges in Scaling Up the Number of Integrations

‍

‍

Building one integration is challenging. The typical complexity of a biopharma — including the sheer number and variety of systems used in their labs and facilities— creates a brand new challenge: how to scale integrations to hundreds of types of data systems and thousands of data system instances while keeping integrations productized, maintainable, and while continuing to meet the above stringent requirements for true integrations.

Such variations are inevitable in the real world. All the iterations required to deploy, test, and harden thousands of variant integrations make data integration one of the most painful challenges biopharma faces in their digitization journey. Here are some examples of frequently-encountered variations that must routinely be accommodated:

Variations in technical interface (software SDK, OPC, file, SOAP API, RS232 serial port)
Variations in data structure and schema
Variations in report formats
Variations in network and host environments
Variations due to instrument makes, models and upgrades
Variations in configuration reflecting specific use cases

How Can We Possibly Pull This Off? The TetraScience Approach

As mentioned above, biopharma companies typically have hundreds of disparate kinds of data sources and targets, and may be operating many instances of each. Historically, biopharma has relied on two approaches for integrating these many data sources and targets together:

They contract with consulting companies, often producing one-off custom integrations between individual data sources and targets. Such custom integrations tend to be brittle: hard to maintain and impossible to scale; narrowly scoped to their initial use case or process
They call on ELN/LIMS providers to create integrations as “customizations” or as professional services projects tied to ELN/LIMS setup. This inevitably leads to integrations that are hard-coded to work only with that specific ELN or LIMS, resulting in data silos, vendor lock-in, and incomplete data extraction

Neither of these approaches has solved the fundamental problem with integrations in a scalable way. Why does TetraScience think we can do so?

We want to be the first to acknowledge that there is no magic or silver bullet. Integrating all the data sources and targets for a large biopharma research, development, and/or manufacturing organization is hard work. However, TetraScience has evolved and is successfully implementing a process that breaks the integrations logjam and will enable biopharma organizations to better leverage the benefits of data, automation, cloud and analytics.

We approach the challenge from five angles:

Vendor agnostic and data-centric positioning
Cross-functional factory line process
Product that is purpose-built for scientific data and for scaling productized integrations
A team that is at the intersection of science, data science, integration and cloud
Significant capital investment

Positioning: Vendor-Agnostic and Data-Centric

To build true scientific data integrations, vendor collaboration is crucial. A vendor-agnostic business model and purely data-oriented positioning is critical. A Tetra Scientific Data Integration connects products in the ecosystem without any other agenda except for providing stewardship of data. At TetraScience we take this very seriously. We believe scientific data should freely flow across all data systems to ensure biopharma can get the most out of its core asset.

We stand by this core belief. We do not make hardware, we do not build informatics or analytics or workflow applications. We feed data into whatever tools customers and partners want to use, whether this is Dataiku, Data Robot, Spotfire, Biovia, IDBS, Dotmatics, or Benchling — enhancing these tools’ capabilities and value. We collect and harmonize data from any data source, whether ThermoFisher, PerkinElmer, Agilent, or Waters instruments.

Process: Cross-Functional Factory Line

At TetraScience, we don’t just have one sub-team focusing on integration. Instead, we treat integration as a company-wide core focus, leveraging participation from the entire organization. Here are the stages of our integration factory line:

Prioritization, Use Case Research, and Prototyping

The core deliverable at this stage is to collect all inbound requests from our customers and partners and use this information to construct our short-, medium-, and long-term integrations roadmap
Our science team, all industry veterans, then begins investigation, outreach to vendors, gaining access to a sandbox environment and documentation, in order to play with each new data source and prototype new integrations
One non-obvious but tremendously important task is to also document integration and system compatibility. For example, which version of LabChip’s exported file will we develop around, and how will scientists properly set up the instrument software such that the exported file is compatible with the integration we’re building?

Build and Productization

This stage is handled and driven by our product and engineering team, who turn the prototype integrations into modular building blocks, which we also call artifacts. These can be agents, connectors, pipelines, Intermediate Data Schema (IDS), and other elements, all of which work together to enable an integration
While building out these reusable artifacts, we sometimes discover that required capabilities stretch the current capabilities of our platform. In such cases, we use these as forcing functions to improve the engine or the substrate on which integrations run

Documentation and Packaging

Integrations are then fully documented, helping users understand how to set up their instruments to best use our integration, how to leverage the resulting harmonized data structure, and how the integration handles detailed edge cases
Design decisions made to ensure that integrations can be robust and repeatable may, in some cases, limit user choice about how data sources and/or targets can be configured. In such situations, engineering decisions must be explained and documented transparently so that users understand and can adapt to them

Feedback and Further Collaboration with Customer and Vendors

Once the integration is deployed, this is likely the first time the customer has access to structured and clean data in data science compatible formats. At this point, customers will often provide feedback on the integration based on their use cases
Vendors will also often provide feedback to TetraScience, helping us minimize unnecessary API calls, manage performance limits of the source system, and implement other improvements

This factory line model ensures that integrations are prioritized to meet customer needs, engineered with partner feedback and that high-quality integrations are delivered in a repeatable, predictable way. Acquired knowledge and best practices are shared across our entire organization, and we leverage patterns/lessons learned from one integration to inform subsequent efforts.

External Facing Process: Ecosystem Partnership

Simply having an internal engine or process is not sufficient since integrations depend highly on customer use cases and integration endpoints, and on knowledge sharing with vendors.

Customer

We believe in working with biopharma organizations to identify the best integration strategy to set them up for long-term success in digital transformation. We often find that the right approach is not always the most straightforward.

For example:

Directly connecting to instruments like blood gas analyzers or mass balances seems simple, but may be suboptimal compared to using a control layer such as Mettler Toledo LabX or AGU Smartline Data Cockpit to add proper sample metadata and user/workflow information needed for full traceability
Plate reader control software such as SoftMax Pro can package data in Excel spreadsheet-based reports for parsing. Customizable reports, however, are subject to variation and user error, significantly increasing future maintenance overhead. For this reason, we will document how to set up the instrument to reduce the chance of manual mistakes
Scientists tend to save files in their preferred folder structures and usually without much convention or consistency. However, lack of a consistent, standard folder structure makes it very difficult for the rest of the organization to browse data. And the folder structure (folder names, etc.) itself may contain metadata important to placing data in context. We will share best practices and guide the customer to start defining their file hierarchy

As we work with each customer, we outline and discuss trade-offs among different approaches and share with the customer what we have learned from the rest of the industry. We also work with customers to help them better articulate their data science and visualization use cases, which will in turn inform the right data model to harmonize their data.

In collaboration, we turn a multi-year effort into bite-sized tasks, enabling consistent progress. Scientists who may, in the past, have been burned and burdened by IT projects that have been hugely distracting without yielding much tangible benefit can achieve immediate value and provide feedback.

Vendor

We believe in partnership with vendors, defining value, and creating abundance, together.

As a vendor-neutral integration provider, TetraScience promotes forward-looking partners, introduces use cases to our partners, provides feedback to partners on their data formats and software interfaces, and introduce them to analytics-related use cases — for example, helping an instrument manufacturer leverage harmonized data and rapidly expand their offering into the application or analytics layer.

As a result, vendors leverage TetraScience to build out better data interfaces and establish their analytics and cloud strategy. Yes, their data can now be accessed by customers without their proprietary software: this can feel a little scary. But vendor-neutral data handling is already an unstoppable trend. On the other hand, because the vendor is typically the expert in their own data and use cases, they are well positioned to explore new possibilities in the visualization, analytics and AI/ML layer which they can build out on top of Tetra Data Platform. Instead of building their own vertical and vendor-specific data stack, vendors can now accelerate their efforts to create value in the application and analytics layer, while benefiting from a vendor-neutral integration layer (which can also provide data from other ecosystem products).

Product: Enabling the Factory Line

TetraScience’s focus on integration has also helped shape and influence our product design and strategy. Tetra Data Platform is crafted to help our team and customers create and maintain true Scientific Data Integrations rapidly, at scale. Some related functionalities include:

Team: Intersection of the Four

Behind the positioning, process, and product is a team dedicated to this mission and vision. Anyone who has tried to tackle this problem will quickly appreciate the diverse expertise needed to get the job done. Imagine the skill sets behind just one single integration, leveraged in the Waters Empower Data Science Link (EDSL), video and blogpost:

Science. Deep understanding of HPLC and Chromatography, Empower and its toolkit to extract the data, harmonize the chromatography data into intuitive and vendor agnostic models
Integration. Deep understanding of data integration patterns to design for speed, resilience and configurability
Cloud. Deep understanding of cloud infrastructure to architecture a scalable and portable platform
Data Science. Deep understanding of uses cases to provide an optimized partition, indexing and query experience and example data apps to help biopharma data scientists jumpstart their own analytics and visualization projects

Deep expertise introduced from these four kinds of backgrounds, coupled with obsessive attention paid to training and knowledge transfer, forges a unicorn team that is positioned to tackle such a challenge.

As a reference, more than 60% of the team that is driving the integration design and data modeling have life science-related PhDs. They have deep understanding of scientific processes, data engineering/science and hands-on experience using these scientific instruments and informatics applications.

Capital: Powering the Vision

To scale the number of productized integrations, the organization must be obsessively focused on the integrations and all the nuances associated with building a true Scientific Data Integration.

This necessitates a big financial commitment over a long period of time, since TetraScience will need to dedicate significant resources to investigate, prototype, build, document, maintain, and upgrade each integration. Scale also allows us the luxury to remain pure-play and focus only where we are uniquely positioned to deliver value. That’s part of the motivation and thesis of our most recent funding round. To read more about our funding announcement, please see: Announcing Our Series B: The What, When, Why, Who, and Where.

Some Last Thoughts

True Scientific Data Integrations are the foundation for life sciences organizations to digitize their workflow and move towards a more compliant and data-driven R&D process. These integrations are the lines that connect the dots; they form the network that all participants can leverage and benefit from by focusing on their unique science and business logic. Integrations define the substrate where innovation can thrive based on the harmonized data sets that flow freely across whole organizations.

Each integration can be challenging and hundreds of such integrations represent a challenge this industry has never solved before. Without true R&D Data Integrations as the connective tissue, scientific R&D processes move at a pace that is significantly slower than other industries and fail to keep up with the demand of discovering, creating, and marketing life-changing therapies.

We believe it’s the right time to solve the integration challenge, once and for all.

Example H4

Example H5

Reimagine Scientific Data Management

Transform your data. Enable lab data automation. Drive analytics and AI.

Explore how