Browse by role
Browse by use case
Browse by topic
Sort
Tag
close icon
Blog
Tetra Data for Shimadzu LabSolutions
No items found.

Chromatography, a technique essential for separating, identifying, quantifying, and purifying mixtures, is crucial in advancing scientific research. As the importance of chromatography continues to grow, so does the need for efficient data management. Moreover, there is a growing aspiration to make chromatography data easily accessible for advanced analytics, including Artificial Intelligence (AI), thereby maximizing its overall value. In this two-part blog series, we will review the preparation of chromatography data, involving its centralization and engineering to a vendor-agnostic format, and explore various use cases where it can be effectively employed.

Shimadzu LabSolutions: Empowering Chromatographers Worldwide

Shimadzu LabSolutions is a suite of software designed to manage data and workflows in analytical laboratories. It can control a wide range of scientific instrumentation, including high-performance liquid chromatography (HPLC) and gas chromatography (GC) systems. LabSolutions ensures data integrity, security, and traceability, supporting compliance with regulatory standards such as FDA 21 CFR Part 11.

A graph of a graph of a personDescription automatically generated with medium confidence
Figure 1: Screen capture of a chromatogram rendered in LabSolutions software, as generated by a Shimadzu HPLC system.

Integration with Data Analytics and AI

Shimadzu and TetraScience have entered into a groundbreaking partnership. This collaboration seamlessly integrates LabSolutions with the Tetra Scientific Data and AI Cloud. The result? Data generated through LabSolutions can now be automatically centralized within a customer's dedicated instance of the TetraScience cloud-based platform. This integration paves the way for versatile data utilization across various use cases and marks a significant step toward making chromatography data AI-ready.

Benefits for Chromatographers Worldwide

This partnership is not just about technology; it's about empowering chromatographers worldwide. Data generated through LabSolutions is transformed into a standardized, accessible format, opening up new possibilities for data analysis and utilization with diverse data analytics tools and AI applications. Regardless of their location, chromatographers can now harness the power of AI for enhanced research.

Figure 2: The immutable order of operations that must be carried out to enable the widespread use of data science, machine learning, and AI on the data generated within an organization.

Standardization for Global Accessibility

Data standardization is key in chromatography. LabSolutions and TetraScience offer a standardized data structure that is ideal for storage and accessibility. This format is vendor-agnostic, ensuring that chromatographers worldwide can access and analyze their data consistently and efficiently.

So what’s new?

Let's delve into the details of this new release. In collaboration with Shimadzu, TetraScience is proud to introduce the first release of a Tetra LabSolutions intermediate data schema (IDS). With this IDS, data is stored in an open and vendor-agnostic format, JavaScript Object Notation (JSON). This schema captures scientific data in a format that is easy to interpret, vendor-neutral, and, for some fields, even instrument class agnostic. This means data can now be queried and aggregated in ways previously impossible. For example, users can query by sample ID or user ID across all data generated within their organization.

A screenshot of a computerDescription automatically generated
Figure 3: The Tetra LabSolutions IDS as depicted in the Tetra Data Platform’s interactive schema viewer.

Why JSON?

JSON is an attractive format for lab automation and data science due to its lightweight nature and ease of access for both humans and machines. Its key-value pairing structure enables efficient data serialization and parsing. This structure makes storing, transmitting, and working with complex datasets straightforward in most of today’s programming languages. Beyond creating standardized structures for raw data, the Tetra Data Platform extends its utility by automatically extracting data into relational database table views. These views are readily accessible through SQL queries, offering additional versatility and ease of access for data manipulation and analysis.

A screenshot of a computerDescription automatically generated
Figure 4: Entity Relationship Diagram (ERD) for Tetra LabSolutions IDS.


The Tetra LabSolutions IDS is parsed from data-rich .lcb and .lcd files after they are ingested into the Tetra Data Platform. This inaugural release of the LabSolutions data schema includes support for HPLC instruments utilizing UV-visible and refractive index (RID) detection methods. The raw chromatographic detector values are stored as “datacubes.” They can be used to recreate chromatograms or even customize new visualizations, such as use case–specific chromatogram overlays created with a scientist's favorite third-party and open-source tools.

A graph of a graphDescription automatically generated
Figure 5: Overlaid chromatograms rendered in Tibco Spotfire using Tetra Data as the data source.

In addition to raw measurements, the Tetra IDS also includes processed results as calculated by LabSolutions, including retention time, area percent, asymmetry, and resolution. These values are cataloged in the Peaks section of the IDS under Results. This data can now be easily accessed for downstream task-specific calculations and integration into downstream repositories, such as an ELN or LIMS, facilitated by industrialized Tetra Connectors.

A few other highlights of the new Tetra LabSolutions IDS are the inclusion of method data, such as pump control parameters, high-level system details like software and software version, and instrument components. The IDS also captures sample details, including sample ID, sample holder type, and injection location. The comprehensive list of IDS values is assembled to build Tetra Data that is foundational to establishing numerous value-adding use cases.

In the second segment of this blog series, we will delve into the diverse applications of Tetra LabSolutions IDS, demonstrating its capability to generate substantial value for chromatographers.

Existing customers can learn more about these updates by visiting the Shimadzu LabSolutions ReadMe on the TDP at slug: shimadzu-labsolutions-raw-to-ids. It thoroughly describes our parsing strategies and provides examples of accessing the data via SQL. Feel free to reach out to your Customer Success Manager with any questions.

Not a customer, but curious to learn more? Please connect with us. We’d be excited to hear from you.

Blog
Use the Tetra GxP Package to reduce your validation cost and time
Quality & Regulatory
Quality & Compliance

Efforts to comply with GxP regulations and guidelines are often seen as significant but not value-adding activities. While you cannot buy GxP compliance, there are means to significantly reduce the burden.

First of all, we highly recommend leveraging the Computer Software Assurance (CSA) approach to improve efficiency and lower costs. With updated risk guidance, your system should be substantially derisked, and the assessment activities can be scaled down correspondingly.  

The Tetra GxP Package is designed to hew closely to best practices and guidance on CSA, streamline your validation process, and provide substantial savings in time and effort. The GxP Package is an annual subscription executed as part of the annual subscription license agreement. Details about the package can be found in our GxP solution brief and V&V strategy document.  

With the Tetra GxP Package, you receive (for every major/minor release):

  • Platform requirements and traceability
  • Platform validation scripts
  • Platform V&V summary report
  • File-Log Agent requirements and traceability
  • File-Log Agent validation scripts
  • File-Log Agent V&V summary report
  • Empower Agent requirements and traceability
  • Empower Agent validation scripts
  • Empower Agent V&V summary report

…and more as we add our industrialized integrations to the Tetra GxP Package.

We estimate an 80% decrease in time and cost to perform GxP system initial validation, change control, and recurring annual upgrades/revalidations. This represents a 2-3x ROI for your investment.  

The GxP package supports your validation efforts with:  

  • Platform updates: When a major or minor update to the platform is released, you will receive updated validation documentation and release notes that describe the changes. This includes updated requirements for new functionality, executed validation scripts for new and existing functionality, and a summary report of the verification testing performed pre-release, including the GxP assessment of the new release. To complete your change management activities, your validation team has to determine what new requirements are met by the update, revise the requirements specification and validation scripts, update traceability, and so on. These time-consuming tasks are already performed by TetraScience and are part of the GxP Package, allowing customers to avoid duplicating these non-value-adding activities.  
  • Agents and connectors: As we add more industrialized integrations to the Tetra GxP Package, its value rises correspondingly. With our completed documentation set, including the documentation of requirements, validation scripts, and traceability, customers don’t have to perform the related validation from scratch. We estimate a 2-3x ROI for new validations.  
  • Agent and connector updates: Just like platform updates, modifications to agents and connectors lead to a proportional increase in change control effort. To mitigate this, customers can use our GxP package.
  • Time delay: Teams can avoid time-consuming assurance activities and documentation as well as delays in value realization from the updates.  
  • Support: As part of the Tetra GxP Package, you receive support for do-it-yourself (DIY) validation efforts, even if they don’t align with our actual validation strategy or evidence.  
  • Supplier assessment: The Tetra GxP Package includes the opportunity to perform a live supplier assessment audit for up to two days free of charge.

The adoption of the Tetra GxP Package will accelerate your transition to industry best practices of CSA. It will significantly reduce your total validation cost, speed up value realization, and ensure your continued success leveraging the Tetra Scientific Data and AI Cloud™ in your validated workflows.  

To learn more, read the GxP solution brief.

Blog
Multiple advantages of multi-tenant deployments
No items found.

Multi-tenancy is a typical approach for cloud solution providers. But we often see customers being reluctant to elect for multi-tenant deployments when they have the single-tenant option. In this blog post, I will discuss the benefits of a multi-tenant deployment.

Multi-tenant architecture

While single tenant provides each customer with a distinct software instance running on infrastructure that is not shared with other users, multi-tenant uses a shared infrastructure to provide access to a cloud-based solution for multiple customers. It is a software architecture where multiple instances of an application run on the same infrastructure. The infrastructure is then responsible for serving multiple tenants simultaneously, enabling multiple user groups access to one instance of an application or system. The users share the infrastructure resources while each tenant is completely isolated and has no access to other's data. 

TetraScience and multi-tenant deployments

We observe a clear trend towards multi tenancy. While only about 25% of our legacy customers opted for a multi-tenant deployment, it is over 60% of our current installation. All our customers who have been onboarded in the last few quarters have opted for a multi-tenant deployment.” -Naveen Kondapalli, SVP Products & Engineering

Tetra Data Platform

The Tetra Data Platform (TDP) is a single instance of TDP in a dedicated Amazon Web Services (AWS) account. Each deployment can be provisioned on single- or multi-tenant infrastructure and each entity in the TDP infrastructure provides isolation of users, data, and workflows. Customers typically have two or three deployments depending on their processes. Their processes might require development, test, and production deployments, often in a GxP environment.

Multi-tenancy advantages

We typically recommend multi-tenant deployments as they come along with significant advantages for customers.

Reduced total cost of ownership (TCO)

A multi-tenant cloud architecture is usually more cost-efficient than single-tenant. The infrastructure and the cost of a single environment (i.e. hardware, software, and maintenance) is shared by all the tenants. Customers do not need to allocate any internal resources to manage and maintain the environment.

Faster access to new releases

A multi-tenant deployment gives customers immediate access to new features once the software is released. Naturally they can also delay the update when their environment is GxP regulated and they need to validate the production system before.

Faster time to resolution

The provider of a multi-tenant environment has access to the system to troubleshoot any issues. This reduces the turnaround time for the support and reduces the operational burden of the customer. 

Scalability

Cloud deployment is highly scalable where customers only use the resources they need. Customers don’t need to plan the purchase and onward maintenance of an extension to their environments; they simply request an expansion of their current deployment. And if a company needs to scale down, they can easily reduce the size of the deployment to avoid unused server capacity.

Increased security

Multi-tenancy is secure, especially when maintained by an experienced service provider like TetraScience leveraging AWS. Customer-hosted environments depend on their internal expertise, while reliance on TetraScience’s security team provides effective threat and intruder detection and prevention.  

TetraScience executes the following security operations for our customers:

  • Cloud infrastructure entitlement management (AWS roles and privileges for the environment)
  • Cloud security posture monitoring (AWS misconfigurations could lead to security issues or compliance risks related to the environment)
  • Intrusion and anomaly detection (threats potentially targeting the environment)
  • Incident response (active response to threats targeting the environment)

Summary

Multi-tenant is the typical approach for cloud deployments and is increasingly the tenancy of choice for biopharmaceutical companies. Customers can benefit from reduced total cost of ownership (TCO), faster access to new releases, faster turnaround time, scalability, and high security of their environment. TetraScience supports multi-tenant customers with the execution of key security operations and provides the required architecture for GxP regulated organizations.

Contact our experts to learn more!

Blog
How to overcome the hurdles to Pharma 4.0
No items found.

The biopharmaceutical industry is in the midst of a digital revolution–referred to as "Pharma 4.0"–where innovative technologies and data strategies are being introduced to drive efficiency, reduce costs, and accelerate drug development. In a recent SLAS Technology publication1, teams at Biogen and TetraScience outline the challenges facing this movement, and how cloud data pipelines, harmonized data, artificial intelligence, and other tools and technologies will pave the way for digital transformation. The authors describe a proof of concept case study to digitize a ThermoFisher ViiA7 qPCR data workflow that enhances data accessibility, interoperability, and analysis efficiency for protein therapeutic development.

Key takeaways

  • Replatforming scientific data: Lab Data Capture (LDC) is a foundational strategy in Analytical Development at Biogen to collect and harmonize siloed scientific data from analytical instruments, reporting systems, and operational platforms. This process facilitates the transformation of diverse, proprietary instrument data into a standardized, harmonized format, enabling seamless data integration across platforms. The result is data that is FAIR: findable, accessible, interoperable, and reproducible.
  • Cloud architecture: The Tetra Data Platform significantly contributes to the implementation of LDC by providing cloud infrastructure, tools, and the largest and fastest-growing library of instrument and software data models and integrations. It addresses the hurdles to Pharma 4.0 and data integrity by centralizing, standardizing, and managing scientific data. This is crucial for effective data utilization and regulatory compliance.
  • Compliance-by-code: The transition to a compliance-by-code model, supported by LDC and the Tetra Data Platform, ensures laboratory processes and data handling are defined by code and verifiable through an audit trail. This offers a major advantage over manual operating procedures that are written with ambiguity that leads to creative interpretation and differences in execution. Automated processes will standardize workflows and reduce manual deviations from operating procedures, ensuring data quality and confidence. This is especially critical for pharmaceutical activities that adhere to regulated data integrity standards.
  • Web applications: Cloud-based analytical applications that integrate with the Tetra Data Platform facilitate easy access and analysis of harmonized data for scientists, simplifying routine analysis and improving the reliability and efficiency of laboratory workflows.
  • Cultural and technological shift: The adoption of LDC requires a cultural shift towards digital transformation and compliance-by-code within laboratories. This change is supported by training and upskilling staff to adapt to new technologies and workflows, enabling Biogen to more effectively leverage data for therapeutic development.
Example of the Lab Data Capture strategy. Reproduced from the original abstract figure with permission from SLAS Technology and Elsevier.

Summary

The Pharma 4.0 digital revolution signifies a major transformation for the biopharma industry, driven by advanced technologies and strategic data management. Through the implementation of Lab Data Capture, the use of the Tetra Scientific Data and AI Cloud, and the move toward compliance-by-code, the industry is poised for significant advancements. These strategies produce scientific data that is harmonized, data integrity and regulatory compliance that is automated, and laboratory workflows that are optimized, all facilitating faster and more efficient drug development. This revolution necessitates a cultural and technological shift within the industry that emphasizes the importance of upskilling and embracing digital transformation at all levels. By addressing these key areas, these advancements promise to enhance data integrity, streamline laboratory operations, and ultimately accelerate the delivery of new therapies to patients.

Learn more

  • Read the full publication for complete insights from the authors.
  • Interested in joining the digital revolution? Reach out to the experts at TetraScience. 

References

  1. Van Den Driessche GA, Bailey D, Anderson EO, Tarselli MA, Blackwell L. Improving protein therapeutic development through cloud-based data integration. SLAS Technol. 2023 Oct;28(5):293-301. doi: 10.1016/j.slast.2023.07.002. Epub 2023 Jul 16. PMID: 37454764.
Blog
Replatform and engineer your scientific data with the world’s largest, fastest-growing, purpose-built library
IT
Integration
Contextualization
Harmonization

Over the last four years, TetraScience has been replatforming and engineering scientific data from hundreds of thousands of data silos. Our goal is to combat the inherent fragmentation of the scientific data ecosystem to unlock Scientific AI. 

The approach we’ve taken—leveraging productization, using a data stack designed for scientific data, and adhering to a vendor-agnostic business model—has been very challenging. However, we firmly believe that our strategy is the only way to generate the large-scale, liquid, purpose-engineered datasets that Scientific AI requires. 

In this blog, we’ll share where we stand today, what we’ve learned, and what we’re planning to do next. 

It’s all about scientific data and it’s about all your data

TetraScience started its mission by replatforming and engineering one of the most notoriously challenging types of scientific data—instrument data. These datasets are often highly fragmented, mainly trapped in vendor proprietary or specific formats.

Over time, the Tetra library of integrations and data schemas has greatly expanded beyond instrument data and most notably includes:

  1. Experimental context and design via bi-directional ELN/LIMS integration
  2. Analysis results via apps accessed through the Tetra Data Workspace
  3. Data from contract research organizations (CROs) and contract development and manufacturing organizations (CDMOs)

For each endpoint system, TetraScience aims to extract as much of the scientifically meaningful data as possible. In a previous blog post, we shared why our strategy is fundamentally data-driven and AI-native in contrast to a middleware approach that is application-driven and, therefore, more limited by nature. 

The largest library and highest coverage 

Considering the vast array of scientific data sources within the biopharma industry, how does TetraScience’s library of productized integrations and data schemas measure up? Let's evaluate our library from three different perspectives.

For a typical biopharma

Since the beginning of 2024, three of the many biopharmaceutical companies we are partnering with, have shared their instrument inventory lists with us. This allowed us to identify how to replatform their scientific data to the cloud and enable analytics and AI. 

Upon analyzing their lists, we discovered that on average TetraScience’s integrations already supported the integration of over 95 percent of all their lab instruments. Thus, by and large, we could fulfill their immediate needs for lab data automation and scientific data management. 

Our process of engineering scientific data involves harmonizing vendor-specific data into Tetra Data, using an open, vendor-agnostic format. Initially, our data schemes supported roughly 60 percent of their instruments. The resulting transformed data (Tetra Data) allows these organizations to harness their scientific data for analytics and AI whenever ready, thereby future-proofing their data strategy.

Digging into the priorities of their instruments, we found that our library covered 100 percent of the highest-priority instruments for data replatforming and data engineering. This is unsurprising since TetraScience's library has been developed based on customer requests and thus supports the most popular scientific use cases. Therefore, it serves biopharma’s expected business outcomes and provides significant value. 

For a popular endpoint category 

Another way to illustrate TetraScience’s coverage is to map the TetraScience library to common endpoint categories. Here are some examples: 

Instrument Technique or Endpoint Type Vendor Software / Instrument
Chromatography and HPLC Cytiva UNICORN
Shimadzu LabSolutions
Thermo Chromeleon
Waters Empower
Agilent OpenLab
Mass Spectrometry Waters MassLynx, Empower
Thermo Fisher Xcalibur
Sciex Analyst
Plate Reader BioTek Gen5, Synergy
BMG LABTECH CLARIOstar, FLUOstar, NEPHELOstar, PHERAstar, POLARstar, SPECTROstar
Bruker OPUS
Luminex Magpix
MSD MESO Quickplex SQ 120, MESO SECTOR S 600
Revvity Microbeta2, TopCount, EnVision
Sartorius Octet
Tecan Spark, Sunrise
Thermo Fisher CellInsight CX5, CX7, FluoroSkan, VarioSkan LUX
Unchained Labs Stunner, Lunatic, DropSense
Wyatt DynaPro
ELN/LIMS Benchling Notebook
BIOVIA ONE Lab
IDBS E-Workbook
Revvity Signals Notebook
LabWare LIMS
LabVantage LIMS
Dotmatics Studies

For an end-to-end scientific workflow

Next, let's consider a scientific use case. Bioprocessing involves a series of carefully controlled and optimized steps to produce pharmaceutical products from living cells. A typical workflow is divided into three stages: upstream processing, downstream purification, and characterization and critical quality attribute (CQA) monitoring. Throughout this process, a large number and diversity of scientific data sources are used. Below, we list the endpoints from the Tetra library that cover each stage of the bioprocessing workflow end to end: 

Upstream Bioprocessing (Cell Culture Fermentation)

Data source Vendor Software / Model
Bioreactors Eppendorf DASGIP
Sartorius Ambr250 (via KepServerEx Connector)
Plate Readers (titer, productivity) Biotek ELX808, Synergy HTX, Synergy 2, Synergy H1
BMG LABTECH CLARIOstar, POLARstar, FLUOstar, NEPHELOstar, PHERAStar
MesoScale Discovery Quickplex SQ 120, MESO MESO Sector S 600
Molecular Devices SpectraMax
Revvity Envision, Microbeta 2
Sartorius Octet HTX, Octet Red96
Tecan Spark, Sunrise, Infinite 200
Thermo Fisher CellInsight CX5/CX7, FluoroSkan
Unchained Labs Lunatic, Little Lunatic, DropSense
Wyatt DynaPro DLS
Cell Counters / Viability Beckman Coulter ViCell Blu, ViCell XR
Chemometec NucleoCounter NC-200, NC-202
Roche Cedex Hi-Res (via AGU SDC)
Chemistry Analyzer (cell viability, metabolites) Beckman Coulter MetaFlex (via AGU SDC)
Nova Biomedical BioProfile FLEX2
Roche Cedex Bio, Cedex BioHT (via AGU SDC)
Liquid Chromatography (protein quality) Thermo Fisher Chromeleon
Waters Empower (bidirectional capability)

Downstream Bioprocessing and Purification

Data Source / Target Type Vendor Software / Model
Electronic Lab Notebook (ELN) IDBS E-Workbook
Revvity Signals Notebook
Benchling Notebook
Laboratory Information Management System (LIMS) LabWare LIMS
LabVantage LIMS
Fast Protein Liquid Chromatography Cytiva UNICORN, ÄKTA avant, ÄKTA express, ÄKTA micro, ÄKTA pure

Characterization and CQA Monitoring

Data Source / Target Type Vendor Software / Model
Electronic Lab Notebook (ELN) IDBS E-Workbook
Revvity Signals Notebook
Benchling Benchling
Laboratory Information Management System (LIMS) LabWare LIMS
LabVantage LIMS
pH Meter Mettler Toledo LabX
Spectrophotometer Agilent Cary 60
Thermo Fisher NanoDrop 8000
Plate Reader MesoScale Discovery Quickplex SQ 120, MESO, MESO Sector S 600
Revvity EnVision, Microbeta 2
BMG LABTECH CLARIOstar, POLARstar, FLUOstar, NEPHELOstar
Molecular Devices SpectraMax
Tecan Spark, Sunrise, Infinite 200
BioTek ELX808, Synergy HTX, Synergy 2, Synergy H1
Sartorius Octet HTX, Octet Red96
Thermo Fisher CellInsight CX5/CX7, FluoroSkan
Unchained Labs Lunatic, Little Lunatic
Wyatt DynaPro DLS
Capillary Electrophoresis Revvity LabChip GXII Touch
ProteinSimple Maurice
Gel Imager Bio Rad Gel Doc XR+
Azure Biosystems Imaging System
Light Scattering System Malvern Panalytical ZetaSizer
NanoTemper Prometheus
High Performance Liquid Chromatography (HPLC) Waters Empower (bidirectional capability)
Thermo Fisher Chromeleon
Wyatt ASTRA SEC-MALS
Fast Protein Liquid Chromatography (FPLC) Cytiva ÄKTA avant, express, micro, pure
Cytiva UNICORN
Quantitative Polymerase Chain Reaction Instrument (qPCR) Thermo Fisher QuantStudio 7 Pro, 7 Flex, 12K Flex, ViiA7
Nuclear Magnetic Resonance (NMR) Bruker AVANCE 700
Surface Plasmon Resonance (SPR) Bruker Sierra SPR32
Liquid handlers Tecan D300e
Tecan Fluent
Beckman Coulter Biomek i7
Hamilton Microlab STAR

The fastest-growing library 

For any data sources not yet supported by TetraScience, you can read about our approach in this blog post: What customers need to know about Tetra Integrations and Tetra Data Schema. In short:

  • TetraScience publishes our library and roadmap transparently so you know exactly what is currently in the library and how we plan to grow it: Data replatforming/integration library (layer 1) and Data engineering library (layer 2)
  • TetraScience provides monthly newsletters announcing the latest releases for data integrations and data models: TetraConnect News 
  • Customers can request to add or accelerate items in the roadmap. TetraScience prioritizes requests based on criticality and impact. If there is a component to productize, TetraScience will create and maintain it for all customers. 

In the last two years, TetraScience has been able to deliver more than fifty new or material improvements to our library every six months. With the introduction of our Pluggable Connector Framework, TetraScience will further accelerate this tempo. 

TetraScience also publishes guidelines to select customers on how to build Intermediate Data Schemas (IDSs), which accelerates their ability to extend the library. For example, this video from our training team teaches users how to create their own pipelines: Tetra Product Short Cuts: Self-Service Tetra Data Pipelines.

In addition to industrializing components for our library, TetraScience has rolled out ready-to-use validation scripts as part of our GxP Package. Our verification and validation (V&V) document set is designed to help customers save as much as 80 percent of their validation effort, allowing them to focus on the “last mile validation.” 

Learn more about some of our data replatforming and engineering library:

The only purpose-built library 

Our journey will never be completed. However, we’re eager to share the significant amount of investment and work TetraScience has made to fulfill our promise to the industry. This commitment involves combining our expertise in technology, data, and science to deliver material impact. 

Understand and overcome the limitations of endpoint systems 

Most of the systems are not designed to be interfaced with data analytics and AI. Their primary function is to execute some scientific workflows, rather than to preserve and surface information for analytics or AI. Here are some of the most challenging situations we have observed: 

  • Change detection is extremely difficult for common lab data systems, such as chromatography data systems (CDS). A typical CDS controls hundreds of HPLCs, holds data from thousands of projects, and supports hundreds to thousands of scientists. As a result, it can be virtually impossible to efficiently detect modifications or new runs.
  • Binary data files are prevalent choices by vendors, and they are only readable inside the vendor’s own analysis software. Sometimes, these vendors provide a software development kit (SDK). However, because the instrument control software must be installed concurrently for the SDK to function, it does not qualify as a true SDK. Also, vendors often restrict any third party from using key libraries in the SDK. 
  • Data interfaces are often undocumented, incorrect, or not designed for analytics or data automation. For example, some lab data systems can return incorrect or conflicting data if using different interfaces, or fail to handle periodic polling on the order of minutes. Anticipating or reproducing these scenarios is often impossible without large-scale data or real lab facilities.

Understand the science and the scientific purpose

A typical approach in the industry is to focus on the integration of instruments and scientific applications without considering the larger picture of the scientific use case. While having many industrialized and validated integrations and data schemas is undeniably essential, it is critical to also have the scientific workflow and purpose in mind. 

  • What is the scientific end-to-end data workflow of the scientist? 
  • Is this workflow part of a larger process?
  • What does the scientist want to achieve, and what is the desired outcome? 
  • Which data and results are relevant to achieve it? 
  • What is the relevant scientific metadata? 
  • For which purpose does the scientist need this metadata today (e.g., search, data aggregation, analytics)? 
  • How might the scientist want to leverage this data later in different applications?
  • What other purposes might the scientist have for the data in the future? 
  • What are the functional and nonfunctional requirements to fulfill the scientific use cases?

Being able to answer these questions will help create the best possible data workflows using suitable integrations and data schemas. To ensure our library is purpose built for science, 48 percent of TetraScience’s staff has a scientific background and 54 percent has advanced degrees (MS or Ph.D.). 

Mimic a scientific workflow via live instrument testing 

One of the most important lessons we have learned is that scientific data formats vary widely and are subject to different kinds of configurations, assays, hardware modules, and operating systems used. As a result, TetraScience has started to contract with various institutions to perform live testing while scientists conduct real scientific workflows. This ensures that our integrations and schemas perform as intended, delivering value to scientific workflows.

Next steps

TetraScience has, is, and will continue to invest in differentiating capabilities for the replatforming and engineering of scientific data from hundreds of thousands of data silos. This endeavor is combating the widespread fragmentation of the scientific data ecosystem. In 2024, we will: 

  1. Continue to evolve our foundational schema component library
  2. Adapt our existing schemas to strengthen harmonization across specific scientific workflows and data sources
  3. Evolve platform architecture to accelerate expansion of the data engineering library 
  4. Rapidly detect  edge cases and remediate them through alerting and monitoring
  5. Deploy and manage components at scale from a centralized platform 
  6. Perform exploratory testing focused on scientific use cases

We are on this journey together

Each of the endpoint systems holds your scientific data. In the modern data world, every organization like yours is demanding seamless access to their data and liquid data flow in and out of these systems. The "tax or tariff" that certain vendors put on your data is no longer acceptable—nor does it have to be. This endpoint-centric data ownership mindset fundamentally does not work in our era of data automation, data science, and AI. Industrializing the building blocks of data replatforming and engineering is inevitable for the industry to move forward. 

TetraScience provides the vehicle for this paradigm shift. When there is a tax or tariff on your data, we encourage every biopharma organization to participate in the success of your own data journey. For example, you can:

  • Submit justified requests to your endpoint provider, insisting that your data be freely accessible along with the related documentation.  
  • Involve TetraScience in your planning process. TetraScience can help ensure that your agreements with endpoint vendors include sufficient requirements for openness and liquidity of your data generated or stored in these endpoint systems.

We are on this journey together. Contact one of our experts today.

Blog
Tetra Data upgrade for Chromeleon CDS: unlocking the comparability of chromatography data
Chromatography
IT
Science
Integration
Contextualization

Chromatography is a biopharma workhorse and its use is expanding worldwide at a staggering pace. To leverage the growing datasets from this technology, organizations will need forward-looking strategies for data management. Merely storing, securing, and archiving chromatography data is not enough. Biopharmas need a next-generation solution. 

The Tetra Scientific Data and AI Cloud addresses this need by transforming chromatography data into future-proof, AI-native Tetra Data. Our latest updates to Tetra Data help you unlock even more value from your chromatography data. They provide enhanced support for analytical and AI-enabled use cases, such as method transfer/validation and human-in-the-loop automated peak integration.

Pressing challenges for chromatography scientists

Conversations with biopharma customers who use chromatography systems revealed common pain points and desired outcomes. Here are the key challenges that inspired our most recent updates:

  • High vendor diversity. The chromatography landscape is marked by a broad range of vendors, equipment, software, ontologies, and raw data formats. This diversity results in a fragmented data environment with minimal standardization and highly heterogeneous data.
  • Comparability issues. The lack of standardization across data formats, analytical algorithms, and system capabilities leads to chromatograms, methods, and results that are only useful for their initial purpose. Scientists struggle to compare data across diverse hardware and software within an organization, limiting their ability to draw broader or deeper insights.
  • Limited context. Hundreds of parameters contribute to the generation, analysis, and reporting of chromatography data. However, only a small percentage of this information is included in standard outputs. Without such context, it is difficult to accurately interpret the data or reproduce the results.

So what’s new in Tetra Data?

With the release of Tetra Chromeleon Agent v2.1.1 and IDS v6.0.1, chromatography Tetra Data is harmonized and enhanced in three ways:

  • Common components for chromatography methods introduce a foundation for data and method comparability.
  • A comprehensive list of reportable parameters provides valuable context for chromatography data.
  • Broader, inclusive support of third-party components offers expanded compatibility with a wide and growing list of hardware.

Let’s take a closer look.

1. Comparability of data and methods

Chromatography data provides valuable insights on the separation, identification, and purification of sample components. However, with so many vendors, methods, and peak-resolving algorithms, comparing data can be arduous. Tetra Chromeleon IDS v6.0.1 unveils an exciting new component in Tetra Data that aligns common reportable fields across different vendors’ CDS outputs. Once replatformed on the Tetra Data Platform (TDP), chromatography data is engineered with a standardized ontology to become vendor-agnostic Tetra Data. Harmonized data allows a parallel query structure between different IDSs, making an SQL query directly transferable from one CDS to another. As a result, methods and results on one instrument using Chromeleon can be compared to those of another using Empower, Shimadzu, UNICORN, or any other CDS following the same method. Simplified comparisons of chromatograms, methods and results across instruments and software vendors help scientists transfer and validate methods throughout the organization, enhancing the value and reach of each experiment. 

Chromeleon IDS v6.0.0 marks the first release of the chromatography method IDS component, with more vendors to come in future updates.

The chromatography method Tetra Data component aligns common output fields across multiple CDS vendors.

2. Broader access to context-providing information

The complexity of chromatographic data requires a flexible toolbox for analyzing injection report variables. While retrieving a default parameter value like peak resolution from the CDS is straightforward, there are many ways to calculate this value. The context or formula used for this calculation may not accompany the data, despite its utility to the user. Contextual parameters, in contrast, provide a better understanding of the method and peak integration conditions. The Tetra Chromeleon IDS now supports more than 600 injection report variables, including peak widths at given heights, peak amount calibration data, peak resolution, and theoretical plates. This additional context will improve downstream decision-making, reproducibility, and compliance.

The Tetra Chromeleon IDS v6.0.1 surfaces a comprehensive list of reportable values.

3. Expanded compatibility

The chromatography equipment industry encompasses a wide range of vendors, systems, and components. Chromeleon is capable of collecting and analyzing data from a variety of them. Historically, the Tetra Chromeleon IDS required updates to parse and support new hardware integrated with the CDS. With Chromeleon IDS v6.0.1, however, a dynamic approach is used to retrieve data from all connected hardware, even if it was previously unencountered. This ensures that users can immediately utilize all data from the CDS as the hardware setup evolves.

When outputs become outcomes

Replatforming and engineering data into a large-scale, liquid, and compliant resource is an impressive output, but the outcomes it can deliver are what truly matter. Let’s consider an example to illustrate this point.

One prominent biotech monitors product stability using a reversed-phase ion-pair liquid chromatography assay to measure the purity of messenger ribonucleic acid (mRNA) products. This is a time consuming process, taking 24 hours to run 96 samples across 4 Thermo Scientific high-performance liquid chromatography (HPLC) systems and another 45 minutes to analyze the results in the Chromeleon CDS. This workflow faces two major challenges:

  • Manual processing. mRNA molecules are large, and degradation products often exhibit subtle differences that are challenging to differentiate—that is, their peaks in the chromatogram do not fully resolve but rather overlap. Experimental variation complicates the results further, making it difficult to automate the analysis. As a result, processing parameters need to be manually optimized for each run within Chromeleon.
  • Inaccessible data. Process monitoring is difficult without ready access to data. Longitudinal charting depends on liquid data to track instrument variability and performance and to predict column degradation. Failing to do so can be costly—resulting in repairs, delays from repeat experiments, and lost materials. Manual efforts to aggregate and process this data for analysis are burdensome and prone to errors.

The Tetra Scientific Data and AI Cloud provides the capabilities to address these issues. The Tetra Chromeleon Agent and IDS pipelines automate the assembly and engineering of chromatography data, making the data fully accessible to scientists. For instance, they can now access data through round-trip data automation with their ELN. The resulting open, liquid, and purpose-engineered Tetra Data create value-adding outcomes for this customer. They can achieve operational and quality control enhancements through process monitoring and predictive maintenance, and support analytical use cases such as ML model building for human-in-the-loop semi-automated analysis. With TetraScience, the biotech company is improving how it captures and leverages its scientific data, eliminating tedious manual processes, and enhancing its AI readiness.

Conclusion

The growing superset of chromatography data in biopharma is highly diverse and unstandardized. Results often lack the full context of how they were captured and analyzed. Comparing across organizations, labs, or even different systems is difficult. The Tetra Scientific Data and AI Cloud is well positioned as the next-generation solution with the largest and fastest-growing library of integrations. The latest Tetra Chromeleon IDS lays a foundation for standardized, interoperable chromatography data (Tetra Data) that can be queried using the same terms and compared across all supported CDSs—and for emerging AI use cases that depend on it.

Chromeleon CDS Agent v2.1.1 also includes performance and robustness enhancements that minimize delays in the Agent to ensure reliability. More information can be found in the release notes.

Interested in learning more about leveling up your chromatography data? Contact one of our experts today.

Blog
Enhancing the scientific data engineering experience
IT
AI & Data Science
Integration
Contextualization

Scientific data engineers face obstacles every day—fragmented data ecosystems, proprietary data formats, data silos, lack of ontology, manual data aggregation, and more. These challenges must be navigated before the value-adding data analytics work even starts. But there’s a solution that offers a smooth path forward: the Tetra Scientific Data and AI Cloud™. It enables you to efficiently replatform, transform, and update data. Let’s take a tour through the latest improvements for scientific data engineers in the release of Tetra Data Platform v3.6.

Tetra Data Platform v3.6 enhances the scientific data engineering experience in three ways:

Low-code solutions

Tetra Data Platform v3.6 introduces a low-code solution for creating and updating protocols, essential to managing the business logic for data pipelines. Previously, modifying a protocol usually required the use of developer tools to create and deploy changes via the Tetra Data Platform API. This latest update simplifies the process by enabling you to directly incorporate DataWeave scripts through the pipeline’s user interface. This enhancement empowers users to directly input and edit the logic of pipeline protocols within the Tetra Data Platform, reducing the reliance on external development tools.

Example of a DataWeave script protocol within a Tetra Data Pipeline.

Bulk File Processing

This feature enables the reprocessing of large numbers of files with fine control over which files to include (e.g., filtering by date range or process status). Users can name reprocessing jobs and monitor their progress in a new dashboard. As the scientific use cases for your data grow, you can easily re-examine large historical datasets or enrich them with more context to enhance their utility. Bulk file processing ensures that data can be efficiently updated at scale across an enterprise, allowing users to focus on more impactful work.

Bulk pipeline process jobs can be created on the Pipeline File Processing page of the Tetra Data Platform.
See low-code protocols and bulk file processing in action

Entity relationship diagrams

Another powerful tool introduced with Tetra Data Platform v3.6 is the entity relationship diagram (ERD). These diagrams visually represent the JSON Intermediate Data Schema (IDS) structure as tables. Users can intuitively explore the ERD by zooming and panning through the structure and relationships across tables.

Significantly, the ERD features a search function that simplifies the process of locating and selecting columns of interest. With just a single click, you can generate an SQL query for the fields you care about, making it easier and faster to parse through enterprise-level data and locate the specific data you need. This functionality allows users to focus on data utility rather than troubleshooting SQL syntax errors. As a result, users are better equipped to create robust data solutions and derive valuable insights from their data.

Example of an ERD used to view the relational data structures of the IDS, search for terms, select fields, and generate SQL queries from selections.
ERD SQL queries can be reviewed, modified, copied, or run directly on the Tetra Data Platform.
See ERDs in action

Summary

The Tetra Data Platform v3.6 offers a host of new features that will delight scientific data engineers, helping them build a simple, robust, and adaptable data ecosystem for their stakeholders. These new features streamline scientific data replatforming and engineering while simultaneously fueling downstream analytics and AI applications with engineered scientific data.

If you’d like to see how our scientific data engineering improvements can help your team, contact one of our experts today.

If you’re interested in other new features released with Tetra Data Platform v3.6, you can read about our self-service pipelines, pluggable connectors, file journey, and multi-byte character support.

Want expert training on these new features?

TetraU provides in-depth training for the Tetra Scientific Data and AI Cloud, including our new TDP v3.6 features. Check out TetraU for workshops and training content developed by the data and life science experts at TetraScience.

Blog
Pluggable Connector Framework accelerates development and release of data connectors
IT
Integration

The fast-paced and data-intensive field of biopharmaceutical research and development requires a future-facing scientific data management solution. Storing, securing, and archiving raw data are table stakes. While superior to paper log books, the traditional SDMS model is becoming obsolete. Many SDMSs have few external integrations and lack interoperability across diverse data sources and applications. These features are essential for advanced functions including data engineering and AI use cases. The traditional SDMS functions like a walled garden. It serves its basic role of protecting your scientific data but fails to promote connections that allow data to be used by outside applications to unlock its true value.

A next-generation solution must provide seamless connectivity with every component of a modern lab data ecosystem, from data generators like instruments to data consumers like analytics and AI. The Tetra Scientific Data and AI Cloud™ has the tools to do just that, including the Pluggable Connector Framework. Released in Tetra Data Platform v3.6, the Pluggable Connector Framework enhances the speed and efficiency of updating connectors, bringing agility to lab connectivity and data management. Let’s explore how it works and the benefits it provides.

See the Pluggable Connector Framework in action for real-time connector updates. In this example, an Amazon Web Services (AWS) S3 bucket Tetra Connector is updated with just a few clicks to a newer version that is capable of file type filtering using glob patterns.

Dynamic integration capabilities with Tetra Connectors

The new Pluggable Connector Framework introduced with Tetra Data Platform v3.6 provides functionality that accelerates the connector release cycle, offers rapid iteration, and scales to the growing needs of your organization.

The Tetra Data Platform v3.6 improves lab connectivity in 3 ways:

  1. Independent connector release cycles

The new Pluggable Connector Framework allows scientific data to be rapidly integrated by allowing new and updated connectors to be deployed independent of Tetra Data Platform releases. Pluggable Connectors can be upgraded and reconfigured using a user-friendly interface in the Tetra Data Platform. Moreover, this interface improves connector lifecycle efficiency to accelerate data assembly, enabling much faster integration with new data sources.

  1. Designed for rapid development

For technical solutions, the freedom to create and iterate is a key to success. The Tetra Scientific Data and AI Cloud is designed with iteration in mind, unlike monolithic traditional SDMSs. The Pluggable Connector Framework unlocks the potential to update and deploy new connectors and versions as needed. Rapid iteration in connector development can accelerate data assembly and enable faster deployment of new features, bug fixes, and connectors for new applications and data sources. The framework comes with many built-in features, such as configuration and lifecycle management, and standardized functions that Tetra’s developers can leverage to build connectors faster. The framework responds to shifting priorities by enabling the delivery of beta connector versions to customers that provide users with early access to solutions.

  1. Scaling at your pace

Business cases with expanding needs require scalable solutions. The Pluggable Connector Framework incorporates design enhancements that unlock more granular scaling for connector deployment. The optimized framework now provisions computing resources for each connector, offering outstanding performance for customers with large numbers of connectors. The result is a platform that can scale data inputs to your growing needs.

A closer look at Tetra Connectors and the Pluggable Connector Framework

Tetra Connectors are integral onramps for data from instruments, software, and file-hosting sources capable of exporting or publishing data. When a connected data source performs an activity (e.g., a Benchling assay run is created), an event record becomes available and actionable. The data source may publish to a networked event broker (such as AWS EventBridge), messaging queue (AWS Simple Queue Service [SQS] or Message Queuing Telemetry Transport [MQTT]), or it may have direct communication with the Tetra Connector through a REST Application Programming Interface (API).

For Tetra Connectors using our subscription model, the broker or queue holds a list of events to be processed by the connector, while the connector is subscribed to the queue and monitors for new events. For those using our API model, the connector queries the data source’s inbound API at a specified polling interval (e.g., every 10 minutes). For both models, the connector processes the event information and queries additional resources if applicable. The relevant scientific data is then replatformed into the Tetra Scientific Data and AI Cloud.

The figure below shows an example of a Tetra Connector workflow. In this workflow, the Benchling notebook will publish an event to EventBridge, which is then transferred to the SQS message queue. The Tetra Benchling Connector monitors SQS for new events and removes messages after an event is processed. The Tetra Connector will then use a URL contained in the event message to contact the Benchling API for event details, and subsequently trigger a data pipeline to process the event.

Communication path for the Tetra Benchling Connector.

With Tetra Data Platform v3.6, the Pluggable Connector Framework surfaces important elements of this sophisticated design and provides an intuitive user interface (UI) to deploy or reconfigure connectors. Within the Tetra Data Platform, users can create or update an existing connector with a few mouse clicks. They can customize connector information, including connector type, version, subscription addresses, client credentials, and other parameters. The connector configuration is updated immediately, and has status checks at a configured interval. Release notes for each version can be found in the Connector READ ME in the Details tab on the Tetra Data Platform.

Edit Connector Information page in the Tetra Data Platform v3.6.

Summary

The Tetra Scientific Data and AI Cloud represents a significant departure from the traditional SDMS model. It has been purpose built to integrate all data producers and consumers throughout the modern laboratory environment, from instruments and software to analytics and AI, independently of the vendors. This functionality exemplifies what a next-generation solution should do. It is achieved, in part, through a flexible and dynamic integration strategy that provides data accessibility and interoperability with speed. The Pluggable Connector Framework helps meet these requirements with faster release cycles, rapid development, and scalable solutions. This framework increases the cadence of innovation, establishing the necessary interoperability for analytics and AI use cases.

To learn more about Pluggable Connectors and other exciting features of Tetra Data Platform v3.6, read the release notes. To learn how to create, configure, update, and monitor your own Pluggable Connector, read the documentation.

If you’d like to discuss how our Pluggable Connector Framework can help your team, contact one of our experts today.

If you’re interested in other new features released with Tetra Data Platform v3.6, you can read about our self-service pipelines, file journey, and multi-byte character support.

Want to try out these new features?

TetraU provides in-depth training for the Tetra Scientific Data and AI Cloud, including our new TDP v3.6 features. Check out TetraU for workshops and training content developed by the data and life science experts at TetraScience.

Blog
Fundamental challenges of scientific data
No items found.

Biopharma organizations are poised for a generational paradigm shift. It will disrupt the industry, just as Amazon disrupted the traditional retail bookstore industry and Uber disrupted the taxi industry. 

Driving this change are the surge in computing power, widespread availability of cloud services, advancements in artificial intelligence (AI) and machine learning (ML) tools, shifts in the regulatory landscape, and Eroom’s law. Pharmaceutical and biotech companies will be able to achieve new breakthroughs and significantly enhance operational efficiency—if they can fully capitalize on emerging technologies.

Given that, many organizations have already begun to:  

  • Shift focus from tangible experiments to data as the core IP asset
  • Digitize and automate data flow
  • Replatform data to the cloud
  • Leverage analytics and AI/ML for breakthroughs

Each of these major trends centers on scientific data. And to succeed with these initiatives, organizations must properly replatform, engineer, and then leverage scientific data for analytics and AI. 

These activities are complicated by several unique structural factors related to scientific data, which result from the nature of the scientific industries. Let’s unpack these structural factors and see how they pose challenges to pharmaceutical and biotech organizations. 

What is scientific data?

The fundamental building blocks of scientific activities are experiments. Scientists design an experiment to investigate a hypothesis or gather evidence. In the life sciences, scientists typically create samples for testing and then perform experiments to test those samples. They subsequently collect the data and analyze it. This process can be represented by the design-make-test-analyze (DMTA) cycle:

  1. Design: Experimental design is typically recorded in an electronic lab notebook (ELN), defining experimental conditions and identifying the protocol to follow. The design is often a result of analyzing the results from previous experiments or data provided by other teams. Scientists might use computational tools to speed up the design process. 
  2. Make: Scientists prepare samples for testing using various instruments. These can include bioreactors for controlled growth of cells, purification instruments for isolating proteins or analytes, and liquid handlers for precise sample preparation.
  3. Test: Scientists may use instruments to perform the experiments, often leveraging partners such as a contract development and manufacturing organization (CDMO) or contract research organization (CRO) to perform in vivo or in vitro studies. Instruments are set with detailed parameters to ensure accurate testing, and in silico (computational) methods might also be employed to simulate tests or assess performance.
  4. Analyze: Scientists use a variety of analysis tools, including the same software used to perform the experiments as well as specialized applications. The results are then integrated with data from other stages of the DMTA cycle as well as historical data to make decisions.

Each step includes a wealth of associated information—either input or output. This collective set of information is defined as “scientific data.” Irrespective of the therapeutic area or target pathway, small-molecule discovery teams repeat a common workflow, the DMTA cycle, to optimize their identified hits toward clinical candidates. Other modalities such as biologics and cell and gene therapy (CGT) use similar versions of DMTA.

Inherent complexity of scientific data

Scientific data is inherently complex. Each step in experimental workflows—whether for the development of small molecules or biologics—generates data in some form.

For example, scientific data is generated by lab instruments, such as spectrometers, plate readers, liquid chromatography systems, and electron microscopes. It might represent the results of experiments and assays, outputs of computational algorithms, and observations.

Instrument data follows proprietary formatting designed by the manufacturer. There are thousands of vendor-model combinations, and for each combination, data can be produced in vastly different formats or different structures due to configuration, operating modes, and settings. For example, Molecular Devices has nine plate reader models. BMG Labtech has six models. Multimode plate readers can have up to three or more modes of operation. If a biopharma has 12 vendors, multiple models and detection modes would result in potentially greater than 200 unique data formats. And data from these instruments is roughly tripling annually.

There are also additional sources of data. For example, organizations collect data during other in vitro and in vivo experiments, as well as modeling experiments conducted in silico

Moreover, different scientific techniques add to the complexity of data. Properly describing the physical, chemical, and biological properties of compounds, proteins, cells, or any drug product requires special expertise and scientific training. The data generated from these techniques is equally demanding and complex to understand and interpret. 

There are many types of scientific data, including:

  1. Time-series data: Bioreactors report conditions and their change over time as time-series data. 
  2. Audit trails and logs: Audit trails and instrument logs are important, since they contain what has actually happened during the experiment, when it happened, and by whom.
  3. Multi-dimensional numerical data: Very often there is multi-dimensional numerical data produced in instrument techniques, such as chromatography, mass spectrometry, flow cytometry, and X-ray crystallography.
  4. Long-term studies: High-throughput screening (HTS) and stability studies take place over weeks, months, or even years, and outcomes are often derived across the full dataset.
  5. Wells and cells: Microplate assays may generate data in numerous formats. For instance, each well can be measured to produce a single-value readout (e.g., absorbance). Cells in each well can be imaged for high-content analysis. Or fluorescence may be measured over time, as in quantitative polymerase chain reaction (qPCR) assays.
  6. Unstructured text: Scientific data often includes unstructured text, such as notes taken by scientists during the experiment, assay reports created by CROs/CDMOs, and notes recorded in the ELN. 
  7. Key value pairs: Key value pairs are common when describing the method or context of the scientific data. Such data may be highly unstructured and inconsistent between vendors and assays due to a lack of widely adopted schema, taxonomy, and ontology conventions.
  8. Proprietary file formats: There are plenty of vendor proprietary file formats, binary data blobs, or images that can only be read in the software in which the data was generated. 
  9. Reports: In a more distributed and collaborative ecosystem, an increasing portion of reports and data are curated by humans. Formats and ontologies may vary widely depending on the intended recipient, including submissions to regulatory authorities, company reports, and external communication.

Complex scientific workflow

Scientific workflows are also highly complex. Hundreds or thousands of individuals might participate in key tasks from discovery to production. These activities span all phases from discovery to commercialization, including many that conform to Good Laboratory Practice (GLP) and Good Manufacturing Practice (GMP) regulatory requirements.

The flow charts below generalize the workflows for small molecules and biologics. As these diagrams show, even the shortest workflow might include over 40 steps.* However, it’s important to note that these flowcharts don’t fully capture the interdependent network that supports these campaigns. That network comprises equipment and informatics applications; inventories of lab consumables; external CRO/CDMOs; supply chains; and the data systems that store, integrate, and contextualize the experiments so companies can focus on making decisions to advance treatments.

4D Process map for biologics therapeutic development. Source: National Institutes of Health (NIH).
4D Process map for small molecule therapeutic development. Source: National Institutes of Health (NIH).

*Can’t count them all? Here’s a sample small molecule workflow: Hypothesis generation, disease pathophysiology (molecular pathway, animal models, target biology), assay development, HTS (libraries, cheminformatics), hit prioritization, hit validation, lead optimization, in silico modeling, medicinal chemistry, process chemistry, safety screening, non-GMP manufacturing, early Chemistry Manufacturing and Controls (CMC) (Quality Control (QC)/ Quality Assurance (QA)), non-GLP in vitro/vivo toxicity, non-GLP pharmacology, cGMP manufacturing, GLP tox (Pharmacokinetics (PK) / Pharmacodynamics (PD)), long-term toxicity, Investigational New Drug (IND) filing, late CMC (formulation, stability studies, pharmaceutical tech), Institutional Review Board (IRB), recruitment, informed consent, clinical trials phase I, II, III; data review, translational pharmacology, New Drug Application (NDA) submission, approval, pricing strategy, insurance/access advocacy, standard of care, patient practice, phase IV (post-approval studies), reporting, new safety trials/observational cohort studies.

If these workflows weren’t complex enough, each step may be replicated, generating multiple datasets. Or a process might be performed by multiple teams, using different data standards or legacy instrumentation. Moreover, transitions from one step to another can introduce additional complexities, such as geographic or language barriers, skill gaps, procedural changes, and differences in documentation practices. 

With so many actors in producing data—including biologists, modelers, data scientists, chemists, analysts, and pharmacologists—it’s easy to see how data silos arise and metadata inconsistencies multiply. These problems trap results forever in non-convertible PDFs, PowerPoint decks, and legacy databases.

In the complex workflow described above, data is produced and resides everywhere. Here’s just a sampling of where your scientific data might be locked:

  • On-premises Windows systems with local control software and outmoded operating systems
  • Processing/analysis software
  • A thick-client ELN, laboratory information management system (LIMS), scientific data management system (SDMS), or compound registration system
  • Methods defined in chromatography data system (CDS) software or internal repositories
  • PowerPoint, Excel, Word, PDF assay reports, and email attachments
  • Domain-specific software such as GraphPad Prism or PerkinElmer ChemDraw

Fragmented scientific vendor ecosystem

The complexity of scientific data and workflows is amplified when you overlay the fragmented instrument vendor ecosystem. Biopharma organizations use a large number and wide variety of instruments and related software applications. Each scientist might use numerous software systems, especially in R&D departments. And each of those software systems might produce a distinct data format. 

Take, for example, the test phase within the DMTA cycle. This phase could involve a range of new discovery technologies, such as DNA-encoded libraries, organoid screens, virtual libraries, and lab-on-a-chip. These technologies exponentially increase the numbers of samples tracked, properties predicted, and assay metadata associated with a given project. Each of these systems adds unique data formats into the ecosystem. 

To make matters even more complex, each instrument could have numerous vendors, each with a proprietary data format. In other words, data formats can differ across various types of instruments, and even among instruments of the same type.

For example, there are over a dozen vendors and file types for a mass spectrometer:

Company Extension File type Processing Packages
Agilent
Bruker
.D (folder) BAF / YEP / TDF data format Agilent MassHunter
Chemstation
MaxQuant
.YEP instrument data format
Bruker .BAF / .FID instrument data format
.TDF timsTOF instrument data format
ABI
Sciex
.WIFF instrument data format Analyst
BioAnalyst
.T2D 4700 and 4800 file format
Waters .PKL peak list format MassLynx
.RAW (folder) Waters MassLynx
Thermo Fisher Scientific
PerkinElmer
.RAW raw data Xcalibur
TurboMass
Chromtech
Thermo Fisher Scientific
VG
.DAT ITDS file format MassLab
MAT95
Thermo Fisher Scientific .MS ITS40 instrument data format
Shimadzu .QGD GCMS solution format LabSolutions Insight
Open Source Analytical
.QGD instrument data format
.LCD QQQ / QTOF instrument data format
.SPC library data format
Bruker .SMS / .XMS instrument data formats ACD Labs
MS Workstation
ION-TOF .ITM raw data SurfaceLab
.ITA analysis data
Physical Electronics
ULVAC-PHI
.RAW raw data Quantes
.TDC spectrum data

File-type lists for more specific applications, such as proteomics (Mol Cell Proteom, 2012), would contain hundreds of entries. And note, this does not include the myriad of processing packages such as MaxQuant, Byos, and Virscidian that often control highly specific workflows in “omics,” protein science, or high-throughput experimentation.

What makes fragmentation so much worse: vendor lock-in

In addition to the challenges created by a fragmented scientific vendor ecosystem, common vendor practices also complicate data workflows. Vendors often lack the motivation to be compatible with each other. In fact, they are usually incentivized to do the opposite—that is, trap their users within the vendor’s ecosystem. As a result, vendor lock-in has become prevalent across the industry.

Lock-in occurs when customers can’t use their scientific data with products outside a vendor’s suite of offerings. Although customers generate the data in their own labs with instruments and software they paid for, vendors often behave as if they have the right to dictate how customers use this data.

In this situation, vendors effectively hold customer data hostage. Scientific data generated by instruments and software applications can be difficult, perhaps impossible, to move outside the vendor’s walled garden. Their characteristics include:

  • Proprietary data formats: Typically instruments generate data in proprietary formats (e.g., binary) that can only be used by software sold by the same vendor. This approach limits the ability to read and process data using third-party tools.
  • Lack of data export options: Vendors frequently do not provide a programmatic way to export data. There may be no software developer kit (SDK), application programming interface (API), or documentation on how to export the data into non-proprietary formats or consume the scientific data programmatically. Oftentimes, the only option for scientists is to manually record data on paper which does not include all the results and settings related to the data.
  • Restrictions on data extraction: In some cases, scientific data is contained in a file and cannot be extracted. Even if data is provided in clear text, the file content might not have a predefined structure or have significant variations based on the instrument model and method settings.
  • Obstacles to external integrations: Instrument and lab informatics software vendors often don’t support integrations with external products or only allow access to some but not all data. They may enact a fee (read: toll) to export the data. Or they can outright block integrations through technical means or draconian licensing terms that restrict API usage for certain purposes.
  • Prioritizing integrations within the vendor’s ecosystem: Even when vendors support external integrations, they usually deprioritize them in favor of any products of their own or affiliated companies. This bias could mean significant delays in fixing bugs, ensuring compatibility with new software versions, or leveraging new features in the software.

Vendor lock-in restricts how biopharma companies can use their own scientific data. It limits the freedom to use the best tools on the market, regardless of manufacturer or developer. It also creates data silos, where organizations struggle to know what data they have or how it was generated. Aggregated analysis across instruments, experiments, or studies becomes nearly impossible. And, importantly, scientific data held captive in proprietary formats or software can’t be engineered and used for advanced analytics and AI. The data hits a dead end; its value is severely capped. This all has a deep impact on the decision making, innovation, and competitiveness of a company. 

Unfortunately, there is little incentive for vendors to change. Instrument manufacturers and application vendors want to sell more hardware or licenses. If data becomes “stuck” in an instrument’s control software, it may limit a scientist’s ability to leverage popular data science tools such as Jupyter Notebooks or Spotfire, but it won’t necessarily affect—and could potentially bolster—the sales of a specific instrument or software.

In fact, as vendors continue to have success within particular applications, the strategy to bind scientists to a specific vendor's ecosystem has become an ingrained “way of working” across the industry. So, even as biopharma companies recognize the need to aggregate, standardize, and harmonize scientific data across all lab instruments, informatics applications, and analysis tools, data lock-in has remained a key goal for many scientific vendors. 

In sum, while instruments and their control software are often powerful scientific tools, they are not designed to be enterprise data systems. They are not optimized to interface with other systems or to enable data access. And sometimes, they are even developed to withhold these capabilities, essentially trapping biopharma organizations’ data in walled gardens.

Science is getting more distributed  

The biopharma industry has increasingly embraced outsourcing to contract organizations as a strategic approach to streamline operations, reduce costs, and accelerate drug development. This trend is underscored by the fact that most global biopharmaceutical companies nowadays work with 10 to 30 CRO/CDMO providers. 

The shift towards external collaboration presents significant challenges, particularly in managing and maintaining the integrity, traceability, completeness and timeliness of scientific data. Contract organizations (CxOs) work with multiple biopharma companies and must track the data of each. Operating under tight deadlines, CxOs usually provide results in formats that are easy and fast to generate, such as Excel or PDF reports. They overwhelmingly use general-purpose tools like email or file-sharing platforms (e.g., Egnyte or Dropbox) to send the data. These technologies, while sufficient for basic data transfer, fall short of meeting the needs of biopharma scientists. The reasons include:

  • Metadata and information loss: Traditional communication channels fail to provide the full scientific context of data—such as how, when, and where it was collected, processed, and analyzed. Raw data and metadata are often omitted in data exchanges. Incomplete information often leads to incomplete insights. Furthermore, ambiguous or incomplete results may trigger the need for additional emails or meetings, causing delays.
  • Inconsistent data formats: Reports and spreadsheets can vary widely—between CxOs and even from the same provider—because the industry lacks standards. Much data resides in unstructured text. Biopharma scientists have to spend substantial time interpreting, reformatting, and consolidating the data. Without this effort, the data would remain siloed with limited value. 
  • Lack of sample lineage: CxOs rarely provide a complete history of sample handling and processing. Partial information about sample tracking, if available, may be buried in a report or spreadsheet. In addition, CxOs and biopharma companies may assign different unique identifiers to the same sample, making it harder to integrate data from the two parties.

Given the complexity of scientific data, simple tools for data exchange, like email, are woefully inadequate for CxO-biopharma collaboration. Continuing the old paradigm will lead to the following drawbacks.

  • Low efficiency and speed: The inefficiencies in data communication and management can slow down the drug development process, negating some of the intended benefits of outsourcing. The lack of real-time feedback hinders collaboration.
  • Uncertain data integrity and quality: Issues with metadata loss, sample lineage, and data siloing can compromise the quality and integrity of the scientific data, which is crucial for drug development and regulatory approval.
  • Higher costs: Inefficient data management can lead to increased costs due to data rework, delays in project timelines, and potential regulatory setbacks.

Cultural habits of scientists 

The typical scientific mindset is curious, logical, and driven by discovery. Scientists understand that proper experimentation requires patience. However, introducing business elements like tight deadlines and demands for assay robustness can influence their practices in the lab. Scientists often adopt habits and practices in the name of expediency—and sometimes at the expense of rigor—impacting how scientific data is handled, and documentation is sometimes seen as a non-value adding burden.

  • Experiment first, record second: Scientists commonly conduct experiments before documenting them. The scientific process is iterative and often involves repeating and refining assays, whereas recordkeeping is more routine. Scientists may try to speed experiments by delaying documentation until a study is complete. However, this approach can compromise data quality (see next).
  • Documentation backlogs: Prioritizing experimentation often causes a backlog in documentation. As time passes, data and information accumulates and memories fade, which can lead to incomplete, unannotated, or low quality data.
  • Transient storage: In many labs, data is temporarily stored in on-premises network folders, or even USB thumb drives, en route from an instrument to the ELN. These transient storage solutions create more silos within the organization, jeopardizing data accessibility and integrity.
  • Informal data recording: Even critical data, such as sample IDs and measurements, are often scribbled on Post-It notes, disposable gloves, or the sash of a fume hood. This informal recordkeeping risks data loss or misinterpretation, lacks contextual information and lacks any good laboratory /scientific practices.
  • Data gatekeeping: Scientists may fail to record experimental work if they doubt its accuracy or relevance. Results from failed experiments, which can offer valuable insights, might be neglected.
  • Narrow documentation focus: When recording an experiment, scientists may omit comprehensive details about the experimental context, instrument parameters, or raw data, especially if they seem irrelevant to the immediate results. This short-sighted documentation diminishes the long-term value of the data.
  • Varied data formats: While routine tests may follow regimented standard operating procedures reflected in a ELN template, research and development work often relies on free-form spreadsheets and documents. As a result, data interpretability can vary widely across teams, departments, or organizations. Single individuals can even be inconsistent in their own documentation.
  • Inconsistent naming: Even diligent scientists may fail to conform to proper naming conventions. Consequently, the data becomes harder to retrieve, limiting its reuse and future utility.

Practices often adopted as a time-saving measure or workaround, pose significant challenges to accessing and using scientific data. While such issues are typically better controlled in regulated GxP environments, they remain common in many research-focused labs. Without automation and incentives provided by comprehensive data management, scientists will naturally handle and organize scientific data with approaches that may diminish its impact and value.

Summary

Scientific data within biopharma companies is prone to contain some of the most valuable insights in the world. But this data is complex, resides in many forms, and is generated through equally complex scientific workflows. To make matters worse, it’s often fragmented and locked away in proprietary data formats and software, may originate from external sources, and is heavily influenced by the scientist who generates it. The complexity of both scientific data and the ecosystem that produces it, significantly limits the effectiveness of analytical tools and AI, greatly diminishing quality and time to insight across biopharma organizations.

Blog
Tetra Benchling Connector, Round-trip Lab Data Automation and Contextualization of Instrument Data using Metadata from Benchling
Chromatography
IT
Integration
Contextualization

Automating manual processes is one of the key improvements required in the biopharma industry to improve efficiency and throughput while reducing data collection errors when building complete scientific datasets without human intervention.

The power of harnessing scientific data lies within pairing laboratory instrument data from analytical systems with contextual metadata recorded in electronic lab notebooks (ELNs). Contextualization is the process of using experimental metadata to assign context and establish findability by answering the following questions: what is the data, who will use it, and how will it be used? Historically, the context describing a test was isolated in an ELN and required manual processes to identify, locate, and pair instrument raw data with experimental metadata. 

The Tetra Benchling Connector, and other Tetra ELN connectors, bring that context into the Tetra Data Platform (TDP) and simultaneously drive round-trip lab automation of scientific workflows to provide a foundation for analytics and AI.

The Cost of Manual Processes

Scientific data makes many stops on the journey from experimental ideation to reported results. Scientists invest great effort into routinely moving data between systems during this journey using manual processes or point-to-point integrations. There are countless instances where manual processes are driving development of next generation therapeutics even in the world’s most innovative biopharma organizations. One industry-leading customer reported that 500 of  their scientists collectively spend 5,000 hours a week entering chromatography data into ELN and chromatography data systems (CDS).

A recent study appearing in Drug Discovery Today determined that the current R&D efficiency to discover a new drug costs $6.16 billion. Biopharma is faced with the challenge of evolving their digital technology stack to lower drug discovery costs while meeting patient needs. One opportunity is onboarding solutions that enable full round-trip lab automation that eliminates manual data entry across biopharma systems. Round-trip lab automation is a seamless process where an experiment created in an ELN triggers a workflow that automatically sends test orders to instruments and reports assay results back to ELN. A scientist is enabled to invest their time, energy, and resources into high-value, innovative work instead of manual and routine data entry tasks.

TetraScience’s new Tetra Benchling Connector, paired with the TDP and Tetra instrument Agents, enables round-trip lab data automation for biopharma. Additionally, these tools help TDP contextualize instrument data using metadata siloed in Benchling Notebooks.

A Round-trip Lab Data Automation Journey for Chromatography Data

Chromatography serves as a pivotal technique in biopharma for the purification, characterization, and quality control of therapeutics. Typically, these instruments are managed by a CDS, like Thermo Fisher Scientific’s Chromeleon. A CDS connects multiple instruments across laboratories and provides a controlled environment for chromatography data analysis.

A major challenge that Biopharma scientists and IT staff face is creating a seamless flow of data from ELN to CDS and back again. Ultimately, scientists will rely on manual processes that transcribe scientific metadata from an ELN into a CDS, and they are forced to manually report chromatography results into ELN. These workflows result in data loss where the bare minimum results are captured and the remaining raw instrument data is siloed. The fact that most CDS data is available as a proprietary binary format further exacerbates these challenges.

Bidirectional data workflows using the Tetra Benchling Connector and Tetra Chromeleon Agent solve these challenges and bring round-trip lab automation into laboratory workflows. The following video will showcase these tools in action.

Benchling Notebook to ThermoFisher Chromeleon

The workflow begins when an operator initializes a Benchling assay run schema to start an experiment. An assay run schema defines the metadata tracked throughout an experiment’s execution and captures information like sample ID, experiment ID, and device ID. Next, the operator registers entities for injections and creates the Chromeleon sequence to select injections for the test. When the sequence is generated the Tetra Benchling Connector is activated and migrates the sequence to TDP. Notably, the arrival of the sequence file activates a series of pipelines that contextualize the file with scientific metadata from the assay run schema and convert it into an intermediate data schema (IDS). TDP leverages the new labels to surface information for easy searchability. Additionally, the pipelines send the sequence order to Chromeleon CDS. When the sequence arrives in Chromeleon, the scientist executes the injections and performs their regular peak integration analysis and result sign-off to complete the test. 

ThermoFisher Chromeleon to Benchling Notebook

Result sign-off triggers the reverse process to send peak data into Benchling. A Tetra raw-to-IDS pipeline is activated when the results from Chromeleon arrive in TDP. Next, the IDS is transfigured into a processed result file that conforms to the Benchling result schema. Result schemas formally define what measurement data will be displayed in a Benchling Notebook. Notably, each pipeline creates a new file that is fully contextualized with appropriate scientific metadata to increase result searchability. Finally, a pipeline sends the processed results to the Benchling Notebook and the experiment is ready for formal review.

The combination of the Tetra Benchling Connector and Tetra Chromeleon Agents enables a fully automated lab data workflow where the operator’s job is now simplified to loading an instrument.

Contextualized Instrument Data

The immediate value of contextualizing raw instrument data is increased file searchability and connecting information across workflows. A standard multi-tenant deployment of the Tetra Scientific Data and AI Cloud will contain multiple organizations. Each organization will leverage numerous Agents and Connectors to capture data and various pipelines to harmonize and transform data. Scientific metadata can be extracted from Benchling’s assay run schema into an IDS and then contextualization pipelines are executed to label files. Notably, these contextualization pipelines can be deployed in a customized manner that is fine-tailored to meet individual customer user requirements. This is an important caveat because each customer has a unique implementation of their ELN and CDS environments.

Labels extracted from the assay run schema are then used to search for data by various conditions. In this use case, the Benchling entity label for injection ID was added to the Chromeleon generated files. Performing a label based search for “INJ009” will return all files (e.g., Benchling assay run IDS, Chromeleon IDS, Peaks.csv, and Benchling result schema) associated with that particular label.

The use case above highlights a simple value realization of contextualized instrument data. Bidirectional integration expands ELN notebook capabilities to include hard-to-surface data points like instrument ID or column ID. Instrument and column identifiers are recorded in CDS systems for each injection performed but manually transcribing that information from CDS to ELN is a tedious and error-prone task. However, the Tetra Chromeleon IDS readily captures and surfaces that information through IDS harmonization and file labeling. Bidirectional integrations ensure that information can be captured in an ELN notebook too! A list of select contextual data use cases are provided in Table 1. Future articles will dive deeper into contextualized instrument data analytics and AI applications.

Table 1. Contextual Data Use Cases

Lab Data Automation and Foundation for Analytics/AI

Round-trip lab automation provides immediate, data replatforming (layer 1) value to biopharma organizations and the unquestionable business value of reallocating scientists from data entry tasks towards strategic work. Contextualization helps advance biopharma into layer 2, data engineering, where data science teams now have the appropriate metadata to search and query scientific data across their organization to begin designing custom datasets. Round-trip lab automation and contextualization improve overall operating efficiency and elevate biopharma into data analytics (layer 3) and AI (layer 4) layers of the data journey pyramid.

Data science teams are already developing AI applications to automate chromatography data analysis tasks like peak detection and peak integration (Anal. Chem. 2020, J. Chromatogr. A. 2022, Bioinformatics 2022). The challenge these teams face is fine-tuning peak detection models using their own custom chromatography data sets. A custom data set for model development often requires a scientist to navigate through a CDS to locate the appropriate data and export it using a copy-and-paste procedure. Additionally, the scientist is required to navigate their ELN to ensure the appropriate metadata is included in the training set. This type of approach often forces data science teams to settle with the initial data set created for training and they are prevented from building an optimal data set for training purposes. If the data science team desires to continuously improve their model using a human-in-the-loop mechanism then they are forced to retroactively perform this process on a recurring basis.

One top-25 biopharma customer has developed their own chromatography peak detection model with human-in-the-loop functionality for a reduced phase, ion phase (RP-IP) assay. The human-in-the-loop component is triggered when model detected peaks fail to pass quality thresholds and scientists are required to perform manual peak detection and integration in CDS.  A new ELN entry is created to collect the manually integrated peak data from failed analyses and the data is added to a training set to further refine model performance. Notably, the scientists are also required to export custom CDS files to capture the chromatogram raw and peak data for model training purposes. 

The Tetra Scientific Data and AI Cloud is able to drastically improve this manual process by providing round-trip lab automation and contextualization. An experiment failure automatically activates the human-in-the-loop process, described above, using TDP to automatically create re-annotation worklists, capture and contextualize scientist integrated chromatogram data. A combination of file versioning and appropriate ELN contextualization data can be used to differentiate model integrations from scientist integrations, and a series of pipelines can be created that automatically add scientist integration data to the appropriate training data sets. This type of capability transforms a model re-training strategy from a critical mass approach to an active learning approach. A critical mass strategy is when a modeling team waits until they have collected enough data to justify re-training efforts. Active learning is a modeling technique that introduces new training data into a model in real-time to continuously improve the model’s predictive capability and domain. Roundtrip lab automation is estimated to provide the customer approximately $5 million savings for one assay by eliminating manual data transfers between Benchling and Chromeleon. But the value proposition extends well beyond the reduction of operation expense, once you capitalize on the Genius of "And".

The Tetra Scientific Data and AI Cloud is helping the pharmaceutical industry accelerate scientific discovery and development with harmonized, contextualized, and AI-ready scientific data in the cloud. Reach out to TetraScience scientific data architects and scientific business analysts to learn more about how your organization can leverage the Tetra Scientific Data and AI Cloud and the newly released bidirectional Tetra Benchling Connector to support Notebook-to-Notebook integrations.

Blog
Simplifying data journey monitoring with easy-to-use dashboard tools
IT
Integration
Quality & Compliance

The Tetra Scientific Data and AI Cloud helps biopharma enterprises reach industrial-scale data transformation. The volume of data in these systems is staggering. One global biopharma standardized and onboarded 17 million files, from 107 pipelines, across 13 departments, in just 9 months with TetraScience. Tracking where this data is coming from and where it’s going can be a massive challenge. That’s why our team is excited to publish a new set of file lineage tools called File Journey, as part of our Tetra Data Platform v3.6 release.

Confirm your data’s status with a few clicks

File Journey is a suite of monitoring tools that help platform administrators ensure every file processing step is being performed correctly throughout the data journey—from creation within an instrument, through file processing within the Tetra Data Platform (TDP). Admins can now easily perform quality control on thousands of files and pipelines with a few clicks. Additionally, the simple dashboard interface helps make proving traceability, accountability, and data integrity a quick and painless process.

How to find your file info with File Journey

The File Journey can be accessed within TDP through the “Health Monitoring” button in the admin navigation.

Health Monitoring dashboard access

Within the Health Monitoring dashboard, users can view file ingestion details under the “Integration Events” tab. There, they can find descriptions of where the file is located, whether the file was scanned, whether the upload has started, and whether the upload has been completed.

Integration Events tab

Once a file has been uploaded into TDP, admins can see file details under the “File Info” tab. This includes when the file was uploaded, the source type and name, file size and category, etc. Additionally, users can see what pipelines were triggered by the file’s ingestion. This combination of monitoring tools allows admins to oversee and track huge numbers of files, confirm their status and location quickly, and verify that the correct data engineering processes have been triggered.

File Info tab

The File Journey also contains an “Event Timeline” tab for each file. These intuitive charts automatically display every file action in chronological order, providing a clear and simple audit trail for every file, throughout its lifecycle, across the entire enterprise.

Events Timeline tab

Building trust (via verification)

The Tetra Data Platform helps admins responsible for industrial-level data transformations build trust with internal stakeholders and simplify compliance processes by providing easy-to-use tools that ensure each step of the file path is executed appropriately, documented, and attributed. If you’d like to see how the File Journey can help your team, contact one of our experts today.

If you’re interested in other new features released with Tetra Data Platform v3.6, you can read about our self-service pipelines and our multi-byte character support.

Need help with your File Journey?

TetraU provides in-depth training for the Tetra Scientific Data and AI Cloud, including our new TDP v3.6 features. Check out TetraU for workshops and training content developed by the data and life science experts at TetraScience.

Blog
Accelerate Tetra Data availability as external tables in Snowflake
IT
Contextualization

The Tetra Scientific Data Cloud has adopted the modern data lakehouse architecture. It brings together ACID (atomicity, consistency, isolation, and durability) compliant transactions offered by data warehouses and large-scale object storage capability for structured, semi-structured, and unstructured data offered by the data lake architecture. The data lakehouse architecture also offers advanced functionalities such as schema evolution, customized partitioning, sophisticated compression techniques, co-resident metadata with data, and features such as time travel. These advantages provided by the data lakehouse architecture ultimately contribute to reducing storage needs, transferring less data over the network, and loading only required data into memory for processing. As a result, users experience expedited query execution times. 

Tetra Scientific Data Cloud: Platform

By adopting an open table format, the Tetra Scientific Data Cloud can be leveraged in workflow automation, AI, ML, and BI workloads with high query performance. The platform eliminates the requirement to physically relocate organizations’ engineered scientific data (Tetra Data). Tetra Data is presented as a data model within third-party data warehouses such as Snowflake and AWS Redshift, while all physical Tetra Data remains stored within the Tetra Scientific Data Cloud. 

The Snowflake platform, with its ability to scale object storage and compute layers independently, supports multiple workloads, such as data warehousing, BI, and ML, as well as custom applications and data sharing. Availability in all major public clouds allows Snowflake to make all data available within an organization by centralizing data in Snowflake storage or alternatively storing as external tables when governance and other workload compulsions mandate such an approach. Combining data from a multitude of instruments with other corporate data in biopharmas helps accelerate drug development cycles through both internal and external collaboration.

Analyzing query performance 

One of the primary and obvious reasons for moving data from the Tetra Scientific Data Cloud to Snowflake is to enhance query speed. The graphic below displays benchmark tests conducted by the TetraScience product development team, comparing query speeds between Tetra Data copied into Snowflake (SF-Copy) and Tetra Data exposed as external tables in Snowflake (SF-Delta).

Query execution times: SF-Copy vs. SF-Delta

Tetra Data exposed as external tables experiences marginally longer query execution times due to network related effects when compared to tables stored internally in Snowflake. As depicted above, the query performance is better for most SQL operations when data is copied into Snowflake.

The query execution times for Tetra Data exposed through external tables can be improved by optimizing individual data file sizes, employing partition columns and keys for efficient table scanning, and utilizing Snowflake’s Results Caching feature for repetitive queries. The overall read efficiencies offered by the data lakehouse architecture results in external tables providing acceptable query times, negating the necessity to move and administer the data in Snowflake storage. 

There are reasons beyond just query execution times that one needs to consider when making Tetra Data available through Snowflake. Exposing Tetra Data hosted in the data lakehouse as external tables provides the following advantages over physically copying the same into Snowflake:

  • Reduced storage and ELT costs
  • Reduced data redundancy
  • Simplified update management and evolution of schemas
  • Access to real-time data
  • Ease of data sharing
  • Consistent data governance and compliance 
  • Support for data formats and specialized applications

Advantages of Snowflake external tables vs. Snowflake copied tables

This section delves deeper into the above-mentioned key advantages of offering Tetra Data as external tables within the Snowflake environment rather than physically relocating (i.e., copying) the data to Snowflake. The graphic below illustrates the flow of Tetra Data from the Tetra Scientific Data Cloud to downstream Snowflake accounts.

Tetra Scientific Data Cloud–Snowflake external tables: data flow 

Reduced storage and ELT costs

Using Tetra Data as external tables in Snowflake allows data to reside in the Tetra Scientific Data Cloud without physically transferring it to Snowflake storage. This can be a more cost-effective approach, especially with large volumes of data, as users incur expenses only for storage in the source system, not for additional Snowflake storage.

For example, instruments using the latest next-generation sequencing (NGS) techniques produce hundreds of terabytes of data. Replicating this data across multiple storage systems and managing ELT pipelines for data migration is costly. The use of Tetra Data as external tables simplifies the data stack, eliminating the need for additional investment and consumption costs associated with ELT applications.

Reduced data redundancy

The Tetra Scientific Data Cloud allows bidirectional communication for writebacks from various scientific tools like electronic lab notebooks (ELNs) and third-party applications through SQL and REST API interfaces. The diverse nature of analyses, workflow integration, and scientific use cases necessitates deep integration of the Tetra Scientific Data Cloud with other applications in the early drug development and manufacturing lifecycle. 

Instead of duplicating data in Snowflake, leveraging Tetra Data as external tables in Snowflake allows querying and processing data directly from its original location. This reduces the need for data replication, minimizing redundancy and ensuring data consistency and integrity across systems. Large organizations typically use multiple compute environments (e.g., Databricks, AWS Redshift), and duplicating Tetra Data across each of these systems would require significant resources and costs.

Simplified update management and evolution of schemas

As the instrument landscape evolves, adapting new versions of Tetra Data schemas for data harmonization and other scientific workflows is necessary. Large life science organizations deal with thousands of scientific instruments and systems, each with its evolving schema, but Snowflake does support schema on read. Manually creating and adjusting these schemas in the Snowflake environment is resource-intensive and an unnecessary duplication exercise.

Tetra Data available as external tables eliminates the need to manually adapt and change schemas in Snowflake. This aids scaling, simplifies the management of updates, and minimizes disruptions in downstream analytics, ML, and integrated scientific workflow implementations.

Access to real-time data

Using Tetra Data as external tables facilitates real-time or near-real-time access to data as it’s updated in the Tetra Scientific Data Cloud. There’s no delay caused by data loading processes, ensuring the most current data is available for analysis. The Tetra Scientific Data Cloud can host real-time data from IoT agents and KEPServerEX agents, moving data to storage for downstream enrichment and harmonization upon arrival. Moving rapidly updating Tetra Data to Snowflake storage would necessitate using Snowpipe, incurring extra costs and pipeline maintenance. Leveraging Tetra Data through external tables eliminates the need for this.

Ease of data sharing

Using Tetra Data through Snowflake’s external tables simplifies data sharing across multiple Snowflake accounts or with external partners, as the data remains in its original location. Tetra Data as external tables circumvents the need for exporting and importing data, enhancing collaboration efficiency. Complex ELT and Snowpipe pipelines do not need to be maintained, and all Tetra Data available in the Tetra Scientific Data Cloud can be immediately shared.

Different domains within an organization may have varying levels of autonomy in selecting data product stacks. By retaining Tetra Data (the engineered scientific data) in the Tetra Scientific Data Cloud as a single source of truth providing all the domains in the organization with a common denominator between teams.  These autonomous domains can implement data products with lineage, ownership, federated governance, and self-service capabilities for scientific analysis. 

Consistent data governance and compliance

Making Tetra Data available as external tables simplifies data governance by offering a single source of truth. It reduces the complexity of managing data in multiple locations and systems, thereby improving governance and compliance efforts. The Tetra Scientific Data Cloud is an industry vertical–focussed platform, offering the necessary depth of adherence to various industry specific compliance requirements such as 21 CFR Part 11, GxP, ALCOA principles, GAMP 5, and Annex 11. With numerous departments and functions dependent on the same scientific data, maintaining a single version of the truth is crucial for data traceability and GxP compliance. For instance, the FDA, through its drug inspection program, has pursued official/voluntary action against various organizations for data-related breaches more than 900 times in the last decade alone, underscoring the importance of data integrity and access controls. 

The Tetra Scientific Data Cloud stores all incoming instrument data in its raw format, along with changes applied through data transformation workflows and archival processes, as mandated by the FDA in its electronic records requirements. The Tetra Scientific Data Cloud also supports the Verification and Validation (V+V) process, a key requirement mandated by FDA. Moving data into Snowflake would detach all the compliance related controls on Tetra Data and would require significant compliance  reimplementation and revalidation effort. 

Support for data formats and specialized applications

Making Tetra Data available as external tables supports various file formats and data sources, enabling organizations to work with diverse datasets and unstructured data like images. Users can access both JSON file formats and data stored in the data lakehouse using SQL and a REST API. 

Unlike Snowflake, the Tetra Scientific Data Cloud not only supports SQL passthrough but also allows bidirectional orchestration, which is leveraged by Tetra Data Apps to trigger pipelines and data transfer of various file types like WSP (workspace) from the FlowJo application and even microscopy image files like ZEISS .CZI. Moving such unstructured data to Snowflake and performing scientific analysis isn't straightforward (using user-defined functions [UDF]) in Snowflake as it primarily supports structured and semi-structured data.

Summary

The integration of Tetra Data into Snowflake using a modern data lakehouse architecture enables users to access Tetra Data in Snowflake as external tables, eliminating the need for physical data movement. Employing this approach helps ensure improved query performance, reduced storage costs, minimized data redundancy, simplified schema evolution, and real-time data access without compromising data integrity or compliance.

By retaining Tetra Data in its original location and accessing it through Snowflake’s external tables, this integration streamlines data sharing and supports distributed processing. It maintains a single source of truth and enhances governance efforts. Moreover, this approach addresses security concerns, compliance challenges, and data sovereignty issues. 

Blog
Proprietary data formats are killing your AI initiative
IT
AI & Data Science
Harmonization
Data Analytics
Scientific AI

Legacy data architectures and a do-it-yourself approach to data management present key obstacles to achieving your AI goals and improving scientific outcomes. But unfortunately, they’re not the only impediments. A lack of harmonization among data formats and metadata can grind your AI journey to a complete halt. 

Your organization might have hundreds or thousands of scientific instruments and applications. And they produce data, often unstructured, in vendor-specific proprietary formats, with unique metadata terms and organizational schemes. 

These formats have limited accessibility outside of each vendor’s walled garden. For example, you might be unable to access data produced by a chromatographic instrument from within an application from another vendor without first doing manual data transformation. Want to analyze or visualize experimental results from multiple instruments? You’ll need to harmonize data so that it’s all accessible from within the analytics or visualization software of your choice. 

Preparing the large-scale datasets for AI applications is a much bigger task. But until you can harmonize all that data, you won’t be able to take advantage of the full potential of AI.

Adopting a single, open, and vendor-agnostic format for scientific data could go a long way toward reducing this barrier in the AI journey. So, why hasn’t the life science industry successfully established a standard? And how does metadata complicate the work of harmonizing data from various instruments and applications? 

Attempts at standards

There have actually been a number of efforts to create data standards in the life science industry. But none of these standards has been widely adopted as of today. For example: 

  • The Standardization in Lab Automation (SiLA) consortium is working to standardize software interfaces specifically for working with robotic automation. But the consortium isn’t providing a true solution for harmonizing the actual data for other data usages.

Moreover, individual vendors have attempted to establish their own “standards,” which are really just proprietary formats disguised as standards. They are clearly not open or vendor-agnostic formats.

In parallel, an international consortium of scientists and organizations introduced FAIR data principles to optimize reuse of data. FAIR data is findable, accessible, interoperable, and reusable. These principles provide a strong guide for enhancing data reuse, but they are still only principles. They do not establish a single industry-standard format.

Standards inertia

There are several reasons why standardization efforts have faltered or failed. And those same reasons have restrained interest in launching new standards projects. Overall, there are few incentives for life science technology companies to establish new standards, adopt standards created by other organizations, or endorse open, vendor-agnostic data formats

  • The quest for competitive advantage: Vendors develop proprietary data formats that support their particular capabilities—capabilities that provide them with competitive advantages. Vendors typically do not want to adopt or help establish open, industry-wide standards that might in any way diminish the benefits of their unique, differentiating capabilities.
  • Legacy designs: Much of the instrument control and data acquisition software available today was initially built on legacy technology and meant to be installed on lab PCs. Each application was designed to give scientists all the functionality to complete their work without leaving that application. Vendors have no motivation to develop an open, vendor-agnostic data format or push for standards.
  • Customer retention: For instrument and application companies, vendor lock-in is generally a good thing. Maintaining proprietary data formats helps them retain their existing customers. Vendors want to keep customers within their walled gardens, binding customers to their particular ecosystem.
  • Complexity: Of course, it’s not just that vendors want to preserve their investments and retain their customers. Developing industry-wide standards is also hard work. Different instruments and modalities produce different bits of information—it is difficult to find one size (that is, one data format) to fit all. The efforts of consortia to create standards have shown how challenging it is to achieve compromise among multiple parties.

If the industry could establish a standard data format, biopharma organizations would still need to get their data to conform to that standard, which would not be an easy task. But without a standard, biopharma companies are stuck with multiple data silos. Data remains trapped in vendor-specific data formats within on-premises environments. To use that data with AI applications, an organization would need to invest significant time and money for harmonization—or they will need to find a commercial solution purpose built to solve standardization challenges.

Metadata harmonization

Harmonizing data formats is essential for enabling data liquidity and producing the large-scale datasets needed for AI. But harmonization can’t stop with the data format—it must also be applied to the taxonomies and ontologies in the metadata. 

A data taxonomy defines elements and their organization. A data ontology describes the relationships among elements. Together, the taxonomy and ontology form a vocabulary that captures key scientific information about samples, materials, equipment, results, and more. This vocabulary is vital for finding, interpreting, analyzing, and assembling data—and using the right data for AI algorithms.

Instruments, applications, and users often use their own terminology and add unique contextual labels and other metadata. As your lab collects data from more and more sources, you are likely to be left with multiple terms to describe the same data. You might even see this inconsistency if you were to compare metadata from different departments, sites, or regions within your organization—or even between scientists.

Unless you harmonize taxonomies and ontologies, it will be difficult for users to find, compare, and reuse data from all of your different sources—and it will be impossible to assemble that data into the large-scale datasets needed for advanced analytics and AI.

This metadata harmonization work must be an ongoing effort because taxonomies and ontologies can evolve. For example, the Allotrope Taxonomies and Ontologies included with the Allotrope Data Format were initially based on existing vocabularies but then grew to several thousand terms and properties as more companies began using the data format. Metadata flexibility is helpful as organizations refine their workflows and use cases over time—but these evolving taxonomies and ontologies must be maintained.

Understanding the Scientific AI gap

There’s no question that biopharma executives have ambitious goals for using AI to accelerate delivery of breakthrough therapeutics, streamline manufacturing, improve quality control, and more. They are investing heavily in AI initiatives and expect results fast. 

Unfortunately, most companies are still far away from meeting AI objectives. Along with legacy architectures and a DIY approach to data integration, a diversity of data formats and metadata vocabularies present formidable obstacles on the AI journey. Organizations need to dramatically simplify data and metadata harmonization efforts before they can start realizing the tremendous potential of AI in science.

Learn more about the AI gap in the white paper, "The Scientific AI Gap."

Blog
Legacy data architectures are slowing your AI journey
IT
AI & Data Science
Integration
Contextualization
Harmonization

What is keeping your company from capitalizing on the tremendous potential of artificial intelligence (AI) today? Your legacy data architecture is a major stumbling block.

Biopharma leaders understand that employing AI and machine learning (ML) applications can drive innovation and help transform nearly every aspect of operations—from research and development to manufacturing and quality control (QC). Biopharmas can use AI to deliver better scientific outcomes, faster, while driving down the costs of finding and producing therapeutics. Given those benefits, it’s not surprising that many companies are already investing heavily in AI initiatives.

But the reality is that many companies are still years away from realizing the benefits of AI in science. Legacy data architectures—along with siloed storage environments—are significant obstacles to moving forward with AI initiatives. Until your company can implement modern, cloud-based data architectures and free your data from silos, you won’t be able to use your data to generate new insights with AI.

The problems with silos

Many biopharma organizations continue to store data in multiple isolated repositories, ranging from local hard drives and workstations to file shares and tape archives. These silos represent a decades-old approach to scientific data storage that is incompatible with advanced analytics and AI applications.

What’s wrong with silos? Moving and sharing siloed data is extremely cumbersome. In the worst case, a scientific workflow might still require using a “sneakernet”—walking a drive from one system to another, and then manually copying files onto a network file drive or an unstructured document storage environment such as Box or Egnyte.

Whenever data is distributed across the organization and maintained in legacy systems, processes are slow, collaboration is challenging, and capitalizing on AI is impossible. AI and ML algorithms require large-scale datasets associated with labeled outputs. To ensure a proper outcome, data scientists first need to understand the current data. Then they must evaluate different modeling techniques, and train and test different approaches on appropriate labeled datasets. Trying to develop and run algorithms when some of your data is stuck on disparate file shares is a Sisyphean endeavor.

Siloed environments also present serious risks. Organizations cannot implement adequate data protection and backup for every individual silo, placing that data at risk for loss. Moreover, the need to manually transfer data from one silo to another—such as an electronic lab notebook (ELN) or laboratory information management system (LIMS)—introduces the possibility of data entry errors.

To make matters worse, siloed data often remains in the proprietary formats created by scientific instrument or application vendors. Those proprietary formats lock you into small vendor ecosystems—walled gardens with limited applications where vendors are holding your data ransom. You are not able to use data in other applications or build predictive algorithms.

Before you can visualize, analyze, or use that data in AI/ML applications, you would need to prepare it for exploratory data analysis by a data scientist, then train and evaluate models. After the algorithm is developed, running an AI workload in production requires data that has been transformed into a standardized format. That format must also harmonize metadata taxonomies (definitions of data elements and structures) and ontologies (descriptions of relationships among data elements).

Data stuck in silos, locked in proprietary formats, remains static. It does not have the liquidity needed to streamline collaborative scientific work or tap into the potential of AI applications. 

Why legacy architectures can slow your journey

Some biopharma organizations have attempted to centralize data using a scientific data management system (SDMS). Traditional SDMSs were designed to store and archive data for regulatory compliance, not to prepare data for AI applications. 

They might be adequate for collecting instrument and application data; cataloging data by adding some metadata; and archiving data in a compliant manner. But most traditional SDMSs have serious limitations for supporting AI initiatives.

Inflexible data flow: Traditional SDMSs have few options for data flow and processing. For example, they might be unable to send data to multiple destinations. If they can’t provide the flexible data liquidity required by biopharma teams, they become a data graveyard.  

Little data engineering: SDMSs are designed to store data but not transform it. Traditional SDMSs don’t attempt to engineer data for scientific use cases. They don’t produce data in a standardized, harmonized, future-proofed format that is engineered specifically for data science, analytics, or AI.

Poor discoverability: SDMSs might add metadata to files, but because they don’t typically harmonize metadata taxonomies and ontologies, they can make it difficult for scientists to discover new or historical datasets. Data is searchable and consumable only if someone knows precisely what terms or labels to query. In many cases, lab scientists end up re-running an assay or an experiment because that’s easier than finding historical data.

Inflexible accessibility: SDMSs are certainly several steps above thumb drives. But they are still largely closed, siloed data repositories. Traditional SDMSs require users to access data only through the SDMS interface, not through their usual interfaces and applications, such as ELNs, analytics tools, or AI applications.

Lack of scalability: On-premises SDMSs cannot be scaled easily or cost effectively: Each upgrade requires multiple changes, including upgrades for the database, servers, and file storage. If SDMSs employ cloud services at all, they often use the cloud as another data center. Consequently, SDMSs are not the best environment for assembling the large-scale datasets required for AI.

SDMSs simply aren’t designed to prepare data for AI. Some SDMS vendors might tack on capabilities to address deficiencies. But in general these legacy solutions cannot provide sufficient data liquidity, allow adequate searchability, enable data accessibility, or efficiently scale up to support the massive volumes of AI-native data needed for AI algorithms.

(Read more about why a traditional SDMS is not an option anymore.)

Before you can close the AI gap

When it comes to AI, there is a large gap between the goals of biopharma executives and the reality in labs. Until biopharma organizations can address key data obstacles, they will be unable to realize the benefits that AI can deliver for science. Retaining a legacy data architecture in the form of a traditional SDMS, and leaving data in siloed environments, prevent organizations from producing the open, vendor-agnostic, purpose-engineered, liquid, large-scale data that they need for AI applications. 

Unfortunately, legacy architectures and data silos are not the only obstacles. A do-it-yourself (DIY) approach to data management plus a lack of data standardization and harmonization can also slow your progress. Sufficiently addressing all of these obstacles will be necessary before you can accelerate your AI journey and improve scientific outcomes.

Learn more about the AI gap in the white paper, “The Scientific AI Gap.”

Blog
Biopharma’s DIY problem: 5 reasons why DIY for AI fails
IT
AI & Data Science
Integration
Harmonization
Scientific AI

The great AI race is here. In April of 2023, Microsoft, Google, Meta, and Amazon mentioned “AI” 168 times combined during their earnings calls. By July, Microsoft alone mentioned AI 175 times.

Despite its reputation as a late adopter, biopharma is responding. CEOs are banging the AI drum in earnings calls, on MSNBC (more on data42 later), and have even formed data-sharing coalitions to leverage AI capabilities.

From within the maelstrom of hype, it may seem like the only limitation on AI is our imagination. Organizations simply have to pluck AI experts from leading universities—find and purchase best-in-class AI technology—fund the right projects—and voilà, life-altering AI capabilities.

However, as organizations are discovering over, and over, and over again, bridging the gap between current capabilities and AI in biopharma comes down to how they manage their most precious asset—their scientific data. More specifically, AI models require large-scale, liquid, compliant, and engineered data to function. Many organizations see this challenge and roll up their sleeves. While admirable, this DIY mentality will lead to definitive competitive disadvantages. Allow me to explain.

What’s the deal with DIY?

Many biopharmas use in-house resources as opposed to commercial solutions to solve their data challenges—the most glaring being their need to prepare data for advanced analytics and AI applications. These DIY approaches tend to share characteristics across the industry. To start, at the bottom layer of the data pyramid, the data infrastructures are composed of point-to-point integrations for specific instruments and software applications. Contextualization and harmonization are less common but can be found within individual workflows.

DIY’s long-tail surprise

Taking a DIY approach to managing scientific data seems to make intuitive sense. At least initially. Budgets look agreeable, organizations have some internal expertise, and there is a perceived security that comes with control.

But as months and years go by, biopharma companies that have adopted a DIY approach find that establishing, maintaining, and updating data workflows is far more resource- and cost-intensive than expected. The brittle, point-to-point integrations are tedious to build and unable to scale. Maintaining and updating software, or integrating new instruments, are extremely time-consuming and resource-intensive tasks. Meanwhile, validation and documentation require intense effort. 

What’s more, as delays occur, there is no one to “blame.” Even if organizations hire the most brilliant data specialists on earth, data management and data transformation are new trades for most biopharmas, and each step of the data journey must be constructed, tested, and debugged from the ground up. The “plug-and-play” model many biopharma leaders imagine is simply not a natural part of developing a new technological infrastructure.

Even after investing significant time and effort in a DIY approach, organizations often experience disappointing results: The user experience is often poor, and data remains trapped at the most basic layer of data maturity. The data is stuck in proprietary formats, illiquid, and subscale—it is in no way ready for AI. Engineering and preparing data for AI requires additional investment and additional projects. If an organization does devote internal resources to data engineering, timelines almost always exceed initial expectations and teams become frustrated. These hidden expenses give the costs of DIY a long tail that could last years.

The tyranny of now

Immediacy bias also plagues DIY projects. Given their limited scope and definition of success, DIY projects tend to favor current needs instead of creating future-proof systems. For example, a point-to-point integration might solve the problem of transferring a single instrument’s data to an electronic lab notebook (ELN). However, what happens when the lab wants to add a second, third, or fourth instrument? What happens when someone wants to analyze data across the lab, across groups, or compare suites of like instruments from different vendors? Suddenly the point-to-point paradigm breaks down. Operating with this relatively myopic viewpoint often forces companies to update entire systems as new technology, software, or data protocols evolve.

What does DIY failure look like?

DIY failure is all too common. Just look for the familiar pattern: Data projects greatly exceed their initial scope, fail to deliver promised data capabilities, and produce only rigid datasets that aren’t ready for AI.

One top 20 biopharma we spoke with reported that three years into a DIY data initiative, the company had only succeeded in integrating 20–25 percent of its total instrument base, despite costs continuing to balloon past initial estimates. And this project was for relatively straightforward integrations.

Things become more complicated on larger AI-based initiatives that attract nine-figure budgets, promise world-altering results, and still struggle to produce results. Take data42, for example, a Novartis project initiated at the end of 2019. The project’s goal was to leverage 2 million patient-years of data with AI technology. But the day-to-day work would include a massive amount of data management.

The nature of the work wasn’t a surprise either. In an early press release, Peter Speyer, the lead of product development, said “All of those data need to be cleaned and curated to make them machine-learnable. This is hard and cumbersome work, but it frees up our data scientists to focus on answering questions with data.” 

Fast forward to 2023: Reports have emerged that Novartis severely curtailed the data42 project. Why? There has been no official reporting. However, Achim Pleuckebaum, data42’s former Head of R&D, provided a few key learnings as he exited the company and project. His first piece of advice:Do not try to build it yourself.”

5 drawbacks to DIY

So why do DIY scientific data solutions fail in modern biopharma companies? It comes down to five problems:

1. Inability to produce AI-native data

Using AI in biopharma requires large-scale, liquid, engineered, and compliant data for optimal performance. Creating that data calls for significant processing, contextualization, and centralization. DIY initiatives typically do not produce data with all these qualities—and if they do, the cost/benefit ratio is poor. DIY teams are generally solving short-term problems, such as integrating lab instruments and informatics applications, instead of laddering up to higher-order problems such as AI enablement.

2. Significant investment of time and capital

Creating, validating, and thoroughly documenting point-to-point integrations is extremely labor intensive and requires a unique combination of technical expertise and collaboration with vendors across the scientific data ecosystem. Both factors contribute to long timelines and high costs. Furthermore, once integrations are established, moving data within and between research, development, and manufacturing quality assurance (QA)/quality control (QC) workflows still require manual data transfers, transformation, and QC. These processes require sophisticated scientific expertise and significant resource investment—and they are still error prone.

Organizations incur additional expenses as internal teams continuously maintain and update integrations and data transformation workflows for new software versions and instruments.

3. Lack of flexibility

IT project-based point-to-point integrations yield static, complex, and rigid data architectures. These architectures result in scattered, subscale data stores that house raw, fragmented data sets. The architectures are not future-proofed. Their lack of flexibility creates poor user experience in the short term and prevents companies from leveraging AI in the long term. 

4. Limited application options

If a DIY project leaves files trapped in proprietary formats and scattered data silos, it doesn’t matter how effective analytical applications are—they will not be able to function optimally. Without large-scale, liquid, engineered, and compliant data, organizations won’t be able to leverage best-of-breed analytics and AI applications, greatly reducing their ability to innovate. 

5. Poor scalability

The DIY model creates an “n of 1” internal customer business model. IT teams create, test, and validate individual data solutions for each workflow or team. With every integration and workflow augmentation being built from scratch, companies establish best practices at an organizational level as opposed to an industry level. They cannot reuse products from these types of projects, and they will have to build, test, and validate any new instrument or software integration. This makes it impossible to achieve economies of scale. Companies certainly cannot create the kind of scientific AI factory needed to produce large-scale datasets for AI applications.  

Bridging the gap

The most important asset for biopharmas in the next decade is AI-native data. Today biopharma organizations are struggling to bridge the gap between where they are today and where they need to go. DIY approaches failed, and will continue to fail, to produce the data required for AI. Biopharmas need a partner that understands how companies can increase their data maturity so they can leverage AI, accelerate time to market, and deliver better medicines to patients faster.

TetraScience is the only company dedicated to transforming biopharma’s data into AI-native Tetra Data. See how we’re approaching the AI-readiness data problem in our latest white paper, “The Scientific AI Gap.”

Blog
Scientific AI trends in biopharma
Science
AI & Data Science
Scientific AI

The biopharmaceutical industry stands at a crossroads. R&D costs are rising. Sales are falling. And the process from discovery to commercialization remains long and complicated. Many biopharma leaders believe artificial intelligence (AI) will reverse these trends, potentially revolutionizing the entire value chain. They’re placing big bets on AI through investments in technology, people, and partnerships. Let’s take a closer look.

The R&D productivity problem

Biopharma companies are spending more than ever on research and development, but the returns are diminishing. Last year it cost an average of $2.3 billion to bring a drug to market—76 percent higher than a decade ago. During the same period, sales have slumped 25 percent on a per-drug basis. Both trends have contributed to a five-fold reduction in R&D productivity.

Drug development is not only costly but also long and risky. It lasts over 10 years on average with a measly 8 percent of drug candidates earning regulatory approval.

The industry needs a paradigm shift.

graphs showing R&D productivity over last decade
Rising costs and slumping drug sales have contributed to declining R&D productivity over the last decade. Source: Deloitte, 2023.

Manufacturing is getting harder to scale

Producing safe and effective drugs is already complicated. But the next generation of therapies will make the process even more challenging. The last 20 years have seen the development of more than 17 new drug modalities, such as antibody-drug conjugates, bispecific proteins, and cell and gene therapies. Small molecules still make up the vast majority of products on the market, but drug development pipelines are shifting. In 2022, new modalities accounted for about half of drug approvals. By 2025, the FDA expects to greenlight 10 to 20 cell and gene therapy products annually.

Biologics and advanced therapies, derived from living cells, are more complicated to produce, handle, store, and analyze than small molecules. This makes their manufacturing costly and difficult to scale. The complexity is even more pronounced for personalized medicines like chimeric antigen T cell (CAR-T) therapy, where a patient’s cells are harvested, modified, and reintroduced.

Although far from perfect, manufacturing processes for traditional therapeutics are significantly more mature than those for newer modalities. Yields for small molecules are often measured in kilograms, whereas gram-scale batches are common for biologics like monoclonal antibodies. Cell and gene therapies are produced in much smaller quantities—micrograms to milligrams. They will need to lean on smart factory capabilities and AI to scale up production for large populations.

The promise of AI

Biopharma organizations are looking to AI to revolutionize the entire value chain for therapeutics. It has the potential to significantly shorten drug development timelines, boost R&D success rates, and radically improve manufacturing end to end.

The gains from Scientific AI could be immense. Morgan Stanley estimates that using AI in early-stage drug development over the next decade could bring an additional 50 therapies to market worth over $50 billion in sales. A McKinsey analysis predicts that generative AI could unlock the equivalent of 2.6 to 4.5 percent of annual revenue ($60 billion to $110 billion) for the industry. 

Executives in biopharma are becoming increasingly optimistic about the impact of AI on their business. Nearly half of the top 50 biopharma companies have mentioned AI on earnings calls over the past five years. Emma Walmsley, CEO of GSK, said AI can "improve the biggest challenge of the [pharmaceutical] sector, which is the productivity of R&D." Sanofi recently announced its ambition "to become the first pharma company powered by artificial intelligence at scale." And Christophe Weber, CEO of Takeda, said "AI will reduce the cost of R&D [per] molecule. It has to."

AI investment is surging

To bring AI ambitions to fruition, biopharma companies are poised to spend billions. According to one report, spending will climb from $1.64 billion in 2023 to $4.61 billion in 2027. Morgan Stanley predicts that AI investments will grow from 1.5 percent of R&D budgets in 2023 to 4 percent in 2030.

Top biopharma companies are building dedicated AI teams, with clear mandates from leadership. Hiring over the last year has been brisk. In late 2022, AI-related jobs accounted for 7 percent of new job postings by biopharma companies, more than double the average across all sectors. These new positions include both AI generalists and expertise-based roles. The latter typically require advanced degrees in data science and proficiency in the functional domain.

Biopharma companies are also looking to collaborate with partners to gain the necessary technology and know-how to carry out their AI initiatives. About 800 companies are currently applying AI to drug discovery and development. Many are startups offering an array of platforms and services, including software as a service (SaaS), custom data sciences services, drug discovery (drug candidate as a service), and clinical trial support. 

R&D partnerships between leading biopharma organizations and AI companies have surged over the last six years, totaling 249 by early 2023. Half of the 50 largest biopharma companies have entered into partnerships or licensing agreements with AI companies. They have invested over $1 billion in upfront payments over the last five years.

graphs showing increased AI spending and partnerships
Biopharma companies are ramping up their investments in AI. Source: Morgan Stanley, 2022; Deep Pharma Intelligence, 2023

The Scientific AI gap

Eager to get started with AI? Your scientific data may be holding you back. 

Read our white paper to understand why biopharma companies risk falling short of their AI goals due to their scientific data strategy. Learn how to close the gap between the vision of AI and present-day scientific data reality.

No results found.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Ready to start your AI data journey?

Find out what you need to embark on your AI journey and how to reach your goals faster.

Discuss your use case

We are glad to have TetraScience as our partner for large-scale onboarding of new lab data sources. We are looking forward to an ongoing collaboration with TetraScience as we make progress on the AI journey.

Dmitriy Ryaboy
VP of AI Enablement

TetraScience is the core platform for our scientific data and a real differentiator and accelerator to our business.

Bryan Holmes
Vice President Digital & Technical Solutions