Why Biopharma Needs an End-to-End, Purpose-Built Platform for Scientific Data — Part 1

Life Sciences’ Data Problem, and Why “Do-it-Yourself” Doesn’t Work (Part One of a Two-Part Series)
Spin Wang
|
March 23, 2022

Life Sciences’ Data Problem, and Why “Do-it-Yourself” Doesn’t Work

Biopharma professionals are on a mission to accelerate discovery and improve human life — exploiting rapidly-evolving technologies for analytics, AI/ML, and automation to shorten time to market for new therapeutics. This need has led to a rapid, industry-wide paradigm shift in how scientific data is understood and valued:

  • Stakeholders throughout biopharma organizations, from bench and data scientists to R&D IT professionals, manufacturing and compliance specialists, and executives, now recognize that data quantity, quality, validity, and accessibility are major competitive assets
  • Beneficiaries beyond bench scientists: Data scientists, tech transfer, external collaborators, procurement, operations, strategic and business management, and multiple other potential stakeholders must now be considered data consumers and producers — their ability to improve scientific and business outcomes depends on  their ability to access high-quality data easily, and provide high-quality data back to groups and systems originating it
  • Meanwhile, organizations are automating and replatforming dataflows to the cloud to enhance access, protect against data loss, leverage elastic compute/storage, and trade capital expenditure for operational expenditure, amongst other benefits

Mastering data changes the game. Biopharma organizations labor to make data work harder, seeking to speed scientific insight, enable new applications, and improve business outcomes. Practical uses for aggregated data surround us — in fundamental science, lab automation, resource management, quality control, compliance and oversight (for further examples, see our blog about use cases for harmonized data from Waters Empower Data Science Link (EDSL)). Deep Learning technology holds out the promise of being able to discover new value, hidden in data. Applications for data analytics, including AI/ML, range from predictive maintenance on lab and manufacturing equipment to discovering novel ligands within huge small molecule datasets and predicting their biological effects.

The Snarl of Scientific Data Sources, Targets, and Workflows

Managing data across a whole biopharma organization is, however, a daunting challenge. Life sciences R&D and manufacturing notoriously suffer from fragmentation and data silos — with valuable data produced (and also trapped) in myriad locations, formats, systems, workflows, and organizational domains. How can biopharma begin to cope with this complexity?

We find it helpful to think of each workflow in terms of a minimum quantum of organizational and logical effort. To gain scientific or business benefit, you need to find and move information from where it’s created or resides (a data source) to a system or application that can usefully consume it (a data target).

Some common data sources include:

  • Lab instruments and instrument control software
  • Informatics applications, like Electronic Lab Notebooks (ELNs) and Lab Information Management Systems (LIMS)
  • Contract Research Organizations (CROs) and Contract Development Manufacturing Organizations (CDMOs)
  • Sensors and facility monitoring systems
  • SaaS systems for general data usage (Egnyte, GMail, Box)

Data targets, on the other hand, are systems that consume the data to deliver insights and conclusions or reports. For example:

  • Data science-oriented applications and tools, including visualization and analytics tools like Spotfire, and AI/ML tools and platforms such as Streamlit, Amazon SageMaker, H2O.ai, Alteryx, and others
  • Lab informatics systems, including LIMS, Manufacturing Execution Systems (MES), ELNs, plus instrument control software like Chromatography Data Systems (CDS) and lab robotics automation systems. Note that these data targets are also treated as data sources by certain workflows

Integrating these sources and targets is seldom simple (we comment on some of the reasons for this complexity, below). In many organizations, moving data from sources to targets remains partly or wholly a manual process. As a result:

  • Scientists’ and data scientists’ time is wasted in error-prone, manual data collection and transcription. Time taken up by manual work on data reduces time available for analysis and gaining insights
  • Meanwhile, pressure to collaborate, distribute research, and speed discovery introduces further challenges in data sharing and validation, often resulting in simplistic procedures that rob data of context and long-term utility, and may compromise data integrity and compliance

Seeking to automate some of these transactions, biopharma IT teams often struggle to build point-to-point integrations connecting data sources and targets, frequently discovering that their work is fragile, inflexible, and difficult to maintain. We’ve written several blogs (e.g., Data Plumbing for the Digital Lab, What is a True Data Integration Anyway?, and How TetraScience Approaches the Challenge of Scaling True Scientific Data Integrations) on the complexities of integration building, inadequacies of point-to-point approaches, and requirements for engineering fully-productized, maintainable integrations.

Building and maintaining point-to-point integrations among myriad scientific data sources and targets is costly, time-consuming, inefficient, and distorts organizational priorities.

“Lego” Solution Assembly: More Complexity

This frustrating experience with pure point-to-point integrations leads many biopharma IT organizations to begin considering a different solution: building a centralized data repository (i.e., a data lake or data warehouse) and using it to mediate connections between data sources and targets. A common approach is to try assembling such a solution from a plethora of available open source and proprietary, industry-agnostic components and services for data collection, storage, transformation, and other functions.

A recent article from Bessemer Ventures describes versions of a componentized architecture. Note that none of these components are purpose built for scientific data.

Creating a data solution involves selecting from among myriad components, then integrating them to support scientific and business needs, comply with regulation, and meet other stringent requirements.

The Problem with “Do-it-Yourself”

The challenges we’ve solved with 20+ global biopharma companies have convinced us that there are two major problems with this approach:

Problem #1: Organizational Spread

None of the components of a cloud data platform – including data lakes/warehouses, integration platforms, batch processing and data pipeline systems, query engines, clustering systems, monitoring and observability — are simple, or “just work” out of the box. Most are complex: serious configuration effort, experimentation, and best practices knowledge are required to make each component do its job in context, and enable all components to work together well. You need to develop serious external tooling (which will require specially-trained headcount to create, maintain, and operate) to make updates, scaling, and other lifecycle management operationally efficient and safe in production.

Specialized (but non-strategic) skills required. To execute, you’ll need to assemble teams with specialized skills — each managing as few as one vendor, component, or subsystem of the complete solution — plus software architects and project managers to orchestrate team efforts. You'll also need expertise in GxP-compliant software design, data integrity, and security, and a cadre of data engineers to work with scientists, create integrations and workflows, and help prepare stored data for use by automation, analytics, and AI/ML.

You’ll need these teams long term since you’ll be responsible for evolving, scaling, and maintaining the full solution stack plus a growing number of integrations. While this expert headcount is critical to the timeliness and success of your project, they’re also a cost center — focused on running, integrating, and scaling your platform, but outside the critical path of extracting value from scientific data and helping you do business faster and better.

A data science manager at a global biopharma organization comments, “This organizational spread creates bottlenecks, slows down operations, and in turn, delays data usage. The additional need to ensure that a data ingestion pipeline is GxP-validated further increases this problem — in fact, it might even add an additional organizational unit to the mix!”

Focus on the big picture becomes compromised. Meanwhile, as teams around the project grow, cost and time pressure can urge delivery of a minimum viable product quickly. Focusing on low-hanging fruit and immediate requirements can easily lead to a partial solution that doesn't scale well or generalize to many use cases, and that may be unmaintainable.

Two practical use case examples illustrate the need for a feature-rich, life sciences data-focused, end-to-end solution:

  • In high-throughput screening (HTS) workflows, robotic automation generates a massive amount of data. These data need to be automatically collected, harmonized, labeled, and sent to screening analytics tools in order to configure the robots for the next set of experiments.
  • In late-stage development and manufacturing, labs are constantly checking the quality of batches and the performance of their method. Harmonizing these data enables analytics to compare method parameters, batches, and trends over time — flagging anomalies and potentially yielding important insights in terms of batch quality and system suitability. 

In both examples, merely storing the data, collecting the data, or providing data transformation is not enough. To yield benefits, these key operations need to be architected, implemented, tracked, and surfaced in a holistic way, targeting the end-to-end flow of those particular data sets.

Data is stripped of context, limiting utility. While data collected without context may be meaningful to scientists that recognize the file, such data will be useless for search/query, post-facto analytics, and data science at scale. Metadata providing scientific context, details about instruments, environmental state, and other details must be determined and added beginning soon after data is ingested. If this doesn't happen:

  • It can be difficult to appropriately enrich (or sometimes, even parse) vendor-specific or vendor-proprietary data
  • Data integrity issues — common for experimental data and when working with external partners — may be missed
  • A significant fraction of total data cannot be used easily by data scientists because it lacks fundamental information about how it was created and what it means

For more detail, see Executive Conversations: Evolving R&D with Siping “Spin” Wang, President and CTO of TetraScience | Amazon Web Services and Move Beyond Data Management.

The "impedance mismatch" between horizontal data solutions and scientific data and use-case requirements is amplified by the complexity of biopharma data sources, targets, and workflows. Overcoming this complexity requires resources, specialized knowledge, and time, and may produce slow return on investment.


Problem #2: Impedance Mismatch

The “impedance mismatch” between industry-agnostic, “horizontal” data solutions and biopharma workflows is amplified by the complexity of the life sciences domain.

Scientific workflows are complex. A single small-molecule or biologics workflow can comprise dozens of sequential phases, each with many iterative steps that consume and produce data. As workflows proceed, they fork, reduplicate, and may transition among multiple organizations with different researchers, instruments, and protocols.

Biopharma has myriad instruments and software systems per user, producing and consuming complex, diverse, and often proprietary file and data types. Distributed research (e.g., collaboration with CROs and CDMOs) adds new sources, formats, standards, and validation requirements to every workflow. This additional complexity results in research data locked within a huge number of systems and formats, requiring a knowledge of each in order to validate, enhance, consume, or reuse – a daunting task to say the least.

Building effective integrations is EXTREMELY difficult. If a life sciences organization builds their own data platform using horizontal components, such as Mulesoft, Pentaho, Boomi, Databricks, or Snowflake, it inevitably also needs to build and maintain all the integrations required to pull data from or push data to instruments, informatics applications, CROs/CDMOs, and other data sources and targets. This last mile integration challenge is a never-ending exercise — where the challenge of creating and maintaining fully-serviceable integrations exceeds the capacity of biopharma IT organizations and distracts from other, more strategic and scientifically important, work. For a closer look at technical and organizational requirements for engineering life sciences integrations, see our blog: What is a True Data Integration, Anyway? 

Two strategies are often considered for managing integration development and maintenance workload:

  • Outsourcing to consulting companies as professional services projects. Integrations produced this way typically take a long time to build, and almost invariably become one-off solutions that require significant ongoing investment to maintain.
  • Handing off to vendors of an important data source/target (e.g., a LIMS or ELN) as “customization” or professional services work. Such efforts often produce vendor-specific and rigid, point-to-point integrations that become obsolete when changes occur or end up locking data into that particular vendor’s offering.

Neither of these two approaches treats connectivity or integration as a first-class business or product priority, meaning that these do-it-yourself projects often bog down the organization and fail to deliver ROI.

Towards a Solution for Scientific Data

In our next installment, we’ll discuss four critical requirements for untangling life sciences’ complex challenges around data and show how fulfilling these requirements enables a solution that:

  • Delivers benefits quickly, helping speed replatforming scientific data to the cloud and enabling rapid implementation of high-value scientific dataflow use cases
  • Scales out efficiently, enabling IT to plan and resource effectively, and freeing scientists and data scientists to refocus on scientific innovation instead of non-strategic, technical wheel spinning

An effective data cloud platform for managing scientific data requires more than just IT/cloud know-how and coding smarts. It requires a partnership with an organization that has deep life sciences understanding, a disciplined process for integration building, and a commitment to open collaboration with a broad ecosystem of partners.

Manual operations on data wastes a huge percentage of scientists' and data scientists' time. To learn more about automating critical scientific workflows, saving time, and improving repeatability and accuracy, read our whitepaper
Manual No More: Automating the Scientific Data Lifecycle


Share this article

Previous post

There is no previous post
Back to all posts
June 27, 2022

Barrier Busting: Bringing ELN and LIMS Scientific Data Together

Read Blog
May 31, 2022

Committed to Curing Diabetes

Read Blog
May 23, 2022

New Frontiers: World’s First Community-Driven AI Store for Biology

Read Blog
May 18, 2022

Tetra Blasts Off at Boston’s Bio-IT World

Read Blog
May 9, 2022

Give Your in vivo Data the Attention it Deserves

Read Blog
May 2, 2022

Customizing Digital Lab Experiences With Ease

Read Blog
April 14, 2022

Sharing a Vision and Deep Customer Commitment

Read Blog
April 11, 2022

Escaping the Scientific Data Quagmire

Read Blog
April 1, 2022

Innovating with a HoloLens and Drones

Read Blog
April 6, 2022

Digital Twins: Seeing Double with a Predictive Eye

Read Blog
March 28, 2022

Automated Anomaly Detection and Correction

Read Blog
March 30, 2022

Making Labs More Efficient

Read Blog
March 4, 2022

Introducing Tetra Data Platform v3.2

Read Blog
March 2, 2022

Are you prepared to utilize ML/AI and Data Visualization?

Read Blog
February 22, 2022

SLAS 2022: The Industry’s “Hyped” for Accessible and Actionable Scientific Data

Read Blog
February 21, 2022

BIOVIA partners with TetraScience

Read Blog
February 16, 2022

Tetra Partner Network: An Interview with Klemen Zupancic, CEO, SciNote

Read Blog
February 4, 2022

Closing the Data Gap in Cancer Research

Read Blog
January 27, 2022

Waters & The Tetra Partner Network: Making Data Science Possible

Read Blog
December 16, 2021

Announcing Acquisition of Tetra Lab Monitoring Business by Elemental Machines

Read Blog
November 29, 2021

Move From Fractal to Flywheel with The Tetra Partner Network

Read Blog
March 26, 2021

How an IDS Complements Raw Experimental R&D Data in the Digital Lab

Read Blog
July 30, 2021

What is an R&D Data Cloud? (And Why Should You Care?)

Read Blog
March 26, 2021

What is a True Data Integration, Anyway?

Read Blog
June 1, 2020

Data Science Use Cases for the Digital Lab: Novel Analyses with Waters Empower CDS Data

Read Blog
April 20, 2022

Unlock the Power of Your ELN and LIMS

Read Blog
July 23, 2020

The Science Behind Trash Data

Read Blog
August 20, 2021

The 4 Keys to Unlock the Lab of the Future

Read Blog
September 29, 2021

TetraScience Achieves SOC 2 Type 2 Validation, Advances R&D Data Cloud GxP Compliance Capabilities

Read Blog
April 20, 2020

Round-up of Semantic Web thought leadership articles

Read Blog
May 11, 2021

R&D Data Cloud: Moving Your Digital Lab Beyond SDMS

Read Blog
September 10, 2021

Principles of Deep Learning Theory

Read Blog
July 8, 2020

Powering Bioprocessing 4.0 for Therapeutic Development

Read Blog
March 30, 2022

Why Biopharma Needs an End-to-End, Purpose-Built Platform for Scientific Data — Part 2

Read Blog
August 19, 2021

Part 2: How TetraScience Approaches the Challenge of Scaling True Scientific Data Integrations

Read Blog
March 23, 2022

Why Biopharma Needs an End-to-End, Purpose-Built Platform for Scientific Data — Part 1

Read Blog
January 18, 2021

New Matter: Inside the Minds of SLAS Scientists Podcast

Read Blog
June 29, 2020

Enabling Compliance in GxP Labs

Read Blog
May 14, 2020

LRIG-New England: Lunchtime Virtual Rapid Fire Event - May 26, 2020

Read Blog
June 10, 2020

Remote Lab Scheduling is No Longer Optional, it is a Requirement

Read Blog
August 2, 2020

Incident Reporting for GxP Compliance

Read Blog
October 15, 2020

Protein Purification with Cytiva UNICORN: Enhanced Analytics through Harmonization and Integration

Read Blog
July 29, 2020

Cloud-based Data Management with Lab Automation: HighRes Biosolutions Cellario + TetraScience

Read Blog
August 20, 2020

Understanding Why Freezer Temperatures May Not Be Uniform

Read Blog
July 14, 2021

Find Experimental Data Faster with Google-Like Search in Tetra Data Platform 3.1 Release

Read Blog
July 22, 2021

Experimental Data in Life Sciences R&D — It’s How Many Copies of Jaws?!

Read Blog
April 26, 2020

The Digital Lab Needs an Intermediate Data Schema (IDS): a First Principle Analysis

Read Blog
April 6, 2020

TetraScience ADF Converter -- Delivering on the Promise of Allotrope and a Startup’s Journey

Read Blog
August 6, 2020

"Data Plumbing" for the Digital Lab

Read Blog
June 8, 2020

Data Automation for High-Throughput Screening with Dotmatics, Tecan, and PerkinElmer Envision

Read Blog
May 15, 2020

Applying Data Automation and Standards to Cell Counter Files

Read Blog
June 11, 2020

AWS Healthcare & Life Sciences Web Day | Virtual Industry Event

Read Blog
February 12, 2021

AWS Executive Conversations: Evolving R&D

Read Blog
April 15, 2021

Announcing Our Series B: The What, When, Why, Who, and Where

Read Blog
April 15, 2021

Announcing our Series B: The DNA Markers of Category Kings and Queens

Read Blog
April 15, 2021

Announcing our Series B: Tetra 1.0 and 2.0 | The Noise and the Signal

Read Blog
March 29, 2020

Allotrope Leaf Node Model — a Balance between Practical Solution and Semantics Compatibility

Read Blog
March 13, 2020

Choose the right alert set points for your freezers, refrigerators, and incubators

Read Blog
August 27, 2020

99 Problems, but an SDMS Ain't One

Read Blog