Life Sciences’ Data Problem, and Why DIY Doesn’t Work
R&D biopharma professionals are on a mission to accelerate discovery and improve human life — shortening time-to-market for new therapeutics. This has occasioned a rapid, industry-wide paradigm shift in how data is understood and valued.
Data is now increasingly seen as having large potential utility and value:
- Data quantity, quality, validity, and accessibility are now recognized as major competitive assets.
- Beneficiaries beyond bench scientists: Data scientists, tech transfer, external collaborators, procurement, operations, and multiple other departments or platforms must now be considered data consumers and producers — their contributions (to science, to profit) gated by their ability to access high-quality data easily, and provide high-quality data back.
- Meanwhile dataflows are being automated and replatformed to the cloud to enhance access, protect against data loss, leverage elastic compute/storage, and trade CapEx for OpEx, among other hoped-for benefits.
Mastering data can change the game. Biopharma organizations are now laboring to make data work harder, seeking to accelerate discovery and improve business outcomes. Practical uses for aggregated data are all around — in fundamental science, lab automation, resource management, quality control, compliance and oversight (for further examples, see our blog about use cases for harmonized data from Waters Empower Data Science Link (EDSL)). Undiscovered value hidden in data is also becoming a powerful lure — R&D focus is expanding beyond tangible experiments to embrace work done on data directly. Applications for ML range from predictive maintenance on lab and manufacturing equipment to discovering novel ligands within huge small molecule datasets and predicting their biological effects.
Data Sources and Targets
Mastering data across a whole biopharma organization is, however, a daunting challenge. Life sciences R&D suffers notoriously from fragmentation and data silos — with valuable data produced (and also entrapped) in myriad locations, formats, systems, workflows, and organizational domains. How can biopharma organizations begin to cope with this mountain of complexity?
We find it helpful to think of each R&D workflow in terms of a minimum quantum of organizational and logical effort. To gain scientific or business benefit, you need to find and move information from where it’s created or resides (a data source) to a system or application that can usefully consume it (a data target).
Some common data sources include:
- Lab instruments and instrument control software
- Informatics applications, like Electronic Lab Notebook (ELN) and Lab Information Management System (LIMS)
- Contract Research Organization (CRO) and Contract Development Manufacturing Organization (CDMO)
- Sensors and facility monitoring systems
Data targets, on the other hand, are systems that consume the data to deliver insights and conclusions or reports. There are roughly two types of data targets:
- Data science-oriented applications and tools, including visualization and analytics tools like Spotfire, and AI/ML tools and platforms such as Streamlit, Amazon SageMaker, H2O.ai, Alteryx, and others.
- Lab informatics systems, including Lab Information Management Systems (LIMS), Manufacturing Execution Systems (MES), Electronic Lab Notebooks (ELNs), plus instrument control software like Chromatography Data Systems (CDS) and lab robotics automation systems. Interestingly, these data targets are also treated as data sources in certain workflows.
Integrating these sources and targets is seldom simple (we comment on some of the reasons for this complexity, below). In many organizations, moving data from sources to targets remains partly or wholly a manual process. As a result:
- Scientists’ and data-scientists’ time is wasted in manual data collection and transcription, with attendant risks of human error and process non-compliance, and inability to focus on analysis and insights.
- Meanwhile, pressure to collaborate, distribute research, and speed discovery introduces further challenges in data sharing and validation, often resulting in simplistic procedures that rob data of context and long-term utility.
Seeking to automate some of these transactions, R&D IT teams often start by struggling to build point-to-point integrations connecting data sources and targets, frequently discovering that their work is minimally viable, fragile, inflexible, and difficult to maintain. We’ve written several blogs (e.g., Data Plumbing for the Digital Lab, What is a True Data Integration Anyway?, and How TetraScience Approaches the Challenge of Scaling True R&D Data Integrations) on the complexities of integration building, inadequacies of point-to-point approaches, and requirements for engineering fully-productized, maintainable integrations.
DIY Solution Assembly
Frustrating experience with pure point-to-point integrations leads many biopharma IT organizations to swiftly begin considering a second-order solution: building a centralized data repository (i.e., a data lake or data warehouse) and using it to mediate connections between data sources and targets. A common approach is to try assembling such a solution from a plethora of available open source and proprietary industry-agnostic components and services for data collection, storage, transformation, and other functions.
The Problem with DIY
The challenges we’ve solved with 20+ global biopharma companies have convinced us that there are two major problems with this approach:
Problem #1: Organizational Spread
Data lakes/warehouses, integration platforms, batch processing and data pipeline systems, query engines, clustering systems, monitoring and observability — none of these components are simple, or “just work” out of the box. Most are flexible to the point where serious configuration effort, experimentation, and best practices knowledge is required to make each component do its job and all components work together well. Serious external tooling (plus skilled operations headcount, training, and planning) is required to make updates, scaling, and other life cycle management operationally efficient and safe in production.
Specialized (but non-strategic) skills required. To execute, you’ll need to assemble teams with specialized skills, each managing as few as one vendor, component, or subsystem of the complete solution, plus architects and project managers to orchestrate team efforts. You’ll need these teams long-term, since you’ll be responsible for maintaining and evolving the full stack. Their skills are critical. But they’re also a cost center — focused on running, integrating, and expanding your platform, but remote from the critical path of accelerating discovery (i.e., extracting value from data).
A data science manager at a global biopharma organization comments, “This organizational spread creates bottlenecks, slows down operations, and in turn, delays data usage. The additional need to ensure that a data ingestion pipeline is GxP-validated further increases this problem — in fact, it might even add an additional organizational unit to the mix! We would like the experts to be able to directly handle their data.”
Architectural goals become compromised. Meanwhile, as teams around the project grow, tensions emerge between the need to deliver minimum viable product quickly and longer-term vision — architecture and data flow can start to get out of sync. The danger is that project focus can begin to dwell on parts of the problem: collecting data, storing it, providing data transformation, harmonizing data from multiple sources — rather than creating an R&D-focused, data-centric solution that can steward the whole life cycle of data, end to end.
This swiftly adds friction. In fact, it can make many important applications impossible. Two examples:
- In high-throughput screening (HTS) workflows, robotic automation generates a huge volume of data. This data needs to be automatically collected, harmonized, labeled, and sent to the screening analytics tools in order to inform the next set of experiments the robots need to perform.
- In late stage development and manufacturing, quality labs are constantly checking the quality of batches and the performance of their method. Harmonizing this data enables analytics to compare method parameters, batches, and trends over time — flagging anomalies and potentially yielding important insights in terms of batch quality and system suitability.
In both examples, merely storing the data, collecting the data, or providing data transformation is not enough. To yield benefits, these key operations need to be architected, implemented, tracked, and surfaced in a holistic way, targeting the end-to-end flow of those particular data sets.
Data is stripped of context, limiting utility. To further elaborate on the need to unify these data operations, data collected without adequate context may be meaningful to scientists that recognize the file, but will be useless for search/query, post-facto analytics, and data science at scale. Without a data processing layer that “decorates” data with critical metadata:
- Vendor-specific or vendor-proprietary data cannot be enriched (or sometimes, even parsed).
- Data integrity issues — common for experimental data and when working with outsourced partners — cannot be caught.
- A significant portion of the data cannot be used easily by data scientists.
Problem #2: Impedance Mismatch
The “impedance mismatch” between industry-agnostic, “horizontal” data solutions and biopharma R&D is amplified by the complexity of the R&D domain.
Scientific workflows are complex. A single small-molecule or biologics workflow can comprise dozens of sequential phases, each with many iterative steps that consume and produce data. As workflows proceed, they fork, reduplicate, and may transition among multiple organizations — different researchers, instruments, protocols.
Biopharma R&D has the largest variety of instruments and software systems per user of any industry, producing and consuming complex, diverse, and often proprietary file and data types. Distributed research (e.g., collaboration with CROs and CDMOs) adds new sources, formats, and standards, and validation requirements to every workflow. This results in research data locked into systems and formats that make it effectively impossible to validate, enhance, consume, or reuse.
Building effective integrations is EXTREMELY difficult. If an R&D organization builds their own R&D data platform using horizontal components, such as Mulesoft, Pentaho, Boomi, Databrick, or Snowflake, it inevitably also needs to build and maintain all the integrations required to pull data from or push data to instruments, informatics applications, CROs/CDMOs, and other data sources and targets. This last mile integration challenge can easily become a never-ending exercise — where the challenge of creating and maintaining fully-serviceable integrations exceeds the capacity of biopharma IT organizations, and distracts from other, more strategic and scientifically important work. For a closer look at technical and organizational requirements for engineering life sciences integrations, see our blog: What is a True Data Integration, Anyway?
Two strategies are often considered for managing this integration workload:
- Outsourcing to consulting companies as professional services projects. Integrations produced this way typically take a long time to build, and almost invariably become one-off solutions that require significant ongoing investment to maintain
- Handing off to vendors of an important data source/target (e.g., a LIMS or ELN) as “customization” or professional services work. Such efforts often produce highly vendor-specific and rigid point-to-point integrations that become obsolete when changes occur, or end up further locking data into that particular vendor’s offering
Neither of these two approaches treats connectivity or integration as a first-class business/product priority, meaning that DIY projects often bog down, failing to deliver ROI in reasonable timeframes.
Towards a Data-Centric, R&D-Focused Solution
In our next installment, we’ll discuss four critical requirements for untangling life sciences’ complex challenges around data, and show how fulfilling these requirements enables a solution that delivers benefits quickly, scales effectively, and enables refocusing on discovery vs. non-strategic, technical wheel-spinning. Delivering an effective R&D data cloud platform requires more than just IT/cloud know-how and coding smarts. It requires a partnership with an organization that has deep life sciences understanding, a disciplined process for integration-building, and a commitment to open collaboration with a broad ecosystem of partners.