R&D biopharma professionals are on a mission to accelerate discovery and improve human life — shortening time-to-market for new therapeutics. This has occasioned a rapid, industry-wide paradigm shift in how data is understood and valued.
Data is now increasingly seen as having large potential utility and value:
Mastering data can change the game. Biopharma organizations are now laboring to make data work harder, seeking to accelerate discovery and improve business outcomes. Practical uses for aggregated data are all around — in fundamental science, lab automation, resource management, quality control, compliance and oversight (for further examples, see our blog about use cases for harmonized data from Waters Empower Data Science Link (EDSL)). Undiscovered value hidden in data is also becoming a powerful lure — R&D focus is expanding beyond tangible experiments to embrace work done on data directly. Applications for ML range from predictive maintenance on lab and manufacturing equipment to discovering novel ligands within huge small molecule datasets and predicting their biological effects.
Mastering data across a whole biopharma organization is, however, a daunting challenge. Life sciences R&D suffers notoriously from fragmentation and data silos — with valuable data produced (and also entrapped) in myriad locations, formats, systems, workflows, and organizational domains. How can biopharma organizations begin to cope with this mountain of complexity?
We find it helpful to think of each R&D workflow in terms of a minimum quantum of organizational and logical effort. To gain scientific or business benefit, you need to find and move information from where it’s created or resides (a data source) to a system or application that can usefully consume it (a data target).
Some common data sources include:
Data targets, on the other hand, are systems that consume the data to deliver insights and conclusions or reports. There are roughly two types of data targets:
Integrating these sources and targets is seldom simple (we comment on some of the reasons for this complexity, below). In many organizations, moving data from sources to targets remains partly or wholly a manual process. As a result:
Seeking to automate some of these transactions, R&D IT teams often start by struggling to build point-to-point integrations connecting data sources and targets, frequently discovering that their work is minimally viable, fragile, inflexible, and difficult to maintain. We’ve written several blogs (e.g., Data Plumbing for the Digital Lab, What is a True Data Integration Anyway?, and How TetraScience Approaches the Challenge of Scaling True R&D Data Integrations) on the complexities of integration building, inadequacies of point-to-point approaches, and requirements for engineering fully-productized, maintainable integrations.
Frustrating experience with pure point-to-point integrations leads many biopharma IT organizations to swiftly begin considering a second-order solution: building a centralized data repository (i.e., a data lake or data warehouse) and using it to mediate connections between data sources and targets. A common approach is to try assembling such a solution from a plethora of available open source and proprietary industry-agnostic components and services for data collection, storage, transformation, and other functions.
The challenges we’ve solved with 20+ global biopharma companies have convinced us that there are two major problems with this approach:
Data lakes/warehouses, integration platforms, batch processing and data pipeline systems, query engines, clustering systems, monitoring and observability — none of these components are simple, or “just work” out of the box. Most are flexible to the point where serious configuration effort, experimentation, and best practices knowledge is required to make each component do its job and all components work together well. Serious external tooling (plus skilled operations headcount, training, and planning) is required to make updates, scaling, and other life cycle management operationally efficient and safe in production.
Specialized (but non-strategic) skills required. To execute, you’ll need to assemble teams with specialized skills, each managing as few as one vendor, component, or subsystem of the complete solution, plus architects and project managers to orchestrate team efforts. You’ll need these teams long-term, since you’ll be responsible for maintaining and evolving the full stack. Their skills are critical. But they’re also a cost center — focused on running, integrating, and expanding your platform, but remote from the critical path of accelerating discovery (i.e., extracting value from data).
A data science manager at a global biopharma organization comments, “This organizational spread creates bottlenecks, slows down operations, and in turn, delays data usage. The additional need to ensure that a data ingestion pipeline is GxP-validated further increases this problem — in fact, it might even add an additional organizational unit to the mix! We would like the experts to be able to directly handle their data.”
Architectural goals become compromised. Meanwhile, as teams around the project grow, tensions emerge between the need to deliver minimum viable product quickly and longer-term vision — architecture and data flow can start to get out of sync. The danger is that project focus can begin to dwell on parts of the problem: collecting data, storing it, providing data transformation, harmonizing data from multiple sources — rather than creating an R&D-focused, data-centric solution that can steward the whole life cycle of data, end to end.
This swiftly adds friction. In fact, it can make many important applications impossible. Two examples:
In both examples, merely storing the data, collecting the data, or providing data transformation is not enough. To yield benefits, these key operations need to be architected, implemented, tracked, and surfaced in a holistic way, targeting the end-to-end flow of those particular data sets.
Data is stripped of context, limiting utility. To further elaborate on the need to unify these data operations, data collected without adequate context may be meaningful to scientists that recognize the file, but will be useless for search/query, post-facto analytics, and data science at scale. Without a data processing layer that “decorates” data with critical metadata:
The “impedance mismatch” between industry-agnostic, “horizontal” data solutions and biopharma R&D is amplified by the complexity of the R&D domain.
Scientific workflows are complex. A single small-molecule or biologics workflow can comprise dozens of sequential phases, each with many iterative steps that consume and produce data. As workflows proceed, they fork, reduplicate, and may transition among multiple organizations — different researchers, instruments, protocols.
Biopharma R&D has the largest variety of instruments and software systems per user of any industry, producing and consuming complex, diverse, and often proprietary file and data types. Distributed research (e.g., collaboration with CROs and CDMOs) adds new sources, formats, and standards, and validation requirements to every workflow. This results in research data locked into systems and formats that make it effectively impossible to validate, enhance, consume, or reuse.
Building effective integrations is EXTREMELY difficult. If an R&D organization builds their own R&D data platform using horizontal components, such as Mulesoft, Pentaho, Boomi, Databrick, or Snowflake, it inevitably also needs to build and maintain all the integrations required to pull data from or push data to instruments, informatics applications, CROs/CDMOs, and other data sources and targets. This last mile integration challenge can easily become a never-ending exercise — where the challenge of creating and maintaining fully-serviceable integrations exceeds the capacity of biopharma IT organizations, and distracts from other, more strategic and scientifically important work. For a closer look at technical and organizational requirements for engineering life sciences integrations, see our blog: What is a True Data Integration, Anyway?
Two strategies are often considered for managing this integration workload:
Neither of these two approaches treats connectivity or integration as a first-class business/product priority, meaning that DIY projects often bog down, failing to deliver ROI in reasonable timeframes.
In our next installment, we’ll discuss four critical requirements for untangling life sciences’ complex challenges around data, and show how fulfilling these requirements enables a solution that delivers benefits quickly, scales effectively, and enables refocusing on discovery vs. non-strategic, technical wheel-spinning. Delivering an effective R&D data cloud platform requires more than just IT/cloud know-how and coding smarts. It requires a partnership with an organization that has deep life sciences understanding, a disciplined process for integration-building, and a commitment to open collaboration with a broad ecosystem of partners.