Fundamental challenges of scientific data

January 19, 2024
Spin Wang

Biopharma organizations are poised for a generational paradigm shift. It will disrupt the industry, just as Amazon disrupted the traditional retail bookstore industry and Uber disrupted the taxi industry. 

Driving this change are the surge in computing power, widespread availability of cloud services, advancements in artificial intelligence (AI) and machine learning (ML) tools, shifts in the regulatory landscape, and Eroom’s law. Pharmaceutical and biotech companies will be able to achieve new breakthroughs and significantly enhance operational efficiency—if they can fully capitalize on emerging technologies.

Given that, many organizations have already begun to:  

  • Shift focus from tangible experiments to data as the core IP asset
  • Digitize and automate data flow
  • Replatform data to the cloud
  • Leverage analytics and AI/ML for breakthroughs

Each of these major trends centers on scientific data. And to succeed with these initiatives, organizations must properly replatform, engineer, and then leverage scientific data for analytics and AI. 

These activities are complicated by several unique structural factors related to scientific data, which result from the nature of the scientific industries. Let’s unpack these structural factors and see how they pose challenges to pharmaceutical and biotech organizations. 

What is scientific data?

The fundamental building blocks of scientific activities are experiments. Scientists design an experiment to investigate a hypothesis or gather evidence. In the life sciences, scientists typically create samples for testing and then perform experiments to test those samples. They subsequently collect the data and analyze it. This process can be represented by the design-make-test-analyze (DMTA) cycle:

  1. Design: Experimental design is typically recorded in an electronic lab notebook (ELN), defining experimental conditions and identifying the protocol to follow. The design is often a result of analyzing the results from previous experiments or data provided by other teams. Scientists might use computational tools to speed up the design process. 
  2. Make: Scientists prepare samples for testing using various instruments. These can include bioreactors for controlled growth of cells, purification instruments for isolating proteins or analytes, and liquid handlers for precise sample preparation.
  3. Test: Scientists may use instruments to perform the experiments, often leveraging partners such as a contract development and manufacturing organization (CDMO) or contract research organization (CRO) to perform in vivo or in vitro studies. Instruments are set with detailed parameters to ensure accurate testing, and in silico (computational) methods might also be employed to simulate tests or assess performance.
  4. Analyze: Scientists use a variety of analysis tools, including the same software used to perform the experiments as well as specialized applications. The results are then integrated with data from other stages of the DMTA cycle as well as historical data to make decisions.

Each step includes a wealth of associated information—either input or output. This collective set of information is defined as “scientific data.” Irrespective of the therapeutic area or target pathway, small-molecule discovery teams repeat a common workflow, the DMTA cycle, to optimize their identified hits toward clinical candidates. Other modalities such as biologics and cell and gene therapy (CGT) use similar versions of DMTA.

Inherent complexity of scientific data

Scientific data is inherently complex. Each step in experimental workflows—whether for the development of small molecules or biologics—generates data in some form.

For example, scientific data is generated by lab instruments, such as spectrometers, plate readers, liquid chromatography systems, and electron microscopes. It might represent the results of experiments and assays, outputs of computational algorithms, and observations.

Instrument data follows proprietary formatting designed by the manufacturer. There are thousands of vendor-model combinations, and for each combination, data can be produced in vastly different formats or different structures due to configuration, operating modes, and settings. For example, Molecular Devices has nine plate reader models. BMG Labtech has six models. Multimode plate readers can have up to three or more modes of operation. If a biopharma has 12 vendors, multiple models and detection modes would result in potentially greater than 200 unique data formats. And data from these instruments is roughly tripling annually.

There are also additional sources of data. For example, organizations collect data during other in vitro and in vivo experiments, as well as modeling experiments conducted in silico

Moreover, different scientific techniques add to the complexity of data. Properly describing the physical, chemical, and biological properties of compounds, proteins, cells, or any drug product requires special expertise and scientific training. The data generated from these techniques is equally demanding and complex to understand and interpret. 

There are many types of scientific data, including:

  1. Time-series data: Bioreactors report conditions and their change over time as time-series data. 
  2. Audit trails and logs: Audit trails and instrument logs are important, since they contain what has actually happened during the experiment, when it happened, and by whom.
  3. Multi-dimensional numerical data: Very often there is multi-dimensional numerical data produced in instrument techniques, such as chromatography, mass spectrometry, flow cytometry, and X-ray crystallography.
  4. Long-term studies: High-throughput screening (HTS) and stability studies take place over weeks, months, or even years, and outcomes are often derived across the full dataset.
  5. Wells and cells: Microplate assays may generate data in numerous formats. For instance, each well can be measured to produce a single-value readout (e.g., absorbance). Cells in each well can be imaged for high-content analysis. Or fluorescence may be measured over time, as in quantitative polymerase chain reaction (qPCR) assays.
  6. Unstructured text: Scientific data often includes unstructured text, such as notes taken by scientists during the experiment, assay reports created by CROs/CDMOs, and notes recorded in the ELN. 
  7. Key value pairs: Key value pairs are common when describing the method or context of the scientific data. Such data may be highly unstructured and inconsistent between vendors and assays due to a lack of widely adopted schema, taxonomy, and ontology conventions.
  8. Proprietary file formats: There are plenty of vendor proprietary file formats, binary data blobs, or images that can only be read in the software in which the data was generated. 
  9. Reports: In a more distributed and collaborative ecosystem, an increasing portion of reports and data are curated by humans. Formats and ontologies may vary widely depending on the intended recipient, including submissions to regulatory authorities, company reports, and external communication.

Complex scientific workflow

Scientific workflows are also highly complex. Hundreds or thousands of individuals might participate in key tasks from discovery to production. These activities span all phases from discovery to commercialization, including many that conform to Good Laboratory Practice (GLP) and Good Manufacturing Practice (GMP) regulatory requirements.

The flow charts below generalize the workflows for small molecules and biologics. As these diagrams show, even the shortest workflow might include over 40 steps.* However, it’s important to note that these flowcharts don’t fully capture the interdependent network that supports these campaigns. That network comprises equipment and informatics applications; inventories of lab consumables; external CRO/CDMOs; supply chains; and the data systems that store, integrate, and contextualize the experiments so companies can focus on making decisions to advance treatments.

4D Process map for biologics therapeutic development. Source: National Institutes of Health (NIH).
4D Process map for small molecule therapeutic development. Source: National Institutes of Health (NIH).

*Can’t count them all? Here’s a sample small molecule workflow: Hypothesis generation, disease pathophysiology (molecular pathway, animal models, target biology), assay development, HTS (libraries, cheminformatics), hit prioritization, hit validation, lead optimization, in silico modeling, medicinal chemistry, process chemistry, safety screening, non-GMP manufacturing, early Chemistry Manufacturing and Controls (CMC) (Quality Control (QC)/ Quality Assurance (QA)), non-GLP in vitro/vivo toxicity, non-GLP pharmacology, cGMP manufacturing, GLP tox (Pharmacokinetics (PK) / Pharmacodynamics (PD)), long-term toxicity, Investigational New Drug (IND) filing, late CMC (formulation, stability studies, pharmaceutical tech), Institutional Review Board (IRB), recruitment, informed consent, clinical trials phase I, II, III; data review, translational pharmacology, New Drug Application (NDA) submission, approval, pricing strategy, insurance/access advocacy, standard of care, patient practice, phase IV (post-approval studies), reporting, new safety trials/observational cohort studies.

If these workflows weren’t complex enough, each step may be replicated, generating multiple datasets. Or a process might be performed by multiple teams, using different data standards or legacy instrumentation. Moreover, transitions from one step to another can introduce additional complexities, such as geographic or language barriers, skill gaps, procedural changes, and differences in documentation practices. 

With so many actors in producing data—including biologists, modelers, data scientists, chemists, analysts, and pharmacologists—it’s easy to see how data silos arise and metadata inconsistencies multiply. These problems trap results forever in non-convertible PDFs, PowerPoint decks, and legacy databases.

In the complex workflow described above, data is produced and resides everywhere. Here’s just a sampling of where your scientific data might be locked:

  • On-premises Windows systems with local control software and outmoded operating systems
  • Processing/analysis software
  • A thick-client ELN, laboratory information management system (LIMS), scientific data management system (SDMS), or compound registration system
  • Methods defined in chromatography data system (CDS) software or internal repositories
  • PowerPoint, Excel, Word, PDF assay reports, and email attachments
  • Domain-specific software such as GraphPad Prism or PerkinElmer ChemDraw

Fragmented scientific vendor ecosystem

The complexity of scientific data and workflows is amplified when you overlay the fragmented instrument vendor ecosystem. Biopharma organizations use a large number and wide variety of instruments and related software applications. Each scientist might use numerous software systems, especially in R&D departments. And each of those software systems might produce a distinct data format. 

Take, for example, the test phase within the DMTA cycle. This phase could involve a range of new discovery technologies, such as DNA-encoded libraries, organoid screens, virtual libraries, and lab-on-a-chip. These technologies exponentially increase the numbers of samples tracked, properties predicted, and assay metadata associated with a given project. Each of these systems adds unique data formats into the ecosystem. 

To make matters even more complex, each instrument could have numerous vendors, each with a proprietary data format. In other words, data formats can differ across various types of instruments, and even among instruments of the same type.

For example, there are over a dozen vendors and file types for a mass spectrometer:

Company Extension File type Processing Packages
Agilent
Bruker
.D (folder) BAF / YEP / TDF data format Agilent MassHunter
Chemstation
MaxQuant
.YEP instrument data format
Bruker .BAF / .FID instrument data format
.TDF timsTOF instrument data format
ABI
Sciex
.WIFF instrument data format Analyst
BioAnalyst
.T2D 4700 and 4800 file format
Waters .PKL peak list format MassLynx
.RAW (folder) Waters MassLynx
Thermo Fisher Scientific
PerkinElmer
.RAW raw data Xcalibur
TurboMass
Chromtech
Thermo Fisher Scientific
VG
.DAT ITDS file format MassLab
MAT95
Thermo Fisher Scientific .MS ITS40 instrument data format
Shimadzu .QGD GCMS solution format LabSolutions Insight
Open Source Analytical
.QGD instrument data format
.LCD QQQ / QTOF instrument data format
.SPC library data format
Bruker .SMS / .XMS instrument data formats ACD Labs
MS Workstation
ION-TOF .ITM raw data SurfaceLab
.ITA analysis data
Physical Electronics
ULVAC-PHI
.RAW raw data Quantes
.TDC spectrum data

File-type lists for more specific applications, such as proteomics (Mol Cell Proteom, 2012), would contain hundreds of entries. And note, this does not include the myriad of processing packages such as MaxQuant, Byos, and Virscidian that often control highly specific workflows in “omics,” protein science, or high-throughput experimentation.

What makes fragmentation so much worse: vendor lock-in

In addition to the challenges created by a fragmented scientific vendor ecosystem, common vendor practices also complicate data workflows. Vendors often lack the motivation to be compatible with each other. In fact, they are usually incentivized to do the opposite—that is, trap their users within the vendor’s ecosystem. As a result, vendor lock-in has become prevalent across the industry.

Lock-in occurs when customers can’t use their scientific data with products outside a vendor’s suite of offerings. Although customers generate the data in their own labs with instruments and software they paid for, vendors often behave as if they have the right to dictate how customers use this data.

In this situation, vendors effectively hold customer data hostage. Scientific data generated by instruments and software applications can be difficult, perhaps impossible, to move outside the vendor’s walled garden. Their characteristics include:

  • Proprietary data formats: Typically instruments generate data in proprietary formats (e.g., binary) that can only be used by software sold by the same vendor. This approach limits the ability to read and process data using third-party tools.
  • Lack of data export options: Vendors frequently do not provide a programmatic way to export data. There may be no software developer kit (SDK), application programming interface (API), or documentation on how to export the data into non-proprietary formats or consume the scientific data programmatically. Oftentimes, the only option for scientists is to manually record data on paper which does not include all the results and settings related to the data.
  • Restrictions on data extraction: In some cases, scientific data is contained in a file and cannot be extracted. Even if data is provided in clear text, the file content might not have a predefined structure or have significant variations based on the instrument model and method settings.
  • Obstacles to external integrations: Instrument and lab informatics software vendors often don’t support integrations with external products or only allow access to some but not all data. They may enact a fee (read: toll) to export the data. Or they can outright block integrations through technical means or draconian licensing terms that restrict API usage for certain purposes.
  • Prioritizing integrations within the vendor’s ecosystem: Even when vendors support external integrations, they usually deprioritize them in favor of any products of their own or affiliated companies. This bias could mean significant delays in fixing bugs, ensuring compatibility with new software versions, or leveraging new features in the software.

Vendor lock-in restricts how biopharma companies can use their own scientific data. It limits the freedom to use the best tools on the market, regardless of manufacturer or developer. It also creates data silos, where organizations struggle to know what data they have or how it was generated. Aggregated analysis across instruments, experiments, or studies becomes nearly impossible. And, importantly, scientific data held captive in proprietary formats or software can’t be engineered and used for advanced analytics and AI. The data hits a dead end; its value is severely capped. This all has a deep impact on the decision making, innovation, and competitiveness of a company. 

Unfortunately, there is little incentive for vendors to change. Instrument manufacturers and application vendors want to sell more hardware or licenses. If data becomes “stuck” in an instrument’s control software, it may limit a scientist’s ability to leverage popular data science tools such as Jupyter Notebooks or Spotfire, but it won’t necessarily affect—and could potentially bolster—the sales of a specific instrument or software.

In fact, as vendors continue to have success within particular applications, the strategy to bind scientists to a specific vendor's ecosystem has become an ingrained “way of working” across the industry. So, even as biopharma companies recognize the need to aggregate, standardize, and harmonize scientific data across all lab instruments, informatics applications, and analysis tools, data lock-in has remained a key goal for many scientific vendors. 

In sum, while instruments and their control software are often powerful scientific tools, they are not designed to be enterprise data systems. They are not optimized to interface with other systems or to enable data access. And sometimes, they are even developed to withhold these capabilities, essentially trapping biopharma organizations’ data in walled gardens.

Science is getting more distributed  

The biopharma industry has increasingly embraced outsourcing to contract organizations as a strategic approach to streamline operations, reduce costs, and accelerate drug development. This trend is underscored by the fact that most global biopharmaceutical companies nowadays work with 10 to 30 CRO/CDMO providers. 

The shift towards external collaboration presents significant challenges, particularly in managing and maintaining the integrity, traceability, completeness and timeliness of scientific data. Contract organizations (CxOs) work with multiple biopharma companies and must track the data of each. Operating under tight deadlines, CxOs usually provide results in formats that are easy and fast to generate, such as Excel or PDF reports. They overwhelmingly use general-purpose tools like email or file-sharing platforms (e.g., Egnyte or Dropbox) to send the data. These technologies, while sufficient for basic data transfer, fall short of meeting the needs of biopharma scientists. The reasons include:

  • Metadata and information loss: Traditional communication channels fail to provide the full scientific context of data—such as how, when, and where it was collected, processed, and analyzed. Raw data and metadata are often omitted in data exchanges. Incomplete information often leads to incomplete insights. Furthermore, ambiguous or incomplete results may trigger the need for additional emails or meetings, causing delays.
  • Inconsistent data formats: Reports and spreadsheets can vary widely—between CxOs and even from the same provider—because the industry lacks standards. Much data resides in unstructured text. Biopharma scientists have to spend substantial time interpreting, reformatting, and consolidating the data. Without this effort, the data would remain siloed with limited value. 
  • Lack of sample lineage: CxOs rarely provide a complete history of sample handling and processing. Partial information about sample tracking, if available, may be buried in a report or spreadsheet. In addition, CxOs and biopharma companies may assign different unique identifiers to the same sample, making it harder to integrate data from the two parties.

Given the complexity of scientific data, simple tools for data exchange, like email, are woefully inadequate for CxO-biopharma collaboration. Continuing the old paradigm will lead to the following drawbacks.

  • Low efficiency and speed: The inefficiencies in data communication and management can slow down the drug development process, negating some of the intended benefits of outsourcing. The lack of real-time feedback hinders collaboration.
  • Uncertain data integrity and quality: Issues with metadata loss, sample lineage, and data siloing can compromise the quality and integrity of the scientific data, which is crucial for drug development and regulatory approval.
  • Higher costs: Inefficient data management can lead to increased costs due to data rework, delays in project timelines, and potential regulatory setbacks.

Cultural habits of scientists 

The typical scientific mindset is curious, logical, and driven by discovery. Scientists understand that proper experimentation requires patience. However, introducing business elements like tight deadlines and demands for assay robustness can influence their practices in the lab. Scientists often adopt habits and practices in the name of expediency—and sometimes at the expense of rigor—impacting how scientific data is handled, and documentation is sometimes seen as a non-value adding burden.

  • Experiment first, record second: Scientists commonly conduct experiments before documenting them. The scientific process is iterative and often involves repeating and refining assays, whereas recordkeeping is more routine. Scientists may try to speed experiments by delaying documentation until a study is complete. However, this approach can compromise data quality (see next).
  • Documentation backlogs: Prioritizing experimentation often causes a backlog in documentation. As time passes, data and information accumulates and memories fade, which can lead to incomplete, unannotated, or low quality data.
  • Transient storage: In many labs, data is temporarily stored in on-premises network folders, or even USB thumb drives, en route from an instrument to the ELN. These transient storage solutions create more silos within the organization, jeopardizing data accessibility and integrity.
  • Informal data recording: Even critical data, such as sample IDs and measurements, are often scribbled on Post-It notes, disposable gloves, or the sash of a fume hood. This informal recordkeeping risks data loss or misinterpretation, lacks contextual information and lacks any good laboratory /scientific practices.
  • Data gatekeeping: Scientists may fail to record experimental work if they doubt its accuracy or relevance. Results from failed experiments, which can offer valuable insights, might be neglected.
  • Narrow documentation focus: When recording an experiment, scientists may omit comprehensive details about the experimental context, instrument parameters, or raw data, especially if they seem irrelevant to the immediate results. This short-sighted documentation diminishes the long-term value of the data.
  • Varied data formats: While routine tests may follow regimented standard operating procedures reflected in a ELN template, research and development work often relies on free-form spreadsheets and documents. As a result, data interpretability can vary widely across teams, departments, or organizations. Single individuals can even be inconsistent in their own documentation.
  • Inconsistent naming: Even diligent scientists may fail to conform to proper naming conventions. Consequently, the data becomes harder to retrieve, limiting its reuse and future utility.

Practices often adopted as a time-saving measure or workaround, pose significant challenges to accessing and using scientific data. While such issues are typically better controlled in regulated GxP environments, they remain common in many research-focused labs. Without automation and incentives provided by comprehensive data management, scientists will naturally handle and organize scientific data with approaches that may diminish its impact and value.

Summary

Scientific data within biopharma companies is prone to contain some of the most valuable insights in the world. But this data is complex, resides in many forms, and is generated through equally complex scientific workflows. To make matters worse, it’s often fragmented and locked away in proprietary data formats and software, may originate from external sources, and is heavily influenced by the scientist who generates it. The complexity of both scientific data and the ecosystem that produces it, significantly limits the effectiveness of analytical tools and AI, greatly diminishing quality and time to insight across biopharma organizations.