Biopharma’s DIY problem: 5 reasons why DIY for AI fails

December 21, 2023
Dr. Daniela Pedersen

The great AI race is here. In April of 2023, Microsoft, Google, Meta, and Amazon mentioned “AI” 168 times combined during their earnings calls. By July, Microsoft alone mentioned AI 175 times.

Despite its reputation as a late adopter, biopharma is responding. CEOs are banging the AI drum in earnings calls, on MSNBC (more on data42 later), and have even formed data-sharing coalitions to leverage AI capabilities.

From within the maelstrom of hype, it may seem like the only limitation on AI is our imagination. Organizations simply have to pluck AI experts from leading universities—find and purchase best-in-class AI technology—fund the right projects—and voilà, life-altering AI capabilities.

However, as organizations are discovering over, and over, and over again, bridging the gap between current capabilities and AI in biopharma comes down to how they manage their most precious asset—their scientific data. More specifically, AI models require large-scale, liquid, compliant, and engineered data to function. Many organizations see this challenge and roll up their sleeves. While admirable, this DIY mentality will lead to definitive competitive disadvantages. Allow me to explain.

What’s the deal with DIY?

Many biopharmas use in-house resources as opposed to commercial solutions to solve their data challenges—the most glaring being their need to prepare data for advanced analytics and AI applications. These DIY approaches tend to share characteristics across the industry. To start, at the bottom layer of the data pyramid, the data infrastructures are composed of point-to-point integrations for specific instruments and software applications. Contextualization and harmonization are less common but can be found within individual workflows.

DIY’s long-tail surprise

Taking a DIY approach to managing scientific data seems to make intuitive sense. At least initially. Budgets look agreeable, organizations have some internal expertise, and there is a perceived security that comes with control.

But as months and years go by, biopharma companies that have adopted a DIY approach find that establishing, maintaining, and updating data workflows is far more resource- and cost-intensive than expected. The brittle, point-to-point integrations are tedious to build and unable to scale. Maintaining and updating software, or integrating new instruments, are extremely time-consuming and resource-intensive tasks. Meanwhile, validation and documentation require intense effort. 

What’s more, as delays occur, there is no one to “blame.” Even if organizations hire the most brilliant data specialists on earth, data management and data transformation are new trades for most biopharmas, and each step of the data journey must be constructed, tested, and debugged from the ground up. The “plug-and-play” model many biopharma leaders imagine is simply not a natural part of developing a new technological infrastructure.

Even after investing significant time and effort in a DIY approach, organizations often experience disappointing results: The user experience is often poor, and data remains trapped at the most basic layer of data maturity. The data is stuck in proprietary formats, illiquid, and subscale—it is in no way ready for AI. Engineering and preparing data for AI requires additional investment and additional projects. If an organization does devote internal resources to data engineering, timelines almost always exceed initial expectations and teams become frustrated. These hidden expenses give the costs of DIY a long tail that could last years.

The tyranny of now

Immediacy bias also plagues DIY projects. Given their limited scope and definition of success, DIY projects tend to favor current needs instead of creating future-proof systems. For example, a point-to-point integration might solve the problem of transferring a single instrument’s data to an electronic lab notebook (ELN). However, what happens when the lab wants to add a second, third, or fourth instrument? What happens when someone wants to analyze data across the lab, across groups, or compare suites of like instruments from different vendors? Suddenly the point-to-point paradigm breaks down. Operating with this relatively myopic viewpoint often forces companies to update entire systems as new technology, software, or data protocols evolve.

What does DIY failure look like?

DIY failure is all too common. Just look for the familiar pattern: Data projects greatly exceed their initial scope, fail to deliver promised data capabilities, and produce only rigid datasets that aren’t ready for AI.

One top 20 biopharma we spoke with reported that three years into a DIY data initiative, the company had only succeeded in integrating 20–25 percent of its total instrument base, despite costs continuing to balloon past initial estimates. And this project was for relatively straightforward integrations.

Things become more complicated on larger AI-based initiatives that attract nine-figure budgets, promise world-altering results, and still struggle to produce results. Take data42, for example, a Novartis project initiated at the end of 2019. The project’s goal was to leverage 2 million patient-years of data with AI technology. But the day-to-day work would include a massive amount of data management.

The nature of the work wasn’t a surprise either. In an early press release, Peter Speyer, the lead of product development, said “All of those data need to be cleaned and curated to make them machine-learnable. This is hard and cumbersome work, but it frees up our data scientists to focus on answering questions with data.” 

Fast forward to 2023: Reports have emerged that Novartis severely curtailed the data42 project. Why? There has been no official reporting. However, Achim Pleuckebaum, data42’s former Head of R&D, provided a few key learnings as he exited the company and project. His first piece of advice:Do not try to build it yourself.”

5 drawbacks to DIY

So why do DIY scientific data solutions fail in modern biopharma companies? It comes down to five problems:

1. Inability to produce AI-native data

Using AI in biopharma requires large-scale, liquid, engineered, and compliant data for optimal performance. Creating that data calls for significant processing, contextualization, and centralization. DIY initiatives typically do not produce data with all these qualities—and if they do, the cost/benefit ratio is poor. DIY teams are generally solving short-term problems, such as integrating lab instruments and informatics applications, instead of laddering up to higher-order problems such as AI enablement.

2. Significant investment of time and capital

Creating, validating, and thoroughly documenting point-to-point integrations is extremely labor intensive and requires a unique combination of technical expertise and collaboration with vendors across the scientific data ecosystem. Both factors contribute to long timelines and high costs. Furthermore, once integrations are established, moving data within and between research, development, and manufacturing quality assurance (QA)/quality control (QC) workflows still require manual data transfers, transformation, and QC. These processes require sophisticated scientific expertise and significant resource investment—and they are still error prone.

Organizations incur additional expenses as internal teams continuously maintain and update integrations and data transformation workflows for new software versions and instruments.

3. Lack of flexibility

IT project-based point-to-point integrations yield static, complex, and rigid data architectures. These architectures result in scattered, subscale data stores that house raw, fragmented data sets. The architectures are not future-proofed. Their lack of flexibility creates poor user experience in the short term and prevents companies from leveraging AI in the long term. 

4. Limited application options

If a DIY project leaves files trapped in proprietary formats and scattered data silos, it doesn’t matter how effective analytical applications are—they will not be able to function optimally. Without large-scale, liquid, engineered, and compliant data, organizations won’t be able to leverage best-of-breed analytics and AI applications, greatly reducing their ability to innovate. 

5. Poor scalability

The DIY model creates an “n of 1” internal customer business model. IT teams create, test, and validate individual data solutions for each workflow or team. With every integration and workflow augmentation being built from scratch, companies establish best practices at an organizational level as opposed to an industry level. They cannot reuse products from these types of projects, and they will have to build, test, and validate any new instrument or software integration. This makes it impossible to achieve economies of scale. Companies certainly cannot create the kind of scientific AI factory needed to produce large-scale datasets for AI applications.  

Bridging the gap

The most important asset for biopharmas in the next decade is AI-native data. Today biopharma organizations are struggling to bridge the gap between where they are today and where they need to go. DIY approaches failed, and will continue to fail, to produce the data required for AI. Biopharmas need a partner that understands how companies can increase their data maturity so they can leverage AI, accelerate time to market, and deliver better medicines to patients faster.

TetraScience is the only company dedicated to transforming biopharma’s data into AI-native Tetra Data. See how we’re approaching the AI-readiness data problem in our latest white paper, “The Scientific AI Gap.”