Experimental Data in Life Sciences R&D — It’s How Many Copies of Jaws?!

July 22, 2021

What does replatforming to the cloud have to do with Jaws?

Last month, Tetra’s Chief Scientific Officer Mike Tarselli, joined John Conway, Science and Technology Visioneering, 20/15 Visioneers for a webinar to discuss the growing momentum in life sciences R&D to replatform experimental data to the cloud. The two debated challenges surrounding data generation and management, explored the need for future-proofed, data-centric approaches to digital transformation, and advocated for why FAIR (Findable, Accessible, Interoperable, and Reusable) data and processes have become de facto industry standards.

The webinar focused on digitalization and cloud replatforming initiatives, time spent on manual data wrangling and lab connectivity, and the reality of data FAIRness in life sciences R&D. In addition to an interactive debate between speakers, audience engagement was encouraged and the responses provided key insights into the current state of their digital data initiatives and highlighted why implementing these initiatives now are more critical than ever before.

The blog highlights these insights a little further down — you can also check out the webinar recording.

But before we dive too deep into the vast ocean, some context...

Data Explosion in Drug Discovery

“Big Data” was first coined by John Mashey in 1987 to quantify large volumes of information. As technology advanced since the late 1900s, (to my fellow Millennials, take a beat and let “late 1900s” sink in for a moment…), the velocity of data generation has exploded across industries, especially in healthcare and life sciences. As Dell EMC reports:

“Healthcare and life sciences organizations have seen an explosive healthcare data growth rate of 878% since 2015.”

It’s estimated that by 2025, over 175ZB (zettabytes) of data will be generated worldwide. It’s a number so big, that wrapping your brain around the concept and truly understanding just how big, is pretty challenging.

That’s why when researching how big the “Big Data” in life sciences explosion really is, I was delighted to have stumbled upon the National Human Genome Research Institute’s fact sheet on Genomic Data Science where they equate the volume of genomic data being generated to copies of the movie Jaws. Yes, Jaws - Jaws of the “scaring children out of the ocean” fame. That one…


NHGRI states that a single human genome = 200GB of data = ~200 copies of Jaws. 

Genomic data, however, is only one piece of the puzzle. Experimental data generated in life sciences runs the gamut of small and large molecules, imaging, proteomics… the list goes on. However, connecting all these disparate data is a challenge due to the fragmented ecosystem of instrumentation and systems in life sciences, which perpetuate data silos. In turn, these data silos cause significant impediments to innovation in R&D - oy vey.

As drug and therapeutic discovery moves more and more into precision medicine territory, other heterogenous, disparate data will be thrown into the mix, for example:

  • Clinical trial data
  • EHRs (Electronic Health Records)
  • Smartphone and wearable data
  • Socio-demographic data

That’s a lot of data. 

A lot of data from a vast array of instrumentation, software systems, and informatics applications. 

A lot of data locked-in vendor proprietary data silos, requiring manual point-to-point integration. 

A lot of data that requires manual wrangling to standardize for more advanced analytics. 

So how do life sciences R&D organizations plan for and manage all the considerations around the V’s of Big Data, ensure the FAIRness of their data and processes, all while future-proofing their investments and accelerating time-to-market? 


We're Gonna Need a Bigger Boat


All this Jaws talk got us thinking - so we sharpened our pencils and did a bit of our own arithmetic. 

We asked: If 200GB of data = ~200 copies of Jaws, how many copies of Jaws = 175ZB of data?

The answer: 175 Trillion copies of Jaws

(Fun fact: 175 trillion copies of Jaws equals 8 quadrillion Great White Shark teeth. Give us a shout out the next time you win trivia at your local watering hole).

Now, we’re not saying that all of the 175ZB of data generated will be experimental R&D data by 2025, but as data generation continues to explode into a massive tsunami, who’s to say that in the next 5 to 10 years, the number won’t be close to it?

Organizations need to ask themselves critical questions as they begin to map out and implement a future-proofed, holistic, data-centric framework to harness the power of experimental R&D data. Questions such as:

  • How, and more importantly where, will you store all of that data?
  • How do you plan for and find the resources necessary for the store and compute?
  • How will you access actionable, high-quality data while ensuring security and compliance, while also simultaneously facilitating collaboration?
  • How do you future-proof your investment today, tomorrow, months and/or years from now?

Data proliferation, the recent explosion in data creation and the efforts to store and manage the data, has become an all too familiar challenge in life sciences R&D as the volume of both structured and unstructured data continues to grow exponentially. 

So how can organizations within life sciences not only manage data proliferation, but actually benefit from the data explosion? 

The Cloud Replatforming Movement in Life Sciences R&D

Simple — use cloud-native technologies and solutions that address challenges with a “data-first” mindset and replatform experimental R&D data to the cloud.

As mentioned previously in the post, we polled the audience to get a better understanding of their thoughts on the data landscape, data maturity, and digital data initiatives within their organizations... and we’re sharing some of them below!

For deeper insights to sink your teeth into (har har) and to watch an interactive Q&A session - check out the webinar recording

Has your organization implemented a digital strategy?

33% of the audience has only begun conceptualizing a digital strategy within their organizations, but have yet to implement. 

There’s many moving pieces to plan, build, and implement a data-centric digital strategy for the long-term. Not only are there technical considerations, but how do you also rally your organization and gain cultural buy-in? 

Where does the cloud fit into your strategy?

0% of attendees are committed to being totally on-prem and have no interest in replatforming to the cloud. 

Clearly, on-prem has “jumped the shark.” It’s costly, requires lots of resources to run day-in and day-out, and perpetuates data silos. Cloud-native solutions improve collaboration, reduce costs, accelerate innovation, and enhance security and compliance. The COVID-19 pandemic made it abundantly clear that facilitating collaboration across hallways, organizations, and globally speeds discovery.

How much time do you currently spend creating/maintaining your own data ingestion/integration/connectivity?

70% of the attendees spend most, or almost all of their time creating and maintaining data ingestion, integration, and connectivity.

What do you seek to gain from easily accessible and dynamic data? When the focus is shifted away from manual data wrangling, time can be spent focused on advancing scientific discovery — and isn’t that the whole point?

Do you understand the concept of a productized integration? Aware of the benefits?

58% stated that productized integrations are fundamental to a future-proofed data strategy. 

True connectivity in life sciences integrates the fragmented ecosystem of the latest technologies, solutions, software, instrumentation, and even systems still running on Windows 95. By standardizing and productizing integrations, the larger scientific community benefits from the time and resources repurposed to more meaningful scientific work.

A Tetra Data Integration has to reach a very high bar to even be considered and must be able to automatically acquire, harmonize, centralize, prepare the data, enable data provenance, and then ultimately push data back into determined targets, like ELN/LIMS systems. 

What percentage of the data in your organization are considered FAIR? And what percentage of your data can you access today?

0% of attendees said all of their data could be considered FAIR. And 0% have full accessibility to their data.

Ahh...FAIR data, (Findable, Accessible, Interoperable, Reusable). We need it, we want it, we GOTTA have it.

Back in 2020, John guest authored “The Science Behind Trash Data” blog where he stated that up to 80% of data produced is trash - i.e. not FAIR. This shines a spotlight on the necessity for not only the building of foundational frameworks for data generation, but also the ability to contextualize data, and the implementation of automating the full life cycle R&D data, (acquisition, harmonization, centralization, and preparation for data science and advanced analytics).

What is the benefit of data if it cannot be found, easily accessed, or reused?  


It’s clear the movement to replatform experimental data to the cloud is gaining serious momentum in life sciences R&D. Organizations are beginning to think strategically about their data initiatives and implement long-term, data-centric approaches to their scientific workflows - and the cloud plays a major role. Benefits of replatforming include future-proofed investments, accelerated time-to-market,  enhanced security and compliance, and the facilitation of collaboration. 

However, the cloud can only close the gap so much - what is imperative is the organizational and cultural understanding and buy-in as to why replatforming now is so critical. Allow us to leave you with a thought:

“It’s time to take a stance on it. Start taking a real prescriptive approach on your data integrity, on your FAIRness, on your openness in terms of connectivity and yet, maintaining security, then we can all make this work.” - Mike Tarselli, Ph.D., MBA, Chief Scientific Officer, TetraScience

Don’t be scared of the ocean, jump in - we've got you.

Watch the full recording and interactive discussion on the "Why Life Sciences Organizations are Replatforming Their Experimental Data to the Cloud" webinar page.

No items found.