The Science Behind Trash Data

John F. Conway
|
July 23, 2020

The Suffering Science, and Missed or Delayed Discoveries Behind and Underneath Trash Data


Author: John F. Conway, Chief Visioneer Officer, 20/15 Visioneers

Trash or Dark Data is collected data that is not being touched or getting secondary use. We are not talking about tracking data or ancillary system data, etc. We are unfortunately talking about R&D data that has come from experimentation and testing, both physical and virtual. This means individuals and organizations are not taking advantage of key decision-making data and information and are missing very valuable insights.

In many R&D organizations, Trash Data has accumulated and can account for upwards of 80% of its known data. (1) , (2)

The interesting part of this conundrum is that it's both structured and unstructured data! If data isn't properly contextualized, it’s probably not going to be reused, hence it becomes Trash Data. So, don’t get fooled with just pushing mis- or uncontextualized data to the cloud, because you will now have Trash Data in the cloud!

Trash-Data

Unfortunately, in the year 2020, this is a major problem for R&D organizations. It is telling you that either you have no Scientific Data and Process Strategy, or, even worse, you are not following the one you have written and committed to following! It's also telling you that some of your AI/ML strategies are going to be delayed until the organization solves its major data and process problems. Model Quality Data (MQD), and lots of it, are needed for AI/ML approaches in R&D. And remember, your processes produce the data, so they go hand-in-hand. Data generation processes need to be formulated and established during the initial step of any data project. To mitigate the risk of being inundated with trash data, a foundation with strict requirements of data production and integration need to be established. “Data plumbing” is a critical step that will ensure your properly contextualized data is routed and deposited in the right location.

Another major issue is what to do with legacy data! Based on what was just discussed, my experience has shown that ~80% of legacy data is not worth migrating. The best strategy is probably to leave it where it is and connect to it with your new data integration platform. However, you will determine the value and worth of the legacy data by performing a data assessment through the new platform.

So, how has this happened? How did we arrive at this place where 80% of our hard-won R&D data ends up as trash? The truth is that it comes down to human behavior and discipline. Writing, agreeing, and committing to a Scientific Data and Process Strategy is step number one. However, to take a written agreement and turn it into standard processes that are embedded within the organization, you need a company culture that includes true leadership and a focus in the area of REPRODUCIBLE science. Tactically, it starts with data capture and storage principles. You need a Data Plumber!

R&D organizations are like snowflakes - no two are identical - but there is much overlap in process and types of data generated! Variation is the real problem. Instrument types and various lab equipment with or without accompanying software (e.g. CDS - chromatography data systems), ancillary scientific software like entity registration, ELN (electronic laboratory notebook), LIMS (laboratory information system), SDMS (scientific data management system), exploratory analysis, decision support, Microsoft Office tools, and the list goes on. Hopefully, you have consistent business rules for managing and curating your data, but the chances are this varies as well. What you, unfortunately, end up with is a decidedly unFAIR (see FAIR data principles) data and process environment.

Why did you, a mature R&D organization end up in this position? (Startup Biotechs – BEWARE! Learn from the mistakes of those who have gone before and don’t let this happen to you!) (3). It may be hard to deconvolute, but I think, and this can be up for debate or confirmation, the mindset of “data and processes as an asset” got lost somewhere along the way. Perhaps management and others became impatient and didn’t see a return on their investment. Poorly implemented scientific software tools that were designed to help these problems, compounded the situation. In some cases, environments were severely underinvested in. Taking shortcuts in the overall data strategy, without establishing other foundational steps or processes, is like building a house without a proper foundation. At first, this leads to sagging windows and doors, and a poor living experience with the house. Eventually, the foundation-less house collapses or needs to be demolished. In other cases, churn and turnover in different IT/Informatics and business functions created a “kick the problem down the road” situation. Many times, the “soft” metrics weren't gathered for the repeat of experiments and the time that was wasted in trying to find things and make heads or tails of poorly documented data or experiments. At the end of the day, humans worked harder instead of smarter to make up for the deficiencies.

Imagine you can start to solve this problem both strategically and tactically. Tactically, it starts with business process mapping and completely understanding your processes. The understanding of the detail is an insurance policy for much better outcomes in your journey. As discussed, strategically you need a sound written Scientific Data and Process (SD&P) Strategy that the organization can follow. Tactically, you need to capture your structured and unstructured data in both raw and processed forms. It must be properly contextualized; in other words, you need to execute on your SD&P Strategy. Make sure you are adding the right metadata so that your data is relatable and easily searchable.

This can’t all be done on the shoulders of your scientists. Instead, use smart technology and adopt standards wherever possible. You need to purposefully design the framework for the “data plumbing” in your organization. Both hot and cold data needs to be routed to where it belongs. And... besides an ELN, SDMS, or LIMS, it may belong in a knowledge graph where its true value can be exploited by all personas, like data scientists, computational scientists, and savvy scientists making bigger decisions! When you can accomplish this purposeful routing, you will end the broken pipes which led to data silos and disparate data! Finally, you are on the road to being a FAIR compliant R&D organization! Findable! Accessible! Interoperable! And last, but not least, secondary use of your data - Reusable!

Your organization must make very important decisions about how it is going to guard some of its top assets - propriety data and processes. All R&D organizations need their high-quality science to be reproducible and repeatable. The data and process capture must be exact for this to occur. This means understanding both the instrument integration AND the “plumbing” or shepherding of the data into the right data repositories is critical.

Let’s consider one example. On the surface, an ELN is keeping track of the researcher’s Idea/Hypothesis through to his or her Conclusion. The ELN captures some artifacts of data including images, files, etc. However, in many cases, the ELN does not support the storage of massive amounts of experimental data. Instead, it records a pointer to this data. This “pointer” strategy prevents application data bloat and encourages the proper storage and curation of experimental data. This is just one example where “data plumbing” design comes into play in many medium to large R&D organizations. Having a platform that you can plug into that captures data, instrument, application, and process integration is a high ROI (return on investment) need.

Having worked in this space for thirty years in a plethora of roles, from individual contributor to leading strategy and teams for many years, it became obvious that this is a very big problem and will need true team work to combat. Many bespoke systems have been built, maybe some have worked well, but I haven't seen it myself.

I believe that we are finally able to solve this Trash Data problem once and for all. You need to partner with companies who are taking a platform approach to get into production as quickly as possible. You need partners who truly understand data diversity, contextualization, and FAIR principles.

Every day you don’t have a solution in production, the Trash Data continues to pile higher. The predictions are out there, IBM is estimating a jump to 93% in the short years to come.

The platform needs to be cloud-native to provide the scalability and agility needed for future-proofing. It needs to be enterprise-grade, meeting the security and compliance needs of Life Sciences R&D. It also needs to elegantly handle the complexity of not only the automated collection of data – everyone can do that these days – but also “plumbing” of the data to and from multiple ELNs, LIMS, SDMS, Knowledgebases/graphs for data science tools, etc. We all know that big pharma is never going to be able to consolidate to one provider across the enterprise. And it needs to harmonize all the data – RAW and processed - and significantly reduce or eliminate Trash Data. The platform needs to automate all repeatable tasks and metadata collection/capture to remove the burden from the scientists and improve data integrity. This is a serious endeavor, and one you can't afford to ignore. After all, you don’t know what you don’t know, but even worse, you don’t know what you already should, and you can’t find that experiment or data in your current environment!

John-Conway-Photo


About the author:
John has spent 30 years in R&D environments that include Life Sciences, Material Sciences, Computational Sciences, Software, and Consulting. In his last BioPharma role, John was Global Head of R&D IT for AstraZeneca and Global Head of Data Sciences and AI. John started his own Consultancy in late 2019, 20/15 Visioneers, which has been taking off and keeps him busy. To read more of John's thought leadership in R&D Informatics, check out the 20/15 Visioneers Blog.

  1. Dark analytics: Illuminating opportunities hidden within unstructured data
  2. Dark Data Wikipedia
  3. The Biotech's Manifesto for Scientific Informatics

Share this article

Previous post

There is no previous post
Back to all posts
September 15, 2022

Creating a Treasure Trove of Scientific Insights

Read Blog
September 8, 2022

Pragmatic Compliance Solutions: Adding Value Effectively to GxP

Read Blog
August 10, 2022

Reinvented Resource Management Powers Innovation Cycles

Read Blog
August 11, 2022

Introducing Tetra Data Platform v3.3

Read Blog
August 4, 2022

Automating qPCR Workflows for Better Scientific Outcomes

Read Blog
July 28, 2022

3 Ghosts of Data Past (and how to eliminate them)

Read Blog
July 26, 2022

Science at Your Fingertips - Across the Enterprise

Read Blog
July 22, 2022

Building The Digital CDMO with TetraScience

Read Blog
June 27, 2022

Barrier Busting: Bringing ELN and LIMS Scientific Data Together

Read Blog
May 31, 2022

Committed to Curing Diabetes

Read Blog
May 23, 2022

New Frontiers: World’s First Community-Driven AI Store for Biology

Read Blog
May 18, 2022

Tetra Blasts Off at Boston’s Bio-IT World

Read Blog
May 9, 2022

Give Your in vivo Data the Attention it Deserves

Read Blog
May 2, 2022

Customizing Digital Lab Experiences With Ease

Read Blog
April 14, 2022

Sharing a Vision and Deep Customer Commitment

Read Blog
April 11, 2022

Escaping the Scientific Data Quagmire

Read Blog
April 1, 2022

Innovating with a HoloLens and Drones

Read Blog
April 6, 2022

Digital Twins: Seeing Double with a Predictive Eye

Read Blog
March 28, 2022

Automated Anomaly Detection and Correction

Read Blog
March 30, 2022

Making Labs More Efficient

Read Blog
March 4, 2022

Introducing Tetra Data Platform v3.2

Read Blog
March 2, 2022

Are you prepared to utilize ML/AI and Data Visualization?

Read Blog
February 22, 2022

SLAS 2022: The Industry’s “Hyped” for Accessible and Actionable Scientific Data

Read Blog
February 21, 2022

BIOVIA partners with TetraScience

Read Blog
February 16, 2022

Tetra Partner Network: An Interview with Klemen Zupancic, CEO, SciNote

Read Blog
February 4, 2022

Closing the Data Gap in Cancer Research

Read Blog
January 27, 2022

Waters & The Tetra Partner Network: Making Data Science Possible

Read Blog
December 16, 2021

Announcing Acquisition of Tetra Lab Monitoring Business by Elemental Machines

Read Blog
November 29, 2021

Move From Fractal to Flywheel with The Tetra Partner Network

Read Blog
March 26, 2021

How an IDS Complements Raw Experimental R&D Data in the Digital Lab

Read Blog
July 30, 2021

What is an R&D Data Cloud? (And Why Should You Care?)

Read Blog
March 26, 2021

What is a True Data Integration, Anyway?

Read Blog
June 1, 2020

Data Science Use Cases for the Digital Lab: Novel Analyses with Waters Empower CDS Data

Read Blog
April 20, 2022

Unlock the Power of Your ELN and LIMS

Read Blog
July 23, 2020

The Science Behind Trash Data

Read Blog
August 20, 2021

The 4 Keys to Unlock the Lab of the Future

Read Blog
September 29, 2021

TetraScience Achieves SOC 2 Type 2 Validation, Advances R&D Data Cloud GxP Compliance Capabilities

Read Blog
April 20, 2020

Round-up of Semantic Web thought leadership articles

Read Blog
May 11, 2021

R&D Data Cloud: Moving Your Digital Lab Beyond SDMS

Read Blog
September 10, 2021

Principles of Deep Learning Theory

Read Blog
July 8, 2020

Powering Bioprocessing 4.0 for Therapeutic Development

Read Blog
March 30, 2022

Why Biopharma Needs an End-to-End, Purpose-Built Platform for Scientific Data — Part 2

Read Blog
August 19, 2021

Part 2: How TetraScience Approaches the Challenge of Scaling True Scientific Data Integrations

Read Blog
March 23, 2022

Why Biopharma Needs an End-to-End, Purpose-Built Platform for Scientific Data — Part 1

Read Blog
January 18, 2021

New Matter: Inside the Minds of SLAS Scientists Podcast

Read Blog
June 29, 2020

Enabling Compliance in GxP Labs

Read Blog
May 14, 2020

LRIG-New England: Lunchtime Virtual Rapid Fire Event - May 26, 2020

Read Blog
June 10, 2020

Remote Lab Scheduling is No Longer Optional, it is a Requirement

Read Blog
August 2, 2020

Incident Reporting for GxP Compliance

Read Blog
October 15, 2020

Protein Purification with Cytiva UNICORN: Enhanced Analytics through Harmonization and Integration

Read Blog
July 29, 2020

Cloud-based Data Management with Lab Automation: HighRes Biosolutions Cellario + TetraScience

Read Blog
August 20, 2020

Understanding Why Freezer Temperatures May Not Be Uniform

Read Blog
July 14, 2021

Find Experimental Data Faster with Google-Like Search in Tetra Data Platform 3.1 Release

Read Blog
July 22, 2021

Experimental Data in Life Sciences R&D — It’s How Many Copies of Jaws?!

Read Blog
April 26, 2020

The Digital Lab Needs an Intermediate Data Schema (IDS): a First Principle Analysis

Read Blog
April 6, 2020

TetraScience ADF Converter -- Delivering on the Promise of Allotrope and a Startup’s Journey

Read Blog
August 6, 2020

"Data Plumbing" for the Digital Lab

Read Blog
June 8, 2020

Data Automation for High-Throughput Screening with Dotmatics, Tecan, and PerkinElmer Envision

Read Blog
May 15, 2020

Applying Data Automation and Standards to Cell Counter Files

Read Blog
June 11, 2020

AWS Healthcare & Life Sciences Web Day | Virtual Industry Event

Read Blog
February 12, 2021

AWS Executive Conversations: Evolving R&D

Read Blog
April 15, 2021

Announcing Our Series B: The What, When, Why, Who, and Where

Read Blog
April 15, 2021

Announcing our Series B: The DNA Markers of Category Kings and Queens

Read Blog
April 15, 2021

Announcing our Series B: Tetra 1.0 and 2.0 | The Noise and the Signal

Read Blog
March 29, 2020

Allotrope Leaf Node Model — a Balance between Practical Solution and Semantics Compatibility

Read Blog
March 13, 2020

Choose the right alert set points for your freezers, refrigerators, and incubators

Read Blog
August 27, 2020

99 Problems, but an SDMS Ain't One

Read Blog