The Lab of the Future (Lab of the Future) is a concept guiding design and creation of new environments for doing science more quickly and efficiently. The details are continually evolving. What remains consistent, however, is that Labs of the Future are data-centric: accelerating discovery by making increasingly-effective use of data, software (including AI and machine learning), and automation (including robotics).
If you’re looking to learn:
… then you’re in the right place!
In this article, we'll answer these questions and provide all the information you need to begin building your digital, data-centric Lab of the Future. But first, let’s dive into why creating a Lab of the Future is important.
Because new knowledge is valuable, and gaining new knowledge faster helps us create opportunities, meet growing challenges, and improve human life.
The success of almost every human undertaking is gated by how quickly we can learn new things and then apply that knowledge to improve outcomes. In biopharma, chemicals and materials, energy, and many other industries, laboratory science (often referred to as 'research and development' or 'R&D') is how new products are imagined and created, and where processes for building them are architected and refined. New knowledge gained here is then applied to claim patents, validate safety and efficacy, submit materials for regulatory approval, and to enable reliable, efficient, and compliant manufacturing at scale.
Doing all this faster, better, more repeatedly, and more predictably is incredibly valuable. In the case of biopharma, for example, getting safe, effective new therapeutics into production quickly can save many lives.
It's also very profitable: saving only 5% of science teams' time (e.g., by reducing the time they spend manually entering and curating experimental data) over the average ten-year drug development cycle can be worth tens of millions of dollars in additional profits earned during the period of patent exclusivity for a new (non-blockbuster) drug.
Applying today's best technologies for data integration — a cornerstone concept for building the Lab of the Future — is already paying off for many of today's most advanced biopharma research and development organizations.
By automating retrieval of data from instruments, ELNs, and other sources; and by parsing that data into a standard Intermediate Data Schema (IDS), these labs are eliminating time-wasting manual toil and errors, while enabling quick access to searchable, enriched, standardized data for bench science, analytics, data science, and automation.
Speeding up science — also known as building the Lab of the Future — requires digitalization and automation.
Digitalization is a term that literally means "turning everything into numbers," and has been extended to mean "storing all an organization's data on computers in forms easily consumed by software, and making it findable, accessible, interoperable, and reusable (FAIR)."
Automation is eliminating manual labor by using software to drive logical and physical processes, and handle certain kinds of decision-making. Today, the emergence of artificial intelligence and machine learning is expanding the meaning of automation to include acts of perception and even forms of automated "reasoning."
Digitalization and automation are closely linked. The greatest part of digitalization depends on using automation to extract, transform, harmonize, and store data, making it searchable and comparable. That data can then be leveraged for inquiry, decision-making, and to drive further automation efforts.
But big headwinds are slowing digitalization's advance. Science is complicated: biopharma workflows with hundreds of steps, thousands of kinds of instruments and software, a plethora of proprietary file types and other issues pose huge bars to digitalization efforts. Right now, in fact, science is pretty slow — in large part, because science-based organizations aren't managing data well.
Bench scientists spend a huge percentage of working time not doing science.
Instead, they spend time finding data and extracting it from silos, manually transcribing data from non-machine-readable sources, finding and importing data files from file-shares, perusing and laboriously transforming data using spreadsheets and other crude tools to enable analysis and reporting. They spend additional time at the bench, setting up and executing experiments, often working manually with macroscopic apparatus (e.g., literal test tubes).
The same is true of data scientists (and everybody else).
Away from the lab bench, data scientists and others (e.g., IT, business analysts, leadership) also spend large percentages of time — widely-accepted estimates range around 70% — finding and curating lab and other data to make it useful for analysis.
The amount of time spent on manually wrangling data slows the entire enterprise, enables persistent inefficiencies, and makes science-focused organizations less agile and proactive. It makes management less able to lead and slows implementation of critical and transformative technologies like AI and machine learning.
Manual data-wrangling invites errors and omissions.
Invalid and/or incomplete data translates to wasted time and materials, and increased risk. Experiments can't be replicated. Processes developed at the bench can't easily be ported to manufacturing. Manufacturing runs fail because inputs are incorrect, critical equipment breaks, consumables aren't available when needed. Regulatory compliance can become impossible to achieve and to prove.
Steps aren't being taken to preserve data's long-term (and undiscovered) value.
Heavy reliance on manual or very simple automated data-wrangling tends to focus on data's local utility in the present moment: i.e., just what scientists need to move experiments forward and build required reports. Important context, provenance, specific instrument used, environmental information, and many other kinds of metadata may fail to be recorded or can become detached from raw experimental results over time.
Complete datasets (vs. results) can become difficult to find or impossible to reconstitute months or years after experiments were performed. Unstructured data may be stored raw, in forms that are difficult to parse and/or compare with data generated on functionally-similar equipment of different brands/models/types/software generations. All this makes it very hard to do wide-ranging research on data itself.
A simple example: experimental data can provide important clues about the state of instruments and equipment, revealing (for example) when a critical instrument will shortly require maintenance. Keeping vital equipment online is one simple way of making science faster (or at least, preventing it from slowing down). But if experimental datasets don't include metadata that identifies the specific instrument used to generate them (i.e., one instrument among dozens in a biopharma chromatography 'core' installation), it becomes impossible to implement such proactive strategies by leveraging existing data.
Most labs have no "single source of data truth." Manual processes don't preclude data entry into centralized lab systems of record, but they also do virtually nothing to aggregate all data in one warehouse, harmonize it for general utility (e.g., routinely convert mass spectrometry data in 20+ different file-types into a single, standard schema enabling comparison of results from any given mass-spec to results from any other), and make it all readily available via a single interface to applications and stakeholders.
Data volume isn't being estimated or prepared for, adequately. As this blog scarily details, healthcare data volumes have been growing at almost 900% per year, since 2015. By 2025, some estimates suggest a total global data volume of over 175 Zettabytes (175 Trillion Gigabytes). And as technology improves, data volumes tend to increase even faster. In biopharma, increased use of high resolution imaging, sequencing (200GB per human genome) and other technologies are magnifying storage requirements and thus creating new challenges of storage scale, transport speed, relative location of data and compute (needed to process data) and of course, total cost.
The problems attendant on storage are mitigated, but not completely solved by replatforming to the cloud. Yes, the cloud can and will scale beyond the practical capacity of on-premises storage solutions, and is more reliable. But cloud providers charge for storage and transport, and provide mechanisms and cost-incentives for careful storage management (e.g., tiering, archiving, etc.). Cloud-native data management solutions need to optimize use of tiering and other services to manage growing storage volumes for lowest cost, while also implementing dynamic storage policies to ensure adequate performance.
Scientists, data scientists, data engineers, and R&D-focused IT specialists have already adopted and implemented many powerful tools and strategies for speeding up science.
At the same time, scientific work is more and more often becoming distributed, geographically and organizationally. The web, video conferencing, and other commonplace technologies let scientists collaborate globally -- the pandemic has only accelerated this trend. Specialized Contract Research Organizations (CROs) and Contract Development and Manufacturing Organizations (CDMOs) help science-based organizations move faster while minimizing costs.
In short, science-based organizations -- like other enterprises -- are already working hard to transform themselves digitally.
But the rapid evolution of scientific equipment, data management systems, analytical tools, scientist workstyles, and strategic opportunities for collaboration are also, in some cases, compounding existing problems while making them more complex:
The result is that — to date — many efforts to accelerate discovery have delivered less benefit than anticipated, failing to keep up with emerging challenges like:
Validating new knowledge requires being able to repeat experiments and obtain the same results as you, or other researchers, obtained originally. But even as humanity does more science than ever before, scientists now globally acknowledge that they're facing a growing replication crisis, where in more and more cases, published experimental results are proving impossible to replicate, and perhaps false.
There are doubtless many reasons for this phenomenon. But one obvious contributor is likely simple human error, resulting in erroneous and/or incomplete data and/or incomplete or erroneous descriptions of methods. Better data hygiene, including automated ingestion of data, contextual and environmental metadata (e.g., what was the temperature in the lab when you performed the experiment?), data standardization, and automation (encoding methods explicitly in full detail for execution by automated instruments and systems) are part of the solution.
Turning scientific discovery into useful products (e.g., safe and effective new medicines) means being able to improve, optimize, and scale up processes developed in laboratories, so that they work in factories. This is often hard to do, and can be viewed as another aspect of the more general replication crisis. Again, better data hygiene, and use of enriched data for process design, monitoring, automation, and quality assurance, is an important part of meeting these challenges.
Truly game-changing medicines — "blockbuster" drugs, like key statins, that markedly improve life for many people and can be marketed profitably, for long periods, under patent protection — are harder and more expensive than ever to discover, develop, and bring to market. Instead, many biopharmaceutical makers are now striving to build broader and more consistently-profitable portfolios of non-blockbuster therapeutics: medicines with narrower markets.
Doing this successfully means getting more drug candidates into the discovery pipeline: implementing more efforts and doing more systematic research across broad categories of potential therapeutics to identify useful effects. This, in turn, requires efficiently scaling up and automating research (e.g., for "hit to lead" optimization) and leveraging new technologies, including applying machine learning and AI to fold proteins and model other biochemical processes in cyberspace. This work can only be accomplished within a mature, modern environment for scientific data flow management.
Outsourcing R&D and manufacturing to global partners has potential to reduce costs while speeding development of drugs (among other scientific products ideally produced under strict regulation and quality controls). But current technology, economics, and politics permits insufficient oversight, leading to what some have called a "global pandemic" of counterfeit and substandard drugs.
Overcoming the problem requires regulatory harmonization, global inspection standards, and many other organizational, legal, political, and economic adaptations. It doubtless also requires new, data-based technologies that help distant organizations communicate, share rich datasets that model experimental processes in great detail, and improve remote oversight and near-real time process monitoring, making R&D processes and their data outputs harder to falsify and easier to review both by humans and machines.
Many new, high potential therapeutic approaches have markets of one: personalized therapies, tuned to match the genetic characteristics governing a disease process in a single individual. Applying this science at scale, to whole populations of patients, means extending the R&D (and manufacturing, and clinical trial, and regulatory) domains to the patient bedside. This will demand every anticipated faculty of the Lab of the Future, including robotics and AI. And all of this depends absolutely on end-to-end management of R&D data, in close-to-real time.
Challenges facing humanity today, including climate change and pandemics, can only be overcome by achieving new levels of efficiency and collaboration. Of special importance here is the question of data harmonization: standards that help align self-similar data from many sources, making it possible for many researchers to share their work with one another, quickly create applications for comparison and analysis, and work together efficiently in scenarios where critical datasets pass through many hands and/or work as canonical resources for many independent research efforts.
The Lab of the Future is fundamentally about making best use of FAIR data to accelerate discovery. Access to FAIR data unblocks and accelerates practical application of a host of emerging technologies and techniques for doing science faster and better, starting today.
Electronic Lab Notebooks (ELNs) are already popular for planning experiment workflows and analyzing data. Using an organization-wide, cloud-native, purpose-built for science data platform (e.g., Tetra Data Platform) you can integrate your ELN with automated instrument control software, LIMS, and other systems, turning it into a command center for high-speed, hands-free science and automated laboratory management.
ELNs integrated with TDP serve scientists better: accelerating the Design/Make/Test/Analyze loop, eliminating manual instrument and data wrangling, removing sources of error, enabling compilation of richer and more complete datasets, and letting scientists more easily exploit ELN tools for data analysis, experiment planning, and optimization. TDP-mediated integration with LIMS automatically assures that samples are properly tracked, resources maintained in good supply, and instrument fleets proactively maintained. Meanwhile, TDP itself makes FAIR data easily accessible to others within the organization: for analytics, reporting, and inquiry.
Bringing data from many instruments and systems into Tetra Data Platform's cloud-based data lake creates an easily-accessible single point of integration for analytics, visualization, AI/ML and other applications. Ensuring that all this data is indexed and harmonized and providing powerful, well-documented, easy to use, REST-based search APIs based on SQL and ElasticSearch (as TDP also does) lets all your apps share a common, search-based integration methodology to get the data they need. This makes it easy to build new apps using programming languages and tools scientists and data scientists already know: like Python, R, StreamLit, Tableau, and many others.
TetraScience maintains a fast-growing public repository of demo data applications built by TetraScience data engineers and others in the growing TetraScience community. Among these is a simple application for aligning chromatograms produced in different conditions, making them easy to compare, simply by overlaying the aligned plots on one another.
Why is this valuable? To start with, why compare chromatograms? One reason might be to determine if a scaled-up, chromatography-based purification process is running correctly by comparing a current plot to an earlier one, previously generated in a research lab. Your first hitch: finding the earlier plot — in some labs, this alone can be difficult; but it's no problem with harmonized, metadata-enriched Tetra Data and ElasticSearch.
Next potential hitch: enabling comparison. Raw data from two different chromatography systems might be stored in different formats, so might require parsing, pulling into spreadsheets or other software, and manual fiddling, with attendant possibility of error. Not so with Tetra Data, since all the plots have been parsed from their raw forms, harmonized into Tetra's Intermediate Data Schema, and made easily accessible as self-similar JSON datasets.
In fact, most of what this application is doing is simply scaling the plots to match in the time dimension, making them visually comparable. But even an application this simple can be impactful, particularly since the Tetra Data Platform lets it work in realtime. Being early to spot problems with a long, expensive, large-scale purification run can potentially save an organization hundreds of thousands, to possibly millions of dollars.
Global partnering and outsourcing for R&D and manufacturing help science-based businesses scale to meet big challenges, shorten timelines to ROI, and control costs. But outsourcing, particularly for quality-critical, highly-regulated industries like biopharma, is still potentially fraught with friction and risk.
Scientific workflows are complex, with many dependencies. And extending them past organizational boundaries -- enabling collaboration between partners -- creates new challenges. Scientific partners need to exchange data in disciplined ways, with agreement on its meaning(s). They may need to port complex workflows from one organization to another, in the process, perhaps taking a process created during R&D and re-engineering it to work at manufacturing scale. Oversight and transparency are essential to ensure partner alignment and support quality goals and compliance requirements.
Andelyn Biosciences, leaders in Cell and Gene Therapy contract research and manufacturing, is working to meet these challenges, using Tetra Data Platform to construct a "Connected Plant," along with cloud-native API and portal services for customers. This initiative will harmonize data from building and facilities, process and manufacturing, and lab instrument sources, providing Andelyn with new levels of insight and control over their own operations, ensuring GxP compliance and enabling new efficiencies. Customers will get high visibility into operations for oversight, and easy access to harmonized data, ready for visualization and analytics.
Automating science at the bench promises massive benefits for science-based organizations looking to accelerate R&D (and manufacturing) while maintaining high quality and repeatability of critical processes. High Performance Liquid Chromatography and Mass Spectrometry solutions and control software, such as Waters Empower CDS and MMS, are now well established in labs seeking to speed up precision applications (for more, see our whitepaper). Vendors like Dotmatics, Tecan, Perkin-Elmer and others are fast building out portfolios of highly-automated instruments and control software of their own.
Complete laboratory robotics systems — produced by organizations like HighRes Biosolutions and other key vendors — now come in dizzying variety and many modular form-factors. They use rail-based mounting frameworks and other connecting and support mechanisms to enable assembly of precise 'factory' structures, then let you mount automated instruments and link them, using liquid handlers, lifts and conveyors, manipulators, and other tools to move samples and dispense substances. Results can range from small, one-room multiprocessing systems to floor- and building-scale robotic lab/factories.
Driving all this flexible robotic infrastructure is software, of course. HRB makes several systems for managing their lab equipment, including Cellario, a powerful suite of tools for scheduling robotic lab procedures, simulating them for debugging and optimization, and executing them at scale. Tools like Cellario form a natural connection point for data-centric solutions like Tetra Data Platform, which can extract and harmonize data produced by instruments running within robotic frameworks, provide it to scientists via an integrated ELN, and can implement pipelines to evaluate, transform, make decisions, and push commands back to robotic systems and instruments, enabling rapid iteration -- getting to useful scientific results rapidly while eliminating tricky manual procedures and data wrangling.
From all the above, it should be clear that the success of Lab of the Future initiatives depends heavily on first solving challenges in handling and managing data:
Fixing the scientific data problem-set is the most pressing task for science, and for science-based businesses, in the present moment.
Happily, the job of making scientific data FAIR can begin right now, using carefully-selected present-day technology — and can generate big short-term benefits. In fact, it's arguable that solving the data problem — all by itself — has potential to fund the entire speculative arc of Lab of the Future initiatives, moving forward. Not least because FAIR data is absolutely essential to development and refinement of lab automation, robotics, machine learning, and artificial intelligence — the tools that will ultimately help scientists accelerate discovery and improve human life.