TetraScience Blog

Explore our thoughts on digital transformation, company updates and announcements, industry news and events, and more.

Subscribe to Our Mailing List

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Over the long pandemic lockdown, I taught some folks Python, Docker, and Kubernetes. I learned some banjo. Lost some weight. Baked a lot of sourdough bread. Joined TetraScience and started learning how to build analytics applets with Streamlit. What did you do?

Dan Roberts, Sho Yaida, and Boris Hanin wrote a textbook called Principles of Deep Learning Theory (due out from Cambridge University Press in 2022, but available in .pdf form at the link) that rigorously explains how deep neural networks (DNNs) work from first principles. Along the way, the collaborators show how DNN aspect ratios (width vs. depth) govern their performance, and advance a theoretical framework and new abstractions for categorizing DNNs, predicting their information flow and representational capacity. They use this framework to prescribe methods for extending DNNs to arbitrary depths and solving challenges like the vanishing gradient problem.

Yaida explains this very elegantly in a Facebook AI Research blog, published in mid-June. To summarize: DNNs are now being used intensively for many kinds of practical work. They're of great interest to biopharma research, where they're being applied in areas as pragmatic as predicting when equipment failures will occur (thus enabling continuous manufacturing with near-zero waste), and as open-ended as searching for hits and optimizing leads across vast virtual databases of candidate small molecules -- doing the fundamental, effortful work of identifying therapeutics of likely value automatically, consuming no human or physical resources. On the bleeding edge of research, meanwhile, machine learning is pushing research beyond the (probably structural) limits of human brainpower and intuition. The well-known AlphaFold project, for example, is advancing use of deep learning to rapidly and accurately predict 3D protein structure from sequence data.

But training DNNs effectively is still largely black art because their behavior differs from that of abstract theoretical models, and can thus be time-consuming and inefficient. And validating the competence of trained models with precision has been effectively impossible because (before Roberts et. al.'s pandemic project, anyway) there was no way to rederive the function a DNN is computing from its trained behavior, or predict it in detail from knowledge of the network's initial state and training arc.

Now this is no longer quite so impossible, or so mathematicians and physicists reviewing the book are hoping. To quote from Yaida's blog: “The book presents an appealing approach to machine learning based on expansions familiar in theoretical physics," said Eva Silverstein, a Professor of Physics at Stanford University. "It will be exciting to see how far these methods go in understanding and improving AI." Yaida's own assertion, central to the blog, is that this work potentially opens the door to what may be a period of rapid, systematic improvements in DNN technology, based on firm principles of physics and math, rather than trial and error. It may also (or so we hope) influence improvements to DNN toolkits that make it easier for scientists with deep understanding of biological problems to design, train, and apply DNNs to help find solutions.

Obviously, we're all going to need to read the book as our next pandemic activity.


Principles of Deep Learning Theory

A groundbreaking new book, soon to be published by Cambridge University Press, promises to simplify application of deep neural networks and make their inner workings easier to validate.

John Jainschigg

In Part one of this blog, we discussed some of the reasons DIY data platform projects often fail to deliver promised benefits, or do so at unexpectedly high cost.

To review: building a do-it-yourself (DIY) data solution from non-R&D-focused components means assuming responsibility to select, integrate, and lifecycle-manage all the pieces and parts, and researching, architecting, building, and maintaining all the integrations. That's a tall order, requiring headcount and specialized skills — mostly focused on building and operating and getting data into the platform, and much less focused on extracting maximum value from the data itself.

Making matters worse, there's an "impedance mismatch" between the capabilities offered by generic data components and services (ingestion, transformation, cloud storage and search, etc.) and biopharma infrastructure, workflow, regulatory, and scientific characteristics and requirements. So a DIY project that aims to create a quality solution needs to fill the gap: providing the needed scientific and process knowledge to extract, parse, and enrich data close to its sources (data without context has limited utility), and map data into an ontology/schema that makes it readily findable, accessible, interoperable, and reusable (FAIR).

The result: DIY efforts can easily consume vast resources and compel non-strategic organizational spread, while delivering sub-par benefits: e.g., brittle, inflexible, hard-to-maintain integrations between data sources and targets (logically equivalent to one-off, point-to-point integrations, despite traversing a transformation tier and data lake/warehouse), and aggregated data that's devoid of context and non-schematized, so hard to find and use for automation, analysis, visualization, and discovery.

Tetra R&D Data Cloud: Unified and Purpose-Built

To meet these challenges requires a different approach. The Tetra R&D Data Cloud represents a fundamental shift in terms of how to bridge life sciences data sources and data targets to accelerate R&D.


Data-centric: treating data as the core asset, providing stewardship of data through its whole life cycle. Tetra R&D Data Cloud is architected to treat data as the continual focus: from acquisition, harmonization, data management, data processing, through to preparing the data for AI/ML and pushing to informatics/automation applications. TDP includes:

  • A flexible ingestion tier (agents, connectors, IoT proxies, etc.) that robustly manages connectivity with and transactional access to any kind of data source.
  • A sophisticated pipeline architecture, engineered to facilitate rapid creation of self-scaling processing chains (for ingestion, data push, transformation, harmonization) by configuring standardized components, minimizing coding and operations burden while minimizing cloud costs associated with data processing.
  • A high performance, multi-tiered cloud storage back end.
  • An R&D-focused, fully-productized, plug-and-play, distributed integration architecture that runs across this purpose-built platform. Integrations are engineered by Tetra's IT and biopharma experts (in collaboration with our ecosystem of vendor partners) to extract, deeply parse, and fully enrich (e.g., with tags, labels, environmental data, etc.) data as it emerges from sources, then harmonize it into an open Intermediate Data Schema (IDS) to make it FAIR.
  • Open, modern REST APIs and powerful query tools (plus API-integrated data apps), providing easy access to raw and harmonized Tetra Data by automation, analytics, and AI/ML applications, and by popular data science toolkits (e.g., Python + Streamlit).

This data-centric architecture (plus the data-centric focus applied in building true R&D data integrations) ensures that:

  • Appearance of new data from instruments and applications (and readiness of instruments and applications to accept instructions) can be detected automatically in virtually every case, enabling hands-free automation: ingestion, enrichment, parsing, transformation, harmonization and storage on the inbound side, and search/selection, transformation, and push (or synchronization) on the outbound side.
  • Newly-appearing data is enriched, parsed, harmonized, and stored as it becomes available, preserving context and meaning for the long term, ensuring lineage and traceability, and making Tetra Data immediately useful for analytics and data science, even in close-to-realtime (i.e., while experiments are running).
  • Data is harmonized into a comprehensive set of JSON IDS schemas which are fully-documented and completely open, making it searchable, comparable, and facilitating rapid (automatic) ingestion into applications.

Cloud-native: The Tetra R&D Data Cloud incorporates best-of-breed open source components and technologies (e.g., JSON, Parquet) and popular standards favored by scientists and by life sciences and data sciences professionals (e.g., SQL, Python, Jupyter) in an aggressively cloud-native architecture that ensures easy, flexible deployment, resilience, security, scalability, high performance, and minimum operational overhead, while optimizing to provide lowest total cost of ownership.

R&D-focused: with connectivity, integration and data models purpose built for experimental data at the core. TetraScience has created a large (and growing), broadly-skilled (software development, cloud computing, and life sciences), and disciplined organization, and evolved a mature process for identifying, building, and maintaining a library of fully-productized integrations to biopharma R&D data sources and targets, and also creating data models for common data sets. These integrations (including the associated data models, comprising the open IDS) are purpose-built and tailored to fulfill informatics and data analytics use cases in R&D. 

Open and vendor-agnostic, leveraging a broad partner network: TetraScience has partnered and actively collaborates industry’s leading instrument and informatics software providers. This partner network collaborates closely, benefiting all the network members (and TetraScience customers) as our collective ecosystem grows. Partnering this way significantly accelerates integration development and productization, helps ensure integration quality, keeps integrations in sync with product updates, and helps guarantee that integrations fully support high-priority, real-world customer use-cases.

Conclusion

Biopharma can best exploit its most important asset — R&D data — by implementing a purpose-built, end-to-end solution that's data-centric, cloud-native, R&D-focused, and open. Hewing closely to these principles, Tetra Data Platform can help reduce non-strategic organizational spread, letting R&D IT organizations focus on adding value and pioneering new applications that accelerate discovery. At the same time, TDP can deliver to scientists, data scientists and other end-users a self-service, unified data experience — where data experts are in control of processing and data modeling, and can configure, manage, and track dataflows from end-to-end.


Part 2: Why Biopharma Needs and End-to-End, Purpose-Built Platform for R&D Data

Building a platform tuned for biopharma R&D use cases means making it data-centric, cloud native, R&D-focused, and open. (Part two of a two-part series.)

Spin Wang

Life Sciences’ Data Problem, and Why DIY Doesn’t Work

R&D biopharma professionals are on a mission to accelerate discovery and improve human life — shortening time-to-market for new therapeutics. This has occasioned a rapid, industry-wide paradigm shift in how data is understood and valued.

Data is now increasingly seen as having large potential utility and value: 

  • Data quantity, quality, validity, and accessibility are now recognized as major competitive assets.
  • Beneficiaries beyond bench scientists: Data scientists, tech transfer, external collaborators, procurement, operations, and multiple other departments or platforms must now be considered data consumers and producers — their contributions (to science, to profit) gated by their ability to access high-quality data easily, and provide high-quality data back. 
  • Meanwhile dataflows are being automated and replatformed to the cloud to enhance access, protect against data loss, leverage elastic compute/storage, and trade CapEx for OpEx, among other hoped-for benefits.

Mastering data can change the game. Biopharma organizations are now laboring to make data work harder, seeking to accelerate discovery and improve business outcomes. Practical uses for aggregated data are all around — in fundamental science, lab automation, resource management, quality control, compliance and oversight (for further examples, see our blog about use cases for harmonized data from Waters Empower Data Science Link (EDSL)). Undiscovered value hidden in data is also becoming a powerful lure — R&D focus is expanding beyond tangible experiments to embrace work done on data directly. Applications for ML range from predictive maintenance on lab and manufacturing equipment to discovering novel ligands within huge small molecule datasets and predicting their biological effects.

Data Sources and Targets

Mastering data across a whole biopharma organization is, however, a daunting challenge. Life sciences R&D suffers notoriously from fragmentation and data silos — with valuable data produced (and also entrapped) in myriad locations, formats, systems, workflows, and organizational domains. How can biopharma organizations begin to cope with this mountain of complexity?

We find it helpful to think of each R&D workflow in terms of a minimum quantum of organizational and logical effort. To gain scientific or business benefit, you need to find and move information from where it’s created or resides (a data source) to a system or application that can usefully consume it (a data target).

Some common data sources include:

  • Lab instruments and instrument control software
  • Informatics applications, like Electronic Lab Notebook (ELN) and Lab Information Management System (LIMS)
  • Contract Research Organization (CRO) and Contract Development Manufacturing Organization (CDMO)
  • Sensors and facility monitoring systems

Data targets, on the other hand, are systems that consume the data to deliver insights and conclusions or reports. There are roughly two types of data targets:

  • Data science-oriented applications and tools, including visualization and analytics tools like Spotfire, and AI/ML tools and platforms such as Streamlit, Amazon SageMaker, H2O.ai, Alteryx, and others.
  • Lab informatics systems, including Lab Information Management Systems (LIMS), Manufacturing Execution Systems (MES), Electronic Lab Notebooks (ELNs), plus instrument control software like Chromatography Data Systems (CDS) and lab robotics automation systems. Interestingly, these data targets are also treated as data sources in certain workflows.

Integrating these sources and targets is seldom simple (we comment on some of the reasons for this complexity, below). In many organizations, moving data from sources to targets remains partly or wholly a manual process. As a result:

  • Scientists’ and data-scientists’ time is wasted in manual data collection and transcription, with attendant risks of human error and process non-compliance, and inability to focus on analysis and insights.
  • Meanwhile, pressure to collaborate, distribute research, and speed discovery introduces further challenges in data sharing and validation, often resulting in simplistic procedures that rob data of context and long-term utility.

Seeking to automate some of these transactions, R&D IT teams often start by struggling to build point-to-point integrations connecting data sources and targets, frequently discovering that their work is minimally viable, fragile, inflexible, and difficult to maintain. We’ve written several blogs (e.g., Data Plumbing for the Digital Lab, What is a True Data Integration Anyway?, and How TetraScience Approaches the Challenge of Scaling True R&D Data Integrations) on the complexities of integration building, inadequacies of point-to-point approaches, and requirements for engineering fully-productized, maintainable integrations. 

Siloed data leaves many stakeholders unable to find, access, and extract value from biopharma's chief asset.

DIY Solution Assembly

Frustrating experience with pure point-to-point integrations leads many biopharma IT organizations to swiftly begin considering a second-order solution: building a centralized data repository (i.e., a data lake or data warehouse) and using it to mediate connections between data sources and targets. A common approach is to try assembling such a solution from a plethora of available open source and proprietary industry-agnostic components and services for data collection, storage, transformation, and other functions.

Recent articles from Bessemer Ventures and Andreessen Horowitz described versions of a componentized architecture. Note that none of these components are purpose-built for experimental R&D data.

Building a do-it-yourself (DIY) R&D data solution means integrating and managing many interdependent components.

The Problem with DIY 

The challenges we’ve solved with 20+ global biopharma companies have convinced us that there are two major problems with this approach:

Problem #1: Organizational Spread

Data lakes/warehouses, integration platforms, batch processing and data pipeline systems, query engines, clustering systems, monitoring and observability — none of these components are simple, or “just work” out of the box. Most are flexible to the point where serious configuration effort, experimentation, and best practices knowledge is required to make each component do its job and all components work together well. Serious external tooling (plus skilled operations headcount, training, and planning) is required to make updates, scaling, and other life cycle management operationally efficient and safe in production.

Specialized (but non-strategic) skills required. To execute, you’ll need to assemble teams with specialized skills, each managing as few as one vendor, component, or subsystem of the complete solution, plus architects and project managers to orchestrate team efforts. You’ll need these teams long-term, since you’ll be responsible for maintaining and evolving the full stack. Their skills are critical. But they’re also a cost center — focused on running, integrating, and expanding your platform, but remote from the critical path of accelerating discovery (i.e., extracting value from data).

A data science manager at a global biopharma organization comments, “This organizational spread creates bottlenecks, slows down operations, and in turn, delays data usage. The additional need to ensure that a data ingestion pipeline is GxP-validated further increases this problem — in fact, it might even add an additional organizational unit to the mix! We would like the experts to be able to directly handle their data.”

Architectural goals become compromised. Meanwhile, as teams around the project grow, tensions emerge between the need to deliver minimum viable product quickly and longer-term vision — architecture and data flow can start to get out of sync. The danger is that project focus can begin to dwell on parts of the problem: collecting data, storing it, providing data transformation, harmonizing data from multiple sources — rather than creating an R&D-focused, data-centric solution that can steward the whole life cycle of data, end to end.

This swiftly adds friction. In fact, it can make many important applications impossible. Two examples:

  • In high-throughput screening (HTS) workflows, robotic automation generates a huge volume of data. This data needs to be automatically collected, harmonized, labeled, and sent to the screening analytics tools in order to inform the next set of experiments the robots need to perform. 
  • In late stage development and manufacturing, quality labs are constantly checking the quality of batches and the performance of their method. Harmonizing this data enables analytics to compare method parameters, batches, and trends over time — flagging anomalies and potentially yielding important insights in terms of batch quality and system suitability. 

In both examples, merely storing the data, collecting the data, or providing data transformation is not enough. To yield benefits, these key operations need to be architected, implemented, tracked, and surfaced in a holistic way, targeting the end-to-end flow of those particular data sets. 

Data is stripped of context, limiting utility. To further elaborate on the need to unify these data operations, data collected without adequate context may be meaningful to scientists that recognize the file, but will be useless for search/query, post-facto analytics, and data science at scale. Without a data processing layer that “decorates” data with critical metadata:

  • Vendor-specific or vendor-proprietary data cannot be enriched (or sometimes, even parsed).
  • Data integrity issues — common for experimental data and when working with outsourced partners — cannot be caught.
  • A significant portion of the data cannot be used easily by data scientists.

For more detail, see Executive Conversations: Evolving R&D with Siping “Spin” Wang, President and CTO of TetraScience | Amazon Web Services and R&D Data Cloud - Beyond SDMS Functionality

Problem #2: Impedance Mismatch

The “impedance mismatch” between industry-agnostic, “horizontal” data solutions and biopharma R&D is amplified by the complexity of the R&D domain.

Scientific workflows are complex. A single small-molecule or biologics workflow can comprise dozens of sequential phases, each with many iterative steps that consume and produce data. As workflows proceed, they fork, reduplicate, and may transition among multiple organizations — different researchers, instruments, protocols.

Biopharma R&D has the largest variety of instruments and software systems per user of any industry, producing and consuming complex, diverse, and often proprietary file and data types. Distributed research (e.g., collaboration with CROs and CDMOs) adds new sources, formats, and standards, and validation requirements to every workflow. This results in research data locked into systems and formats that make it effectively impossible to validate, enhance, consume, or reuse.

Building effective integrations is EXTREMELY difficult. If an R&D organization builds their own R&D data platform using horizontal components, such as Mulesoft, Pentaho, Boomi, Databrick, or Snowflake, it inevitably also needs to build and maintain all the integrations required to pull data from or push data to instruments, informatics applications, CROs/CDMOs, and other data sources and targets. This last mile integration challenge can easily become a never-ending exercise — where the challenge of creating and maintaining fully-serviceable integrations exceeds the capacity of biopharma IT organizations, and distracts from other, more strategic and scientifically important work. For a closer look at technical and organizational requirements for engineering life sciences integrations, see our blog: What is a True Data Integration, Anyway? 

Two strategies are often considered for managing this integration workload:

  • Outsourcing to consulting companies as professional services projects. Integrations produced this way typically take a long time to build, and almost invariably become one-off solutions that require significant ongoing investment to maintain
  • Handing off to vendors of an important data source/target (e.g., a LIMS or ELN) as “customization” or professional services work. Such efforts often produce highly vendor-specific and rigid point-to-point integrations that become obsolete when changes occur, or end up further locking data into that particular vendor’s offering 

Neither of these two approaches treats connectivity or integration as a first-class business/product priority, meaning that DIY projects often bog down, failing to deliver ROI in reasonable timeframes.

Towards a Data-Centric, R&D-Focused Solution

In our next installment, we’ll discuss four critical requirements for untangling life sciences’ complex challenges around data, and show how fulfilling these requirements enables a solution that delivers benefits quickly, scales effectively, and enables refocusing on discovery vs. non-strategic, technical wheel-spinning. Delivering an effective R&D data cloud platform requires more than just IT/cloud know-how and coding smarts. It requires a partnership with an organization that has deep life sciences understanding, a disciplined process for integration-building, and a commitment to open collaboration with a broad ecosystem of partners.

Part 1: Why Biopharma Needs an End-to-End, Purpose-Built Platform for R&D Data

Why is biopharma R&D data so hard to manage? Why do DIY data warehouse solutions, created from industry-agnostic components, so often fail to deliver desired benefits? (Part one of a two-part series.)

Spin Wang

Biopharma is on a mission to accelerate discovery and improve human life — here’s how the Tetra R&D Data Cloud facilitates the mission.

Fragmented, inaccessible data. Manual research workflows and data wrangling that increases the risk to data integrity. Interoperability between the enormous ecosystem of instruments, platforms, and applications that are brittle and failure-prone. All these roadblocks impede biopharma organizations in their mission to accelerate innovation in drug discovery and development. 

As organizations move to speed digital transformation and implement robust, future-proofed methods to realize the full value of scientific data, emerging solutions that provide frictionless access to FAIR, (findable, accessible, interoperable, and reusable), data are paving the way to the Lab of the Future.

The infographic below highlights four characteristics of the Tetra R&D Data Cloud that empower biopharma to extract value from their data and unlock the Lab of the Future.

Read on!


R&D Data Cloud
R&D Data Cloud

The 4 Keys to Unlock the Lab of the Future

Infographic highlighting 4 critical “keys” that biopharma needs to unlock the Lab of the Future.

Victoria Janik

(This blog is Part 2 of a two-part series. For Part 1, see What is a True Data Integration, Anyway?)

In this blog post, we introduce:

  • What TetraScience and pharma define as a true R&D Data Integration: i.e., what biopharma needs today in its journey of digitization and adoption of AI/ML
  • Why integrations, though conceptually straightforward, are extremely challenging to implement...especially at scale 
  • How TetraScience approaches the massive challenge of integration with a holistic organizational commitment from different angles: positioning, process, product, team and capital. We discuss why we raised an $80M series B round to fuel this effort, explain why more than 60% of our integrations squad life sciences Ph.D.s, and describe how Tetra Data Platform is purpose-built to accelerate integration development

What is a True R&D Data Integration? 

In a previous article, we started defining what we mean by a true R&D Data Integration. To fundamentally shift the R&D data landscape from a point-to-point, project-based paradigm to one that is platform-oriented, productized, and cloud-native, biopharma needs integrations with disparate systems such as lab instruments (e.g., HPLC, mass spec, robotics systems, etc.), informatics applications (e.g., registration systems, Electronic Lab Notebooks (ELNs), Laboratory Information Management Systems (LIMS), and contract organizations (CRO/CDMOs). A true R&D Data Integration must therefore be:

  • Configurable and productized: with flexible parameters to adjust how data is ingested from sources and provided to targets. Such configuration should be well-documented and tailored closely to particular data source and target systems of interest
  • Simple to manage: Integration with sources and targets should be achievable with a couple of clicks, and should centrally managed and secured, eliminating the need to log into multiple on-prem IT environments 
  • Bidirectional: able to pull data from a system and push data into it, treating it as both a data source and data target. This capability is critical for enabling iterative automation around the Design, Make, Test, Analyze (DMTA) cycle of scientific discovery
  • Automated: able to detect when new data is available, and requiring little (or no) manual intervention to retrieve it from data sources or push it to data targets
  • Compliant: designed so that any change to the integration and all operations performed on data sets are fully logged and easily traced 
  • Complete: engineered to extract, collect, parse, and preserve all scientifically meaningful information from data sources, including raw data, processed results, context of the experiment, etc. 
  • Capable of enabling true data liquidity: able to harmonize data into a vendor-agnostic and data science-friendly format such that data can be consumed or pushed to any data target
  • (Bidirectionally) Chainable: The output of one integration must be able to trigger a subsequent integration. For example, an integration that pulls data from an ELN must be able to trigger an integration that pushes data to instrument control software, eliminating the need for manual transcription. The reverse may also the required: the latter integration (extracting data from the instrument control software) may need to trigger the former integration to push data back, submitting it to the ELN or LIMS.

Why is Integration so Hard?

Challenges in building a true R&D Data Integration.

Integrations meeting the above requirements can be extremely difficult to build. Commonly-encountered challenges include:

  1. Binary and Proprietary File Formats. Prevalent in the R&D world. Software and instrument vendors often lock data into proprietary formats, obligating use of their proprietary software (often a Windows desktop application) for data retrieval, analysis, and storage 
  2. Unsupported Change Detection. Primitive (and often undocumented) interfaces can make it difficult to detect when new data is available (e.g., when an injection has been reprocessed and new results calculated), undermining efforts to fully automate the Design, Make, Test, Analyze (DMTA) cycle at the center of hypothesis-driven inquiry
  3. IoT-based Integration. Many common instruments (e.g., balances, pH meters, osmometers, blood gas analyzers, etc.) are not networked, provide no standard API for data retrieval, and may lack even the ability to export new data files to a shared drive. Smart, network-connected devices are required to retrieve data from such sources, eliminating the need for manual transcription
  4. ELN/LIMS Integration. Pushing experimental results back to the ELN and LIMS sounds conceptually simple, but can be extremely complex in practice. Data structures used by ELNs and LIMS can be highly flexible and variable, requiring deep understanding of specific use cases in order to make data consumable by the target application
  5. Data Harmonization. Data collection from disparate data silos with proprietary vendor data formats adds complexity beyond simply moving the data around. Harmonization requires data to be vendor-agnostic and available in data science compatible formats; this allows data to be consumed or pushed to any targets easily and lets it flow freely among all systems. Building these data models requires a deep understanding of instruments, data structures, data interfaces, and their limitations. It requires thorough review of how data is being used in any given scenario, to determine the right entity relationships to optimize for data consumption 

There is a lot of hand-to-hand combat with each system that needs to be properly integrated. Building a new integration requires deep understanding of actual scientific use cases, how to detect changes, how to package data from the source system, how to harmonize the data to vendor-agnostic and data science compatible formats, and how to push the data in a generic and scalable manner to data consumers.

Challenges in Scaling Up the Number of Integrations

Building one integration is challenging. The typical complexity of a biopharma — including the sheer number and variety of systems used in their R&D — creates a brand new challenge: how to scale integrations to hundreds of types of data systems and thousands of data system instances while keeping integrations productized, maintainable, and while continuing to meet the above stringent requirements for true integrations.

Such variations are inevitable in the real world. All the iterations required to deploy, test, and harden thousands of variant integrations make data integration one of the most painful challenges biopharma faces in their digitization journey. Here are some examples of frequently-encountered variations that must routinely be accommodated:

  1. Variations in technical interface (software SDK, OPC, file, SOAP API, RS232 serial port)
  2. Variations in data structure and schema
  3. Variations in report formats
  4. Variations in network and host environments
  5. Variations due to instrument makes, models and upgrades
  6. Variations in configuration reflecting specific use cases

How Can We Possibly Pull This Off? The TetraScience Approach

As mentioned above, biopharma companies typically have hundreds of disparate kinds of data sources and targets, and may be operating many instances of each. Historically, biopharma has relied on two approaches for integrating these many data sources and targets together:

  • They contract with consulting companies, often producing one-off custom integrations between individual data sources and targets. Such custom integrations tend to be brittle: hard to maintain and impossible to scale; narrowly scoped to their initial use case or process
  • They call on ELN/LIMS providers to create integrations as “customizations” or as professional services projects tied to ELN/LIMS setup. This inevitably leads to integrations that are hard-coded to work only with that specific ELN or LIMS, resulting in data silos, vendor lock-in, and incomplete data extraction

Neither of these approaches has solved the fundamental problem with integrations in a scalable way. Why does TetraScience think we can do so?

We want to be the first to acknowledge that there is no magic or silver bullet. Integrating all the data sources and targets for a large biopharma R&D organization is hard work. However, TetraScience has evolved and is successfully implementing a process that breaks the integrations logjam and will enable biopharma organizations to better leverage the benefits of data, automation, cloud and analytics. 

We approach the challenge from five angles:

  1. Vendor agnostic and data-centric positioning 
  2. Cross-functional factory line process
  3. Product that is purpose-built for R&D data and for scaling productized integrations
  4. A team that is at the intersection of science, data science, integration and cloud
  5. Significant capital investment

Positioning: Vendor-Agnostic and Data-Centric

To build true R&D data integrations, vendor collaboration is crucial. A vendor-agnostic business model and purely data-oriented positioning is critical. Tetra R&D Data Integration connects products in the ecosystem without any other agenda except for providing stewardship of data. At TetraScience we take this very seriously. We believe the R&D data should freely flow across all data systems to ensure biopharma can get the most out of its core asset.

We stand by this core belief. We do not make hardware, we do not build informatics or analytics or workflow applications. We feed data into whatever tools customers and partners want to use, whether this is Dataiku, Data Robot, Spotfire, Biovia, IDBS, Dotmatics, or Benchling — enhancing these tools’ capabilities and value. We collect and harmonize data from any data source, whether  ThermoFisher, PerkinElmer, Agilent, or Waters instruments.

Process: Cross-Functional Factory Line

At TetraScience, we don’t just have one sub-team focusing on integration. Instead, we treat integration as a company-wide core focus, leveraging participation from the entire organization. Here are the stages of our integration factory line:

Prioritization, Use Case Research, and Prototyping

  • The core deliverable at this stage is to collect all inbound requests from our customers and partners and use this information to construct our short-, medium-, and long-term integrations roadmap
  • Our science team, all industry veterans, then begins investigation, outreach to vendors, gaining access to a sandbox environment and documentation, in order to play with each new data source and prototype new integrations
  • One non-obvious but tremendously important task is to also document integration and system compatibility. For example, which version of LabChip’s exported file will we develop around, and how will scientists properly set up the instrument software such that the exported file is compatible with the integration we’re building? 

Build and Productization

  • This stage is handled and driven by our product and engineering team, who turn the prototype integrations into modular building blocks, which we also call artifacts. These can be agents, connectors, pipelines, Intermediate Data Schema (IDS), and other elements, all of which work together to enable an integration 
  • While building out these reusable artifacts, we sometimes discover that required capabilities stretch the current capabilities of our platform. In such cases, we use these as forcing functions to improve the engine or the substrate on which integrations run

Documentation and Packaging 

  • Integrations are then fully documented, helping users understand how to set up their instruments to best use our integration, how to leverage the resulting harmonized data structure, and how the integration handles detailed edge cases
  • Design decisions made to ensure that integrations can be robust and repeatable may, in some cases, limit user choice about how data sources and/or targets can be configured. In such situations, engineering decisions must be explained and documented transparently so that users understand and can adapt to them

Feedback and Further Collaboration with Customer and Vendors 

  • Once the integration is deployed, this is likely the first time the customer has access to structured and clean data in data science compatible formats. At this point, customers will often provide feedback on the integration based on their use cases 
  • Vendors will also often provide feedback to TetraScience, helping us minimize unnecessary API calls, manage performance limits of the source system, and implement other improvements

This factory line model ensures that integrations are prioritized to meet customer needs, engineered with partner feedback and that high-quality integrations are delivered in a repeatable, predictable way. Acquired knowledge and best practices are shared across our entire organization, and we leverage patterns/lessons learned from one integration to inform subsequent efforts.


External Facing Process: Ecosystem Partnership

Simply having an internal engine or process is not sufficient since integrations depend highly on customer use cases and integration endpoints, and on knowledge sharing with vendors.  

Customer

We believe in working with biopharma organizations to identify the best integration strategy to set them up for long-term success in digital transformation. We often find that the right approach is not always the most straightforward.

For example:

  • Directly connecting to instruments like blood gas analyzers or mass balances seems simple, but may be suboptimal compared to using a control layer such as Mettler Toledo LabX or AGU Smartline Data Cockpit to add proper sample metadata and user/workflow information needed for full traceability
  • Plate reader control software such as SoftMax Pro can package data in Excel spreadsheet-based reports for parsing. Customizable reports, however, are subject to variation and user error, significantly increasing future maintenance overhead. For this reason, we will document how to set up the instrument to reduce the chance of manual mistakes
  • Scientists tend to save files in their preferred folder structures and usually without much convention or consistency. However, lack of a consistent, standard folder structure makes it very difficult for the rest of the organization to browse data. And the folder structure (folder names, etc.) itself may contain metadata important to placing data in context. We will share best practices and guide the customer to start defining their file hierarchy 

As we work with each customer, we outline and discuss trade-offs among different approaches and share with the customer what we have learned from the rest of the industry. We also work with customers to help them better articulate their data science and visualization use cases, which will in turn inform the right data model to harmonize their data. 

In collaboration, we turn a multi-year effort into bite-sized tasks, enabling consistent progress. Scientists who may, in the past, have been burned and burdened by IT projects that have been hugely distracting without yielding much tangible benefit can achieve immediate value and provide feedback.

Vendor

We believe in partnership with vendors, defining value, and creating abundance, together. 

As a vendor-neutral integration provider, TetraScience promotes forward-looking partners, introduces use cases to our partners, provides feedback to partners on their data formats and software interfaces, and introduce them to analytics-related use cases — for example, helping an instrument manufacturer leverage harmonized data and rapidly expand their offering into the application or analytics layer.

As a result, vendors leverage TetraScience to build out better data interfaces and establish their analytics and cloud strategy. Yes, their data can now be accessed by customers without their proprietary software: this can feel a little scary. But vendor-neutral data handling is already an unstoppable trend. On the other hand, because the vendor is typically the expert in their own data and use cases, they are well positioned to explore new possibilities in the visualization, analytics and AI/ML layer which they can build out on top of Tetra Data Platform. Instead of building their own vertical and vendor-specific data stack, vendors can now accelerate their efforts to create value in the application and analytics layer, while benefiting from a vendor-neutral integration layer (which can also provide data from other ecosystem products).

Product: Enabling the Factory Line

TetraScience’s focus on integration has also helped shape and influence our product design and strategy. Tetra Data Platform is crafted to help our team and customers create and maintain true R&D Data Integrations rapidly, at scale. Some related functionalities include:

  • Self-service data pipeline module - key to harmonizing data coming from data sources and pushing data to data targets
    • The data pipeline module enables data engineers to use Git and JupyterLab (or other familiar tools) to rapidly develop, test, publish, and upgrade integrations with their applications
    • Product features such as reprocess and error code categorization allow integrations to be tested, debugged, and upgraded, recognizing that software is rarely perfect the first time and true R&D Data Integration needs to be battle-tested and continually improved
  • IoT agent enables the TetraScience Smart Edge IoT Agent to handle integrations at scale and integrate with instruments via serial port, also providing resilience against network downtime
  • Pluggable connector designs decouple connectors from TDP platform releases, such that end-to-end data flow within integrations is fully version controlled, platform version independent, traceable, and can be easily validated‍Command service lets authorized users, operators, and applications securely send instructions and commands to the connectors and agents such that they can push information into data targets, enabling computation on the cloud to communicate with on-prem execution software

Team: Intersection of the Four

Behind the positioning, process, and product is a team dedicated to this mission and vision. Anyone who has tried to tackle this problem will quickly appreciate the diverse expertise needed to get the job done. Imagine the skill sets behind just one single integration, leveraged in the Waters Empower Data Science Link (EDSL), video and blogpost:

  • Science. Deep understanding of HPLC and Chromatography, Empower and its toolkit to extract the data, harmonize the chromatography data into intuitive and vendor agnostic models
  • Integration. Deep understanding of data integration patterns to design for speed, resilience and configurability
  • Cloud. Deep understanding of cloud infrastructure to architecture a scalable and portable platform 
  • Data Science. Deep understanding of uses cases to provide an optimized partition, indexing and query experience and example data apps to help biopharma data scientists jumpstart their own analytics and visualization projects

Deep expertise introduced from these four kinds of backgrounds, coupled with obsessive attention paid to training and knowledge transfer, forges a unicorn team that is positioned to tackle such a challenge. 

As a reference, more than 60% of the team that is driving the integration design and data modeling have life science-related PhDs. They have deep understanding of scientific processes, data engineering/science and hands-on experience using these scientific instruments and informatics applications.

Capital: Powering the Vision

To scale the number of productized integrations, the organization must be obsessively focused on the integrations and all the nuances associated with building a true R&D Data Integration.

This necessitates a big financial commitment over a long period of time, since TetraScience will need to dedicate significant resources to investigate, prototype, build, document, maintain, and upgrade each integration. Scale also allows us the luxury to remain pure-play and focus only where we are uniquely positioned to deliver value. That’s part of the motivation and thesis of our most recent funding round. To read more about our funding announcement, please see: Announcing Our Series B: The What, When, Why, Who, and Where.

Some Last Thoughts

True R&D Data Integrations are the foundation for life sciences organizations to digitize their workflow and move towards a more compliant and data-driven R&D process. These integrations are the lines that connect the dots; they form the network that all participants can leverage and benefit from by focusing on their unique science and business logic. Integrations define the substrate where innovation can thrive based on the harmonized data sets that flow freely across whole organizations.

Each integration can be challenging and hundreds of such integrations represent a challenge this industry has never solved before. Without true R&D Data Integrations as the connective tissue, scientific R&D processes move at a pace that is significantly slower than other industries and fail to keep up with the demand of discovering, creating, and marketing life-changing therapies.

We believe it’s the right time to solve the integration challenge, once and for all.

R&D Data Cloud
R&D Data Cloud

Part 2: How TetraScience Approaches the Challenge of Scaling True R&D Data Integrations

Integrations are challenging. Here's how TetraScience meets the challenge: by leveraging community, disciplined process, product architecture, a multi-skilled team, and capital investment. Part 2 of a series.

Spin Wang

The DMTA Loop

Perhaps you learned the scientific method in high school: problem, hypothesis, experiment, analyze data, draw conclusions. Industrial scientists across the scientific and technological landscape have streamlined this process into the Design, Make, Test, Analyze (DMTA) Loop. You start with a design, run your experiments, collect and test samples, then analyze the data and use the insights derived to iterate on the next design. Electronic Lab Notebooks (ELNs) and Lab Information Management Systems (LIMS) are arguably two of the most crucial systems used in the Design and Analyze phase of the DMTA cycle.

Let’s craft an analogy — An ELN is like a storybook, with a distinct beginning, middle, and end. Often, though it depends on your domain, we begin with a picture of what you hope to achieve. For chemists, it’s a reaction scheme with the predicted end product. For biologists, perhaps it’s a specific animal model of disease or a protein structure. For pharmacologists, it’s a target assay or endpoint.

As ELNs are “experiment-oriented,” they focus on experiment results and functions like an “Evernote for science labs.” These are generally more unstructured and flexible, allowing the end user to dictate how to structure their work and affirm success.

A LIMS, on the other hand, is “just the facts, ma’am": it’s sample-oriented and focuses more on barcode tracking, workflow management, and handling analytical results performed on groups or lists of samples. LIMS typically enforces a rigid structure and is less free-form than an ELN.

Data Sources and Data Targets

Let’s zoom out and consider the informatics landscape in biopharma R&D. Few would disagree that challenges begin to mount: data complexity increases and systems fragment. Workflows are held together with brittle, point-to-point integrations. Labs and even individual scientists produce heterogeneous data streams. To introduce some order into this otherwise chaotic data landscape, we here at TetraScience tend to apply a mental model to every R&D workflow — we move information and data from data sources to data targets

Data sources are where valuable R&D data are produced (or entered) by scientists and lab instruments. Some common data sources include:

  • Scientific instruments and control software such as Chromatography Data System (CDS) and lab robotics automation systems
  • Informatics applications like Electronic Laboratory Notebooks (ELNs) and Laboratory Information Management Systems (LIMS)
  • Outsourced partners like CROs (research) and CDMOs (development and manufacturing)

Data targets, on the other hand, are systems that consume data to deliver insights and conclusions or to generate reports. There are roughly two types of data targets:

  • Data science-oriented applications and tools, including visualization and analytics tools, data science tools, and AI and machine learning tools such as TensorFlow and platforms such as those provided by AWS, Google, and Microsoft 
  • Lab informatics systems, including Lab Information Management Systems (LIMS), Manufacturing Execution Systems (MES), Electronic Lab Notebooks (ELNs), and instrument control software, such as Chromatography Data Systems (CDS) and lab robotics automation systems. Interestingly, these data targets are also treated as data sources in certain workflows

The Achilles’ Heel of ELN and LIMS

Biopharmas make significant investments when selecting ELN/LIMS. It pays dividends to optimize the scientists' experience when using these solutions.  

However, one of the major challenges impeding scientific ROI is the tedious busywork of capturing experimental workflow from, and getting data into, the ELN/LIMS. For example:

  1. Scientists must manually transcribe design parameters or sample information into instruments or instrument control software
  2. Scientists need to manually enter experimental data into the ELN/LIMS and very often struggle to deal with binary data, unstructured, or semi-structured data

In summary, major pain points include moving information from the analyze/design to execution (Make/Test) phases of the DMTA loop, and back again.

When following compliance or expecting high data integrity, this problem manifests more severely, leading to second- or third-scientist reviews. When high-throughput experimentation (HTE) is involved or large files are produced, this problem becomes unmanageable due to the sheer size of the data and the number of operations needed. Imagine spending an hour to run your assay and then 2-3x more time to collect, move, transcribe, upload, and review the same data!

That’s why using Tetra Data Platform (TDP) and ELN/LIMS together is the growing trend in biopharma’s digital lab stack — enabling a true “lab of the future” or “digital lab” experience. The scientists design their experiment in ELN/LIMS, bring their experimental design or sample list to the lab for execution, and then get the data back into the ELN/LIMS without worrying about the “plumbing.” Very often, biopharma uses multiple ELN/LIMS depending on the department or data streams. TDP serves as the data exchange layer for multiple ELN/LIMS while remaining invisible to the scientists. We provide background data stewardship: harmonizing, validating, and moving the data to facilitate the Design, Make, Test, Analyze (DMTA) loop at the core of every scientific process. 

ELN/LIMS as Data Source: From Design to Execute

Just like almost everything else in life, execution starts from the planning and design. The experimental work starts in assay design or as a request to measure a list of samples (sometimes called a batch). Using TDP, scientists can automatically send experimental designs and transfer batch information or sample lists to instrument control software like Waters Empower, ThermoFisher Chromeleon, or AGU SDC.

ELN/LIMS as Data Target: From Execute to Analyze

Once the scientist finishes executing an experiment, TDP can automatically collect the data, harmonize, validate the data, and then push data into the ELN/LIMS. Sometimes, scientists will search and retrieve the data from the TDP and then import into ELN/LIMS to start their analysis.

Benefits of a Combined Solution

Save time and ensure data integrity. Leveraging TDP as the data exchange layer between analysis and execution eliminates manual input of design parameters or sample identifiers to instrument control software as well as the need to manually enter data back into the ELN/LIMS, saving scientists valuable time, preventing compliance issues related to manual transcription, and eliminating tedious second- or third-scientist review by enabling a validated workflow. 

Enable data science. Importantly, the automated integration allows ALL scientifically-relevant data to be collected, stored, and queried through the TDP, enabling other data science oriented tools to leverage the experimental data and providing rich data sets for future analysis (data that scientists may wish to ingest into their ELN and LIMS).

Project agility and time-to-value. With TDP focusing on the data (exchange) layer, biopharma organizations now have one throat to choke in terms of bridging the gap between data sources and targets, simplifying project complexity, and shortening time-to-value. TetraScience is responsible for maintaining and upgrading productized integrations and moves in conjunction with ELN/LIMS vendors during rollout. 

But don’t ELN/LIMS already do this?

Our biopharma clients and partners often ask us: “Why wouldn’t an ELN or LIMS just interface with instruments directly? Wouldn’t that be more efficient?”

If you work for an “early-stage” organization with limited data, a lab under construction, or your assays and syntheses change often, this may not be a bad idea. In fact, many ELN/LIMS providers enable features for tabular data import...or your scientists can just manually transcribe the data into/out of ELN/LIMS. 

However, if your life sciences company has reached a certain scale — remember, ELN/LIMS core focus is not connecting data sources and  targets — very obvious disadvantages will surface when ELN/LIMS are the “connective tissue” of the data layer:

  • ELN/LIMS typically only have file capture and tabular data import features; they are not designed to handle sophisticated integrations that require interfacing with software (e.g., Waters Empower), with non-networked devices (e.g., blood gas analyzers), or parsing and storing semi-structured (e.g., plate reader reports) and binary data (e.g., mass spec files)
  • Without utilizing productized, off-the-shelf integrations, time-to-value during initial deployment (and subsequent upgrades) can be dramatically increased and becomes heavily reliant on systems integrators or professional service personnel
  • The other side of the equation — data targets — still must be addressed. ELN/LIMS are great at being data targets, but may not possess or enable sophisticated connections to visualization, analytics, or automation software
  • All of the above can lead to slowed cycle times around the DMTA — exactly the opposite of what lab automation and digitalization are meant to do!

See It in Action

Every day, more ELN/LIMS providers leverage TetraScience to gain a foothold in the cloud-first data landscape and partner with us to drive synergistic product value between our systems.  

Linked below are documents describing some real-world data integration examples with industry leading providers, including bidirectional (pull/push) data exchange, showing the full DMTA cycle brought to life and written in data.

R&D Data Cloud
R&D Data Cloud

Unleash the Power of Your ELN and LIMS by Automating Lab Data Flow Together with Tetra Data Platform

In this article, we share fundamental drivers behind the growing trend of combining the relative strengths of the Tetra Data Platform (TDP) with ELN/LIMS and learn how 1 + 1 = 4 for any biopharma organization looking to accelerate its R&D.

Mike Tarselli, Ph.D., MBA

What does it take to implement an end-to-end, cloud-native solution for unlocking and extracting value from your R&D data? Whether you are a decision maker, scientist, data scientist, or in R&D IT, you play a critical role in accelerating discovery and shortening time-to-market for new therapeutics

TL;DR If you are performing one of the above roles, we invite you to read this whitepaper and explore why traditional approaches to solving the problem of data proliferation, data silos, and end-to-end data and lab automation frequently fail or miss the mark. You will learn how an R&D Data Cloud offers a better approach to scientific data management and value extraction. 

You know why R&D data is important

R&D data represents a huge and largely untapped store of value for biopharma organizations. Mastering data accelerates discovery: it lets you do science better, collaborate more smoothly, broaden pipeline, negotiate regulatory hurdles, and get new therapeutics to market faster. It lets you leverage and/or build applications — including AI/ML — that grant scientific and strategic insight, enable course-correction, improve quality assurance, and help automate away manual steps at the bench.

Here’s what you need to know to tap its value

Doing this, however, requires a new category of solution that treats the meaning and significance of R&D data as a priority, gives data a genuine life cycle, and then takes responsibility for that life cycle (for ensuring data’s validity and helping extract its full value) end-to-end. Enter: the R&D Data Cloud.

So what’s in it for you: Benefits of an R&D Data Cloud

Our new whitepaper, (aptly enough) called What is an R&D Data Cloud? identifies four characteristics mandatory for such an optimal solution: it must be R&D-focused, data-centric, cloud-native, and open.

The table below highlights how a purpose-built solution for life sciences R&D fuels innovation and what significant gains can be achieved based on your organizational role: 

The take-away  

Whatever your biopharma R&D focus, if you’re seeking to derive maximum long-term value from R&D data, please check out our whitepaper, What is an R&D Data Cloud? And if you’re ready to see a real R&D Data Cloud in operation and talk more about how it can help you accelerate discovery, please contact us for a demo.

Decision Maker Scientists and
Data Scientists
R&D IT
Accelerates time-to-value
  • Purpose-engineered platform delivers benefits quickly
  • Reduced data wrangling speeds time to market, extends time under patent
Accelerates time-to-insight
  • More time exploring data, less time wrangling
  • Harmonized data reveals hidden insights
Accelerates implementation
  • Get going fast with a complete solution
  • Robust, productized integrations speed scale-out
Reduces toil, enables collaboration
  • Less data wrangling with harmonized data
  • Smooth collaboration with CROs/CDMOs
Makes data FAIR
  • Use your own tools and benefit from OOTB integrations
  • Faster, easier search
Future-proofs data strategy
  • Focus on building strategic applications
  • Cloud scalability
Aids compliance
  • Regulatory guardrails built-in
  • All transactions auditable
Increases Data Integrity
  • Eliminates human error
  • Facilitates data validation
Simplifies operations
  • Resilient architecture
  • Built-in monitoring
R&D Data Cloud
R&D Data Cloud

What is an R&D Data Cloud? (And Why Should You Care?)

What does it take to implement an end-to-end, cloud-native solution for unlocking and extracting value from your R&D data? A guide for decision-makers, scientists, and R&D IT.

TetraScience

What does replatforming to the cloud have to do with Jaws?

Last month, Tetra’s Chief Scientific Officer Mike Tarselli, joined John Conway, Science and Technology Visioneering, 20/15 Visioneers for a webinar to discuss the growing momentum in life sciences R&D to replatform experimental data to the cloud. The two debated challenges surrounding data generation and management, explored the need for future-proofed, data-centric approaches to digital transformation, and advocated for why FAIR (Findable, Accessible, Interoperable, and Reusable) data and processes have become de facto industry standards.

The webinar focused on digitalization and cloud replatforming initiatives, time spent on manual data wrangling and lab connectivity, and the reality of data FAIRness in life sciences R&D. In addition to an interactive debate between speakers, audience engagement was encouraged and the responses provided key insights into the current state of their digital data initiatives and highlighted why implementing these initiatives now are more critical than ever before.

The blog highlights these insights a little further down — you can also check out the webinar recording.

But before we dive too deep into the vast ocean, some context...

Data Explosion in Drug Discovery

“Big Data” was first coined by John Mashey in 1987 to quantify large volumes of information. As technology advanced since the late 1900s, (to my fellow Millennials, take a beat and let “late 1900s” sink in for a moment…), the velocity of data generation has exploded across industries, especially in healthcare and life sciences. As Dell EMC reports:

“Healthcare and life sciences organizations have seen an explosive healthcare data growth rate of 878% since 2015.”

It’s estimated that by 2025, over 175ZB (zettabytes) of data will be generated worldwide. It’s a number so big, that wrapping your brain around the concept and truly understanding just how big, is pretty challenging.

That’s why when researching how big the “Big Data” in life sciences explosion really is, I was delighted to have stumbled upon the National Human Genome Research Institute’s fact sheet on Genomic Data Science where they equate the volume of genomic data being generated to copies of the movie Jaws. Yes, Jaws - Jaws of the “scaring children out of the ocean” fame. That one…

via GIPHY

NHGRI states that a single human genome = 200GB of data = ~200 copies of Jaws. 

Genomic data, however, is only one piece of the puzzle. Experimental data generated in life sciences runs the gamut of small and large molecules, imaging, proteomics… the list goes on. However, connecting all these disparate data is a challenge due to the fragmented ecosystem of instrumentation and systems in life sciences, which perpetuate data silos. In turn, these data silos cause significant impediments to innovation in R&D - oy vey.

As drug and therapeutic discovery moves more and more into precision medicine territory, other heterogenous, disparate data will be thrown into the mix, for example:

  • Clinical trial data
  • EHRs (Electronic Health Records)
  • Smartphone and wearable data
  • Socio-demographic data

That’s a lot of data. 

A lot of data from a vast array of instrumentation, software systems, and informatics applications. 

A lot of data locked-in vendor proprietary data silos, requiring manual point-to-point integration. 

A lot of data that requires manual wrangling to standardize for more advanced analytics. 

So how do life sciences R&D organizations plan for and manage all the considerations around the V’s of Big Data, ensure the FAIRness of their data and processes, all while future-proofing their investments and accelerating time-to-market? 

Well...

We're Gonna Need a Bigger Boat


via GIPHY

All this Jaws talk got us thinking - so we sharpened our pencils and did a bit of our own arithmetic. 

We asked: If 200GB of data = ~200 copies of Jaws, how many copies of Jaws = 175ZB of data?

The answer: 175 Trillion copies of Jaws

(Fun fact: 175 trillion copies of Jaws equals 8 quadrillion Great White Shark teeth. Give us a shout out the next time you win trivia at your local watering hole).

Now, we’re not saying that all of the 175ZB of data generated will be experimental R&D data by 2025, but as data generation continues to explode into a massive tsunami, who’s to say that in the next 5 to 10 years, the number won’t be close to it?

Organizations need to ask themselves critical questions as they begin to map out and implement a future-proofed, holistic, data-centric framework to harness the power of experimental R&D data. Questions such as:

  • How, and more importantly where, will you store all of that data?
  • How do you plan for and find the resources necessary for the store and compute?
  • How will you access actionable, high-quality data while ensuring security and compliance, while also simultaneously facilitating collaboration?
  • How do you future-proof your investment today, tomorrow, months and/or years from now?

Data proliferation, the recent explosion in data creation and the efforts to store and manage the data, has become an all too familiar challenge in life sciences R&D as the volume of both structured and unstructured data continues to grow exponentially. 

So how can organizations within life sciences not only manage data proliferation, but actually benefit from the data explosion? 

The Cloud Replatforming Movement in Life Sciences R&D

Simple — use cloud-native technologies and solutions that address challenges with a “data-first” mindset and replatform experimental R&D data to the cloud.

As mentioned previously in the post, we polled the audience to get a better understanding of their thoughts on the data landscape, data maturity, and digital data initiatives within their organizations... and we’re sharing some of them below!

For deeper insights to sink your teeth into (har har) and to watch an interactive Q&A session - check out the webinar recording

Has your organization implemented a digital strategy?

33% of the audience has only begun conceptualizing a digital strategy within their organizations, but have yet to implement. 

There’s many moving pieces to plan, build, and implement a data-centric digital strategy for the long-term. Not only are there technical considerations, but how do you also rally your organization and gain cultural buy-in? 

Where does the cloud fit into your strategy?

0% of attendees are committed to being totally on-prem and have no interest in replatforming to the cloud. 

Clearly, on-prem has “jumped the shark.” It’s costly, requires lots of resources to run day-in and day-out, and perpetuates data silos. Cloud-native solutions improve collaboration, reduce costs, accelerate innovation, and enhance security and compliance. The COVID-19 pandemic made it abundantly clear that facilitating collaboration across hallways, organizations, and globally speeds discovery.


How much time do you currently spend creating/maintaining your own data ingestion/integration/connectivity?

70% of the attendees spend most, or almost all of their time creating and maintaining data ingestion, integration, and connectivity.

What do you seek to gain from easily accessible and dynamic data? When the focus is shifted away from manual data wrangling, time can be spent focused on advancing scientific discovery — and isn’t that the whole point?

Do you understand the concept of a productized integration? Aware of the benefits?

58% stated that productized integrations are fundamental to a future-proofed data strategy. 

True connectivity in life sciences integrates the fragmented ecosystem of the latest technologies, solutions, software, instrumentation, and even systems still running on Windows 95. By standardizing and productizing integrations, the larger scientific community benefits from the time and resources repurposed to more meaningful scientific work.

A Tetra Data Integration has to reach a very high bar to even be considered and must be able to automatically acquire, harmonize, centralize, prepare the data, enable data provenance, and then ultimately push data back into determined targets, like ELN/LIMS systems. 

What percentage of the data in your organization are considered FAIR? And what percentage of your data can you access today?

0% of attendees said all of their data could be considered FAIR. And 0% have full accessibility to their data.

Ahh...FAIR data, (Findable, Accessible, Interoperable, Reusable). We need it, we want it, we GOTTA have it.

Back in 2020, John guest authored “The Science Behind Trash Data” blog where he stated that up to 80% of data produced is trash - i.e. not FAIR. This shines a spotlight on the necessity for not only the building of foundational frameworks for data generation, but also the ability to contextualize data, and the implementation of automating the full life cycle R&D data, (acquisition, harmonization, centralization, and preparation for data science and advanced analytics).

What is the benefit of data if it cannot be found, easily accessed, or reused?  

Summary

It’s clear the movement to replatform experimental data to the cloud is gaining serious momentum in life sciences R&D. Organizations are beginning to think strategically about their data initiatives and implement long-term, data-centric approaches to their scientific workflows - and the cloud plays a major role. Benefits of replatforming include future-proofed investments, accelerated time-to-market,  enhanced security and compliance, and the facilitation of collaboration. 

However, the cloud can only close the gap so much - what is imperative is the organizational and cultural understanding and buy-in as to why replatforming now is so critical. Allow us to leave you with a thought:

“It’s time to take a stance on it. Start taking a real prescriptive approach on your data integrity, on your FAIRness, on your openness in terms of connectivity and yet, maintaining security, then we can all make this work.” - Mike Tarselli, Ph.D., MBA, Chief Scientific Officer, TetraScience

Don’t be scared of the ocean, jump in - we've got you.

Watch the full recording and interactive discussion on the "Why Life Sciences Organizations are Replatforming Their Experimental Data to the Cloud" webinar page.

R&D Data Cloud
R&D Data Cloud

Experimental Data in Life Sciences R&D — It’s How Many Copies of Jaws?!

Key takeaways from the “Why Life Sciences R&D Organizations Are Replatforming Their Experimental Data to the Cloud” webinar with 20/15 Visioneers.

Victoria Janik

The essential feature of an R&D Data Cloud is that it gathers all your data in one place: enriching it with metadata, tags, labels, etc., and parsing and harmonizing it using consistent schemas. That’s what gives Tetra Data Platform power to act as a single source of truth: aligning self-similar data so you can compare and analyze it.

The trick is making that both powerful and simple. The 3.1 release of Tetra Data Platform debuts a reorganized search interface sporting (among other improvements) a Google-like full-text search feature that makes it fast, easy, and intuitive to query the data lake, explore and download results, and/or quickly iterate: useful, for example, in developing performant queries for TDP’s more-technical search tools.


Illustration 1. Full-text search mode is engaged by flipping a toggle, upper right.


By default, the search screen comes up ready to accept typed queries in TDP’s standard search syntax, based on ElasticSearch’s built-in query_string search capability (a good balance of simple and powerful, but sometimes daunting to neophytes). Full-text search can be turned on by flipping a toggle on the search screen (see Illustration 1), then typing into the search box. You can find full, partial, or approximate matches, leveraging your “Google-fu” directly to explore the TDP data lake. Searches can also be refined with Filters, applied using a popdown Options dialog (see Illustration 2).

Matched results are displayed below (by default), with matches highlighted. You can drill down into returned data items directly, examining schema and values, download search results in .csv format, etc. The new release even lets you preview common stored-data formats (e.g., .pdf) right in the webUI. So you have everything you need to get work done, fast.


Illustration 2. Access filters to refine search by basic metadata, attributes, IDS or schema details.


As you work, your search phrases and filters are combined in an ElasticSearch Query Language query, which you can review, modify, and re-execute directly (see Illustration 3). All TDP search tools — from the simplest to the most powerful — are thus now linked together to enable a natural workflow: useful for ad-hoc checks, quickly browsing and extracting records, learning the more-technical methods of searching, or for developing well-formed queries for programmatic application via the API.

Illustration 3. Searching and defining filters builds up an ElasticSearch Query Language query, which you can then directly review, modify, and execute.



For more on TDP’s growing system of search features, please see our documentation.

New TDP operations tools

Multiple enhancements to search and new EQL affordances make TDP easier and faster for scientists and developers to use. These improvements are complemented by several improvements to quality-of-life for TDP administrators/operators.

Cloud Configuration: We want TDP to deliver value quickly, which means connecting instrument connectors to agents efficiently. WIth release 3.1, admins can now select a connector and create its agent in one place, instead of going elsewhere (i.e., the Data Hub, Data Management or User-Defined Integration screens) to do so. In the same screen, admins can also view agent status and details. Fewer clicks and context-switches lets you wire things up faster, with fewer mis-steps.

System-wide health/status display: Administrators (and users) can now view the health and performance of TDP components and pipelines from a set of tabbed dashboards. Realtime updates means you can easily locate and intervene if agents, connectors, or data hubs are stressed or offline; and quickly ascertain the cause of pipelines running slowly. 

Design refresh and docs update

TDP version 3.1 updates aren’t all profound — some are just skin-deep, like a newly-refreshed webUI design with contemporary branding and improved information layout. Our documentation has also been improved, both cosmetically and in terms of discoverability: finding information has never been simpler.

Please update

We encourage TDP users to update as soon as practical, to take advantage of new search and operations features (plus bug fixes). Among the latter, version 3.1 of TDP updates npm http-proxy-agent, recently discovered to have a vulnerability to man-in-the-middle attacks.

For more information on TDP Release 3.1, please consult the release notes. To see TDP in action, request a demo.

Find Experimental Data Faster with Google-Like Search in Tetra Data Platform 3.1 Release

Release 3.1 of Tetra Data Platform adds easy-to-use full-text search, cloud configuration, and a system-wide health dashboard.

John Jainschigg

Introduction

In 2007, Nokia ruled the mobile phone market as its category leader. Their flip phones, among the first to support text messaging and limited video, were a worldwide phenomenon. And yet, Nokia hardly realized an emerging threat: Apple was about to release the iPhone, a service-modeled platform that depended on multiple application developers sharing and co-existing in a more open digital landscape. Within six years, Apple outsold Nokia 5-to-1 and went on to be the fastest-adopted device in history. 

We foresee a similar paradigm shift in life sciences R&D when comparing legacy, silo-prone data storage and management to an open, cloud-native, data-centric model. In this blog, we will compare life sciences’ traditional data management choice (SDMS) to the emerging R&D Data Cloud. 

Let’s first understand what is an SDMS and what is an R&D Data Cloud.

What is an SDMS, anyway?

An SDMS imitates a filing cabinet using software. It captures, catalogs, and archives all versions of a data file from a scientific instrument -- HPLCs, mass specs, flow cytometers, sequencers -- and scientific applications like LIMS, ELNs, or analysis software. Specialties of an SDMS include rapid access, compliance with regulatory requirements, and specific workflows around data management. It’s certainly an improvement over the ‘90s-era process of maintaining your quality team on paper reports!

Essentially, an SDMS is a file server and data warehouse. Unfortunately, for many organizations an SDMS ends up as a "black hole" for experimental data that's carelessly tossed in and quickly forgotten.

What is an R&D Data Cloud?

The R&D Data Cloud embraces several fundamentally different design and architectural principles compared to SDMS. 

First, it is data-centric; instead of just focusing on storing the data, it focuses on the full life cycle of the data and full stack data experience, including data acquisition from disparate sources, data harmonization into vendor agnostics format, data processing, and pipelining to facilitate data flow, data preparation and labeling for data science.  

Data does not get stuck here, the goal is to present and deliver the data to the best-of-breed applications that data can be analyzed.

Second, it is cloud-native. Instead of treating the Cloud as just another data center, it leverages cloud services natively for traceability, scalability, security, and portability into private clouds. 

The Problem

Let's compare the two by first understanding the challenges life sciences R&D is facing. 

Data volumes in biology, chemistry, materials science, agriculture, and other common R&D organizations continue to explode. The zettabyte era, a measure of total data which began in 2016, will have already expanded 175 times by 2025. Systemic complexity, data heterogeneity and interface discrepancy increase as biological targets become tougher to drug and regulations increase. Rising complexity and data volume present difficulties for legacy systems - outmoded data structures, physical storage limitations, and equipment obsoletion contribute to a trend we’ve observed: R&D organizations are moving away from Scientific Data Management Systems (SDMS) and towards cloud-based or cloud-native solutions.

Now consider the current capabilities of Scientific Data Management Systems (SDMS) and how perceived limitations might impair critical innovations instead of facilitating an open, interactive data future.

#1: Integration issues

SDMS solutions traditionally rely on files exported from instrument control/processing software. However, labs are full of data sources that are not in the form of files; for example, blood gas analyzers, chromatography data systems, etc. These data sources require sophisticated and non-file based connections such as IoT-based or software-based programmatic integration.

Below is a short analysis of file-based integration used in SDMS solutions versus programmatic integrations used by the Tetra R&D Data Cloud.

 
File-based integration used in SDMS
Tetra R&D Data Cloud (Tetra Data Platform + software agents and productized integrations)
Completeness
File export lacks info needed to enable sophisticated AI/ML, such as audit trail, method, comprehensive raw data.
Collects everything we can extract from the source system, including raw data, logs, method, processing results, and others.
Automation and human error
Scientists must initiate and configure exports on the spot. These manual operations are subject to variability, typos, and human errors.
Cloud-native agents and data connectors communicate with data sources programmatically, capturing comprehensive data sets and detecting changes on update.
Change detection
Requires manual export. If scientists forget to export the data, changes remain stuck in the source system and will not be reflected in analytics or reporting.
Automatically detect changes using audit trail, timestamp, and other mechanisms to ensure data are always current and updates are captured.
Bidirectionality
Files are “unidirectional” - one may export data from the instrument software; however, files typically cannot instruct automation software in how to run the experiment.
The Tetra Data Platform permits third-party systems to send commands and instructions to the instrument or instrument control software via the agent.

#2: Scaling issues

Once the data is brought into the SDMS, then the SDMS faces a scaling issue. None of the existing SDMS solutions are cloud-native, they are largely on-prem or use cloud storage simply as another data center. This poses two major challenges:

Challenge 1: Difficulty upgrading and maintaining on-prem SDMS

An on-prem SDMS typically requires multiple modules: Oracle databases, application servers, and file storage. Every upgrade requires changes in every module; a non-trivial effort. 

A cloud-native architecture enables fully automated upgrade / deployment via infrastructure as code and rolling updates, and can continue to leverage the continuous innovations introduced as these infrastructure platforms evolve.

Challenge 2: Big R&D data

Did we mention zettabytes? As R&D becomes more complex, chemistry and assay outputs on the 100MB scale are giving way to proteomes, genomes (GB), images (TB), and exome maps (PB). Since cloud-native solutions rely on the elasticity and scalability of cloud storage, R&D organizations no longer need to worry about expansion or tedious upgrades.

Cloud-native infrastructure allows tiered storage, providing more flexibility to classify the storage layer to optimize for cost and availability. If you’re not going to access microscopy images for a while, park them in Glacier instead of purchasing extra server racks or hard disks.

#3: Modern R&D requires flexible data flow

Beyond the on-prem vs cloud-native architecture, the R&D Data Cloud differs from SDMS in another fundamental way: SDMS supports very limited data flow or processing since it’s largely an archival dumping ground. Data flow grinds to a halt.  

Without these capabilities, SDMS cannot accommodate the dynamic and sophisticated data flow required by modern R&D organizations. Consider a common R&D workflow:

  • Scientists produce data and reports from biochemical assays
  • These results pass through quality checks or enrichment 
  • If the data check out, they are further pushed to ELN or LIMS
  • If an issue arises, stakeholders should be notified to take action and rectify the data mistake with full traceability
  • Data is also parsed to enable data visualization and data science

Flexibility is key; R&D organizations should also be able to easily submit the data to multiple destinations such as both the ELN and LIMS (and any other data targets) as their process requires. To accomplish this, simple archiving inside an SDMS is not enough. Configurable data processing enables metadata extraction from the data set, proper categorization with user-defined terms or tags, harmonization of data into vendor-agnostic formats, data integrity checks, and further initiates the submission of the data to downstream applications. 

The R&D Data Cloud has elevated the strategic importance of data engineering and integration through cloud-native and Docker-based data pipelines. A user can configure data processing after exploring the data, reprocess the files using another data pipeline, and create their own customized data extraction. They can also merge data from multiple sources, dynamically perform quality control or validation of the data sets, and integrate the data into other informatics applications such as ELN/LIMS. 

Users of the R&D Data Cloud will be able to develop such data processing pipelines completely based on their own business logic by creating python scripts in a self-service fashion. 

In short, the R&D Data Cloud fundamentally assumes that data should flow and that this flow merits equal consideration to data capture and data storage. The data flow reinvigorates the siloed and stale experimental data and enables fully automated workflows. 

#4: An SDMS isn't optimized to support data analytics or queries

SDMS solutions’ static nature makes them largely closed systems; simply putting data together in one place does not make the data discoverable, queryable, or ready for data science and analytics. 

To truly leverage the power of the data, it needs to be properly labeled and categorized, and its content needs to be properly extracted. To support API-based queries and data analytics, data needs to be properly prepared, indexed, and partitioned. After all, data scientists can’t do much with a binary file! 

The R&D Data Cloud introduces the concept of Intermediate Data Schema files, aka IDS files, which you may find helpful to understand via the following analogy.

In a nutshell, think about your lab as a big warehouse, where there are multiple parcels of different content, shape, weight, and color (like the data produced by your lab instruments, their software, your CRO/CDMOs, and other data sources). These parcels are heavily packaged in their own unique way, and when you have too many parcels, it becomes difficult to find the ones you need. You can't compare the content within each of these parcels, or make any meaningful observation beyond, “these are all very different things.” 

Now imagine that each parcel comes with attached IDS metadata, which describes the parcel’s content in a consistent manner. With IDS, it’s much easier to find what you need, since you do not need to unpack each parcel - the IDS makes searching and finding what you need much more efficient.

You can also leverage the IDS’s content consistency to compare different parcels; for example:

  • Which parcels contain more items 
  • Which parcel is the heaviest 
  • Show me all the parcels that contain a bottle of wine from before April 2000
  • Only select the parcels that contain books with blue covers

Data curation (labeling, categorization, content extraction, harmonization, indexing, partitioning) requires the data be put through a configurable and orchestrated processing layer, which SDMS does not have and simply is not designed for (see previous section, Issue 3). This processing layer is fundamental since data is never ready to be queried or explored without proper curation and transformation.   

In the R&D Data Cloud, all data is automatically harmonized into a vendor-agnostic file format. By leveraging open and data science-friendly file formats like JSON and Parquet, the data combines with some of the most popular search and distributed query frameworks to provide highly reusable, flexible platforms to view and structure your data as you wish. 

Another aspect of an SDMS to keep in mind is, even though an SDMS may provide preview or analysis on one individual run or data set, it is not optimized for aggregated insights and trending or clustering. However, the R&D Data Cloud provides accessible and harmonized data through a common software interface for visualization and analysis tools that act upon massive amounts of data from different sources, produced at different time points. 

Summary

Combining the cloud-native data lake, Docker container-based data harmonization, and productized data acquisition, the Tetra R&D Data Cloud distinguishes itself from a traditional SDMS, in that the R&D Data Cloud supports:

  • Powerful, productized integrations with data sources
  • Scalability based on business needs by leveraging cloud-native architecture 
  • Flexible data processing through self-serviceable and modular data pipelines 
  • Harmonization into Intermediate Data Schema (IDS) provides a wider range of compatibility across data analytics tools

As cloud-native services offer more features, accessibility, and post-processing options, it’s time to abandon the SDMS “data flip-phone” for the better, more innovative data cloud option. By adopting the R&D Data Cloud ahead of the curve, you position your company for maximal flexibility and derive deeper analytics from more connected instruments and devices, while benefiting from contemporary best-of-breed technologies to handle big data in your lab digital transformation journey.

Follow us for ongoing updates about all things R&D data and other related topics:

LinkedIn | Twitter | YouTube

R&D Data Cloud
R&D Data Cloud

R&D Data Cloud: Moving Your Digital Lab Beyond SDMS

In this blog, we compare life sciences’ traditional data management choice (SDMS) to the emerging R&D Data Cloud: an open, cloud-native, data-centric solution for experimental data.

Spin Wang

Typically, when startups issue blog posts about their funding rounds, they follow a standard template which talks about their perspective of the financing; one which often omits the perspective of their investors and the basis for their investment.

I’d like to try and summarize the investment thesis for TetraScience, since it has profound implications for all of our stakeholders – customers, partners, employees, investors, and humanity as a whole.

The following is a composite analysis which includes my own decision criteria as an investor, as my fund - Impetus Ventures - invested in the Series A just a few months ago.

A Venture Primer: Category Kings and Queens and Tech Economics

Category Kings and Queens dominate in the winner-take-all world of tech economics in which they capture virtually all of a market's mindshare, market share, margin, and market capitalization. Simply put, to the category victor belong the category spoils. Or, in the inimitable words of Ricky Bobby, “If you ain't first, you're last. You know, you know what I'm talking about?”

These market leaders don’t dominate based upon a single distinguishing attribute (e.g., best product), but rather by strategically defining and perfectly aligning their category, product, organization, and ecosystem in furtherance of their category strategy. When executed properly, and in complete harmony, these integrated components generate powerful self-reinforcing value loops and enable the category leader to raise the most capital, attract the greatest and largest talent pools, build the best products, aggregate the most partners, and deliver more value – faster – to its ecosystem, than any potential competitor.

These Kings and Queens are very clever in developing platforms, and sometimes networks, which serve as rocket fuel, driving non-linear growth, resulting in increasing returns to scale and insurmountable competitive moats.

Lastly, and most importantly, by harnessing and harmonizing these components, these Kings and Queens quickly evolve from being technology vendors to leaders of industry movements. The reference example in this regard is Salesforce. While it’s certainly a leading vendor within the CRM market, the real legacy and impact of Salesforce was its pivotal role in inspiring a generational movement to SaaS and the cloud. That category leadership also unsurprisingly led to their dominance within CRM. Ultimately, Category Kings and Queens create abundance for all ecosystem participants, including themselves, whereas subscale tech vendors live or die in a world of zero-sum transactional outcomes.

DNA markers of Category Kings and Queens

Pattern recognition is a primary tool of venture capitalists as it enables them to efficiently sift through thousands of investment opportunities per year. There are consistent “DNA markers” of Category Kings and Queens which VCs screen for and once they recognize them, they pursue those startups with uncommon energy. No group understands the power of Category Kings and Queens better than VCs, as they rely on power law distribution economics to generate their returns. Simply put, they only make real money for their limited partners and themselves if they’re able to back the Category Kings and Queens of tomorrow which generate nearly all of their returns.

Clarity and Simplicity  

Unlike me-too vendors, Category Kings and Queens have to contend with many moving parts in building out the category and engendering a movement around them. Orchestrating a category, product roadmap, organization, and ecosystem concurrently is no small logistical feat and requires complete obsession with clarity – strategic clarity, financial clarity, operational clarity, and cultural clarity – to ensure every single stakeholder is speaking the same language and singing from the same hymn sheet.

Investors who ask a startup for their “elevator pitch” are neither impatient nor unreasonable. They’re simply testing startups to determine if they have perfect clarity into who they are, what they do, why it matters, and why it’s different. If they cannot succinctly and consistently convey this, everything downstream – the company’s culture, operations, customer interactions, etc. – is likely a dumpster fire.

Even when startups are discovering greenfields and breaking new product ground, investors want to understand comparable companies and business models to help them analyze the potential shape of the business and inform their valuation assessment. They like to see plausible comps and simple taglines.

At TetraScience, we are maniacally obsessed with clarity on all dimensions. There is no e-staff session, all-hands meeting, or company activity which we undertake in which we don’t tie everything back to our North Star mission, values, and metrics.

Explaining why we exist, who we are, and why it matters is easy:

  • The world’s R&D labs, in vital industries such as life sciences, chemicals, agricultural, materials, and energy, are upstream to the transformational innovations required to solve humanity’s greatest challenges.
  • TetraScience is the R&D Data Cloud Company.
  • Our mission is to be the most trusted and valuable partner to the world’s scientific community, fueling innovation, facilitating collaboration, accelerating discovery, and enhancing and extending human life.
  • We will accomplish this by delivering the first and only cloud-native and open platform and network which allows us to migrate the world’s experimental data to the Tetra R&D Data Cloud to enable new classes of data applications and position the data for advanced analytics and AI-based discovery.

For investors, knowing that patterns and comps help inform their thesis and valuation, we explain that TetraScience is to Snowflake, as Veeva is to Salesforce - the highly focused vertical domain expert and full stack provider against the backdrop of a horizontal but distracted Gorilla. Like Veeva, TetraScience is expected to emerge as the de facto standard in Life Sciences and then rapidly move into adjacent verticals which possess many of the same lab markers.

Big Markets, Shifts, and Trends, and Information Asymmetry

Not surprisingly, VCs seek out startups which are addressing enormous markets and capitalizing on transformational technology shifts (e.g., cloud, big data, AI) and secular replatforming trends. Venture capitalists face two major challenges in this regard – with $300 billion going into startups annually, once a new technology trend emerges which can galvanize a market replatforming, there’s likely to be scores of startups competing for the same mindshare and market share and hundreds of VCs competing to fund the same companies. An important reminder here – notwithstanding the size of the market, ultimately it will coalesce around a single Category King or Queen. Given the overabundance of venture capital and the scarcity of Kings and Queens, I borrow repeatedly from The Hunger Games when I root on my VC friends - “May the odds be ever in your favor.”

For VCs and founders alike, this means that they need to identify and take advantage of some market information asymmetry. They have to identify the non-obvious market opportunity which today looks less like an opportunity and more like a niche or a startup quagmire of some type. A market that scares away most of their competitors but, with the right platform, team, and go-to-market model, could yield commercial success and lead to ever-larger market expansion opportunities.

Representatively, today, TetraScience resides at the nexus of two of the most important venture capital investment themes – the movement of the world’s data to the cloud, and the application of cloud computing and artificial intelligence to life sciences and other complex experimental data challenges. This would logically lead one to conclude that TetraScience must have scores of competitors and is facing all of the byproducts thereof – long sales cycles, feature/functionality-based RFPs in which most startups are column fodder, commoditized pricing, and more.

Yet, as impossible as it may seem, TetraScience is the world’s only pure-play R&D Data Cloud company, operating unimpeded and without a direct competitor, and is building wide and deep competitive moats which will make it exceedingly difficult for new entrants to compete. In fact, we have yet to see a plausible competitor in 23 months due to our first-mover advantage, best-in-class cloud capabilities, open platform and network, and deep domain data expertise. Our competition isn’t really competition at all and is composed of internal IT and/or systems integrators/body shops, or end-point vendors who dabble in integration but are focused elsewhere.

How is this possible? Well, prior to COVID-19 this market had been overlooked, underfunded, and underdeveloped by venture-backed startups due to a misguided and/or misinformed view that experimental R&D labs represent a niche market and that these advanced labs, which have historically been technology laggards, wouldn’t move their data to the cloud. As such, founders and VCs alike have instead focused on pure-play software-enabled biotech companies and end-point lab applications.

In reality, these labs represent a large, growing, and underserved market. Global R&D labs spend more than $300 billion ($177B in Life Sciences alone) annually on research and development and we estimate that $15 billion of that Life Sciences spend is focused on contending with their data challenges via manifestly suboptimal approaches. When large pharmas do replatform – as they have in commercial activities and clinical trials – it’s yielded Category Kings and Queens such as Veeva and Medidata, with a combined value of >$50B.

Endless and Boundless Expansion Opportunities

Category Kings and Queens optimize for creating abundance in all they do and avoid unnecessary zero sum outcomes. The bigger that they can make the economic and value pie for the entire ecosystem, the more everyone benefits. In turn, the ecosystem reinforces their commitment to the King or Queen as the de facto standard. Companies and markets rely upon these standard bearers every day as they maintain clarity and consistency for all.

At TetraScience, our product, market, and business model vectors yield boundless expansion opps for us and all of our category stakeholders. Our product strategy, market focus, and business model – i.e., open data platform, Life Sciences, and partner network – combine to serve as a strong and leverageable foundation from which to seamlessly expand into logical adjacencies, resulting in accelerating revenue, competitive moats, increasing returns to scale, continuous innovation for our customers, and endless monetization opportunities for our partners.

We define this as the SEVENS strategy – i.e., stack evolution (SE), vertical expansion (VE), and network scale (NS):

  • Our open and cloud native data platform is designed to assemble the largest and most organized experimental data sets in the world. We believe this gives us unparalleled advantages in enabling advanced native and 3rd party data-enabled apps, as well as native and 3rd party AI/ML capabilities, which we envision as the core building blocks for future R&D discovery. Platforms are the “gifts that keep on giving” insofar as they enable heretofore unavailable capabilities, ushering in a new generation of applications that solve more problems, deliver more choice, and generate more economic activity than standalone application vendors or primitive data integration providers.
  • Our beachhead in Life Sciences, the largest and most demanding R&D lab segment, coupled with our platform and network model, allows us to move into other large R&D lab verticals with common markers including chemicals, agricultural, materials, energy, and more, with far greater leverage and velocity than potential new entrants. 
  • Our partner network represents the largest ecosystem of R&D innovators. We are enabling these partners to leverage our productized integrations as well as build configurable data apps on our platform and enter into commercial go-to-market agreements with us. These activities accelerate bookings and engender a sense of category and company inevitability across the ecosystem with a constant stream of innovation, press releases, joint marketing, references, and 3rd party validation. This also yields considerable platform cross-sell opportunities as we execute a land-and-expand GTM model following the sale of these Tetra-enabled apps.

Metrics, Metrics, Metrics

A world of capital and startup overabundance which greatly distorts the signal-to-noise ratio, a news cycle operating with ever greater velocity, and a toxic combination of social media and founder vanity which fuels endless and shameless self-promotion, converge to form a perfect storm of hype.

But in the final analysis, you can’t hype your way to best-in-class metrics and Category Kings and Queens know this. They’re the embodiment of execution and their metrics reflect this. The best VCs also know this and, as Andrew Carnegie would have said were he working on Sand Hill Road, “As I grow older, I pay less attention to what people say. I just watch what they do.”

Thanks to two decades of cloud company data and the amazing work of firms like Bessemer Venture Partners and Meritech who catalog the data and share metrics, everyone now knows what meh, good, great, epic, and legendary looks like for today’s cloud companies.

Since May 2019, when Spin and I reimagined Tetra 2.0 and launched the Tetra R&D Data Cloud, our metrics have been somewhere between epic and legendary with 2020 ARR growing 10x over 2019, and 2021 ARR expected to increase 3x over 2020. Our capital efficiency has been stellar, with a score of >1.3 in 2020 and an expected 1.5 score in 2021. To that point, we raised an $11M Series A in 2020 and had $9M left on the balance sheet when we closed our $80M Series B on March 26, 2021. Our customer acquisition costs are certainly in the realm of legendary as we’ve barely invested in marketing (although that’s about to change) and our customer retention and expansion metrics are epic. 

In connection with our $80 million Series B, I never built a “pitch deck” and instead simply gave Insight and Alkeon unfettered access to our 2020 plan, results, and metrics, and our board-approved 107-slide Plan of Record. I wanted them to “see what we do” and not “hear what we say.”

We led with our metrics and not hype, and we were rewarded by an extraordinarily efficient, professional, and rewarding fundraising process.

Summary

At TetraScience, our vision is noble, our mission is vital, and we’re passionately committed to being responsible stewards of a global movement to improve and extend human life through AI-based discovery, enabled by the Tetra R&D Data Cloud.

In preparation for this awesome and humbling responsibility, we have been extremely deliberate in designing, organizing, and aligning our category, product, organization, and ecosystem and we have taken nothing for granted in this regard.

We are guided by our mission and enabled by our values and we are assembling the very best and brightest talent from the scientific, cloud, and data science domains.

We have very purposefully built a one-of-a-kind open platform and integrated network which will deliver increasing returns to scale and generate powerful network effects fueling self-reinforcing value-creation and innovation loops resulting in deep and wide competitive moats.

We have catalyzed massive latent market demand and enjoy first-mover advantage and benefit from remarkably strong tailwinds generated by a secular replatforming to the cloud by the world’s most important labs.

In summary, TetraScience possesses all of the DNA markers of a Category King or Queen and today’s $80 million Series B announcement serves as both validation of this and an enabler of it. Evidence once again that category leaders generate self-reinforcing value creation loops and self-fulfilling prophecies.

We are grateful, blessed, and humbled by the commitment of Insight and Alkeon. Now, back to execution.

Patrick


News
News

Announcing our Series B: The DNA Markers of Category Kings and Queens

Category Kings and Queens dominate in the winner-take-all world of tech economics in which they capture virtually all of a market's mindshare, market share, margin, and market capitalization.

Patrick Grady

Spin and I first met in May 2017 at the Harvard Incubation Lab when he and his fellow TetraScience founders were trying to build an Internet-of-Things (IoT) and application startup to help Life Sciences research labs accelerate discovery. The three founders - former researchers from Harvard and MIT - were brilliant, enthusiastic, and passionately committed to their vision. They had assembled early customers and seed capital and were just about to close their Series A funding.

I came away from my all day session with the founders believing that they were potentially on to something important and valuable – i.e., helping the world’s most important research labs accelerate discovery - but like so many startups, they were trying to do too many things at once, a clear recipe for suboptimization, or worse.

In the ensuing months, with every subsequent high-fidelity discussion with Spin, it became more and more obvious to each of us that TetraScience was likely mistakenly focused on the “market noise” and not the “customer signal.”

In other words, while there was endless hype about IoT – the “market noise” - and biopharma companies were of course interested in improving their existing lab operations applications and intrigued by the potential of IoT technology, their most acute pain was felt at the data layer - the “customer signal.”  

Despite their critical importance to humanity, these labs remain dependent on 20th century on-prem software stacks, resulting in experimental data silos, limited collaboration, stifled innovation, and materially sub-optimized discovery. Worse, the proliferation of IoT sensors (the “noise”) would only lead to further data fragmentation and isolation. More to the point, TetraScience might have been ironically exacerbating its customers’ data problems if it only focused on IoT.

IoT and improved lab applications will invariably benefit scientists and aid in discovery – and we are strategically committed to enabling our industry partners in this regard - but to get to the required step-function breakthroughs needed to attain the holy grail of AI-based discovery, someone would have to fundamentally reimagine and then solve the data problem.

Spin and I knew that this was simply impossible in the absence of large-scale and clean data sets. Ultimately, without a centralized, cloud-native, and data-centric architecture and well-designed experimental data taxonomies and ontologies, there was no possibility of capitalizing on advanced data science and AI/ML capabilities. We also concluded that to solve the data problem we would have to be the “Switzerland of experimental data” and not compete with the companies that were generating or consuming the data – i.e., instrument manufacturers, ELNs, etc. It was the only way to ensure that everyone in the ecosystem would trust us as the data intermediary. In effect, the Switzerland of experimental data would have one goal - maximize the value of the data.

Spin and I developed a mind-meld on what could and should be built to move the industry forward. Unfortunately, there wasn’t a consensus among the founders and investors, and little resources were devoted to the R&D Data Cloud that we envisioned. Having no formal role in the company but being a devout believer in the idea and a huge fan of Spin’s, I asked him to reach out if the company ever decided to focus on the R&D Data Cloud concept.

In the Spring of 2019, Spin called me to give me the “good” and “bad” news. The good news was that he was now CEO and completely committed to focusing on the R&D Data Cloud. The bad news was that the company’s overall lack of focus had finally taken its toll, and it was out of money and had lost the confidence of its investors and creditors. With no clear financial backing, a rapidly deteriorating balance sheet, and customer and employee attrition spiking, the company would have to shut down shortly or complete a successful and difficult pivot.

Spin and Sal, the two remaining original founders, came to visit me in San Francisco for three days to brainstorm on how to bring the R&D Data Cloud vision to life. Just as importantly, we focused on how to build a company of enduring consequence and value. We had three choices – let TetraScience shut down and begin anew; recapitalize the company and build anew; or build Tetra 2.0 inside of the legal and capital structure of Tetra 1.0. While the latter approach was the most difficult to execute on (i.e., it’s always easier to start with a clean sheet of paper), we felt it was the right thing to do for all legacy stakeholders.

In May 2019, we raised a new $1.5 million seed round for Tetra 2.0, and with a small number of dedicated employees we began to reimagine and rebuild the company from the ground up. While we entered 2020 with only a handful of midsize pharma customers and proof of concepts with large pharmas, we ended the year with 12 of the world’s top 40 pharma companies as customers and successfully raised an $11 million Series A to begin building a leadership team and support these early customers. 

Since launching Tetra 2.0 our growth metrics have been simply remarkable, with 2020 ARR growing 10x over 2019 and 2021 ARR growth expected to increase another 3x over 2020. On the heels of these successes, and for the many other reasons cited in today’s press release and accompanying blog posts, we were able to raise a category-defining $80 million Series B round of funding. In connection with our Series B round we finalized my transition to the CEO role and Spin is now our President and CTO. Our partnership will continue as we scale Tetra 2.0 based on our shared values, singular vision, and mutual trust. We are deeply committed to one another and to our stakeholders, and count on each other to help realize the full potential of Tetra 2.0.

It has been a unique, magical, and immensely rewarding Tetra 2.0 journey thus far and we are just getting started. Along the way, we’ve learned an important lesson for any company and for startup leaders around the world: ignore the market noise and focus on the customer signal.

We wanted to share our unique story and express our profound gratitude to those few that believed in us and the pursuit of this vital mission. 

Patrick and Spin

News
News

Announcing our Series B: Tetra 1.0 and 2.0 | The Noise and the Signal

Since launching Tetra 2.0 our growth metrics have been simply remarkable, with 2020 ARR growing 10x over 2019 and 2021 ARR growth expected to increase another 3x over 2020.

Patrick Grady

Data Integration is one of the biggest challenges for the life sciences industry in its journey to leverage AI/ML. Data gets stuck in proprietary, vendor-proprietary and -specific formats and interfaces. Data silos are connected via rigid, one-off point-to-point and unsustainable connections. As the number of disparate systems increases exponentially - thanks to equipment obsoletion, new instrument models, evolving data standards, acquisitions, and a host of other factors - internal maintenance of this spider web of connections quickly exceeds any life sciences org’s internal IT capabilities.

Here, we’ll introduce our definition of a true integration with an R&D data system, in particular with a piece of lab instrument or instrument software. We’ll also explain why our audacious approach - building and maintaining an expanding library of agents, connectors, and apps - can unify ALL pharma and biotech data silos, something that is the foundation for accelerating drug discovery and R&D.


How can we knit together disparate instruments and software systems into a logical platform?

Fragmented Ecosystem 

Let’s start from an assumption we believe is well-understood: life sciences R&D data systems are notoriously fragmented and heterogeneous. We’ve touched on this before (post and video).

Systems diverge on:

  • File formats - 20+ in common usage for mass spectrometry, alone
  • Instrument data schemas
  • Physical and data interfaces (OPC/UA, API, Software Toolkit, Serial Port, etc.)
  • FDA submission standards
  • Terminology differences between ELNs, LIMS, LES (workflow, reaction, scheme, process)
  • And numerous other variables

AI/ML and digital transformation depend on data liquidity

Do you want clean, curated data sets, with consistent headers, attuned to FAIR standards, auditable, traceable, and portable? The only way to enable AI/ML will be to have high quality data. To achieve this data liquidity, your divergent instrument outputs must “flow” across the system, connecting the disjointed landscape via a vendor-agnostic open network.

What is a Tetra Integration?

So what is TetraScience’s definition of a true R&D Data Integration, that enables automation, unites diverse data systems, and enables AI/ML? 

A true R&D Data Integration with a particular data system needs to be able to surpass a high bar. It must be:

  • Configurable and productized: with flexible knobs to adjust the behavior of pulling data from sources or pushing data into data targets. The configuration should be documented, tailored to the particular data system, generally achievable with a couple of buttons and via a central configuration portal remotely and securely without having to log into the on-prem IT environments
  • Bidirectional: the integration should support both pulling data from the data system and also pushing the data into the data system, thus treating a system as a data source and data target simultaneously and enabling the full Design Make Test Analysis (DMTA) cycle of any scientific discovery
  • Automated: with no or minimal manual intervention to grab all data from the data source or pushing the data to targets, detect new data automatically to capture every change in the data
  • Compliant: all changes to the integration including configuration are traced in the user action audit trail, operations on the data set are logged and can be associated to provide a fully traceable and transparent history of the data set, enabling fully GxP validatable solutions
  • Complete: if the integration is designed to extract data from a data source, the integration needs to extract ALL the scientifically meaningful information possible, including raw data, processed results, context of the experiment, such as sample, system, users, etc.
  • Enable true data liquidity: the integration must not simply stop at moving the data around, it must also harmonize the data to vendor-agnostic and data science compatible formats so that data can be consumed or pushed to any data targets and freely flow across the systems
  • Chainable: the output of one integration can trigger the next integration, for example, data pulled from an ELN via the integration, can trigger to a push integration to the instrument control software, thus avoid manual transcribing the batch or sample list; conversely, data pulled from the instruments via the integration, can further trigger the push integration to submit the data to the ELN or LIMS


A true R&D Data Integration must necessarily be “full-stack and purpose-built” — configurable data collection, harmonization to vendor-neutral formats, preparation for analytics consumption, automated push to data targets, and tailored to the system and related scientific use cases — so that scientists and data scientists can access and take actions on previously siloed data in order to accelerate discovery.

A true R&D Data Integration can be achieved via a combination of Tetra agents, connectors, and pipelines depending on the specific data systems. For example: 

  • To integrate with Waters Empower, we leverage a data agent to draw information from a Chromatography Data System (CDS) using a vendor toolkit
  • To integrate with NanoTemper Prometheus, we leverage the Tetra file-log agent and Tetra Data Pipelines
  • To integrate with Solace, AGU SDC, we use RESTful API services to build connectors
  • To integrate with osmometers, blood gas analyzers, or shaking incubators, we use an IoT agent to stream continuous data from a physical, mounted instrument to the cloud through secure MQTT 
  • To integrate with Benchling, IDBS eWorkbook, Dotmatics Studies, PerkinElmer Signals and to push data into the ELNs, we use Tetra Data Pipelines

Tetra’s “Special Sauce”: Productized and data-centric integrations

Important Note: When we use the term true R&D Data Integration, we reject simple “drag-and-drop” of instrument RAW files into a data lake. To meet our criteria and quality standards, we must contextually transform source data into a harmonized format, like JSON. These integrations are differentiators for the Tetra Data Platform; to us, if you’re moving files without true parsing and interpretation, no value is added.

Differentiated from LIMS/SDMS: IoT agent and software agent/connector

Most life sciences data integrations are performed by LIMS and SDMS software, which need to bring data from different sources in ELN/LIMS for tracking and reporting, and to SDMS for storage. LIMS and SDMS rely on two major methods: 

  • Serial to ethernet adaptor for instruments such as osmometer and analyzer
  • File-based export and import 

While these may be viable options, they are far from optimal for an organization trying to upgrade to an Industry 4.0 motif. Consider the following scenarios:

 
 SDMS, LIMS, or ELN connecting through Serial to Ethernet Adapter
 Tetra Data Platform + IoT Agent
Network resilience 
Network interruptions between server and instrument result in lost data.
Resilient to network interruption; IoT agent will buffer data locally in case of network downtime.
Scalability
Centralized server maintains and manages connections with increasing number of endpoints = Not scalable.
Each IoT agent performs distributed computation, tracks its own state, performs complicated handshake with instrument, and preprocesses
Supported Communication Protocol
Serial to Ethernet adapter implies that only serial port is supported
Can handle more communication protocols: OPC, GPIO, CAN, others
Architecture
Data sources coupled strongly to destination
Data source and data target are no longer strongly coupled.

ELN, LIMS and SDMS have been traditionally relying on file export from the instrument for the majority of their integrations with the instruments and instrument control / processing software.


 
File export from ELN/LIMS/SDMS
Tetra Data Platform + software Agent and Connectors
Completeness
File export lacks info needed to enable sophisticated AI/ML, such as audit trail, method, comprehensive raw data
Collects everything we can extract from the source system, including raw data, logs, method, processing results and others. 
Automation and human error
Scientists must initiate and  configure exports on the spot. These manual operations are subject to variability, typos, and human errors
Cloud-native agents and data connectors communicate with data sources programmatically, capturing comprehensive data sets and detecting changes on update
Change detection
Requires manual export. If scientists forget to export the data, changes remain stuck in the source system and will not be reflected in analytics or reporting.
Automatically detect changes using audit trail, timestamp and other mechanisms to always ensure data current and updates are captured
Bi-directionality
Files are “unidirectional” - one may export data from the instrument software; however, files typically cannot instruct automation software in how to run the experiment.
TDP permits third- party systems to send commands and instructions to the instrument or instrument control software via the agent

Data harmonization for data science and data liquidity

Extraction from source systems is insufficient to claim true integration. For example, imagine a data scientist has access to thousands of .pdfs from a Malvern Particle Sizer, or thousands of mass spec binary files from Waters MassLynx, or TA instruments differential scanning calorimetry (DSC) binary files; these formats can’t unlock the value of the data and impact R&D.

Other than the file name and path, these binary files are essentially meaningless to other data analytics applications. These data need to be further harmonized into our Intermediate Data Schema (IDS), based on JSON and Parquet, to truly allow any applications to consume the data and R&D teams to apply their data science tools. 

Fostering a community based on data liquidity 

TetraScience has taken on the challenging, audacious stance of building out, maintaining, and even upgrading sophisticated integrations; we believe this to be a first in the life sciences data industry, which has long suffered from the vendor data silo problem:  

  • An instrumental OEM’s primary business driver involves selling instruments and consumables
  • An informatics application provider’s  primary goal is to get more data flowing into its own software

The R&D Data Cloud and companion Tetra Integrations are entirely designed to serve the data itself, liberating it without introducing any proprietary layer of interpretation.  If your software can read JSON or Parquet, and talk to SQL, you can immediately benefit from the Tetra Integration philosophy.

Our cloud-native and Docker-based platform allows us to leverage the entire industry’s momentum to rapidly develop, test, and enhance these integrations with real customer feedback. Rapid iteration and distribution of consistent, reproducible integrations across our customer base introduces more use cases, more test cases, and more battle-tested improvements for the entire scientific community. 

Check out some of our Tetra Integrations, and request an integration for your team right on that page. We're always interested in hearing from you!

What is a True Data Integration, Anyway?

We introduce a definition for proper integration with an R&D data system, in particular with a piece of lab instrument or instrument software. Our audacious approach - building and maintaining an expanding library of agents, connectors, and apps - might unify ALL pharma and biotech data silos.

Spin Wang

Our previous post (The Digital Lab Needs an Intermediate Data Schema (IDS): a First Principle Analysis) has established the need for an intermediate data file format for Digital Labs and explains why we chose JSON + Parquet. Now let’s dive deeper into its use cases and discuss what an IDS is intended for, and what it’s not intended for. 

In a nutshell, think about your lab as a big warehouse, where there are multiple parcels of different content, shape, weight, and color (like the data produced by your lab instruments, their software, your CRO/CDMOs, and other data sources). Because these parcels are heavily packaged in their own unique way, when you have too many parcels, it’s really difficult to find the ones you need. You can't compare the content within each of these parcels, or make any meaningful observation beyond  “these are all very different things.” 

Now imagine, each parcel has a sidekick called IDS, attached to each cardboard box, describing the parcel’s content in a consistent manner. With IDS, it would be much easier to find what you need, since you do not need to unpack each parcel and you can just search through the sidekick. 

You can also leverage the sidekick's content consistency to compare different parcels; for example:

  • Which parcels contain more items? 
  • Which is heaviest? 
  • Show me all the parcels that contain a bottle of wine from before April 2000
  • Only select the parcels that contain books with blue covers.

Enabling a data sidekick

Hopefully the analogy above is helpful. Now let's move from physical packages in a warehouse to data "packages" in a life sciences company. In more technical terms, IDS is a vendor-agnostic, data science-friendly file format that captures all the scientifically meaningful and data science-relevant information from the vendor-specific or vendor-proprietary file formats produced by a fragmented lab data source ecosystem.

IDS serves multiple important functions: data search, data flow/liquidity and data analytics, which we'll discuss a bit below.

IDS enables search 

With an IDS, data scientists can leverage API-based or full text search, enabling them to quickly locate an instrument RAW file. As with the package analogy, all data fields are acquired and indexed.

IDS enables data liquidity and data flow

Data can now flow to third-party systems without requiring each of these systems to parse the vendor-specific file formats. This fundamentally breaks down internal data silos and enables more seamless information sharing among instruments, informatic applications, analytics tools and contract organizations (CRO/CDMOs).

IDS enables cloud-based, distributed data processing and analytics at massive scale

The majority of existing lab informatics software solutions were designed as Windows-based on-premises applications. For example, in order to analyze Mass Spec data, a scientist would drag and drop 20-100 RAW files into their favorite analysis software program, spend hours processing the data, and generate a report. This workflow unnecessarily consumes time, and limits the potential for data reuse and sharing.

Now, imagine you need to analyze data using big data technologies like Hadoop, Spark, or others. You wouldn’t be able to effectively use those big data tools while manually managing RAW files. However, IDS (JSON + Parquet) makes big data analytics possible, and the Tetra Data Platform will store, partition and index the IDS files in anticipation of such use cases.

Will IDS replace proprietary formats? 

Not at all. IDS is not a replacement of the vendor-specific formats. Organizations need always to save these RAW files, such that scientists can:

  • Understand the nuances of the content
  • Potentially import the RAW file back to the vendor software
  • Regenerate the IDS, for further information extraction or error correction

IDS makes it easier to find the RAW file and provides large-scale data analysis much efficiently since the RAW file does not need to be opened. Just like the Parcel and Sidekick analogy, the sidekick is not meant to replace the parcel; only to improve its use. In fact, the Tetra Data Platform will first ingest RAW data and then trigger RAW to IDS conversion or harmonization using data pipelines.

Can IDS be used for archiving your data? 

This is essentially a variation of the question above, but usually asked in the context of experimental data management. 

Let’s first define “archive.” In the context of experimental R&D data, an instrument typically produces “RAW” data - the points, ramps, intensities, depths, counts, and response factors produced by the detector or sensor - which is processed by instrument control/analysis software to produce “Analysis Result." Since sometimes the “Analysis Result” is saved in the same “RAW” file; for the purpose of this article we'll call this RAW data. RAW data is most often present in a vendor-specific format, vendor proprietary format, or non-file-based data system. 

For example: 

Archive of RAW data means creating a backup from the instrument software or analysis software as a file, stashing it somewhere, and then, when needed, restoring back to the original system, as if the data has never left the source system. (AWS S3 Glacier deep archive is a popular, secure, durable, and extremely low-cost Amazon S3 cloud storage service for data archiving and long-term backup).

Thus, the answer is: No, IDS itself isn't intended for long-term archival.

In most cases, it’s not possible to convert a vendor-proprietary and often binary format into JSON and Parquet, and then be able to re-import the data into the source system without loss of information.

However, the goal of IDS is not to capture everything possible in the RAW file, and in this sense it’s lossy. Therefore, IDS files alone are not sufficient to fulfill experimental data archival requirements. However, as described in the previous section, the content in IDS files will serve as an abstraction layer that will greatly augment search, liquidity, and analytics.

IDS helps you archive and manage your data

Do not get us wrong, we believe data archival is extremely important. GXP Compliance, patient safety, plumbing "dark data" for new insights: in these use cases and more, data archival critically impacts scientific success. The Tetra Data Platform (TDP) archives original instrument RAW files to the cloud with full traceability and data integrity. You can easily restore processed versions and track lineage back to the original instrument or instrument software. 

On top of this, ‍having IDS as a harmonization layer makes the archived data much easier to locate and the information stored in these RAW files much more easily accessible. Vendor and third-party binaries can be searched by file name or extension, but TDP's IDS enables users to search on granular fields like experiment ID, user name, method parameter, or instrument serial number; everything is indexed for faster and easier discovery.

For example, take a look at our Empower Cloud Archival App that leverages IDS to help you search your Waters Empower projects archived in Tetra Data Platform:

‍Hopefully we've helped you understand what happens to your irreplaceable RAW files (we save them!) and how IDS can lead better analytics, facile search, and ready indexing. In other words, IDS provides everything you'll need to track, find, and sort the heterogeneous and invaluable “packages” in your R&D data warehouse.

‍If you’d like to dive deeper into the Tetra Data Platform, we recommend this whitepaper covering many of its key capabilities.

As always, follow TetraScience for ongoing updates on experimental R&D data and related topics:

LinkedIn | Twitter | YouTube

R&D Data Cloud
R&D Data Cloud

How an IDS Complements Raw Experimental R&D Data in the Digital Lab

We use an intuitive analogy to help explain the purpose and use cases of our Intermediate Data Schema (IDS) and its relationship with data archival in a life sciences context.

Spin Wang

TetraScience President and CTO, Spin Wang, discussed industry trends shaping life sciences R&D with Jared Saul, AWS Head of Healthcare and Life Sciences Startup Business. 


The conversation focuses on major industry trends shaping R&D processes in pharmaceutical and biotech companies, the complex data lifecycle inherent to R&D, and what does the “Lab of the Future” actually mean. 

Read the entire conversation on the AWS for Industries: Executive Conversations blog.

AWS for Industries: Executive Conversations blog is a series of conversations held with innovators progressing disruptive technologies in their respective industries, with a focus on discovery, ingenuity, and contributions to healthcare and life sciences. 

Highlights from the blog include: 

  • Transformative solutions and a community-driven data ecosystem are fostering innovation
  • The "three pillars" of TetraScience: data-centric solution, cloud-native approach, and life sciences R&D focus
  • Ever-evolving security and compliance regulations and their impact on R&D

AWS Healthcare and Life Sciences focuses on collaboration with life sciences organizations around the globe to increase the pace of innovation, accelerate development timelines, and ultimately improve outcomes. TetraScience remains committed to our close partnership with AWS and appreciates having our story highlighted.

R&D Data Cloud
R&D Data Cloud

AWS Executive Conversations: Evolving R&D

AWS Executive Conversations highlight disruptive technologies across industries with a focus on ingenuity in healthcare and life sciences. TetraScience President and CTO, Spin Wang, is the latest to be featured.

TetraScience

Marshall Brennan, Scientific Director at SLAS, chats with Mike Tarselli about his birth on a planet far, far away to his rise as Chief Scientific Officer at TetraScience (a SLAS2021 Innovation AveNEW company). Mike discusses his winding journey starting as a pre-med major and eventually careening to his final path - a Ph.D. in Chemistry and the battle scars accumulated from the arduous path to completion. And then hitting the Great Recession!

Mike shares his thoughts on the challenge of “Big Data” in life science R&D, from data generation, management, all the way to visualization, and what organizations at the forefront of discovery seek to gain from implementing holistic approaches to data automation. TetraScience, and the Tetra R&D Data Cloud, covers the full lifecycle of R&D data: from acquisition to harmonization, engineering, and downstream analysis with native support for state-of-the-art data science tools.

R&D Data Cloud
R&D Data Cloud

New Matter: Inside the Minds of SLAS Scientists Podcast

Listen to the New Matter podcast episode, "Helping Scientists Do More," with SLAS's Scientific Director, Marshall Brennan, Ph.D., and Tetra's Chief Scientific Officer, Mike Tarselli, Ph.D. where they discuss the challenge of "Big Data" in life science R&D.

TetraScience
  • Biopharma R&D has not realized the full power of the data it possesses and generates every day. There are easy-to-use data science tools and there is a vast amount of data. But disconnects exist; data are locked in on-premises silos and heterogenous formats, and often lack the interface needed to derive insights at scale.
  • Our “Data Science & Application Use Cases for the Digital Lab” blog series shares some non-obvious ways top pharmaceutical organizations apply data science to extract novel insights from their R&D data.
  • We crowdsource use cases through our partnerships with top biotech and pharmaceutical organizations, enabled by our ever-growing Tetra Partner Network of connections to common data sources and data science tools. Our cloud-native Tetra Data Platform automatically collects, centralizes, harmonizes and prepares R&D data for analysis.

Authors: Evan Anderson, Cheng Han, Mike Tarselli, Spin Wang

Overview

Protein therapeutics are integral to the pharmaceutical landscape. Consider insulin for diabetes, Factor VIII to treat clotting conditions, or monoclonal antibodies against emerging pathogens or cancerous cells: all are protein-based and require orthogonal characterization and purification techniques to more common small-molecule workflows. Fast protein liquid chromatography (FPLC)[1], a common separation technique, utilizes multiple methods to purify proteins based on their size, charge, or affinity to column packing. For this post, we’ll consider the nearly-ubiquitous Cytiva ÄKTA series of instruments, an adaptive protein purification platform. These instruments are controlled by Cytiva UNICORN software, which holds the chromatographic data and associated metadata for each result.

As with many R&D instruments and their control software, accessing the data contained within the Cytiva UNICORN has traditionally been challenging:

  • Tedious comparison of results across projects
  • Search for specific by diverse metadata results is non-native
  • Time-consuming manual analysis

Our Data Science Link makes the data inside UNICORN instantly accessible and actionable. Scientists can now perform analyses and identify insights using their preferred data science tools, without manual data wrangling[2].

Image: Data Science Link Overview

Screen-Shot-2020-10-14-at-9.45.15-PM

Search, Select, Overlay, Evaluate, and Report

Our partners sought to reduce scientist hours involved in manual data transfer, comparison and analysis. On the Tetra Data Platform, all data from UNICORN systems is automatically harmonized. Using data applications and visualization tools like Streamlit, Jupyter Notebook, Spotfire, Tableau and etc., R&D organizations can build interactive applications that enable scientists to streamline search, select, overlay, evaluate and report.

Search

Scientists seek to compare FPLC run results through comparing output chromatograms. They can conduct flexible queries to obtain results of interests. For example, in order to select a column with better performance for a therapeutic protein of interest, a scientist can search by molecules, resins, column diameter and/or start or end date ranges.

Image: Search by multiple terms

Screen-Shot-2020-10-14-at-9.56.23-PM

Pre-aggregated values from UNICORN results - molecules, HPLC systems and column parameters - permit rapid selection.  Scientists can leverage partial name matches to find specific resins and fetch relevant chromatograms.

Select

Among search results, scientists can select all or a subset of the chromatogram of interests.

Image: Surface chromatograms from selected results

Screen-Shot-2020-10-14-at-10.00.58-PM

After locating the specific data they are looking for, scientists can select the chromatograms (see: elution peak overlay) to overlay and analyze. In the subsequent sections, we will discuss some common use cases in method development, column performance, and fast access to structured and complete experiment data.

Overlay

Chromatographic overlays from multiple runs can help visualize trends in column behavior, or spot anomalies in flow rate under certain conditions. Clear communication is key; interdepartmental reports help analytical staff visually report their optimization studies

However, there’s a fly in the ointment: run start time, injection time, and injection volume can vary by sample. As a result, the chromatograms arrive misaligned on both time and intensity axes. Simple plotting and overlay of the chromatograms together will lead to erroneous conclusions. Normalization, the process of aligning peaks and baseline with awareness of  injection volume, time, and column volume, consumes hours of manual effort for process development scientists daily.

If scientists have all the data from Cytiva UNICORN harmonized, centralized and available for query, they can use the injection volume value to normalize the intensity or height of the chromatogram; scientists can use custom set marks defined in the run log to realign the chromatograms. Enabling scientists to align their chromatograms at a setmark such as elution or wash allows them to compare protein mobility - and, thus, method performance - between results.

Image: Perform overlay with auto alignment for peaks and baseline

Screen-Shot-2020-10-14-at-10.15.41-PM

Image: Align peaks by Elution Start to compare method performance between different results

Screen-Shot-2020-10-14-at-10.17.17-PM

Image: Perform custom baseline adjustments (see that the y position at x=1170mLs is now at y=0 mAU) for downstream peak integration

Screen-Shot-2020-10-14-at-10.18.48-PM

Evaluate

Comparison of peak integrations drives rapid conclusions on recovery yields, impurities, or reaction progression without extra clicks or reinterpretation. This allows scientists to go the “extra step” to automating screening runs, calibrating QC runs, or monitoring method development. Once finished, the final visualization will help scientists determine optimal conditions

Image: Perform custom peak integration to evaluate elution efficiency

Screen-Shot-2020-10-14-at-10.21.39-PM

Report

In any large scientific organization, communications are key to ensuring alignment between large teams operating in parallel functions. As interpretation of data - in addition to manual processes, file conversions, and kludging this into PowerPoint - may take longer than the experiment itself, it’s unsurprising to see scientists dedicating as much or more of their time to reporting and publishing data sets and notebook pages as they do to their science.

In the instance below, scientists can share chromatograms overlay images and peak table results in .csv format so they can be included in emails, presentations or Excel.  The .csv results can also be imported into statistical softwares such as JMP(“JUMP”).

Image: Dynamically Zoom in on overlay and save resulting graphs

Screen-Shot-2020-10-14-at-10.23.11-PM

Applications: Method Development

Identifying the protocol that achieves optimal yield and quality requires iterative method development. Simply, one changes method parameters and compares purification results. Capturing key method parameters systematically leads to highly efficient and transparent optimization. For example, some of these important parameters include flow rate, pH gradient, buffer, pressure and various scouting variables defined by users.

pexels-francesco-ungaro-673648

Capturing and tracking run results and associated operating conditions is not only important in method development but also is essential, in the context of Quality by Design[3], to help decision making during scaling up and method transfer.

Achieve the highest resolution size separation in final polishing steps

Every time chromatography is performed, a portion of scientists' material is lost. This may not be an issue if it’s a commodity material, but in a therapeutic context scientists may have lost much of an antibody, protein therapeutic, or oligonucleotide scientists have spent weeks manufacturing and characterizing.

Polishing runs are critical to produce quality therapeutics[4]. After bulk separations and a concentrating  chromatography run, the remaining material may still hide among closely-related contaminants like protein isoforms, polymers, n-1 adducts, or other post-translational modifications. Higher resolution obtained during polishing correlates to less expense, minimized use of customized resin, lower chance of rework (and therefore further material loss), and a better chance that the material so derived will meet specifications.

In this instance, chromatographic overlay will quickly allow scientists to:

  • Determine the yield with the integral of peak area and volume
  • Observe overall peak shape and symmetry, as more symmetric peaks indicate pure final material
  • Screen various resins or techniques of the purification scheme to determine which combination decreases manual handling, balances recovery against purity, and delivers material to spec prior to polishing

Flow Rate Scouting

Flow rate, though a simple parameter (units volume through the column / time) controls the theoretical “plates” [5] that the eluent travels through. This defines what maximal resolution can be. Granted, this also depends on the type of chromatography; size exclusion and gel permeation techniques function differently than silica or hydrophobic stationary phases. As a general rule, scientists want the fastest flow rate that their processes can tolerate (this reduces solvent cost and material degradation) while maintaining resolution.

In a given flow rate scouting, a scientist likely conducts several runs at different speeds, with a standard amount of material and standard run conditions, and monitors the peak shape, retention time (where it appears in the spectrum) and overall recovery (area under the curve or physical measurement at end).

Having all UNICORN results available on Tetra Data Platform, scientists do not need to  manually open each run and compare on screen. Instead, they can overlay and time-synch chromatograms of interests, while the areas under the curve (AUCs) are pre-calculated. With a liquid handler and auto injector, scientists could in fact then physically automate this process and automate the data analysis using Tetra Data Platform pipelines.

Applications: Column Performance

When transferring methods to an external organization - a CDMO, a collaborator, or another corporate site - it's important to communicate all critical variables. This might include (but is not limited to): column type, size, packing material, flow rate, diffusion rate, material physical properties, pH sensitivity, and many more.

Scientists can plot the performance of the purification across a large number of runs and better understand how column performance changes over time.

Fast Access to Structured and Complete Experiment Data

Check out this video to see how easy it is to set up the Cytiva UNICORN integration with the Tetra Data Platform.

With the data readily available for scientists to manipulate in data science and data analytics tools, they can create more sophisticated analysis, such as

  • Comparing control runs to detect operating and machine anomalies over time, across batches and systems
  • Extracting chromatographic features such as peak shapes
  • Process optimization to achieve higher yields (elution peak volumes) and better purity

When structured and complete data is made available and data integrity is reinforced, scientific research or process improvements are no longer limited by the functionalities of the instrument control software.

Summary

Cytiva UNICORN protein purification data provide tremendous insights.  When scientists can compare current-day runs with historical results thanks to harmonized, analysis-ready data, scientists save time, manual processing headaches, and reach better technical conclusions

Follow our blog, where we will continuously share use cases from our Tetra Partner Network.  These demonstrate how to harness the power of harmonized and vendor-agnostic scientific data.  Whether your goal is build reports, conduct correlation or causality analysis, or run AI/ML models to discover untold truths, we hope that you will find relevant answers here.

References

  1. "Fast protein liquid chromatography - Wikipedia."
  2. "Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Tasks, Survey Says"
  3. "Quality by design - Wikipedia."
  4. "How to combine chromatography techniques | Cytiva"
  5. "Theoretical Plate - an overview | ScienceDirect Topics."

Share another data science use case or inquire about the Data Science Link application by contacting us at solution@tetrascience.com or any of the channels below.

R&D Data Cloud
R&D Data Cloud

Protein Purification with Cytiva UNICORN: Enhanced Analytics through Harmonization and Integration

For this post, we'll consider the nearly-ubiquitous Cytiva AKTA series of instruments, an adaptive protein purification platform.

Cheng Han

Maintaining an environment fit for a cell
Pharmaceutical and biotech organizations globally use carbon dioxide (CO₂) incubators to maintain stable environments in labs working with cells or tissue cultures. Monitoring CO₂ incubators is tricky due to the sensitivity of the CO₂ sensors. The high humidity in incubators may cause drift or failure in the sensors. These deviations may also not be reported on time, causing serious business and financial impacts.

This blog focuses on the direct integration capabilities of the Tetra Lab Monitoring application which facilitates CO₂ incubator monitoring without the use of independent probes. The benefits result in less time to install, eliminated removal of probes before sterilization, and lower upfront costs in the lab.

CO₂ incubator monitoring provides unique challenges
Properly monitoring CO₂ incubators in the lab is crucial to ensure there are no changes in the equipment or environmental changes such as track temperature, humidity, and CO₂ levels.

Monitoring CO₂ incubators is different from monitoring freezers, environmental temperatures, light levels, gas manifold alarms, or other equipment/factors in the lab. Some considerations lab managers and scientists need to keep in mind when monitoring CO₂ incubators are:

  1. Incubators must be regularly sterilized. Probes installed internally must be removed when a sterilization cycle is completed.
  2. All CO₂ sensors (whether independent or built into the incubator) need to be calibrated periodically.
  3. Independent probe installation is time-consuming for incubators, especially if there are strict policies around moving incubators or the materials inside of them.

By adding more incubators into the lab, more and more overhead is added as well. An average lab can house 10 - 20 incubators with each set of independent CO₂ sensors taking 10-15 minutes to remove and reinstall. The process takes more time if samples need to be moved to other incubators.

Battle-tested CO₂ monitoring in the lab
The Tetra Lab Monitoring application, built on the Tetra Data Platform,  is used by the top global pharmaceutical and biotech organizations to monitor a variety of equipment in R&D and GxP labs such as freezers and refrigerators in addition to environmental parameters. The application also tracks temperature, humidity, CO₂, O₂ and agitation levels in incubators.

Sturdy CO₂ sensors are used and equipped with protective filters to keep out humidity for monitoring. The main benefit is that they can be installed in any incubator. The only thing that is needed is access to an independent sensor port. Monitoring this way allows for a uniform solution for all incubators in the lab.

The Tetra Lab Monitoring application simplifies the CO₂ monitoring process by leveraging the data acquisition capabilities of the Tetra Data Platform. We have developed direct incubator integrations that are easy to install and maintain. Now we can pull data from the data output of an incubator - no external sensors necessary.

IMAGE: View of Heracell Incubator USB data port and the real time CO₂ level monitoring via direct integration

Heracell-USB-1

When monitoring incubators via direct integration, you can:

  1. Leave the monitor set up during a stericycle. The IoT gateway connected to the USB output of the incubator will have no problem recording that the inside of the chamber reached 180℃. If you left an independent probe connected, you’d be cleaning melted plastic out of the incubator once your sterilization was completed. This can be an expensive mistake!
  2. Calibrate just your incubator, and know that your monitoring system is reading a calibrated value.
  3. Install the TetraScience monitoring system in just a few minutes - saving both time and installation costs.
  4. Unlock more data using less wires. Some incubators are equipped with sensors that can monitor parameters like water level or the agitation rate.  In the case of shaking incubators, loss of agitation could potentially inhibit optimal cell growth and lead to a failed experiment costing the scientist time and money on reagents.
  5. Monitor incubators that could not be monitored using independent probes because they lack an access port.

IMAGE: View of the agitation rate in RPM for Infors HT Multitron measured via direct integration

InforsHT-2

To start, we’ve released a direct integration with the following manufacturers and models of incubators:

The integration allows for the monitoring of parameters like CO₂, O₂, temperature, humidity, spinning rate, and water level via data port output from the incubator.

Summary
Supporting the most popular incubators in the Tetra Lab Monitoring application ensures experimental conditions in incubators across biotech and life science labs are stable and lowers the probability of failure. Scientists do not need to worry about the chances of their cell or tissue cultures dying due to improper environments and losing work that took months. Over the years, our easy-to-use and configurable application has made work in labs more efficient. Developing integrations across a variety of lab equipment, and alongside our dedicated customer support team, has saved many organizations like Notable Labs millions of dollars, countless numbers of equipment, and potential losses.

Thanks to its unique ability to integrate with any instruments, TetraScience is continuosly growing the portfolio of direct integrations with CO₂ incubators. Future integrations include ThermoFisher Forma™ Steri-Cult™ CO2 incubator, and ThermoFisher Forma™ Steri-Cycle™ CO2 incubator to name a few. We continuously work alongside our customers and partners to develop new integrations and features. If you have a specific CO₂ incubator that needs monitoring via direct integration, contact us to learn how easy direct integrations can be.

Lab Monitoring
Lab Monitoring

Simplified Remote Monitoring of CO₂ Incubators via Direct Integration

Remote monitoring of CO2 incubators via direct integrations simplifies the process. This results in lesser installation times and lower upfront costs in the lab, and eliminates the need to remove independent probes for sterilization.

Salvatore Savo

Eliminate the need for independent probes - lower failures in the lab and make monitoring CO₂ incubators more efficient


Maintaining an environment fit for a cell
Pharmaceutical and biotech organizations globally use carbon dioxide (CO₂) incubators to maintain stable environments in labs working with cells or tissue cultures. Monitoring CO₂ incubators is tricky due to the sensitivity of the CO₂ sensors. The high humidity in incubators may cause drift or failure in the sensors. These deviations may also not be reported on time, causing serious business and financial impacts.

This blog focuses on the direct integration capabilities of the TetraScience Lab Monitoring application which facilitates CO₂ incubator monitoring without the use of independent probes. The benefits result in less time to install, eliminated removal of probes before sterilization, and lower upfront costs in the lab.

CO2 incubator monitoring provides unique challenges
Properly monitoring CO₂ incubators in the lab is crucial to ensure there are no changes in the equipment or environmental changes such as track temperature, humidity, and CO₂ levels.

Monitoring CO₂ incubators is different from monitoring freezers, environmental temperatures, light levels, gas manifold alarms, or other equipment/factors in the lab. Some considerations lab managers and scientists need to keep in mind when monitoring CO₂ incubators are:

  1. Incubators must be regularly sterilized. Probes installed internally must be removed when a sterilization cycle is completed.
  2. All CO₂ sensors (whether independent or built into the incubator) need to be calibrated periodically.
  3. Independent probe installation is time-consuming for incubators, especially if there are strict policies around moving incubators or the materials inside of them.

By adding more incubators into the lab, more and more overhead is added as well. An average lab can house 10 - 20 incubators with each set of independent CO₂ sensors taking 10-15 minutes to remove and reinstall. The process takes more time if samples need to be moved to other incubators.

Battle-tested CO2 monitoring in the lab
The TetraScience Lab Monitoring application, built on the TetraScience Data Platform,  is used by the top global pharmaceutical and biotech organizations to monitor a variety of equipment in R&D and GxP labs such as freezers and refrigerators in addition to environmental parameters. The application also tracks temperature, humidity, and CO₂ levels in incubators.

Sturdy CO₂ sensors are used and equipped with protective filters to keep out humidity for monitoring. The main benefit is that they can be installed in any incubator. The only thing that is needed is access to an independent sensor port. Monitoring this way allows for a uniform solution for all incubators in the lab.

The TetraScience Lab Monitoring application simplifies the CO₂ monitoring process by leveraging the data acquisition capabilities of the TetraScience Platform. We have developed direct incubator integrations that are easy to install and maintain. Now we can pull data from the data output of an incubator - no external sensors necessary.

IMAGE: View of real time CO₂ level monitoring via direct integration

VWR_incubator-3

When monitoring incubators via direct integration, you can:

  1. Leave the monitor set up during a stericycle. The IoT gateway connected to the USB output of the incubator will have no problem recording that the inside of the chamber reached 180℃. If you left an independent probe connected, you’d be cleaning melted plastic out of the incubator once your sterilization was completed. This can be an expensive mistake!
  2. Calibrate just your incubator, and know that your monitoring system is reading a calibrated value.
  3. Install the TetraScience monitoring system in just a few minutes - saving both time and installation costs.

To start, we’ve released a direct integration with the VWR Basic and VWR Symphony CO₂ incubators. The integration allows for the monitoring of temperature and CO₂ via data port output from the incubator. The incubator does not monitor a percent humidity reading, however an external sensor for humidity percentage can always be added.

Summary
Supporting the VWR incubators in the TetraScience Lab Monitoring application ensures CO₂ levels in incubators across biotech and life sciences labs are stable and lowers the probability of failure. Scientists do not need to worry about the chances of their cell or tissue cultures dying due to improper environments and losing work that took months. Over the years, our easy-to-use and configurable application has made work in labs more efficient. Developing integrations across a variety of lab equipment, and alongside our dedicated customer support team, has saved many organizations like Notable Labs millions of dollars, countless numbers of equipment, and potential losses.

In addition to the direct integration with VWR incubators, TetraScience supports also the following list of Incubators

Thanks to its unique ability to integrate with any instruments, TetraScience is continuosly growing the portfolio of direct integrations with CO₂ incubators. Heracell is only the tip of the iceberg. Future integrations include Panasonic Cell-IQ, ThermoFisher Forma™ Steri-Cult™ CO2 incubator, and ThermoFisher Forma™ Steri-Cycle™ CO2 incubator to name a few. We continuously work alongside our customers and partners to develop new integrations and features. If you have a specific CO₂ incubator that needs monitoring via direct integration, contact a solution specialist to learn how easy direct integrations can be.

Lab Monitoring
Lab Monitoring

Remote Monitoring of VWR CO₂ Incubators via Direct Integration

Remote monitoring of VWR CO2 incubators wia direct integration results in lesser installation times, lower upfront costs in the lab and eleiminates the need to remove independent probes for sterilization.

Salvatore Savo

Break down data silos and make managing scientific data easier


Authors: Mike Tarselli, CSO, Spin Wang, CEO

Data volumes in biology, chemistry, and materials science continue to explode. This leads to rising complexity for processing workflows: legacy systems just can’t keep up with innovation demand. Holistic data solutions, coupled with data-centric capabilities to enable the full lifecycle of R&D data, bring these glaring limitations to light. Can we stem the data tide?

Open, configurable data solutions with vendor-agnostic connectivity will empower life science organizations to advance discovery and innovation.[1]

Let’s consider how a scientific data management system (SDMS) might impair the critical innovation path instead of facilitating an open, interactive data future.

What is an SDMS, anyway?

So many letters, so little time - an SDMS is software acting like a filing cabinet. These capture, catalogue, and archive all versions of data from instruments -- HPLCs, mass specs, flow cytometers, sequencers -- and scientific applications like LIMS or ELNs. The extracted data are stored and handled in context - and process-specific data formats, which usually maintain consistent metadata and a defined structure.

Essentially, it’s a data warehouse with known limitations: potential incompatibility with heterogeneous data landscapes, ingestion of new experimental data, or data prep for downstream analysis tools. This perpetuates a common biotech Achilles’ heel - the dreaded data silo.

Silos-for-SDMS-blog--2-

Break out of the SDMS and eliminate all your woes

Woe #1: Missed connections
An SDMS doesn’t play well with others: no linking of disparate data sources and targets, like devices, external collaborators, or software systems (ELNs, database, etc.).

Open access to more information
The Tetra R&D Data Cloud easily ingests heterogeneous data sources across the R&D ecosystem. By unifying data in the cloud via native connections, scientists are freed from the data yoke.

Woe #2: Wrangling Data
Data harmonization involves converting vendor-specific or proprietary formats, often manual steps today. An SDMS doesn’t have the functionality to automatically harmonize data into query-ready formats or easy application to visualization or analyses.

Standardize, Harmonize
Automatically harmonize across vendors and formats connecting experimental data into a structured open format -- an Intermediate Data Schema (IDS) or Parquet -- enabling data science and Big Data applications. This means disparate sources can finally be united in standard formats, allowing you to glean insights from all data captured. This wrings value from your precious data, no matter where it came from.

Woe #3: Too much static
An SDMS is a stale, limited repository with no customization opportunities to open your data flows. It simply cannot handle the evolving complexity of research projects.[2]

Let it flow
Comprehensive data engineering capabilities will streamline data prep for AI/ML and advanced analytics. Data flows seamlessly throughout the development process and allows for easy iteration. The TetraScience Data Platform allows Pharmas to create their own processing pipelines, configure triggers, view status, files, and automate notifications via a modern centralized dashboard...a literal control nexus for all your operations.

Woe #4: Stuffing the cabinet
Integration limitation with the myriad file formats found in R&D means SDMSs can easily become yet another data silo, eliminating straightforward data management or collaboration.

Declutter and streamline
Automatic centralization of results in the Tetra R&D Data Cloud enables seamless access, ensures data integrity, and allows for secure collaboration anywhere.

Woe #5: You're gonna need a bigger boat!
R&D data is Big Data.[3] On-premise SDMS can’t digest data scale and complexity if organizations are not able to leverage the digital benefits of the cloud. Without the flexibility needed or the ability to scale, innovation plummets. To swim this data lake, you’re gonna need a bigger boat.

Floating in the cloud...on a yacht
TetraScience’s cloud-native architecture provides organizations the right tools to tackle the complexities in R&D. It makes it easy to manage high volumes of data and complicated workflows with access to the data in a secure environment.

Summary

Certainly, an SDMS will work for your organization if you only use one data format or just need a place to stash your files. It’s simple, easy, and stable...but also capability limited. R&D data are complex by nature. Complex challenges require novel approaches. Holistic data solutions that are adaptable, flexible, and scalable optimize data management and eliminate data silos. Get rid of outdated systems that perpetuate roadblocks and cause data woes - and say “Whoa!” to the breakneck speed of digital transformation!

  1. N. Limaye, "Data Integration: Changing the Pharma and Healthcare Landscape," www.Technologynetworks.com/biopharma, 27 February 2020
  2. T. Broekel, "Measuring technological complexity - Current approaches and a new measure of structural complexity," Utrecht University, 12 March 2018
  3. J. Cumbers, "How The Cloud Can Solve Life Science's Big Data Problem," www.Forbes.com, 19 December 2019
R&D Data Cloud
R&D Data Cloud

99 Problems, but an SDMS Ain't One

Data volumes are exploding. Workflows continue to get more compley Legacy systems and processes can't keep up. All these factors are compounding into one big cluster...

Mike Tarselli, Ph.D., MBA

Factors to consider when monitoring freezers in life science labs when temperature discrepancies and fluctuations arise


The display on your -80℃, -20℃, or 4℃ cold storage units will almost always say just that - -80℃, -20℃, or +4℃. However, it is not uncommon for discrepancies between the temperature reported on your freezer and lab monitoring systems to arise. For instance, a freezer in the lab will display -80℃, however the monitoring solution reflects -75℃. This blog highlights how fluctuations in freezer temperature are common and other considerations to keep in mind for properly monitoring freezers in R&D labs.

A brief overview of how freezers work

Most freezers and refrigerators have a compressor (or compressors) that cool the unit. The compressor runs when the temperature gets too high, and turns off once the temperature has gotten low enough. The freezer insulation keeps the temperature from rising quickly, even when the compressor is off.

What does uniformity mean?

Uniformity generally refers to a state of consistency. In this context, we’re interested in temperature uniformity throughout different locations in the freezer. When analyzing temperature uniformity, temperature readings over time are averaged. Variation due to compressor cycles is not taken into account because while this affects temperature over time, it should affect all locations in the freezers in a consistent way.

In an ideal world, a ULT freezer averages -80℃ on all sides and in the middle. However, in reality, it might be -75℃ at the top, -78℃ at the side, and -82℃ in the middle. The University of California Riverside published a study[1] in 2016 evaluating ULT freezer temperature uniformity and is very helpful in understanding what the temperature distribution can look like.

Image below is a figure from that study which illustrates a typical temperature profile:

Freezer-temp-blog-image

What causes the temperature variability, or lack of uniformity?

Two common reasons why there may be a lack of uniformity in freezers being monitored could be:

  • Convection inside the freezer
    The top of the freezer will have a higher average temperature
  • Differences in insulation thickness
    All freezers have certain locations which are less well-insulated than the rest (e.g. around the door seal)

What factors increase and decrease this variability?

Common factors include:

  • Number of racks
    When a freezer has big empty spaces, it allows more air to circulate. This means more temperature change due to convection, and more temperature fluctuations overall.
  • Amount of material stored
    The more frozen material stored in a freezer, the more uniform the temperature stays during normal use. In thermodynamics terms, the freezer has a high thermal mass, and so it takes more heat for the temperature to change.
  • Freezer make and model
    All freezer models have slightly different performance characteristics. For example, a Stirling freezer and ThermoFisher TSX could both be set to -80℃, but their average temperatures and typical temperature variation could differ. This doesn’t mean that anything is wrong with either freezer!

How do discrepancies in temperature monitoring affect day-to-day lab management

There are two key take-aways to keep in mind when working with freezers in the lab and how to appropriately monitor them for your business needs:

Takeaway 1: If precise temperature matters to your samples, understand the spots in your equipment with variable temperatures and advise scientists appropriately. If you’ve ever accidentally frozen a dozen eggs because you put them on the wrong shelf in your refrigerator at home - the same problem can happen in the lab.

Takeaway 2: For installation and maintenance of your lab monitoring system, be aware of how probe placement might affect the temperature data collected. If you are a TetraScience Lab Monitoring customer, read on for specifics. If not, request a demo to find out how the solution can streamline freezer monitoring in your lab.

How does this affect my TetraScience installation?

Overall, freezer temperature uniformity should not affect the day-to-day usage of your alarm system. Your alert thresholds should be set above the highest average temperature typically seen in your freezer during normal operation.

Understanding temperature distribution is more important during installation or if you need to move, reposition, or replace the temperature probe.

If you are doing self-installation you should be aware of these temperature differences, and take the following steps to minimize issues due to normal temperature variation:

  • Place the probe towards the back of the freezer; do not place it too close to the door where the average temperature is higher and there will be more variability in temperature.
  • If possible, avoid placing the probe on a completely empty shelf.
  • Avoid placing the probe on the back wall of the freezer, if the shelf is completely full. This can be too well insulated by the freezer racks, and it will take longer to detect temperature excursions.

Keep in mind that every model of freezer is different! If you think the temperature you’re recording is too high or low on average, you can always try moving the probe to a different location.

Summary

The key takeaway here is not to panic when there is a difference in how temperature reflects when monitoring freezers in R&D life science labs! Fluctuations in the temperature and discrepancies reflected between the freezers and monitoring systems do not equate failure in either system. Knowing these intricacies can help alleviate worry, establish appropriate alert set points to eliminate alert fatigue, and generally bring peace of mind to scientists and lab managers. The key is to employ solutions that are built and battle-tested to support best-practices for how important lab equipment is not only used, but also properly monitored.

  1. Footnote: Faugeroux, Delphine. Ultra-Low Temperature Freezer Performance and Energy Use Tests. 2016, Ultra-Low Temperature Freezer Performance and Energy Use Tests.
Lab Monitoring
Lab Monitoring

Understanding Why Freezer Temperatures May Not Be Uniform

Discrepancies in temperature reflected between the freezers and monitoring systems do not equate failure in either system. This blog highlights key takeaways for monitoring freezer temperature.

Erika Tsutsumi

Let’s talk about data as the plumbing of the Digital Lab.

John Conway of 20/15 Visioneers has an analogy between R&D organizations and a builDing. The capital D is not a typo - he argues that the word is active (builDing) instead of passive (builTing) because it is never really finished. There will always be some kind of upgrade, improvement, enhancement, or repair. He talks about power - electricity and networks that allow for communication. He talks about virtual capabilities like software and advanced analytics. He talks about infrastructure as the critical foundation that protects and allows for proper existence. He also talks about plumbing as the processes and instruments that produce and move data. Here at TetraScience, we think this analogy is spot on, and we are inspired to dive deeper into the concept of "data plumbing."

“Data Plumbing” for the Digital Lab

When you think about plumbing for your home, running water, dishwashers, washing machines, etc. is what what pops into your head. At face value, it doesn't seem too complicated. However, plumbing can be incredibly complex when you think about how it needs to work in a NYC skyscraper, sports stadium, or the New England Aquarium. For biotech and pharmaceutical companies today, data flow in the lab is equally - if not more - complex than the water plumbing systems in those buildings. In the lab, there are heterogeneous instruments - different brands and models producing disparate file formats. There are distributed partners, such as CROs and CDMOs, that are doing outsourced research. Different applications are also found in the lab, such as registration, inventory, ELN, LIMS, and home-grown applications. There are many workflows that scientists are running, different kinds of experiments and assays, and increasingly, different visualization and analysis tools that data scientists are introducing into the ecosystem.

We define lab data plumbing as the collection, cleansing, harmonization, and movement of all lab data.

Just like the plumbing in a building, it can be messy and dirty. It involves the unrestricted flow of a valuable substance, (information or water), through a complex system that can be simple to start, but quickly becomes complicated. This requires special skill sets, tools, and careful design to properly implement.

Let’s consider some examples of data plumbing in today’s lab:

  • CRO sends a huge volume of excel files and pdfs via file share. Scientists then have to manually download the files and perform a quality check to make sure the barcode is available in the registration, for example.
  • The next step involves scientists copying & pasting the values into lab notebooks, then manually transcribing into an ELN. In a GxP compliant environment, this process often includes additional review steps.
  • Results are manually compiled or aggregated from multiple batches or samples at different time points into an excel spreadsheet to align the curves and illustrate trends and anomalies.

This type of primitive data plumbing is happening every day in every biopharma lab around the world. It isn’t automated, but it is already present - like manually pumping water out of a well instead of simply turning on the faucet.

Data-Plumbing-1

Requirements for a modern lab data plumbing system

Using building plumbing as inspiration, let’s think about some design requirements for a modern Digital Lab Data Plumbing system:

  1. Prevent dirt and use filtration. We do not want dirty water coming into our buildings. Incoming water will often undergo additional sanitization and filtration. Similarly, we do not want dirty or untagged data going into the data lake, ELN, visualizations, or reports. It is crucial to collect as much data as possible from different sources. Equally crucial is to attach the right metadata, perform validation checks, and harmonize the data.
  2. Fix leaks and clogs. There will always be leaks, broken pipes, and clogs causing drips and floods in buildings. This can be very annoying, or it can be disastrous. Similarly, in data plumbing, there can be processing errors. We need alerting, a notification system to proactively detect missing data, processing failures, and any kind of throughput bottlenecks or fluctuations so we can fix them before they become disasters. Imagine a file size is much larger than expected and causes memory issues. This needs to be tracked and proactively alerted. Trying to identify the source of a pipe leak would be a nightmare without a building floor plan to map out how the pipes are connected. It is the same for data flow - it is important to view and manage from a central dashboard.
  3. Configurability. We need to be able to swap out a sink or shower head without impacting or changing the rest of the system. In a data plumbing system, we need the same, or maybe a higher, degree of configurability, enabling labs to change an instrument or application without impacting the rest of the system. We need to be able to easily plug in multiple ways to consume data - Spotfire, Tableau, Juptyer Notebook, R, visualization tools, or your own applications. This should be able to be done in a plug-and-play fashion. The lab instruments and applications should enable the science; the science shouldn’t be limited by the available lab instruments and applications.
  4. Manage pressure. Water enters the building under pressure. That’s how it travels upstairs and around corners. In data plumbing, when you connect a new data source, the data ingestion pressure will build up. Maybe you want to ingest two years’ worth of experiments. Or a new CRO, or a new type of experiment, or a new robotic system. In each case, the amount of new data builds up. You will need proper load balancing, throttling, and auto-scaling to handle the pressure.

These are just some of the parallels we can pull from plumbing in a building. This can inspire to use proper design of a lab data plumbing system.

Of course, labs have their own unique challenges to consider:

  • collecting data from instruments is very difficult - this can be a fight as many instruments do not consider the downstream use cases of the data they produce
  • comprehensive and immutable data processing logs - a lot of the transformed experimental data will be used in FDA submissions or quality processes
  • ability to rerun or replay the data flow - you may want to extract new metadata out of the raw data, merge the data, or direct it to new places
  • rapidly customize and/or upgrade pipelines - for new types of studies or new types of insights
  • keep track of changes and upgrades to the data processing tools - need to maintain a thorough audit log for 21 CFR part 11

Hopefully by now you agree that data flow in the Digital Lab is quite complex and quite important, and deserves a renovation. Just to be safe, let’s put some numbers to the concept.

The impact of not taking action to improve your lab Data Plumbing

Let’s go back to the 3 data plumbing examples we talked about: 1) CROs sending a large volume of files that need to be quality checked manually, 2) scientists copy-pasting into notebooks and then manually transcribing into ELNs, and 3) data scientists manually aggregating results. Customers tell us it is common for scientists to spend as much as 10-20 hours per week on these types of manual data wrangling activities.

If we do some quick, back-of-the-envelope math to scale this up to a large organization with thousands of scientists, it could look something like this:

  • 15 hours per week x 44 working weeks x 1500 scientists = almost 1,000,000 hours per year on manual data wrangling.
  • If we assume $125/hour, that translates to $125 million per year wasted on manual data processes.

This is a meaningful, sizable, fundamental problem in the lab. In terms of opportunity cost, what could your scientists do with that 1 million hours of productive or uninterrupted time? Could they run 10% or even 20% more experiments?

What could you data scientists do with clean, accessible, prepared data at their fingertips? What is the implication, or rippling effect, of transcription errors or the lack of reported information propagating downstream?

Recommendations to get started on “home improvements” for your lab data plumbing

First, plan the data plumbing system for your lab as a first level architecture consideration, not as an afterthought. This does not mean that you have to be building a new lab. For your existing labs, the typical approach is to buy the instrument, buy the ELN, and then figure out how to connect them. In your building, you design the plumbing system knowing that there will be a dishwasher and a sink, even if the exact fixture hasn’t been purchased yet. Same for the lab, you know you will need to move data around, so plan a configurable “message bus” (for our IT readers) that is capable of plugging into a variety of instruments and applications. And then plug them in.

Second, START NOW! Don’t wait. Don’t let scientists and data scientists waste their time on data wrangling. Every little bit helps make a dent in that 1 million hours and $125 million dollars wasted every year. If the job seems daunting, just pick off one small component at a time. You are not in this alone! We are here to help. Follow our blog and our social media for tips, or check out our website to learn more about our data engineering platform that provides the data plumbing for the Digital Lab.

Watch our "Data Plumbing for the Digital Lab" presentation from the Discovery IT Digital Week 2020 to learn more!

R&D Data Cloud
R&D Data Cloud

"Data Plumbing" for the Digital Lab

Just like the plumbing in a huge building, data plumbing in the digital lab can be messy and dirty. It's simple to start, but quickly becomes complicated.

Spin Wang

Adhere to industry best-practices by enabling GxP lab compliance through automatic incident reporting


Automatic reporting of deviations = GxP Compliance

As Life Sciences labs continue to ramp up the speed of innovation, standardized management of controls and processes need to be enforced to ensure the reliability, reproducibility, and high integrity of not only all the data produced, but also the processes around monitoring and managing all activities in the labs themselves. GxP quality guidelines and standards are necessary for the safety and integrity of products manufactured - this is especially important in the life sciences industry, as most of the final products are intended for human use.

GxP compliance in organizations includes the ability to ensure all processes and digital data records are trustworthy, reliable, and cannot be manipulated. This includes keeping track of any unusual events detected by systems, like deviations or anomalies across all equipment found in the lab. This information is required to be provided to government agencies, like the FDA. The reports need to include supporting documentation that reflects incidents and that no step in the process was compromised or manually edited. This is necessary to determine whether the level of deviation does not negatively impact the final product and ultimately deem it unsafe for use.

The challenge for many organizations to ensure GxP compliance is having access to best-in-class lab monitoring solutions that automatically export thorough reports required by government agencies to successfully complete audits. By not automating the generation of reports in non-editable formats the task is not only laborious, and at risk of human error, but also does not comply with governmental agency guidelines stating the data and reports cannot be manipulated. By implementing incident reporting features, organizations have the ability to monitor and report on any deviation or anomaly detected by lab monitoring solutions and be able to automate reports. This further enforces GxP and ensures compliance across the organization.

IMAGE: Automatic incident report generated by TetraScience Lab Monitoring application

Copy-of--BLOG--INCIDENT-REPORTING
Summary

The Tetra Lab Monitoring application, built on top of the Tetra Data Platform, is designed for GxP labs to enable 21 CFR part 11 compliance through features such as comprehensive audit trails, electronic signature, multi-factor authentication, and others. The addition of incident reporting to the Lab Monitoring application will enable Life Sciences organizations to automatically generate comprehensive incident reports that reflect all deviations detected across equipment in the lab - and at a click of a button. This will save organizations time, resources, and enable their compliance and operational integrity.

The application combines cloud-native and web-based user experience with reliable hardware that supports 5GHz Wi-Fi, cellular, and hardwired connections. Built on infrastructure that makes data accessible and secure, it meets all data integrity requirements necessary.

Lab Monitoring
Lab Monitoring

Incident Reporting for GxP Compliance

The TetraScience Lab Monitoring application now provides comprehensive incident reporting for monitoring and tracking of all potential deviations across equipment in Life Sciences labs.

Salvatore Savo

HighRes Biosolutions Cellario and Tetra Data Platform optimize data management and data flows in the cloud

HighRes Biosolutions designs and builds innovative laboratory automation systems, dynamic scheduling software, and lab automation instruments. Cellario, industry's state-of-the-art lab automation software, enables instrument and robotics scheduling in the lab. Lab automation systems and software generate massive volumes of R&D data that can accelerate therapeutic discovery.

The integration between Cellario and TetraScience automates and streamlines the collection of R&D data generated from Cellario into downstream data science applications and other informatics applications such as ELN and LIMS.

Cellario, with its coupled API layers, is designed to support a wide range of upstream and downstream data integration requirements. Cellario’s RESTful API can be used to integrate with any other software platform. In this case, TetraScience is leveraging Cellario’s publisher/subscriber event APIs to receive data events. In addition to standard data events that every reader creates, end users can easily customize the data stream by using scripts to create data events.

Data is centralized in the Tetra Data Platform and harmonized into the Intermediate Data Schema (IDS)-JSON, which is a structured, vendor-neutral format. Once R&D data is harmonized, it is directly queryable in web API or SQL, and can be further transformed into any format needed. Cellario produced data is now accessible by your favorite data science tools. It can also be combined with other R&D data that customers store in the TetraScience platform for further analysis.

How the integration works

Step 1: Configuration
Our connector was developed in collaboration with HighRes Biosolutions as part of our integration. You can simply configure connection to the Cellario software on the TetraScience platform web interface. No need to kick-off a multi-month customization project, write code from scratch, and then spend even more effort and money to maintain the connection [1].

The product roadmap for this integration includes more features in future releases, such as, filter by event/data types, data selection with a determined time frame. These are based on use cases crowdsourced from Life Sciences companies within the Tetra Network. Continuous platform innovations, like new features and capabilities, are made available to customers regularly.

IMAGE: Configuration of HighRes Biosolutions Cellario software to the Tetra Data Platform


cellario-connector-config

Step 2: Collect RAW Files and Attach Metadata
The first step after configuration is collecting the RAW files. The files are automatically extracted by our Cellario connector and uploaded into the Tetra Data Lake. The connector also collects and attaches important metadata to the files. Metadata and tags are customizable, and often include information about the order, request, plate, Cellario protocol, and/or other relevant metadata. Tagging metadata provides powerful context to integrate data with ELN/LIMS and perform advanced data science and analytics. Sufficient context is one of the foundational steps in FAIR data principles.

IMAGE: Metadata attached to Cellario extracted files


Screen-Shot-2020-06-14-at-11.09.07-PM

IMAGE: Select Cellario Metadata & Tags


Screen-Shot-2020-06-15-at-10.40.44-PM

Step 3: Data Engineering + Data Science in the Cloud
After the automatic data collection process, a data pipeline parses the data into IDS-JSON format. The data is now harmonized in the cloud-native data lake. Two immediate benefits of this data engineering are:

  1. Data is is queryable, which means scientists and data scientists can find it
  2. Once your data is accessible, queryable, and in a common format like JSON, it can be imported into a myriad of data science tools to discover actionable insights

IMAGE: Visualization of HighRes Biosolutions lab automation systems usage


Screen-Shot-2020-06-16-at-10.36.54-PM

IMAGE: Visualization of data produced during a plate reader run


heatmap

This process enables data and workflows that are accessible and scalable in a secure, cloud-native environment.

Step 4: Inline Cloud-based Data Analysis and DOE
This integration establishes a closed-loop in lab automation and Design of Experiment (DOE) software. There are also APIs to place orders and manage automation systems and Cellario-controlled instruments and devices.

IMAGE: Data automation feedback loop between HighRes Biosolutions and TetraScience


To take it one step further, you can introduce cloud-based in-line analysis and calculation, using interactive data science tools like Jupyter Notebook. The lab automation system now benefits from in-line cloud computation. You can also leverage historical data sets - contributing to the model and instructing the next point to search in the parameter space.


Summary

TetraScience + HighRes Biosolutions best-in-class workflow, control, and orchestration solution will no doubt allow scientists to unlock discoveries in life sciences faster. The manual effort associated with data integration and subsequent human error is virtually eliminated - allowing for high integrity and consistency of the data. Leveraging an enterprise-grade and cloud-native platform enables actionable insights in the entire development and discovery process.

  1. https://www.econnectivity.se/app-maintenance-cost-can-be-three-times-higher-than-development-cost/#:~:text=Enterprise%20App%20Maintenance%20Costs&text=More%20than%2070%25%20of%20the,successfully%20maintaining%20an%20enterprise%20app
R&D Data Cloud
R&D Data Cloud

Cloud-based Data Management with Lab Automation: HighRes Biosolutions Cellario + TetraScience

Integration between HighRes Biosolutions' Cellario lab automation software and the Tetra Data Platform provides cloud-based data management and unlocks data science

Kai Wang

The Suffering Science, and Missed or Delayed Discoveries Behind and Underneath Trash Data


Author: John F. Conway, Chief Visioneer Officer, 20/15 Visioneers

Trash or Dark Data is collected data that is not being touched or getting secondary use. We are not talking about tracking data or ancillary system data, etc. We are unfortunately talking about R&D data that has come from experimentation and testing, both physical and virtual. This means individuals and organizations are not taking advantage of key decision-making data and information and are missing very valuable insights.

In many R&D organizations, Trash Data has accumulated and can account for upwards of 80% of its known data. (1) , (2)

The interesting part of this conundrum is that it's both structured and unstructured data! If data isn't properly contextualized, it’s probably not going to be reused, hence it becomes Trash Data. So, don’t get fooled with just pushing mis- or uncontextualized data to the cloud, because you will now have Trash Data in the cloud!

Trash-Data

Unfortunately, in the year 2020, this is a major problem for R&D organizations. It is telling you that either you have no Scientific Data and Process Strategy, or, even worse, you are not following the one you have written and committed to following! It's also telling you that some of your AI/ML strategies are going to be delayed until the organization solves its major data and process problems. Model Quality Data (MQD), and lots of it, are needed for AI/ML approaches in R&D. And remember, your processes produce the data, so they go hand-in-hand. Data generation processes need to be formulated and established during the initial step of any data project. To mitigate the risk of being inundated with trash data, a foundation with strict requirements of data production and integration need to be established. “Data plumbing” is a critical step that will ensure your properly contextualized data is routed and deposited in the right location.

Another major issue is what to do with legacy data! Based on what was just discussed, my experience has shown that ~80% of legacy data is not worth migrating. The best strategy is probably to leave it where it is and connect to it with your new data integration platform. However, you will determine the value and worth of the legacy data by performing a data assessment through the new platform.

So, how has this happened? How did we arrive at this place where 80% of our hard-won R&D data ends up as trash? The truth is that it comes down to human behavior and discipline. Writing, agreeing, and committing to a Scientific Data and Process Strategy is step number one. However, to take a written agreement and turn it into standard processes that are embedded within the organization, you need a company culture that includes true leadership and a focus in the area of REPRODUCIBLE science. Tactically, it starts with data capture and storage principles. You need a Data Plumber!

R&D organizations are like snowflakes - no two are identical - but there is much overlap in process and types of data generated! Variation is the real problem. Instrument types and various lab equipment with or without accompanying software (e.g. CDS - chromatography data systems), ancillary scientific software like entity registration, ELN (electronic laboratory notebook), LIMS (laboratory information system), SDMS (scientific data management system), exploratory analysis, decision support, Microsoft Office tools, and the list goes on. Hopefully, you have consistent business rules for managing and curating your data, but the chances are this varies as well. What you, unfortunately, end up with is a decidedly unFAIR (see FAIR data principles) data and process environment.

Why did you, a mature R&D organization end up in this position? (Startup Biotechs – BEWARE! Learn from the mistakes of those who have gone before and don’t let this happen to you!) (3). It may be hard to deconvolute, but I think, and this can be up for debate or confirmation, the mindset of “data and processes as an asset” got lost somewhere along the way. Perhaps management and others became impatient and didn’t see a return on their investment. Poorly implemented scientific software tools that were designed to help these problems, compounded the situation. In some cases, environments were severely underinvested in. Taking shortcuts in the overall data strategy, without establishing other foundational steps or processes, is like building a house without a proper foundation. At first, this leads to sagging windows and doors, and a poor living experience with the house. Eventually, the foundation-less house collapses or needs to be demolished. In other cases, churn and turnover in different IT/Informatics and business functions created a “kick the problem down the road” situation. Many times, the “soft” metrics weren't gathered for the repeat of experiments and the time that was wasted in trying to find things and make heads or tails of poorly documented data or experiments. At the end of the day, humans worked harder instead of smarter to make up for the deficiencies.

Imagine you can start to solve this problem both strategically and tactically. Tactically, it starts with business process mapping and completely understanding your processes. The understanding of the detail is an insurance policy for much better outcomes in your journey. As discussed, strategically you need a sound written Scientific Data and Process (SD&P) Strategy that the organization can follow. Tactically, you need to capture your structured and unstructured data in both raw and processed forms. It must be properly contextualized; in other words, you need to execute on your SD&P Strategy. Make sure you are adding the right metadata so that your data is relatable and easily searchable.

This can’t all be done on the shoulders of your scientists. Instead, use smart technology and adopt standards wherever possible. You need to purposefully design the framework for the “data plumbing” in your organization. Both hot and cold data needs to be routed to where it belongs. And... besides an ELN, SDMS, or LIMS, it may belong in a knowledge graph where its true value can be exploited by all personas, like data scientists, computational scientists, and savvy scientists making bigger decisions! When you can accomplish this purposeful routing, you will end the broken pipes which led to data silos and disparate data! Finally, you are on the road to being a FAIR compliant R&D organization! Findable! Accessible! Interoperable! And last, but not least, secondary use of your data - Reusable!

Your organization must make very important decisions about how it is going to guard some of its top assets - propriety data and processes. All R&D organizations need their high-quality science to be reproducible and repeatable. The data and process capture must be exact for this to occur. This means understanding both the instrument integration AND the “plumbing” or shepherding of the data into the right data repositories is critical.

Let’s consider one example. On the surface, an ELN is keeping track of the researcher’s Idea/Hypothesis through to his or her Conclusion. The ELN captures some artifacts of data including images, files, etc. However, in many cases, the ELN does not support the storage of massive amounts of experimental data. Instead, it records a pointer to this data. This “pointer” strategy prevents application data bloat and encourages the proper storage and curation of experimental data. This is just one example where “data plumbing” design comes into play in many medium to large R&D organizations. Having a platform that you can plug into that captures data, instrument, application, and process integration is a high ROI (return on investment) need.

Having worked in this space for thirty years in a plethora of roles, from individual contributor to leading strategy and teams for many years, it became obvious that this is a very big problem and will need true team work to combat. Many bespoke systems have been built, maybe some have worked well, but I haven't seen it myself.

I believe that we are finally able to solve this Trash Data problem once and for all. You need to partner with companies who are taking a platform approach to get into production as quickly as possible. You need partners who truly understand data diversity, contextualization, and FAIR principles.

Every day you don’t have a solution in production, the Trash Data continues to pile higher. The predictions are out there, IBM is estimating a jump to 93% in the short years to come.

The platform needs to be cloud-native to provide the scalability and agility needed for future-proofing. It needs to be enterprise-grade, meeting the security and compliance needs of Life Sciences R&D. It also needs to elegantly handle the complexity of not only the automated collection of data – everyone can do that these days – but also “plumbing” of the data to and from multiple ELNs, LIMS, SDMS, Knowledgebases/graphs for data science tools, etc. We all know that big pharma is never going to be able to consolidate to one provider across the enterprise. And it needs to harmonize all the data – RAW and processed - and significantly reduce or eliminate Trash Data. The platform needs to automate all repeatable tasks and metadata collection/capture to remove the burden from the scientists and improve data integrity. This is a serious endeavor, and one you can't afford to ignore. After all, you don’t know what you don’t know, but even worse, you don’t know what you already should, and you can’t find that experiment or data in your current environment!

John-Conway-Photo


About the author:
John has spent 30 years in R&D environments that include Life Sciences, Material Sciences, Computational Sciences, Software, and Consulting. In his last BioPharma role, John was Global Head of R&D IT for AstraZeneca and Global Head of Data Sciences and AI. John started his own Consultancy in late 2019, 20/15 Visioneers, which has been taking off and keeps him busy. To read more of John's thought leadership in R&D Informatics, check out the 20/15 Visioneers Blog.

  1. Dark analytics: Illuminating opportunities hidden within unstructured data
  2. Dark Data Wikipedia
  3. The Biotech's Manifesto for Scientific Informatics
Expert's Corner
Expert's Corner

The Science Behind Trash Data

John F. Conway, Chief Visioneer Officer at 20/15 Visioneers, dives into the subject of Trash Data and its impact on R&D organizations. With thirty years of experience, Mr. Conway believes the problem of Trash Data can finally be solved.

John F. Conway

Eliminate the need for independent probes - lower failures in the lab and make monitoring CO₂ incubators more efficient


Maintaining an environment fit for a cell
Pharmaceutical and biotech organizations globally use carbon dioxide (CO₂) incubators to maintain stable environments in labs working with cells or tissue cultures. Monitoring CO₂ incubators is tricky due to the sensitivity of the CO₂ sensors. The high humidity in incubators may cause drift or failure in the sensors. These deviations may also not be reported on time, causing serious business and financial impacts.

This blog focuses on the direct integration capabilities of the Tetra Lab Monitoring application which facilitate CO₂ incubator monitoring without the use of independent probes. The benefits result in less time to install, eliminated removal of probes before sterilization, and lower upfront costs in the lab.

CO2 incubator monitoring provides unique challenges
Properly monitoring CO₂ incubators in the lab is crucial to ensure there are no changes in the equipment or environmental changes such as track temperature, humidity, and CO₂ levels.

Monitoring CO₂ incubators is different from monitoring freezers, environmental temperatures, light levels, gas manifold alarms, or other equipment/factors in the lab. Some considerations lab managers and scientists need to keep in mind when monitoring CO₂ incubators are:

  1. Incubators must be regularly sterilized. Probes installed internally must be removed when a sterilization cycle is completed.
  2. All CO₂ sensors (whether independent or built into the incubator) need to be calibrated periodically.
  3. Independent probe installation is time consuming for incubators, especially if there are strict policies around moving incubators or the materials inside of them.

By adding more incubators into the lab, more and more overhead is added as well. An average lab can house 10 - 20 incubators with each set of independent CO₂ sensors taking 10-15 minutes to remove and reinstall. The process takes more time if samples need to be moved to other incubators.

Battle-tested CO2 monitoring in the lab
The TetraScience Lab Monitoring application, built on the Tetra Data Platform, is used by the top global pharmaceutical and biotech organizations to monitor a variety of equipment in R&D and GxP labs such as freezers and refrigerators in addition to environmental parameters. The application also tracks temperature, humidity, and CO₂ levels in incubators.

Sturdy CO₂ sensors are used and equipped with protective filters to keep out humidity for monitoring. The main benefit is that they can be installed in any incubator. The only thing that is needed is access to an independent sensor port. Monitoring this way allows for a uniform solution for all incubators in the lab.

The TetraScience Lab Monitoring application simplifies the CO₂ monitoring process by leveraging the data acquisition capabilities of the TetraScience Platform. We have developed direct incubator integrations that are easy to install and maintain. Now we can pull data from the data output of an incubator - no external sensors necessary.

IMAGE: View of real time CO₂ level monitoring via direct integration

-BLOG-IMAGE--Real_Time_CO2_Monitoring-1

When monitoring incubators via direct integration, you can:

  1. Leave the monitor set up during a stericycle. The IoT gateway connected to the USB output of the incubator will have no problem recording that the inside of the chamber reached 180℃. If you left an independent probe connected, you’d be cleaning melted plastic out of the incubator once your sterilization was completed. This can be an expensive mistake!
  2. Calibrate just your incubator, and know that your monitoring system is reading a calibrated value.
  3. Install the TetraScience monitoring system in just a few minutes - saving both time and installation costs.

To start, we’ve released a direct integration with the HeraCell VIOS. The integration allows for the monitoring of temperature, water level, CO₂, and O₂, (when required), via USB output from the incubator. The incubator does not monitor a percent humidity reading, however when the water in the equipment is low a notification is sent. An external sensor for humidity percentage can always be added.

Summary
Supporting the Heracell VIOS incubator in the TetraScience Lab Monitoring application ensures CO₂ levels in incubators across biotech and life sciences labs are stable and lowers the probability of failure. Scientists do not need to worry about the chances of their cell or tissue cultures dying due to improper environments and losing work that took months. Over the years, our easy-to-use and configurable application has made work in labs more efficient. Developing integrations across a variety of lab equipment, and alongside our dedicated customer support team, has saved many organizations like Notable Labs millions of dollars, countless numbers of equipment, and potential losses.

Heracell is only the tip of the iceberg. Future integrations include Infors HT Multitron shaking incubator, ThermoFisher Forma™ Steri-Cult™ CO2 incubator, and LiCONiC StoreX Series of incubators to name a few. We continuously work alongside our customers and partners to develop new integrations and features. If you have a specific CO2 incubator that needs monitoring via direct integration, fill out our Tetra Lab Monitoring Contact Us form to learn how easy direct integrations can be.

Lab Monitoring
Lab Monitoring

Direct Data Acquisition and Sensor Integration for ThermoFisher Heracell™ VIOS CO₂ Incubators

Direct data acquisition and integration for monitoring CO2 incubators results in lesser installation times, lower upfront costs in the lab, and eliminates the need to remove independent probes for sterilization.

Erika Tsutsumi

Integration between AGU's Sm@rtLine Data Cockpit and TetraScience unifies life sciences R&D data in the cloud


Automating lab workflows to remove manual data management will spearhead a new era of therapeutic development
Bioprocessing is important for discovering effective biologics that can be leveraged in the development of treatments and therapeutics. In the first half of 2019, 36% of the new molecular entities (NMEs) approved by the FDA were biologics, up from 29% in 2018 and 26% in 2017. This is a trend that is expected to continue.[1]

However, it is a challenging and expensive problem to figure out. Since 2014, biologic drugs account for nearly all of the growth in net drug spending: 93 percent of it, in fact.[2] The methods, workflows, and APIs (Active Pharmaceutical Ingredients) used by biopharma in the discovery of biologics, and subsequent development and implementation of therapeutics, need to be streamlined for maximum efficiency. This means bioprocessing workflows and data need to be fully automated from data collection to analysis of the data. By automating and connecting equipment and systems, manual tasks and associated human error are eliminated which keep the total cost of ownership (TCO) low.

Anywhere between 20% - 30% of time spent by researchers is wasted on manual data transcription and tedious manual data integration.

In addition to manual data management, bioprocessing data is typically in heterogeneous formats. The data is produced by different lab equipment and systems throughout multiple stages of the workflow. This leads to data silos that are difficult to integrate. These challenges - manual tasks, human error, heterogeneous formats - slow down bioprocess development significantly.

The new era of bioprocessing, known as Bioprocessing 4.0, focuses on the integration and connectivity of lab equipment and systems. This will give scientists access to higher quality data and allow for deeper insights.

Connecting bioprocessing data with disparate life sciences R&D data for further analysis
Unifying BIO-API (or bioprocessing) data with other R&D data produced in the development process will provide scientists with deeper insights that have the potential to accelerate the drug discovery process. AGU’s Sm@rtLine Data Cockpit (SDC) enables the use of sensors and analyzers for the collection, review, and approval of trial results in laboratories. AGU and SDC have over ten years of experience serving top global pharmas in bioprocessing (BIO-API). In that time, they have improved quality, reduced significant costs, and shortened the length of discovery time for both biopharmaceutical companies and the BIO-API industry. The cloud-native and enterprise-scale Tetra Data Platform powers the centralization and harmonization of all R&D data, preparing it for advanced analytics and data science. The partnership not only lowers the total cost of ownership (TCO) and is significantly quicker to deploy than building internally, but will speed up the rate of discovery and development of therapeutics and treatments.

IMAGE: SDC + TetraScience example architecture diagram

NEW-SDC

Bioprocess data collected through the SDC-TetraScience integration is immediately ready for data science.

Higher data integrity and deeper insights power development
Life sciences, biotech, and biopharmaceutical organizations will reap massive benefits through adopting the concept of Bioprocessing 4.0 in their workflows. The foundation of Bioprocessing 4.0 - the automatic collection, centralization, harmonization, and unification of bioprocessing, BIO-API, and R&D data in a central cloud-native platform - is not easy to achieve. Once fully adopted, Bioprocessing 4.0 will allow scientists to stop wasting time with tedious manual data management. It will reduce human error, keeping data integrity high. Additionally, data silos will be eliminated, allowing scientists to query large volumes of bioprocessing data and providing the ability to reach actionable insights efficiently.

Summary
The partnership between SDC and TetraScience expands the network of connectors and integrations that automates data collection - the first step towards Bioprocessing 4.0. Scientists have the ability to leverage bioprocessing data collected by SDC. The data is uploaded into the cloud, where it is centralized, harmonized, and prepared for advanced analytics and data science, making R&D data in the cloud truly accessible and actionable.

Learn more about the TetraScience and AGU partnership and how bioprocessing data is automatically harmonized and centralized, connecting disparate data silos to activate the flow of data across the R&D and Bioprocessing ecosystem.

Our recent blog post - Data Science Use Cases for the Digital Lab - highlights how scientists are leveraging our network of connections and integrations to truly harness the power of life sciences R&D data. The blog post features seven data sciences use cases crowdsourced from our partners and customers.

  1. https://dcatvci.org/6057-what-s-new-in-biologics
  2. https://www.forbes.com/sites/theapothecary/2019/03/08/biologic-medicines-the-biggest-driver-of-rising-drug-prices/#1adc95c018b0
R&D Data Cloud
R&D Data Cloud

Powering Bioprocessing 4.0 for Therapeutic Development

AGU's Sm@rtLine Data Cockpit and TetraScience partner to unify life sciences R&D data in the cloud and automate lab workflows to power Bioprocessing 4.0.

Spin Wang

TetraScience Lab Monitoring is now 21 CFR Part 11 ready

Author:
Salvatore Savo, Co-Founder, TetraScience

Since its inception, Tetra Lab Monitoring Application has been assisting the life sciences industry with monitoring of critical equipment. Our best-in-class Lab Monitoring Application is used by top global pharmaceutical and biotech organizations to monitor freezers and CO2 incubators. Over the years we have saved our customers millions of dollars and countless numbers of equipment and in potential losses.

What's new

Over the last several months, our team has been working tirelessly to take the product to the next level so that we could also add value to GxP labs. We are excited to announce that our Lab Monitoring Application is now 21 CFR part 11 ready and can enable your lab to be GxP compliant.

For decades, life science organizations have relied on paper based records as the only source of truth. With the advent of electronic records, the life sciences industry had to adopt guidelines to ensure that the digital data could be trusted. The 21 CFR Part 11 regulations define the criteria under which electronic records and electronic signatures are considered trustworthy, reliable, and equivalent to paper records by the Food and Drug Administration (FDA).

These are some of the most relevant new product features and services that we have introduced:

  • Audit trail
  • Electronic signature
  • Session inactivity timeout
  • Multi-factor authentication
  • NIST traceable sensors
  • Annual sensors certification and calibration

Record Keeping and Auditability
The TetraScience Audit Trail chronologically catalogs events to provide support documentation and history used to authenticate operational actions. These records provide proof of compliance and operational integrity. These are some of the critical elements contained in the audit records for each event:

  • Event description
  • User
  • Date & time of event

IMAGE: Example of an Audit Trail report in the TetraScience Lab Monitoring application

audittrail

Security
In order to enable the highest level of security and to prevent fraud, the TetraScience Lab Monitoring application supports multi-factor authentication. This adds a second layer of security to protect user accounts and personal information. Additionally, users will be asked to prove their identity by entering username and password when performing critical actions. Lastly, administrators have the possibility to require users to re-enter their credentials in case of prolonged inactivity using the dashboard.

Why this is important

Whether you are planning to build a new manufacturing facility or working in an existing one, you need to make sure to use the most reliable Lab Monitoring system on the market to keep your samples safe while being compliant. TetraScience Lab Monitoring combines the best-in-class cloud-based software user experience, with the most reliable hardware that supports 5GHz Wi-Fi, cellular, and hardwired connection, all built on an infrastructure that ensures your data is always accessible and never compromised, thus meeting your data integrity requirements.

What innovations we are bringing to the GxP world

Superior User Experience
The Lab Monitoring application is designed to be intuitive and easy to use. Whether you are monitoring one piece of equipment or hundreds of pieces of equipment, you'll be able to seamlessly integrate our monitoring system with your internal workflows.

Customer-centric
We work with you and your QA team through every step of the process, and are always available to assist you. Our modern dashboard includes a live chat feature for communicating with our customer support team in real time when you have questions or emergencies.

Low Onboarding Maintenance Overhead
Traditional monitoring systems require you to manage and configure the alerting system. We do everything for you, from equipment data entry and instrument naming to alert configuration.

Lab Monitoring
Lab Monitoring

Enabling Compliance in GxP Labs

TetraScience's Lab Monitoring application is now 21 CFR part 11 ready. This means we can enable your lab to be GxP compliant with features like audit trail, electronic signature, multi-factor authentication, and more.

Salvatore Savo

TetraScience presented at the AWS Healthcare & Life Sciences event.


About AWS and the event:
AWS Healthcare and Life Sciences focuses on collaboration with healthcare providers, public health organizations, government agencies, and life sciences businesses around the globe to increase the pace of innovation, accelerate development timelines, engage with patients, and ultimately improve outcomes. The free virtual event focused on how healthcare and life science organizations are accelerating research, rethinking patient care, and maintaining clinical and operational continuity during this unprecedented time for the global healthcare system.

Watch our presentation on-demand: "Unifying R&D Data in the Cloud: Making biopharma R&D data truly accessible and actionable."

We discussed the Digital Lab landscape and how fragmented data and silos hinder organizations in life sciences from gaining actionable insights quickly and efficiently. TetraScience CTO, Punya Biswal, and Vice President of Marketing, Rachel Daricek also spoke on how we partner with AWS to leverage AWS's best-in-class technologies and featured our case study with Biogen which highlighted the platform in action.

Check out the full event on-demand, including keynote and break-out sessions.

Events
Events

AWS Healthcare & Life Sciences Web Day | Virtual Industry Event

Watch our presentation on-demand from the AWS Healthcare & Life Sciences Web Day Virtual Industry Event.

TetraScience

The current state of lab scheduling during COVID-19

COVID-19 is shining a light on many laboratory operations processes, making organizations prioritize the implementation of new safety measures to ensure the health and safety of their employees. Pre-COVID-19, users could safely share the same areas and laboratory scheduling did, and still does, rely on manual methods such as calendar printouts, paper sign-up sheets, and whiteboards. While these methods are quick and inexpensive to determine available instrumentation and schedule equipment, they are inflexible, tedious, and inefficient. Oftentimes employees stop using them and they cause a lot of wasted time and resources. Post-COVID-19, lab managers and scientists are facing a new and significant type of problem - working and utilizing instruments that are located in close proximity or in the same room to each other.

With organizations enacting social distancing measures in their laboratories, lab managers and scientists need to be able to know, in real-time, if and when their colleagues are using a particular instrument in the same room they also need to access. Additionally, allowing teams the ability to schedule their workdays around instrument reservations while still adhering to social distancing guidelines will help maximize operational efficiencies while limiting potential health risks.

Organizations and employees need processes that are intuitive, easy-to-use, and reflect real-time data and cannot run the risk of miscommunication or failure when it comes to employees scheduling instruments.

TetraScience Scheduling is a purpose-built solution that allows laboratories to schedule not only instrumentation and equipment, but also assign specific devices to users, and the tasks that need to be run on them. TetraScience Scheduling enables full visibility into resource availability, reservations, and maintenance ensuring the safety of your employees while maximizing outputs.

Product features

We’ve designed TetraScience Scheduling with the idea that planning out your experiments and lab resources should be the easy part. Here are some of the top features that we hope will improve both your daily workflow as well as provide key insights into your resources, especially now when it matters most.

Dynamic calendar

See what’s happening in your lab at a glance with our dynamic calendar view. TetraScience Scheduling displays real-time availability of your equipment as well as the metadata associated with reservations and instruments.

frank-sally-v2

Instrument metadata

As with the Tetra Lab Monitoring application, all instrument metadata is easily searchable. You can quickly find your preferred instruments by name, instrument type, location, and more.

scheduling2-web

One-click reservations

Create reservations or block out time for repairs and preventative maintenance with a single click. Drag your cursor for your desired time window, or simply type it in for fine-tuning.

One-click reservations | TetraScience Scheduling

You can also mark a broken instrument as out of service. This immediately changes the status of the instrument for everyone in your organization to see, preventing additional bottlenecks for your fellow lab mates.

Out of service instruments | TetraScience Scheduling

Operational reporting

Dive into your scheduling data to surface the operational insights most important to you. Track project progress and understand which instruments are out-of-service the most. Using the Logbook, you can export CSVs containing reservation metadata to create flexible reports using Excel or the reporting tool of your choice.

Lab Monitoring
Lab Monitoring

Remote Lab Scheduling is No Longer Optional, it is a Requirement

TetraScience Scheduling is a purpose-built solution that allows laboratories to schedule not only instrumentation and equipment, but also the people who use those devices, and the tasks that need to be run on them.

Salvatore Savo

Automation is more than robots.


Authors:
Yi Jin - Solution Architect
Spin Wang - CEO and Co-Founder

High-throughput screening (HTS) methods are used extensively in the pharmaceutical industry, leveraging robotics and automation to quickly test the biological or biochemical activity of a large number of chemical and/or biological compounds. The primary goal of HTS is to identify high-quality 'hits' that are active at a fairly low concentration and that have a novel structure. Hits generated during the HTS can then be used as the starting point for following ‘hit to lead’ drug discovery effort.

Automation has played a huge role in the development of HTS to date. Tools like the automatic liquid handler and robotic, high-throughput plate readers significantly improve compound screening efficiency and consistency. Robotic automation has transformed the physical aspect of the process. However, experiment data sets remain isolated from one another, requiring manual data acquisition and handling for data storage and analysis. Experiments are automated, but data flow is not.

In order to truly reap the benefits of HTS, biopharma companies need to automate the accompanying data flow. They also need to connect the data with the rest of the Digital Lab to make it accessible and actionable. Otherwise, the vast volumes of generated data will be stuck in yet another silo.

This blog post identifies opportunities for improvement in an example HTS data workflow, based on our experience with biopharma customers, and offers our approach to evolving the HTS data flow to be as efficient and consistent as the physical process.

Analyzing Today's Manual Data Flow

The following diagram illustrates an example biopharma customer's HTS assay workflow for small molecule compound libraries (edited for confidentiality), leveraging best-in-class instruments and tools in the market.

Diagram 1: Example High-Throughput Screening Workflow

diagram-1

Let's break down this example HTS data flow in R&D labs today, based on our experience working with top pharma and biotech companies:

Step 1: Scientists register the compounds in a compound registry, like Dotmatics Register, generating the experiment ID and compound ID. Scientists also enter compound information, such as molecular weight and initial amount, into an ELN like Dotmatics Studies.

Step 2: Scientists create child samples by sample dissolution or aliquoting. Each child sample is identified by a unique barcode in the sample inventory software like Titian Mosaic. The sample inventory system contains extensive information such as parent sample ID, batch ID, amount, solvent type, sample volume, location in the freezer, etc.

Step 3A: After compound registration and child sample preparation, scientists enter the compounds’ information into the liquid handler software to set up the assay plate in the liquid handler workstation, like Tecan. The sample plate output file information is saved locally to the lab Windows computer.

Step 3B: The HTS assay is incubated based on assay design protocol and detected on a plate reader, like Perkin Elmer Envision. The assay result is also saved locally to the lab Windows computer.

Step 4: After the experiment finishes, the scientist manually moves the files back to the office to analyze the assay data with GraphPad to give either IC50 or target binding information.

Step 5: The analysis result is manually updated in the ELN, like Dotmatics Studies, to complete the experiment.

The key takeaway here is that while the robots are automating the physical component of the experiment, there are multiple instruments and software systems involved in the process and data needs to flow seamlessly into and out of all of them. Today, this is not the case. The Dotmatics components may integrate together since they are part of the same product portfolio. The others may have some point-to-point integrations that you can set up, or APIs that you can use to write your own integrations, but this is real work, needs to be maintained, and ultimately is not scalable. We need an easy way to connect all of these instruments and software systems to a common network to knock down the data silos and get the data flowing, both within this work process and across the broader R&D data ecosystem.

Optimizing HTS Data Flow

The TetraScience Platform is that common network. Connecting the instruments and software systems needed to conduct HTS transforms the manual steps of data entry, processing, and transfer into an automated solution, saving time, reducing errors, and increasing throughput. It also harmonizes and transforms the data, preparing it for data science, AI, and other advanced analytics - we'll get to this at the end.

Let's take a look at how it works.

Diagram 2: Automating the High-Throughput Screening Data Workflow

diagram-2

Steps 1 and 2: Scientists register the compounds in the compound registry, as before. Except now, the TetraScience Dotmatics connector automatically detects new or modified compounds in the Dotmatics Register and triggers a pipeline to automatically push the information to the inventory management software. As part of this process, the data is also now available to query via RESTful API in the TetraScience Data Lake.

Steps 3A, 3B, 4, and 5: After setting up the assay plates with the liquid handler, scientists run the assay and read the assay readout with various types of plate readers. The liquid handler output file contains sample plate information, including the sample concentration of each well. The plate reader file contains the assay readout. The TetraScience File connector automatically detects the files produced by Tecan and the Envision plate reader, moves the raw instrument files into the Data Lake and then triggers pipelines to parse, merge, and push to Dotmatics Studies. IC50 is then automatically calculated.

Image: Automatically generated IC50 calculation results

Screen-Shot-2020-05-06-at-2.43.28-PM

In this optimized workflow, only one manual data workflow step remains - initiating the experiment by registering the compound. Scientists also have to physically set up the experiment, but once the experiment is complete, all the data automatically appears in the ELN, with the calculation results shown above completed, as well as in the Data Lake, ready for further querying and analysis.

The TetraScience Data Integration Platform automates the HTS data workflow, providing greater efficiency by removing painful manual data handling and processing from scientists' daily work and by improving data integrity.

Let's compare the two processes side-by-side:

 
Before
After
Step 1:
Manually register compound in Dotmatics Register
No change
Step 2:
Manually synchronize compound registry with child samples in inventory management software
Automated
Step 3A/B:
Manually transfer instrument output files from local computer to shared drive
Automated
Step 4:
Perform calculation in GraphPad
Automated
Step 5:
Manually upload results to Dotmatics ELN
Automated

Beyond Data Automation: Data Science

Now that the data workflow accompanying the high-throughput screening process is automated, what's next? Compound information, sample information, type of assays performed, and screening results are now centralized and harmonized in the TetraScience Platform. This seems like a prime opportunity to apply some data science! Check out a related blog post about our Intermediate Data Schema (IDS) to learn more about how we harmonize disparate data, knocking down the data silos. IDS is the open standards method we use to seamlessly move data between and across all the different HTS instruments and software systems, unifying the unique data structure and format from each.

A benefit of the centralized, harmonized data is that it is also prepared for use with various data science and data analytics tools such as Spotfire, Tableau, or Dotmatics Vortex. Our open standards approach means that scientists and data scientists can use the software, platforms, and languages they already know and use - no need to install or learn something new.

Diagram 3: Applying Data Science to High-Throughput Screening Data

diagram-3

You can now fully utilize your HTS data, including querying and visualization of data sets, using your existing data science and analytics tools. For example, scientists can easily query and visualize all active compounds at a certain threshold level in a particular screen, or the behavior of all compounds of similar structure across different screens. Scientists can derive insights that develop more efficient HTS assays, design more active compound libraries, and significantly speed up the drug discovery process.

Watch this video to see the optimized data flow in action, enabled by the TetraScience Platform.

R&D Data Cloud
R&D Data Cloud

Data Automation for High-Throughput Screening with Dotmatics, Tecan, and PerkinElmer Envision

Despite robotic automation, experiment data sets are still isolated from one another, requiring manual data acquisition and handling. The Tetra Data Platform brings automation to the data.

Spin Wang
  • Biopharma R&D has not realized the full power of the data it possesses and generates every day. There are advanced, easy-to-use data science tools and there is a vast amount of data. But these two are disconnected; the data is locked in on-premise silos and heterogenous formats, and often lacks the interface needed to derive insights at scale.
  • This “Data Science Use Cases for the Digital Lab” series of blog posts shares some obvious and non-obvious ways top pharmaceutical organizations are applying data science to extract novel insights from their R&D data.
  • We crowdsource use cases through our partnerships with top pharmaceutical organizations, enabled by our ever-growing network of connections to common data sources and data science tools, and our cloud-native platform automatically collects, centralizes, harmonizes, and prepares R&D data for analysis.
  • Use cases in this blog post include: analytical method performance | column degradation | instrument usage and operational insights | stability studies trending | chromatogram overlay | quality procedure assessment | metadata quality reporting.

Overview

The use of Data Science across all industries has been rising in recent years. At the same time, modern laboratories have experienced massive growth in data volumes. In a single pharmaceutical company, there can be hundreds of High Performance Liquid Chromatography (HPLC) instruments used by process development, quality control, downstream bioprocess, manufacturing, and many other stages of drug R&D. It is common to have millions or tens of millions of injections that need to be available for analysis, and thousands of new injections produced daily.

Waters Empower is one of the most widely used chromatography data systems (CDS). CDS controls chromatographic instruments like HPLCs, runs the injections, collects the raw data, and performs analysis on the chromatogram to detect peaks.

As with most R&D instruments and their control software, accessing the data contained within the Waters Empower CDS has traditionally been challenging. Our Data Science Link for Waters Empower makes the CDS data instantly accessible and actionable. Data scientists can now perform analyses and identify insights using their preferred data science tools, without hours of data wrangling.

We have included more information about how the Data Science Link for Waters Empower works at the bottom of this post.

Waters Empower CDS Data: Data Science Use Cases

Once your Empower data is in the cloud, harmonized, structured, and connected to your favorite data science tools – a non-trivial effort – what meaningful analyses can your data scientists and analysts perform? After partnering with many of the world’s leading pharmaceutical companies, we have collected several obvious and non-obvious data science use cases that will help you get started. The use cases are logically organized and have overlaps. We will continue to add to this list over time. If you have a use case to add, we want to hear it! Contact information is listed at the bottom of this post.

1 - Analytical method performance

Data science tools can easily automate trending and cluster analysis of method performance characteristics, (for example, peak tailing factor, resolution, and relative retention time), ensuring continued method performance and providing R&D organizations with a fast and flexible feedback cycle.

Such analysis can be used in experiment design to drive continuous improvements and optimization of the methods.

2 - Column degradation

This is similar to the previous use case but focused primarily on column performance. Plotting the key performance or suitability parameters, such as peak tailing factor, resolution, or symmetry for one set of sample runs at different times will show very clear trending of column degradation.

Such information is crucial when teams try to generate control charts, predict life time of the column, or try to transfer the method to different teams or to CROs/CDMOs.

Use factors such as peak area, retention time, and tailing factor from the System Suitability Test (SST) runs to "predict" failures before they occur. Use this information to define safe operating limits, automate reporting and alerts when instruments/columns are approaching these limits, and build more intelligent control charts.

Image: Peak area mean value vs. time for each column

chromatography-2

Image: Peak tailing factor vs. time for each column

empower-spotfire-1

3 - Instrument usage and operational insights

HPLC is a major work horse in R&D. Therefore, it is crucial to optimize the usage of these instruments and understand the operational efficiency of the R&D teams via HPLC data. Data science can help you gain operational insights for the following aspects of HPLC instruments:

  • Instrument utilization. Easily analyze the distribution of your instrument usage. Identify the instruments that are not used. Better understand why and optimize your capital spend.
  • Instrument up-time for better preventative maintenance. For example, sum up the injection run time from all the injections performed on each HPLC system and identify which instrument has the longest cumulative run time. This can serve as a great indicator for preventative maintenance.
  • Understand the distribution of your sample and method parameters. For example, if the team runs a large percentage of injections using a particular method with long injection run time, and you are looking for a method to optimize, then that particular method may give you the biggest return on investment.

Image: Instrument usage analysis

Instrument-Usage-Analysis-Image_Blinded

4 - Stability trending

A stability study is a common type of analysis. It usually takes a lot of time and manual effort to organize the data, create the right report or analysis, and predict shelf life. However, once your data is accessible and prepared, data science can automate stability analysis and enhance the insights.

  • Plot the key metrics of an Active Pharmaceutical Ingredient (API) and other impurities at different time points
  • Use visualizations and data science tools to perform statistical analysis on the key stability indicator trend

5 - Chromatogram overlay

Getting a set of tables and charts is useful, however, scientists often need to visually inspect a group of chromatograms to understand the anomalies and subtle changes in the results. Such visual representation can help the brain to easily detect insights, and often triggers more in-depth and quantitative analysis and comparison.

Image: Chromatogram overlay

Visualize_your_results_overlay_curves_and_explore_insights

6 - Quality procedure assessment

It is crucial to follow the appropriate quality procedures while working in an R&D environment. Data science can help quality teams assess adherence to established procedures.

For example:

  • Flag when manual integration is used instead of the CDS' processing method
  • Flag injections that are processed multiple times
  • Flag injections that are not signed off properly or that do not have enough sign offs

Quality teams can closely monitor any deviation or anomaly from the organization's established process and immediately take actions to remediate and prevent errors from propagating downstream. For example, if the ratio of results without sign-off suddenly increases, that may indicate a change in process, which may be expected or by mistake.

Constantly assessing the quality procedure will rapidly reduce mistakes, increasing trust and synergy across departments.

7 - Metadata quality reporting

Metadata integrity is often difficult to assess. Missing and mislabeled information has significant impact on the reporting and usage of R&D data. Data science tools can assess the completeness and consistency of key business-related metadata entered in the Waters Empower CDS. Armed with such an assessment, teams can take action to improve metadata quality. After all, in the end, you get what you measure/track.

So, what does "metadata quality" mean? Here are some examples:

  • Each injection may have a custom field defined in Empower called ELN_Experiment_ID. Entering this accurately is crucial to aggregate or "join" the Empower data with experiment set-up information in the ELN, or to further automate data transfer into the ELN or LIMS.
  • For HPLCs that do not use a Waters column, Empower is not able to automatically gather the column information because the eCord will be missing. If you want to run a column degradation analysis like the one described above, then it is crucial to understand which data sets have the Waters eCord populated. For those using non-Waters columns, it is crucial to know if ColumnId is entered properly and consistently as a custom field.
  • Scientists likely use several slightly different naming conventions for their sample sets and methods that mean the same thing. It is helpful to look at the variation and uncover obvious or hidden conventions.

With access to the metadata, in a consumable format, data science tools can automatically flag experiments without the proper sample naming convention, method naming convention, or instances missing specific custom fields (for example, ColumnId or ELN_Experiment_ID).

Identifying these metadata quality issues and providing feedback in real time will vastly improve metadata integrity, making your R&D data much more actionable.

To improve “metadata quality”, here are the 3 steps you can follow.

Note: this applies to any other type of instruments or workflow.

Surface metadata quality as a metric

Using tools like Spotfire, PowerBI you can flag the injections with the incorrect naming. Such as, a dashboard can be updated automatically on a daily basis, thus providing a real time view of the “quality” of crucial metadata.

Reporting

For the injections run last month, what is the percentage of incorrect / missing metadata? Is this percentage decreasing over time or increasing over time.

Notification and action

If you have some well defined business rules on the structure TetraScience can automatically trigger data pipelines to notify the scientists by sending an email to an email-distribution list. The content of the email can be something like the following

Tuesday 2020 Feb 20, 10:10:10AM CET
Pipeline “Waters Empower metadata verification”
Pipeline failed due to

  • Error: Sample name “xxxxxxx” does not match the pre-defined naming convention
  • Error: Missing critical Custom field “ColumnID” and "LIMS_Request_ID"
  • Warning: “ExperimentID 112313 is not available in the ELN, recommend...”

Summary

The R&D data locked inside the Waters Empower CDS silo has tremendous potential, if only you could access it via the cloud and use your data science tools to identify insights and take action! The use cases listed above will drive value and improve efficiency for your teams.

Read on for details about how our Data Science Link for Waters Empower enables these analyses. We can also collect and aggregate important information from other instruments, ELN, or LIMS, together with your Empower CDS data, using the TetraScience Platform.

And let us know if you have a use case to add to the list!

TetraScience’s Data Science Link

The Data Science Link is an end-to-end application of the TetraScience Platform that automates data acquisition from the most complex (and frequently used) R&D lab instruments, harmonizes it, and moves it to popular data science tools where it can be analyzed.

Image: Data Science Link for Waters Empower schematic

TetraScience's Empower Data Agent automatically collects CDS data

Waters Empower CDS is one of the most complex and frequently used instruments in biopharma R&D. TetraScience’s Empower Data Agent is the world's fastest and most sophisticated product for data extraction from the Waters Empower CDS. It supports advanced features such as:

  • Configuring data extraction based on projects of interest and sign-off status
  • Detecting changes or re-processed injections

Deployment

Install the Empower Data Agent on the order of minutes and point it to the TetraScience Platform. Empower data will immediately flow to the cloud where it will be harmonized and made available to your data science tools within a few minutes.

Data Science and Analytics Tools

Once the Data Science Link has extracted, harmonized, and prepared your Empower data, easily connect the following data science and analytics tools to access the data:

These are the tools used most frequently by our current customers; we regularly add new connections. We also support customers in developing and configuring your own applications, written in any programming language you choose.

Share another data science use case or inquire about the Data Science Link application by contacting us at solution@tetrascience.com or any of the channels below.

R&D Data Cloud
R&D Data Cloud

Data Science Use Cases for the Digital Lab: Novel Analyses with Waters Empower CDS Data

Once your Empower data is in the cloud - harmonized, structured, and connected to your data science tools - what meaningful analyses can your data scientists perform? We've collected several obvious and non-obvious data science use cases from our network.

Spin Wang

Advanced Data Engineering Improves Data Integrity Using an Allotrope-Compatible, Data Science Ready File Format



Diagrams-for-Blogpost-2

Authors:
Evan Anderson - Delivery Engineer, TetraScience
George Van Den Driessche – Scientist I, Biogen
Spin Wang - CEO & Co-Founder, TetraScience

The pharmaceutical industry generates experimental data every day that is stored on local PCs or on instrument vendor servers, with vendor dependent data schemas. These practices result in the creation of data silos across pharma that do not adhere to FAIR (Findable, Accessible, Interoperable, and Reusable) data principles and prevent companies from utilizing all the information found in their raw data sets.

“The lab of the future is built on data. Right now, our cell counter data is largely inaccessible, and the heterogeneous nature of the information makes it difficult to analyze without significant manual manipulation. My team needs to make this data accessible and actionable for our scientists and data scientists,” says Len Blackwell, Associate Director of Strategic Analytics at Biogen.

This blog post highlights how our recent collaboration with Biogen knocks down the data silos associated with their Beckman Coulter Vi-CELL cell counter results, making the data readily available for further analysis using standard data science tools.

Cell Counters generate valuable, but heterogenous, data

Cell counters enable process development scientists to differentiate the number of viable versus non-viable cells using the trypan blue exclusion cell counting method. These values are then used to measure overall cell density and cell viability percentage. Cell density measurements help monitor cell culture feed requirements and the cell viability percentage assesses the overall health of a cell culture.

“Cell counting is a critical step in the biomanufacturing process that provides information about the density and viability of the mammalian cells that produce our protein products. In the process development laboratories, daily cell counts inform process development engineers of the results of experimental conditions with the goal of optimizing cell health and productivity,” says Brandon Moore, Cell Culture Engineer at Biogen.

A cell culture study typically lasts 14 days, with one sample analyzed from each condition per day. Biogen’s cell counter sample analysis produces 50 images that are analyzed with the trypan blue exclusion cell counting method. The resulting measurements are stored in a single text file as aggregated values and raw data arrays. The image files, used for the measurements, are exported and stored separately from the numerical data.

This presents a few challenges:

  1. Multiple file types (image and text files)
  2. 51 files generated for each sample, multiplied by the number of days in a study
  3. Major data integrity risk due to the separate file storage

Image: The Beckman Coulter Vi-CELL exports images and a .txt report for each experiment.

raw-files

Automating cell counter data movement and conversion

TetraScience addresses these data integrity challenges with the Tetra Data Platform. The platform provides automated cell counter file movement from the instrument PC to an AWS data lake. Once the files are moved to the data lake, they undergo two conversion pipelines. First, the instrument data schema is mapped into an Intermediate Data Schema JSON file (IDS-JSON) and then the IDS-JSON is mapped into a pharmaceutical standard Allotrope Data Format (ADF) file. The ADF file provides a single output that captures both image files and the numerical data reports generated per cell counter sample.

Image: an excerpt of the cell counter report as a JSON in the IDS format.

ids

Harnessing Cell Counter Data

Once the cell counter data is moved to an AWS data lake and packaged as either an IDS-JSON, or ADF, data scientists can begin analyzing these files with python notebooks like Google CoLab, or other common data science tools. Scientists can call files related to cell growth studies by querying the IDS-JSON files with ElasticSearch. Next, the cell density data is plotted versus date using the Pandas and Seaborn python libraries. ADF files enhance cell counter data integrity by combining numerical and image data into one file output; scientists can access this information with the H5py python library. The combined data outputs enable new types of data analysis, such as cell contamination monitoring with cell image analysis. For more information on the ADF file format, check out our blog post about the ADF graph model and leaf node model.

Image: Converting cell counter data into an ADF wraps up the standardized data along with associated images and ontology specified by the Allotrope Foundation

adf

We can also easily use TetraScience REST API to import data into interactive python notebooks to do more detailed image analysis.

Image: Use TetraScience REST API to access your data using popular data science tools, like Jupyter iPython notebooks

api-python

Conclusion

“There are two really important improvements that come out of this process for Biogen. First, analysis is fully automated; once someone reads a sample in the cell counter, the data for the growth curve is visualized in the BI tool. This was previously a manual process with a lot of data movement. Second, data integrity is improved by aggregating multiple files from each sample into a single file and automating storage of the data. These are key points and should not be overlooked, “ says Blackwell.

We are delighted to have the opportunity to work with innovators at Biogen like Len Blackwell, Associate Director, and George Van Den Driessche, Scientist I. This collaboration will save their scientists time, make their cell counter data truly accessible and actionable, and negate the risk presented by separately storing files.

R&D Data Cloud
R&D Data Cloud

Applying Data Automation and Standards to Cell Counter Files

Use case illustrating how experimental data from a cell counter is automatically collected, centralized in a data lake, and converted into an ADF-compatible format, saving time, reducing error, and making the data accessible and actionable.

Evan Anderson

TetraScience presented at the virtual Rapid Fire event on May 26 at 12 pm ET.


About LRIG and the event:
LRIG New England is a rapidly growing special interest group focused on laboratory automation. Membership consists of scientists and engineers, primarily from the pharmaceutical and biotechnology industry, with chapters across the US and in Europe. The LRIG mission is to provide and facilitate instruction for both self-development and the benefit of the laboratory automation community.

The LRIG-New England's first virtual Rapid Fire event brought new ideas and technologies to members in a quick and concise format. There were be 3 speakers, each presenting for a total of 15 minutes. Topics focused on new technologies for the lab: Data Engineering, Alpha CETSA assays, and IoT in the lab.

TetraScience CEO and Co-Founder, Spin Wang, discussed "Data as 'Plumbing' of the Digital Lab: Evolving to a Sophisticated System with Data Engineering."

Check out our presentation at the event.

Events
Events

LRIG-New England: Lunchtime Virtual Rapid Fire Event - May 26, 2020

TetraScience CEO and Co-Founder, Spin Wang, presented during the May 26 LRIG-New England lunchtime virtual Rapid Fire event.

TetraScience

We believe one of the critical components of the Digital Lab is the flow of data. Without the flow of information, data is siloed and fragmented, and it is nearly impossible to take action based on the information. To enable scalable data flow, the Digital Lab needs an Intermediate Data Schema (IDS).

The IDS is used to decouple the data sources (producers) and data targets (consumers). This decoupling significantly reduces the total cost of ownership for exchanging data in a heterogeneous R&D lab environment. It also enables data science, visualization, and analytics.

IDS is also a stepping stone towards the Allotrope Data Format (ADF). We do not recommend directly converting RAW lab data into ADF, since ADF is still an evolving standard. We need something "intermediate" as the bridge to use what we have now, while future-proofing to easily adapt to the rapid changes.

Authors:

Rae Wu - Life Sciences Data Analyst
Yi Jin - Solution Architect
Evan Anderson - Solution Architect
Kai Wang - Delivery Lead and Sr. Software Engineer
Spin Wang - Co-Founder and CEO

Stay tuned for the second blog of this blogpost series. We will cover our experience and suggestions in how to avoid losing data while transforming in and out of IDS, how to manage, govern IDS and leverage ontologies, and how to create IDS for your own data sets and business logic.

The Digital Lab Landscape

Data Sources
The life sciences R&D lab is full of heterogeneous data sets and diverse interfaces for data acquisition. Instruments and CRO/CDMOs produce a large amount of experimental data.


Fundamentally, these "source" systems are designed to run a task, execute a job, and/or deliver services. Providing the data in an accessible format and interface is not the data sources’ main consideration.

Many instrument manufacturers make their data format proprietary as a barrier against competition, incentivizing users to continue purchasing their software and hardware.

Data Targets
There is a similar heterogeneous landscape on data target side.


The heterogeneity is a key nature of the scientific R&D process, which requires data and information to be handled based on its particular domain and context. Since data is consumed with very specific context, many data targets (e.g., ELN, LIMS) require data to be provided in its specific format or using its provider's APIs.

In such a many-to-many data landscape, it is essential to use an Intermediate Data Schema (IDS) to harmonize heterogeneous data and decouple data sources from data targets.

As the following diagram shows, if each data source and data target is left to communicate with each other, N X M number of point-to-point connections or data flow / translations are needed. However, if data from the data sources is harmonized into an intermediate format, the total cost of ownership can be reduced by an order of magnitude to N + M. The IDS approach makes data flow manageable.

2020_05_Blog_TheDigitalLabNeedsAnIDS_NxMImage

From a first principle, such an IDS format should have the following characteristics:

  • Comprehensive information capture
  • Easy to write and transform data to IDS files
  • Easy to read and transform from IDS files
  • Minimum effort to consume the IDS files out of the box
  • Popular, well supported and exists among an ecosystem of mature tooling
  • Decoupling data sources and data targets to avoid vendor lock in

Intermediate Data Schema (IDS)

Such a heterogenous data landscape presents an exciting challenge, but not something new. The broader tech community has faced a similar challenge since the Internet and web started to be dominate the world, more than 20 years ago.

"JSON has taken over the world. Today, when any two applications communicate with each other across the internet, odds are they do so using JSON." - The Rise and Rise of JSON

JSON is adopted and supported by Google, Facebook, Twitter, Pinterest, Reddit, Foursquare, LinkedIn, Flickr, etc. Without any exaggeration, it is safe to say that the Internet communicates using JSON.

Here is an example of a JSON file describing a puppy (not just any puppy, her name is Ponyo and you can follow her on Instagram).

Screen-Shot-2020-04-11-at-3.25.09-PM

Without any prior training, just about anyone can easily understand and read the basic information about this puppy, in this format.

Here at TetraScience, we made the decision to use the most popular data interchange format of the Internet as our Intermediate Data Schema (IDS) for the Digital Lab, building our Intermediate Data Schema (IDS) on top of JSON Schema.

Here are some additional examples:

Image: Example IDS JSON describing an injection

Screen-Shot-2020-04-11-at-3.47.22-PM


Image: Example IDS JSON describing a gradient table in liquid chromatography

Screen-Shot-2020-04-11-at-3.49.57-PM


Image: Example IDS JSON describing a cell counter measurement

Screen-Shot-2020-04-12-at-11.58.28-AM

Inside the IDS, you can capture as much information as possible.
From the top level, this includes:

  • user
  • system
  • sample
  • run
  • system_status
  • experiment
  • method
  • result(s)
  • data cubes
  • logs
  • vendor RAW file

Data Cubes

JSON seems great for hierachical key value pairs, but what about a numerical matrix? These are pretty common in the labs. In this case, we use a data cube. If you are familiar with Allotrope or HDF5, you can see that this is inspired by the concept of data cubes in ADF and data sets in HDF5.

Screen-Shot-2020-04-12-at-12.05.51-PM

{
 "datacubes":[{
   "name": "Plate Reads",
   "description": "More information (Optional)",
   "mode": "absorbance",
   "another property": "you decide",
   "measures": [{
     "name": "OD_600",
     "unit": "ArbitraryUnit",
     "value": [
       [1.19, 1.05, 1.05],
       [1.11, 0.90, 0.95],
       [1.11, 0.93, 0.95]
      ]
    }],
    "dimensions": [{
      "name": "row",
      "unit": "count",
      "scale": [1, 2, 3]
    }, {
      "name": "column",
      "unit": "count",
      "scale": [1, 2, 3]
    }]
  }]
}

You can use data cubes to model many types of experimental data sets, such as:

INSTRUMENTMEASUREDIMENSIONChromatogram1: detector intensity1: wavelength
2: retention timeRaman Spectroscopy1: intensity1: wavenumber shiftPlate Reader1: absorbance
2: concentration1: row position
2: column positionMass Spectrometer1: intensity1: mass charge ratio
2: time

We do not recommend creating JSON files larger than 100MB. Instead, consider using File Pointers.

File Pointers

JSON is not suitable for storing large data sets. We do not have to force everything into JSON either. Leveraging the cloud and other purpose-built, efficient, and analytics-ready file formats, like Apache Parquet and HDF5, you can easily reference large binary files from an IDS.

{
 "result": {
   "cell": {
     "viability": {
       "value": 0.2,
       "unit": "Percent"
     }
   }
 },
 "image": {
   "type": "s3file",
   "fileKey": "uuid/RAW/folder/batch1_day2.parquet",
   "bucket": "ts-dev-datalake",
   "version": "3/L4kqtJl...",
   "fileID": "9ff46a4e-c6a0-4cfe-b6ff-1e7b20bbf561"
 }
}

Screen-Shot-2020-04-12-at-12.52.48-PM

Life sciences organizations commmonly pick from the following 3 formats, depending upon the use case.

Use Cases by format:

  • Parquet: Big data, data science, cloud computing. Highly recommended for data science and big data applications. Supported by Databricks, Spark, AWS Athena, and Facebook Presto.
  • HDF5: Grouping disparate files in one file. For example, if you would like to package images plus excel file(s) together into one file. This is a portable container format for storing numerical arrays. Ecosystem of scientific computing.
  • Instrument RAW Format: Using vendor specific software or analysis tools. Also needed when you need to import a file back into the vendor's software. Typically, images and instrument binary files are kept in the original format to leverage domain specific tooling and ecosystem.

The best informatics software uses JSON

Hopefully now you have a basic understanding of JSON, its popularity in the broader technology ecosystem, and its ability to capture scientific data.

Let's now take a look at its compatibility with some of the best informatics software in the Digital Lab space.

Alt
Alt
Alt
Alt
Alt


JSON is used to send data into some of the best informatics software, like Dotmatics, Riffyn, Benchling, Scilligence, IDBS, and many others.

The best analytics and data sciences tools use JSON

Let's turn our attention to some trends in data science:

All of these tools have seamless native support for JSON.

Another benefit of JSON is that it's interchangeable with tabular formats (those with rows and columns). At first glance, IDS JSON does not look very tabular, but it is actually quite easy to transform JSON into a tabular structure.

Image: Use Pandas to "flatten" or "normalize" the JSON. There are also countless other tools to do this.

Screen-Shot-2020-04-12-at-12.55.51-PM

As a result, some of the best visualization and analytics tools that rely on tabular data and SQL interface - Dotmatics Vortex, Tibco Spotfire, Tableau - can leverage IDS JSON with very minimum overhead, one of the important criteria of an Intermediate Data Schema

Image: Example visualization using Dotmatics Vortex

Dotmatics-Vortex--Visiualization-Example

What about Allotrope Data Format (ADF)?

This is good question. There have been a lot of conversations about converting all experimental data into ADF files and using Semantic Web RDF graphs. Naturally, we asked the question, "Is ADF a good option as the intermediate data schema/format?"

Let's think about it.

Allotrope Data Format is really HDF5 + Semantic Web RDF. Let's discuss these two separately.

HDF5

HDF5 is a popular binary data format for storing multi-dimensional numerical arrays along with heterogeneous data in one single portable container. In fact, it is one of the recommended file formats in the section above when it comes to saving large numerical data sets and binary files in the IDS JSON. See File Pointers. As a result, HDF5 is already part of the IDS JSON.

RDF

The image below is an RDF representation of a LCUV injection. This only captures one part of the information (10-20% of the experimental information from the run) and, as you can see, it is already getting quite complicated.

Image: RDF representation of 10-20% of the data from an LCUV injection

Screen-Shot-2020-04-12-at-2.39.05-PM

If we apply the criteria of an Intermediate Data Schema (IDS) to RDF, here are our observations:

WHY RDF IS NOT SUITABLE

  • Comprehensive information capture: Every relationship needs to be carefully designed and governed. There are not enough terms to capture the lab data.
  • Easy to write and transform data to IDS files: Non-trivial effort to create RDF files and validate they are correct.
  • Easy to read and transform from IDS files: Difficult to query from the RDF. Complicated SPARQL query is not scientist/data scientist friendly.
  • Minimum effort to consume the IDS files out of the box: Very few ELN/LIMS or visualization tools support semantic web RDF.
  • Popular, well supported and exists among an ecosystem of mature tooling: Limited adoption in the life sciences industry and broader IT ecosystem. Thus there is a small user base and steep learning curve.
  • Decoupling data sources and data targets to avoid vendor lock in: Currently RDF data in ADF files can only be read by the Allotrope Foundation's JAVA/.NET SDK.

As a result, we do not believe RDF is suitable as the Intermediate Data Schema for R&D labs. Major barriers include the learning curve and the lack of tooling.

BUT....

  • Though not a good workhorse for day-to-day transformation and processing and many of your data targets, RDF graphs can be useful to unlock specialized, semantic analysis tools. If you are considering RDF, we recommend you view it as another data target, not as the middle layer of the Intermediate Data Schema (IDS).
2020_04_Blog_TheDigitalLabNeedsAnIDS_ADFTargetImage
Screen-Shot-2020-04-12-at-16.42.48
  • We also recommend that you consider simplifying the semantic web RDF approach and use the Allotrope Leaf Node Pattern, which focuses on the data that scientists care about while preserving its semantic meaning, making the resulting file data science ready. Read more about converting from IDS to Leaf Node RDF in these blog posts: TetraScience ADF Converter and Leaf Node Model.

Summary: use JSON as IDS to harmonize your lab data sets

These are all the reasons we chose JSON as our Intermediate Data Schema (IDS) for the Digital Lab, and we think you should too.

Pick something that is popular, battled tested, easy to read/write, and supported by a vibrant tech community as well as almost all the websites on the Internet.

Pick something that is inclusive and transferrable to other formats when use cases justify further transformation. That's why it's called Intermediate Data Schema :)

Screen-Shot-2020-04-12-at-1.53.50-PM


Note: the list above is inclusive but not exhaustive.

References

Here is a video illustrating how easy it is to learn and use JSON.

Here is a roundup of curated articles about Semantic Web and RDF for further reading.

R&D Data Cloud
R&D Data Cloud

The Digital Lab Needs an Intermediate Data Schema (IDS): a First Principle Analysis

We believe one of the key building blocks of a Digital Lab is the flow of Data. Without the flow of information, data is siloed, fragmented, and nearly impossible to drive action. To enable scalable data flow, Digital Lab needs an Intermediate Data Schema...

Spin Wang

Google won the "search wars" by using algorithms to make intelligent choices and return information that actually answers your search query, rather than simply returning all of the information that matches the query string. In many cases, I believe this is the type of goal that folks are trying to achieve, using semantic web as the path to get there.

Back in my college days, I worked on (or near) the Haystack project — an attempt to build a UI for the semantic web. It was built on top of the Eclipse IDE, and focused on personal information management and bioinformatics as its two domains. Each semantic data type (think IRI) was mapped to a program written in a custom language called Adenine, that rendered a UI specific to that type. Adenine’s syntax was itself a subset of RDF, so you could write it in N3. It was a cool vision, but never came to fruition.

Ever since then, semantic web (or SemWeb) is something I've thought about a lot. This is very relevant to my role as CTO at TetraScience because our platform is dedicated to collecting, centralizing, harmonizing, and preparing R&D lab data for exploration and analysis with techniques like machine learning and data science. Ontologies, metadata, and formatting are all critical to making data accessible and actionable (aka useful). While we each have our own individual opinions about the future and practicality of the semantic web, our goal as a company, and my responsibility as the CTO, is to ensure we achieve the end goal of accessible/actionable data, using the best means possible.

Here is a collection of thought leadership articles about semantic web that I'm tracking. This is a question we are asked often, so I thought I'd share.

I'll add to this list as new, interesting articles are published. What are your thoughts about semantic web? Are you using or building it? Contact us via one of the methods below to discuss further.

Resource Center
Resource Center

Round-up of Semantic Web thought leadership articles

Here is a collection of thought leadership articles about semantic web that we are tracking. This is a question we are asked often, so we thought we'd share.

Spin Wang

This week, we are very excited to launch the first cloud-native, data science ready ADF Converter. Our application converts your heterogeneous experimental data sets into ADF files, leveraging Leaf Node pattern, Allotrope Foundation Ontology and QUDT for ontology, HDF5 for underlying format. We would like to share how we get started and how we gained clarity and conviction to officially support ADF in our journey.

We believe that storing your heterogeneous experimental data sets in one single HDF5 file, embedding ontological labels on key data fields, and doing that in a way that can be consumed by any data science and 3rd party software, and can be extended to full semantic graph, is indeed a promising data management strategy for organizations to consider.

Authors: Evan Anderson - Delivery Engineer, Vincent Chan - Product Owner/Software Engineer, and Spin Wang - CEO & Co-Founder

Phase 1: Excitement

TetraScience joined the Allotrope Foundation in 2017 as a member of the Allotrope Partner Network. For the last 2.5 years, we have worked actively with the foundation and its working groups, many pharmaceutical companies, and instrument manufacturers to pursue the vision of experimental data standardization for the Digital Lab. We are deeply impressed and motivated by the community built by and around the Allotrope Foundation, and by their deep belief in data standardization to reduce error and enable analytics and data science.

Phase 2: Questions and Exploration

However, truth be told, for the team at TetraScience, it was not a smooth journey; we struggled and frequently questioned our choices.

In the beginning, when the challenge was mainly about learning various concepts involved in Allotrope, the community provided a lot of assistance and helped us to become more knowledgeable. Our team must have read the ADF Primer more than 10 times. After being able to convert one file, we then struggled to convert other data types scalably and repeatedly; then we struggled to make sure the file could be consumed easily and to get the information out.

Because of our conviction in the vision, and also the unwavering enthusiasm we saw from the community members, these challenges did not really bother us. Instead, these became exciting problems that our engineering and product team discussed on a daily basis. We treated these as opportunities to contribute and received tremendous help from the Foundation and the community.

We published our internal training document: Allotrope 101, and contributed automation and libraries we built to the Foundation. We provided at least two presentations in each of the past five Allotrope Connect workshops to report our findings and progress. As we contributed, we also voiced our concerns and questions:

  • Why not just zip a folder to collect all the experimental data?
  • Why not use popular file formats like Apache Parquet and JSON?
  • Full semantic web (full graph) approach is very time consuming to build a model, generate instance graphs, validate, and read. How could we use ontology, but in a more scalable way?
  • ADF files can only be read using Foundation's Java/C# library. How could we lower the barrier of consumption and make ADF compatible with Data Science applications?
  • How could we quickly and cost effectively access a large amount of ADF files stored in the cloud? In other words, what is a Hadoop/Spark equivalent for ADF?

As a start-up with high opportunity cost and limited resources, this is a major decision. We can not support Allotrope without having conviction that it has a differentiated value proposition and without genuinely feeling confident. We relentlessly asked ourselves - what are the use cases, why do those use cases have to be achieved by ADF and not the alternatives, and by the way, what are the alternatives?

That’s why, in the last 2.5 years, despite our active participation, we have always been hesitant on what we should do. Our strategy is not to follow and adopt blindly. Instead, we question why the use cases have to be achieved by ADF, and what are ADF’s unique advantages?

Phase 3: Clarity and Conviction

With the help of the Foundation, our customers, and through many debates and iterations, we started to gain more clarity around ADF’s unique value proposition.

ADF, as a file format, is really HDF5 + semantic web RDF graphs. These solve very different use cases and have specific, unique benefits.

On the HDF5 side

  • HDF5 is a well supported and adopted open technology to store heterogeneous data sets, such as text file, binary file, images and large numerical multi-dimensional matrix, together in one file. This makes HDF5 more suitable to package everything related to an experiment into one portable file compared to a zip file, JSON, or Parquet file. A folder with large numerical matrices stored in clear text is not efficient and Parquet is not suitable for handling multiple different file formats. HDF5 simplifies the burden of performing data integrity check, portability, and archiving, especially for data sets with multi-dimensional numerical arrays.
  • ADF is built on HDF5, allowing for the theoretical possibility that any tools supporting HDF5 could open ADF files (if the ADF files are constructed with this in mind).
  • HDF group’s Kita and HSDS started to provide a viable and promising option to store a vast number of HDF5 files in the cloud (e.g., AWS S3), and perform distributed queries. Now you can slice and dice your chromatograms from HDF5 files stored in the cloud.

On the semantic web RDF side

  • Full semantic graph can be complicated and time consuming to implement (here are some good articles about semantic web: What Happened to the Semantic Web? and Hacker News Discussion). However, it does not need to be the only way to get to an ontologically aware data set.
  • Leaf node pattern simplifies the semantic web framework, allowing automatic generation and query of knowledge graphs by leveraging these reusable building blocks. Allotrope Ontologies can be easily embedded in the leaf node, while the ontologies themselves can provide semantic relationships and explanations for each term. Leaf node allows scientists, software engineers, and data scientists to easily create, validate, and read the data, while semantic experts can focus on the ontology. We dive deeper into the Leaf Node Model in this blogpost: Allotrope Leaf Node Model -- the Balance between Practical Solution and Semantics Compatibility

It’s also important that we realize ADF is not suitable for all use cases; it is not a magic bullet that will solve all the issues.

  • For example, if you are interested in a single file for your Pandas data frame, HDF5 may not be the best option and you may consider Feather or Apache Parquet.
  • For example, systems like GE UNICORN, Waters Empower, and Shimadzu LabSolutions client/server edition are based on databases; it is not possible to just export the data related to one experiment and save into ADF. Thus, ADF is not suitable as the archive format. You can backup the entire data base and save it in ADF. Or you can export one single experiment's data into a file and then put into ADF, but then you can not restore that file back to the vendor software. For the majority of the file-based data outputs, ADF can indeed be used for archiving.

Instead of discouraging us, understanding the limitations provided us with much more clarity on how to recommend ADF to our customers and thus gave us conviction on where ADF is uniquely positioned to excel!

We believe that storing your heterogeneous experimental data sets in one single HDF5 file, embedding ontological labels on key data fields, and doing that in a way that can be consumed by any data science and 3rd party software, and can be extended to full semantic graph, is indeed a promising data management strategy for organizations to consider.

During this process, we continue to be encouraged by the community’s enthusiasm -- people from different pharmaceutical companies and manufacturers working together collaboratively, iterating towards a common vision, willing to learn from past experience, and finding the right balance between vision and execution. A lot of these ideas and opinions were inspired by others in the community. We could not be more grateful for the inspirations we received from the community!

Today: ADF Converter

This week (early April of 2020), we are very excited to launch the first cloud-native, data science ready ADF Converter. It is an application built on the Tetra Data Platform that converts your experimental data into ADF files, leveraging Leaf Node pattern, Allotrope Foundation Ontology (AFO) and QUDT for Ontology, and HDF5 for underlying file format. It supports all the instrument types modeled thus far by the Allotrope Modeling Working Group using Leaf Node pattern, and is expected to support up to 40 instrument models by the end of 2020. You can leverage these models to document and analyze key experimental data in ELNs, LIMS, visualization, data science tools, or any software you write yourselves.

Here are our core convictions that shaped the ADF Converter.

First, we believe that the barrier to data access needs to be as low as possible. The data format serves the need of use cases and is a means to the end.

Right now, the only way to read an ADF file is via the Java/C# library provided by the Allotrope Foundation. This limits the ability for Electronic Lab Notebooks, Data Science tools (Python, R), and Business Intelligence & Analytics tools (Dotmatics Vortex, Tibco Spotfire, Tableau) to use the ADF files.

To address this limitation, we made sure that ADF files can now also be opened using Python, R, and easily consumed by Data Scientists, and made them compatible with HDF group’s Kita and HSDS. Now, leveraging any tools or programming language compatible with HDF5, using R Data Functions in Spotfire, you can easily extract the information from ADF into your data science related applications.

Image: use Python to explore and visualize your ADF file

Read ADF using Python





Second, we believe that programmatic API and automation are crucial for data standardization. Standardization is only meaningful when the number of files following that standard exceeds a critical amount. To reach that critical mass, automation is necessary. As a result, the ADF Converter has a flexible programmatic interface for you to ingest input files, monitor conversion progress, and search and retrieve converted ADF files. Third party applications can now leverage this API for conversion.

Image: ADF Converter ingesting files + third party application retrieving the files

ADF Converter API



Third, we believe in starting with simple patterns, because simplification allows automation, automation leads to scale, scale leads to adoption and momentum, which provides more user requirements and use cases. These requirements and use cases can help to solidify the next iteration. The Agile development philosophy also applies to data modeling. The ADF Converter starts with the Leaf Node pattern and will include the aggregation pattern after the working groups finish the design.

Image: Allotrope Leaf Node pattern

Leaf Node

End

It has been a great journey. We encourage more people and organizations to be involved and participate actively. It’s a collaborative community that will result in value and deep relationships! We hope our effort will enable organizations to overcome the barrier of initial adoption and start to truly evaluate and explore ADF in real production settings against your use cases. By doing this, a data format like this one can be truly battle-tested and demonstrate its vitality. There are still many unknowns, and we look forward to further collaborating on this journey of Data Standardization for Digital Labs!

Read more about the ADF Converter and our approach, or contact us with questions, comments, or feedback.

TetraScience ADF Converter -- Delivering on the Promise of Allotrope and a Startup’s Journey

Our goal is to enable other organizations to overcome the barrier of initial adoption in order to evaluate and explore ADF in real production settings against your use cases. By doing this, such a data format can be truly battle-tested and demonstrate its vitality.

Spin Wang

Using the Leaf Node model, scientists, software engineers, and data scientists can rapidly iterate on the data model, create and query standardized data sets, while semantic experts focus on the ontology and semantics. It’s a division of labor, allowing common activities to be easy & fast, while maintain compatible with semantic world.

An added benefit of the Leaf Node model is that it can be easily transformed to / from popular formats such as JSON, Tabular formats like CSV and Columnar formats Parquet. These formats make the data easily compatible with common software engineering, data sciences and visualization tools. Being interchangeable with popular software and data science ready formats is another major advantage of the Leaf Node model.

Written by: Spin Wang

Earlier this year, we discussed the basic concepts of Allotrope Data Format (ADF) in this blog post: Allotrope 101. One of the keys of Allotrope is its Data Description, a triple store leveraging the semantic web and Resource Description Framework (RDF) graphs.

Traditionally Allotrope data description is presented by what we call a “Full Graph” stored in RDF, something that looks like the following --

Full Graph

The goal of the full graph is very exciting, it captures the relationship of the entities. For example

  • material A has role of experiment sample
  • the experiment sample is realized in an HPLC injection
  • the HPLC injection has participant an autosampler

Such relationships can potentially allow the machine (computer software) to understand what is a sample, a chromatography injection, and autosampler and how they are related to each other in an abstract way, like a lab scientist. However, the downside is that it introduces a LOT of ontological, taxonomy and even philosophical complexity & overhead. In preparation for the machine to understand the data in the future, scientists now incur more overhead and get stuck.

Quickly the community realizes that it is non-trivial to build and use a Full Graph data model, for the following reasons

  • The semantic meaning and relationships of different entities are not something that a typical scientist would understand, significant semantic and ontology knowledge is needed to build, understand the Full Graph.
  • Lack of an effective validation mechanism. Namely, it is not yet possible (or at least extremely challenging) to break the Full Graph into smaller pieces, validate the each module separately, combine the modules and then ensure that validation still passes. In fact, the SHACL, the shape constraint language used to validate the Full Graph, was only recently published in summer 2017.
  • Even when the Full Graph is available, its complexity quickly becomes the barrier for scientists, data scientists, software developers and 3rd party software to consume the data. (Most of the vendors in the lab informatics market do not support graph, not to mention Full Graph.)

In light of this observation, the community proposed the concept of Leaf Node model, which is the theme of this blogpost.

Essentially, the Leaf Node model says -- “What if we only focus on the Leaves of the Full Graph, namely those nodes that are directly associated with the data?”.

This allows the scientists, software developers, and data scientists to quickly zoom into what is the important part of the graph, the actual data fields, and their values. Example Leaf Nodes are listed below. They look like

  • Experiment name is “my test”
  • Sample id is “123”
  • Sample batch barcode is "xyz"
  • The injection volume is 1 microliter
  • Cell viability is 60.5 percent

When represented in RDF and saved in ADF, Leaf Node is composed of the following triples.

Leaf Node

Leaf Node model is essentially whole bunch of Leaf Nodes. Leaf Node model also supports an array of Leaf Nodes which are related to each other, such as an array of chromatogram peaks. We will leave this to future articles to explain in more detail.

You will probably ask: how about the semantic meaning and relationship? How is the information such as “sample is input to an experiment” captured and how can we tell the machine to understand what is a “sample”?

The answer to this actually quite delightful, since an IRI is attached to node (for example, result#AFR_0001111 is attached to cell viability), and because that IRI is already part of an ontology, then the relationship between the nodes have already been rigorously and elegantly defined in the ontology. That IRI serves as a semantic hook or label, that explains what is “viability” and bridges the Leaf Node graph with the ontology. If you were to want a “Full Graph”, simply combine the Leaf Node graph with the ontology.

With this approach, lab scientists, software engineers, and data scientists can rapidly iterate on the data model, easily create standardized data sets, easily query the data sets. While the knowledge engineers and semantic experts can focus on the development of the ontology, which is the right place to describe what is a sample and it can be part of an experiment.

Leaf Node approach is essentially a division of labor, allowing what should be easy & fast to be actually easy & fast, while enabling compatibility with the semantic world via the IRI. The delineation of data (captured in leaf node) and semantics (captured in the ontology) enables scientists, software engineers and data scientists to easily create, validate and read the data, while sementic experts can focus on the ontology.

An added benefit of the Leaf Node model is that it can be easily transformed to / from popular formats such as JSON, Tabular formats like CSV and Columnar formats Parquet. These formats make the data easily compatible with common software engineering, data sciences and visualization tools. Being interchangeable with popular software and data science ready formats is another major advantage of the Leaf Node model.

{
 "experiment": {
   "name": "my test"
 },
 "sample": {
   "id": "123"
 },
 "injection": {
   "volume": {
     "value": 1,
     "unit": "Microliter"
   }
 }
}
EXPERIMENT_NAMESAMPLE_IDINJECTION_VOLUME_VALUEINJECTION_VOLUME_UNITmy test1231Microliter

To find out more about how you can use Allotrope Leaf Node Model and automatically standardize your lab data for analytics, automation and archiving, please reach out to us at www.tetrascience.com/contact-us!

R&D Data Cloud
R&D Data Cloud

Allotrope Leaf Node Model — a Balance between Practical Solution and Semantics Compatibility

Specialization and compatibility are possible with the Leaf Node - scientists, engineers, and data scientists can rapidly iterate on the data model to create and query standardized data sets so ontology experts can focus on semantics.

Spin Wang

Use this guide to ensure you don't miss anything important while avoiding alert fatigue.

Blog post below infographic

2020_03_Infographic_LM_AlertSetPoints-2

You’ve got your monitoring system installed: Great!

Now you just need to set up your alert set points, and then you won’t need to think about it anymore.

But how do you do that? What happens if you don’t choose the right alert set point? How do you know you chose correctly?

If you’ve never done this before, it can be hard to know what to do here - under-sensitive alerts might mean you’ll miss a problem with your equipment, but over-sensitive alerts are annoying and will cause alert fatigue.

Alert fatigue happens when you receive so many alerts that you learn to ignore them. If you receive 99 alerts that are false alarms due to normal usage, it is pretty likely that you will ignore the 100th alert. Of course, that 100th alert is the one that indicates a freezer failure, or something equally important.

At TetraScience, we’ve monitored more than 3,600 pieces of equipment over the last four years. We've collected lots of data about how freezers and incubators perform and what alert settings work for the Lab Managers and Ops/Facilities folks who rely on our Lab Monitoring system. One of the most common requests we receive is for alert setting recommendations. We decided to gather our data and anecdotal experience and write it down - so here it is:

Alert variables: threshold and delay

When setting alerts, the variables we have to work with are alert threshold and delay. The way alerting generally works is this - once the temperature (or other data feed) has crossed the set alert threshold, a timer starts. That timer continues for the duration of the set delay, and if the temperature is still out of bounds after that period of time, you receive an alert.

As an example, if you have a -80℃ freezer, where you have set an alert threshold of -65℃ and a delay of 30 minutes, you would receive an alert if the freezer temperature went above -65℃ and remained above -65℃ for 30 minutes. The figure below shows what this looks like on your TetraScience Lab Monitoring dashboard.

2020_03_Blog_LM_AlertSetPoints_TempProfile-2

Setting alert thresholds

The purpose of setting an alert threshold is pretty straightforward: if your equipment exceeds a set temperature, you want to know so you can take corrective action.

When selecting a threshold, the important thing to remember is that temperature naturally fluctuates during normal usage.

For example, a primary cause of temperature fluctuation is users opening and closing the door during the course of a normal day - this will cause excursions that last for a few minutes before your equipment recovers. Freezers will also have a sinusoidal temperature variation even when nobody is using them. This is caused by the compressor cycle and can be a noticeable variation (+/- 5℃ would be very typical).

Because these variations are part of normal operation, we typically recommend thresholds that are a few degrees outside of normal operations. Below are some examples of reasonable thresholds:

2020_03_Blog_LM_AlertSetPoints_ThresholdsGraphic

Note that we do not recommend low temperature alerts for -20s or -80s: these generally are not useful and can cause alert fatigue. We do, however, sometimes recommend low temperature thresholds for 4℃ refrigerators, as you may have material that you do not want to accidentally freeze. However, risk of hitting freezing temperatures does vary by type of refrigerator. For example, low temperature thresholds can be helpful for refrigerators that are likely to have frost on top.

Setting alert delays

The purpose of setting a delay might be a little less obvious - wouldn’t you want to know immediately if your equipment is out-of-bounds? In some cases, this is true (if you’re monitoring a gas manifold, and it goes into alarm, there’s no reason for a delay).

But for cold storage and incubators, the time it takes to return to the temperature set point after opening the door can be 20-30 minutes or more - especially if users open and close the door several times in rapid succession.

Because of this, we typically suggest alert delays in this range:

2020_03_Blog_LM_AlertSetPoints_DelaysGraphic

We can use a shorter delay for incubator high temperature alarms because the incubator temperature does not generally rise with normal use. High incubator temperature normally means there is a problem with your HVAC system or the incubator itself.

Summary

The best monitoring system is one that you use, not one that you tune out.

Alert setup shouldn’t be difficult, and it shouldn’t take a lot of fine-tuning down the line. Though lab equipment can show some variation in typical operating ranges, we usually find that you should be able to set universal alert thresholds and delays for each type of instrument. The important thing to remember when you are setting up your lab monitoring system is to avoid the extremes of alert sensitivity - the best monitoring system is one that you use, not one that you tune out.

Do you have alert threshold/delay setting best practices to share? Additional questions? Let us know!

Lab Monitoring
Lab Monitoring

Choose the right alert set points for your freezers, refrigerators, and incubators

Choose the right alert set points for your freezers and incubators Use this guide to ensure you don't miss anything important while avoiding alert fatigue.

Erika Tsutsumi

Authors:

Benjamin J. Woolford-Lim, Senior Laboratory Automation Software Engineer, GSK

Vincent Chan, Product Owner & Software Engineer, TetraScience

Spin Wang, Chief Technology Officer and Co-founder, TetraScience

Mike Tarselli, Chief Scientific Officer, TetraScience

Overview

How to learn the Allotrope Framework and use the Allotrope Data Format (ADF):

Don’t worry if you get stuck or have questions - try these helpful resources:

Underlying Concepts and Motivations

Motivation

As life sciences organizations become increasingly data-driven, the strategic importance of high-quality data sets grows. Scientific instruments were not historically designed as “open” systems and typically generate data in proprietary or vendor-specific formats. The resultant data silos  greatly reduce the ability to gain insights and perform analytics on scientific data.

The Allotrope Foundation aims to revolutionize the way scientific data is acquired, shared, and how actionable observations from the data are attained by establishing a community and framework for standardization and linked data.

Underlying Concepts

Semantic Web

The Semantic Web, an extension of the World Wide Web, provides a common framework for data sharing and reuse across applications, enterprises, and community boundaries. Its goal is to make Internet data machine-readable and to integrate content, information applications, and systems. With a Web of Information, users can contribute resource knowledge openly and freely, which leads to unprecedented growth. To implement Semantic Web, however, there needs to be a way to allow information interchange. 

RDF

Semantic Web promotes common data formats and exchange protocols through a popular modeling language, Resource Description Framework (RDF). It is the basis of languages such as Web Ontology Language (OWL). RDF relies on the concept of "triples”; it extends the linking structure of the Web by using uniform resource identifiers (URIs) to name the relationship between things as well as the two ends of the link. 

This simple model allows structured and semi-structured data to be mixed, exposed, and shared across different applications. Note that RDF as a data model is distinct from RDF/XML, a means of representing RDF in XML, largely superseded by easier-to-use formats like Turtle.

Triples

Triples store elements or facts. A set of triples can be combined to represent a graph of data and relationships in an ontology. Triples consist of a subject, a predicate, and an object. For example:

  • The subject is the resource the fact is about and is usually either a class from an ontology, or some instance of an entity in the overall graph
  • The predicate is the relationship between the subject and the object, such as the type of detector an instrument has
  • The object is the value this fact asserts is related to the subject

It can either be another resource like the subject, i.e. an instance of an entity or an ontological class, or it may be some fixed literal value such as  500.14 or  "Allotrope".

Here is a snippet of the QUDT's that contains some facts about the unit of Atomic Mass, namely  u or dalton (Da), expressed in Turtle syntax.

Once you have a graph, it can be queried using SPARQL, a query language similar to SQL but designed specifically for semantic content. Every ADF file can store triples about its data in the  Data Description. Allotrope Foundation Ontologies (AFO) provide consistent terms and relationships across instruments, techniques, disciplines, and vendors.

RDF provides a mechanism that allows anyone to make a basic statement about anything and layer those statements into a single graph.

Now imagine an instrument being able to automatically produce those statements and add them into a big pool of scientific data with consistent terms and relationships between the data sets. Scientists and data analysts can now spend the majority of their time analyzing and gaining insights into the data, instead of trying to interpret or interchange it.

Wouldn’t that be pretty powerful?

URI and IRI

Remember URIs, from the RDF section above? URIs can uniquely identify a resource or name something. By leveraging URIs in the RDF framework, one can represent ontologies in a unique and even resolvable way. 

International Resource Identifiers (IRIs), unlike URIs, are Unicode character set in addition to ASCII characters (URIs’ limitation).

SPARQL

SPARQL, pronounced "sparkle" and recursively meaning SPARQL Protocol and RDF Query Language, queries database Triples. It’s structurally similar to SQL. To learn more, check out the SPARQL tutorial

Below is a set of triples in Turtle format. These triples form a graph that describes the cell counter measure as the total cell count at 1972.0.

Here is an example of SPARQL query that obtains the total cell count from the previous graph.

Leveraging triples and SPARQL, you can perform powerful queries on top of highly connected data sets - the key benefit of using Semantic Web.

SHACL

Shape Constraint Language defines and validates constraints on RDF graphs. It is a relatively new standard from the W3C. 

Using RDF triples allows fact expression and notation  in order to connect to other facts across datasets and domains. Super! However, this same flexibility strength can be a weakness: it can cause inconsistency in data representation. This is where SHACL comes in.

SHACL can validate an RDF graph against a set of constraints or rules. By encoding data models in SHACL, this automatic validation checks conformance of our triples and graphs to the expected model, with all required information in the correct place (and linked in the right way). Now these datasets can be searched via the same SPARQL queries, enabling easy and consistent linking to other datasets with validated data structure.

Here’s an example of a SHACL file snippet:

This snippet (unnamed:checkForEntityNode) tries to make sure that for an entity node in the graph that belongs to class viability (defined as http://purl.allotrope.org/ontologies/result#AFR_0001111 in the Allotrope Foundation Ontology), there is only one numerical value and only one unit. Notice that the SHACL file is also presented as a set of triples in the Turtle format.

Data Organization and Hierarchy

Taxonomy

A hierarchical classification of entities, using the same relationship type, e.g. "is a subclass of" throughout. Taxonomies are typically represented by a tree structure (think like the animal kingdom KPCOFGS).

Ontology

A superclass of taxonomies, with several different relationships, e.g. "is a", "has a", "contains a", and with multiple inheritances allowed in the same ontology. Whilst taxonomies can be represented as a tree due to their hierarchical nature, ontologies have more complex relationships and are modeled as graphs.

Graphs and Graph Databases

Graphs are a powerful way to store and explore unstructured and semi-structured data to identify relationships between data and quickly query these relationships.

Data is most often represented in tabular form, e.g.  relational databases. What are advantages to modeling data as a graph over the traditional relational data format?

Graph databases have advantages over relational databases for use cases like social networking, recommendation engines, and fraud detection, where the relationships between data are arguably as important as the data itself. If you use traditional relational databases, you would need a large number of tables with multiple foreign keys to store the data, which are difficult to understand and maintain. Furthermore, using SQL to navigate this data would require nested queries and complex joins that quickly become unwieldy, and the queries would not perform well as your data size grows over time.

In graph databases, the relationships are stored as first-order citizens of the data model, as opposed to relational databases which require us to establish relationships using foreign keys. This allows data in nodes to be directly linked, dramatically improving the performance of queries that navigate relationships in the data. It also enables the model to map closely to our physical world.

Additional graph information:

Tools and Technologies

  • BFO: Basic Formal Ontology, an upper-level ontology used to ensure consistent usage and linking of terms across different ontologies. It is widely used in the biomedical space, including serving as the basis for every ontology in the Open Biological and Biomedical Ontology Foundry (OBOFoundry). 
  • HDF5: A binary file format, optimized for high-performance access to large datasets. Used as an underlying technology in ADF. 
  • Jena: An Apache open source Java API supporting the use of Semantic Web approaches such as triples and SPARQL queries. Used as an underlying technology or the Data Description layer of ADF. 
  • Jena Fuseki: A popular tool to easily test SPARQL queries. 
  • Protégé: A standard ontology development and exploration tool, developed by Stanford University and provided free for general use. Watch a basic tutorial on the use of Protege
  • Triplestore: A database-like storage mechanism for triples, such as Jena-Fuseki.
  • Turtle: A syntax for representing RDF triples in a more human-readable form than the RDF/XML standard. It is structurally similar to the SPARQL language. 

Summary

By establishing a framework and unified data formats to structure the multitude of experimental data generated in life sciences R&D, the scientific community has the ability to focus on pushing the needle of scientific innovation. 

R&D Data Cloud
R&D Data Cloud

Allotrope 101

The Allotrope Foundation, an international consortium of pharma, biotech, and other research-intensive industries, has developed advanced data architecture to transform the acquisition, exchange, and management of laboratory data throughout its lifecycle.

Spin Wang