In Part one of this blog, we discussed some of the reasons DIY data platform projects often fail to deliver promised benefits, or do so at unexpectedly high cost.
To review: building a do-it-yourself (DIY) data solution from non-R&D-focused components means assuming responsibility to select, integrate, and lifecycle-manage all the pieces and parts, and researching, architecting, building, and maintaining all the integrations. That's a tall order, requiring headcount and specialized skills — mostly focused on building and operating and getting data into the platform, and much less focused on extracting maximum value from the data itself.
Making matters worse, there's an "impedance mismatch" between the capabilities offered by generic data components and services (ingestion, transformation, cloud storage and search, etc.) and biopharma infrastructure, workflow, regulatory, and scientific characteristics and requirements. So a DIY project that aims to create a quality solution needs to fill the gap: providing the needed scientific and process knowledge to extract, parse, and enrich data close to its sources (data without context has limited utility), and map data into an ontology/schema that makes it readily findable, accessible, interoperable, and reusable (FAIR).
The result: DIY efforts can easily consume vast resources and compel non-strategic organizational spread, while delivering sub-par benefits: e.g., brittle, inflexible, hard-to-maintain integrations between data sources and targets (logically equivalent to one-off, point-to-point integrations, despite traversing a transformation tier and data lake/warehouse), and aggregated data that's devoid of context and non-schematized, so hard to find and use for automation, analysis, visualization, and discovery.
Tetra R&D Data Cloud: Unified and Purpose-Built
To meet these challenges requires a different approach. The Tetra R&D Data Cloud represents a fundamental shift in terms of how to bridge life sciences data sources and data targets to accelerate R&D.
Data-centric: treating data as the core asset, providing stewardship of data through its whole life cycle. Tetra R&D Data Cloud is architected to treat data as the continual focus: from acquisition, harmonization, data management, data processing, through to preparing the data for AI/ML and pushing to informatics/automation applications. TDP includes:
- A flexible ingestion tier (agents, connectors, IoT proxies, etc.) that robustly manages connectivity with and transactional access to any kind of data source.
- A sophisticated pipeline architecture, engineered to facilitate rapid creation of self-scaling processing chains (for ingestion, data push, transformation, harmonization) by configuring standardized components, minimizing coding and operations burden while minimizing cloud costs associated with data processing.
- A high performance, multi-tiered cloud storage back end.
- An R&D-focused, fully-productized, plug-and-play, distributed integration architecture that runs across this purpose-built platform. Integrations are engineered by Tetra's IT and biopharma experts (in collaboration with our ecosystem of vendor partners) to extract, deeply parse, and fully enrich (e.g., with tags, labels, environmental data, etc.) data as it emerges from sources, then harmonize it into an open Intermediate Data Schema (IDS) to make it FAIR.
- Open, modern REST APIs and powerful query tools (plus API-integrated data apps), providing easy access to raw and harmonized Tetra Data by automation, analytics, and AI/ML applications, and by popular data science toolkits (e.g., Python + Streamlit).
This data-centric architecture (plus the data-centric focus applied in building true R&D data integrations) ensures that:
- Appearance of new data from instruments and applications (and readiness of instruments and applications to accept instructions) can be detected automatically in virtually every case, enabling hands-free automation: ingestion, enrichment, parsing, transformation, harmonization and storage on the inbound side, and search/selection, transformation, and push (or synchronization) on the outbound side.
- Newly-appearing data is enriched, parsed, harmonized, and stored as it becomes available, preserving context and meaning for the long term, ensuring lineage and traceability, and making Tetra Data immediately useful for analytics and data science, even in close-to-realtime (i.e., while experiments are running).
- Data is harmonized into a comprehensive set of JSON IDS schemas which are fully-documented and completely open, making it searchable, comparable, and facilitating rapid (automatic) ingestion into applications.
Cloud-native: The Tetra R&D Data Cloud incorporates best-of-breed open source components and technologies (e.g., JSON, Parquet) and popular standards favored by scientists and by life sciences and data sciences professionals (e.g., SQL, Python, Jupyter) in an aggressively cloud-native architecture that ensures easy, flexible deployment, resilience, security, scalability, high performance, and minimum operational overhead, while optimizing to provide lowest total cost of ownership.
R&D-focused: with connectivity, integration and data models purpose built for experimental data at the core. TetraScience has created a large (and growing), broadly-skilled (software development, cloud computing, and life sciences), and disciplined organization, and evolved a mature process for identifying, building, and maintaining a library of fully-productized integrations to biopharma R&D data sources and targets, and also creating data models for common data sets. These integrations (including the associated data models, comprising the open IDS) are purpose-built and tailored to fulfill informatics and data analytics use cases in R&D.
Open and vendor-agnostic, leveraging a broad partner network: TetraScience has partnered and actively collaborates industry’s leading instrument and informatics software providers. This partner network collaborates closely, benefiting all the network members (and TetraScience customers) as our collective ecosystem grows. Partnering this way significantly accelerates integration development and productization, helps ensure integration quality, keeps integrations in sync with product updates, and helps guarantee that integrations fully support high-priority, real-world customer use-cases.
Biopharma can best exploit its most important asset — R&D data — by implementing a purpose-built, end-to-end solution that's data-centric, cloud-native, R&D-focused, and open. Hewing closely to these principles, Tetra Data Platform can help reduce non-strategic organizational spread, letting R&D IT organizations focus on adding value and pioneering new applications that accelerate discovery. At the same time, TDP can deliver to scientists, data scientists and other end-users a self-service, unified data experience — where data experts are in control of processing and data modeling, and can configure, manage, and track dataflows from end-to-end.