Allotrope 101

Spin Wang
|
April 3, 2019

Authors:

Benjamin J. Woolford-Lim, Senior Laboratory Automation Software Engineer, GSK

Vincent Chan, Product Owner & Software Engineer, TetraScience

Spin Wang, Chief Technology Officer and Co-founder, TetraScience

Mike Tarselli, Chief Scientific Officer, TetraScience

Overview

How to learn the Allotrope Framework and use the Allotrope Data Format (ADF):

Don’t worry if you get stuck or have questions - try these helpful resources:

Underlying Concepts and Motivations

Motivation

As life sciences organizations become increasingly data-driven, the strategic importance of high-quality data sets grows. Scientific instruments were not historically designed as “open” systems and typically generate data in proprietary or vendor-specific formats. The resultant data silos  greatly reduce the ability to gain insights and perform analytics on scientific data.

The Allotrope Foundation aims to revolutionize the way scientific data is acquired, shared, and how actionable observations from the data are attained by establishing a community and framework for standardization and linked data.

Underlying Concepts

Semantic Web

The Semantic Web, an extension of the World Wide Web, provides a common framework for data sharing and reuse across applications, enterprises, and community boundaries. Its goal is to make Internet data machine-readable and to integrate content, information applications, and systems. With a Web of Information, users can contribute resource knowledge openly and freely, which leads to unprecedented growth. To implement Semantic Web, however, there needs to be a way to allow information interchange. 

RDF

Semantic Web promotes common data formats and exchange protocols through a popular modeling language, Resource Description Framework (RDF). It is the basis of languages such as Web Ontology Language (OWL). RDF relies on the concept of "triples”; it extends the linking structure of the Web by using uniform resource identifiers (URIs) to name the relationship between things as well as the two ends of the link. 

This simple model allows structured and semi-structured data to be mixed, exposed, and shared across different applications. Note that RDF as a data model is distinct from RDF/XML, a means of representing RDF in XML, largely superseded by easier-to-use formats like Turtle.

Triples

Triples store elements or facts. A set of triples can be combined to represent a graph of data and relationships in an ontology. Triples consist of a subject, a predicate, and an object. For example:

  • The subject is the resource the fact is about and is usually either a class from an ontology, or some instance of an entity in the overall graph
  • The predicate is the relationship between the subject and the object, such as the type of detector an instrument has
  • The object is the value this fact asserts is related to the subject

It can either be another resource like the subject, i.e. an instance of an entity or an ontological class, or it may be some fixed literal value such as  500.14 or  "Allotrope".

Here is a snippet of the QUDT's that contains some facts about the unit of Atomic Mass, namely  u or dalton (Da), expressed in Turtle syntax.

Once you have a graph, it can be queried using SPARQL, a query language similar to SQL but designed specifically for semantic content. Every ADF file can store triples about its data in the  Data Description. Allotrope Foundation Ontologies (AFO) provide consistent terms and relationships across instruments, techniques, disciplines, and vendors.

RDF provides a mechanism that allows anyone to make a basic statement about anything and layer those statements into a single graph.

Now imagine an instrument being able to automatically produce those statements and add them into a big pool of scientific data with consistent terms and relationships between the data sets. Scientists and data analysts can now spend the majority of their time analyzing and gaining insights into the data, instead of trying to interpret or interchange it.

Wouldn’t that be pretty powerful?

URI and IRI

Remember URIs, from the RDF section above? URIs can uniquely identify a resource or name something. By leveraging URIs in the RDF framework, one can represent ontologies in a unique and even resolvable way. 

International Resource Identifiers (IRIs), unlike URIs, are Unicode character set in addition to ASCII characters (URIs’ limitation).

SPARQL

SPARQL, pronounced "sparkle" and recursively meaning SPARQL Protocol and RDF Query Language, queries database Triples. It’s structurally similar to SQL. To learn more, check out the SPARQL tutorial

Below is a set of triples in Turtle format. These triples form a graph that describes the cell counter measure as the total cell count at 1972.0.

Here is an example of SPARQL query that obtains the total cell count from the previous graph.

Leveraging triples and SPARQL, you can perform powerful queries on top of highly connected data sets - the key benefit of using Semantic Web.

SHACL

Shape Constraint Language defines and validates constraints on RDF graphs. It is a relatively new standard from the W3C. 

Using RDF triples allows fact expression and notation  in order to connect to other facts across datasets and domains. Super! However, this same flexibility strength can be a weakness: it can cause inconsistency in data representation. This is where SHACL comes in.

SHACL can validate an RDF graph against a set of constraints or rules. By encoding data models in SHACL, this automatic validation checks conformance of our triples and graphs to the expected model, with all required information in the correct place (and linked in the right way). Now these datasets can be searched via the same SPARQL queries, enabling easy and consistent linking to other datasets with validated data structure.

Here’s an example of a SHACL file snippet:

This snippet (unnamed:checkForEntityNode) tries to make sure that for an entity node in the graph that belongs to class viability (defined as http://purl.allotrope.org/ontologies/result#AFR_0001111 in the Allotrope Foundation Ontology), there is only one numerical value and only one unit. Notice that the SHACL file is also presented as a set of triples in the Turtle format.

Data Organization and Hierarchy

Taxonomy

A hierarchical classification of entities, using the same relationship type, e.g. "is a subclass of" throughout. Taxonomies are typically represented by a tree structure (think like the animal kingdom KPCOFGS).

Ontology

A superclass of taxonomies, with several different relationships, e.g. "is a", "has a", "contains a", and with multiple inheritances allowed in the same ontology. Whilst taxonomies can be represented as a tree due to their hierarchical nature, ontologies have more complex relationships and are modeled as graphs.

Graphs and Graph Databases

Graphs are a powerful way to store and explore unstructured and semi-structured data to identify relationships between data and quickly query these relationships.

Data is most often represented in tabular form, e.g.  relational databases. What are advantages to modeling data as a graph over the traditional relational data format?

Graph databases have advantages over relational databases for use cases like social networking, recommendation engines, and fraud detection, where the relationships between data are arguably as important as the data itself. If you use traditional relational databases, you would need a large number of tables with multiple foreign keys to store the data, which are difficult to understand and maintain. Furthermore, using SQL to navigate this data would require nested queries and complex joins that quickly become unwieldy, and the queries would not perform well as your data size grows over time.

In graph databases, the relationships are stored as first-order citizens of the data model, as opposed to relational databases which require us to establish relationships using foreign keys. This allows data in nodes to be directly linked, dramatically improving the performance of queries that navigate relationships in the data. It also enables the model to map closely to our physical world.

Additional graph information:

Tools and Technologies

  • BFO: Basic Formal Ontology, an upper-level ontology used to ensure consistent usage and linking of terms across different ontologies. It is widely used in the biomedical space, including serving as the basis for every ontology in the Open Biological and Biomedical Ontology Foundry (OBOFoundry). 
  • HDF5: A binary file format, optimized for high-performance access to large datasets. Used as an underlying technology in ADF. 
  • Jena: An Apache open source Java API supporting the use of Semantic Web approaches such as triples and SPARQL queries. Used as an underlying technology or the Data Description layer of ADF. 
  • Jena Fuseki: A popular tool to easily test SPARQL queries. 
  • Protégé: A standard ontology development and exploration tool, developed by Stanford University and provided free for general use. Watch a basic tutorial on the use of Protege
  • Triplestore: A database-like storage mechanism for triples, such as Jena-Fuseki.
  • Turtle: A syntax for representing RDF triples in a more human-readable form than the RDF/XML standard. It is structurally similar to the SPARQL language. 

Summary

By establishing a framework and unified data formats to structure the multitude of experimental data generated in life sciences R&D, the scientific community has the ability to focus on pushing the needle of scientific innovation. 

Share this article

Previous post

There is no previous post
Back to all posts
May 18, 2022

Tetra Blasts Off at Boston’s Bio-IT World

Read Blog
May 9, 2022

Give Your in vivo Data the Attention it Deserves

Read Blog
May 2, 2022

Customizing Digital Lab Experiences With Ease

Read Blog
April 14, 2022

Sharing a Vision and Deep Customer Commitment

Read Blog
April 11, 2022

Escaping the Scientific Data Quagmire

Read Blog
April 1, 2022

Innovating with a HoloLens and Drones

Read Blog
April 6, 2022

Digital Twins: Seeing Double with a Predictive Eye

Read Blog
March 28, 2022

Automated Anomaly Detection and Correction

Read Blog
March 30, 2022

Making Labs More Efficient

Read Blog
March 4, 2022

Introducing Tetra Data Platform v3.2

Read Blog
March 2, 2022

Are you prepared to utilize ML/AI and Data Visualization?

Read Blog
February 22, 2022

SLAS 2022: The Industry’s “Hyped” for Accessible and Actionable Scientific Data

Read Blog
February 21, 2022

BIOVIA partners with TetraScience

Read Blog
February 16, 2022

Tetra Partner Network: An Interview with Klemen Zupancic, CEO, SciNote

Read Blog
February 4, 2022

Closing the Data Gap in Cancer Research

Read Blog
January 27, 2022

Waters & The Tetra Partner Network: Making Data Science Possible

Read Blog
December 16, 2021

Announcing Acquisition of Tetra Lab Monitoring Business by Elemental Machines

Read Blog
November 29, 2021

Move From Fractal to Flywheel with The Tetra Partner Network

Read Blog
March 26, 2021

How an IDS Complements Raw Experimental R&D Data in the Digital Lab

Read Blog
July 30, 2021

What is an R&D Data Cloud? (And Why Should You Care?)

Read Blog
March 26, 2021

What is a True Data Integration, Anyway?

Read Blog
June 1, 2020

Data Science Use Cases for the Digital Lab: Novel Analyses with Waters Empower CDS Data

Read Blog
April 20, 2022

Unlock the Power of Your ELN and LIMS

Read Blog
July 23, 2020

The Science Behind Trash Data

Read Blog
August 20, 2021

The 4 Keys to Unlock the Lab of the Future

Read Blog
September 29, 2021

TetraScience Achieves SOC 2 Type 2 Validation, Advances R&D Data Cloud GxP Compliance Capabilities

Read Blog
April 20, 2020

Round-up of Semantic Web thought leadership articles

Read Blog
September 19, 2020

Simplified Remote Monitoring of CO₂ Incubators via Direct Integration

Read Blog
May 11, 2021

R&D Data Cloud: Moving Your Digital Lab Beyond SDMS

Read Blog
September 10, 2021

Principles of Deep Learning Theory

Read Blog
July 8, 2020

Powering Bioprocessing 4.0 for Therapeutic Development

Read Blog
March 30, 2022

Why Biopharma Needs an End-to-End, Purpose-Built Platform for Scientific Data — Part 2

Read Blog
August 19, 2021

Part 2: How TetraScience Approaches the Challenge of Scaling True Scientific Data Integrations

Read Blog
March 23, 2022

Why Biopharma Needs an End-to-End, Purpose-Built Platform for Scientific Data — Part 1

Read Blog
January 18, 2021

New Matter: Inside the Minds of SLAS Scientists Podcast

Read Blog
June 29, 2020

Enabling Compliance in GxP Labs

Read Blog
May 14, 2020

LRIG-New England: Lunchtime Virtual Rapid Fire Event - May 26, 2020

Read Blog
June 10, 2020

Remote Lab Scheduling is No Longer Optional, it is a Requirement

Read Blog
August 2, 2020

Incident Reporting for GxP Compliance

Read Blog
October 15, 2020

Protein Purification with Cytiva UNICORN: Enhanced Analytics through Harmonization and Integration

Read Blog
July 29, 2020

Cloud-based Data Management with Lab Automation: HighRes Biosolutions Cellario + TetraScience

Read Blog
August 20, 2020

Understanding Why Freezer Temperatures May Not Be Uniform

Read Blog
July 14, 2021

Find Experimental Data Faster with Google-Like Search in Tetra Data Platform 3.1 Release

Read Blog
July 22, 2021

Experimental Data in Life Sciences R&D — It’s How Many Copies of Jaws?!

Read Blog
July 15, 2020

Direct Data Acquisition and Sensor Integration for ThermoFisher Heracell™ VIOS CO₂ Incubators

Read Blog
September 15, 2020

Remote Monitoring of VWR CO₂ Incubators via Direct Integration

Read Blog
April 26, 2020

The Digital Lab Needs an Intermediate Data Schema (IDS): a First Principle Analysis

Read Blog
April 6, 2020

TetraScience ADF Converter -- Delivering on the Promise of Allotrope and a Startup’s Journey

Read Blog
August 6, 2020

"Data Plumbing" for the Digital Lab

Read Blog
June 8, 2020

Data Automation for High-Throughput Screening with Dotmatics, Tecan, and PerkinElmer Envision

Read Blog
May 15, 2020

Applying Data Automation and Standards to Cell Counter Files

Read Blog
June 11, 2020

AWS Healthcare & Life Sciences Web Day | Virtual Industry Event

Read Blog
February 12, 2021

AWS Executive Conversations: Evolving R&D

Read Blog
April 15, 2021

Announcing Our Series B: The What, When, Why, Who, and Where

Read Blog
April 15, 2021

Announcing our Series B: The DNA Markers of Category Kings and Queens

Read Blog
April 15, 2021

Announcing our Series B: Tetra 1.0 and 2.0 | The Noise and the Signal

Read Blog
March 29, 2020

Allotrope Leaf Node Model — a Balance between Practical Solution and Semantics Compatibility

Read Blog
April 3, 2019

Allotrope 101

Read Blog
March 13, 2020

Choose the right alert set points for your freezers, refrigerators, and incubators

Read Blog
August 27, 2020

99 Problems, but an SDMS Ain't One

Read Blog