Using the Leaf Node model, scientists, software engineers, and data scientists can rapidly iterate on the data model, create and query standardized data sets, while semantic experts focus on the ontology and semantics. It’s a division of labor, allowing common activities to be easy & fast, while maintain compatible with semantic world.
An added benefit of the Leaf Node model is that it can be easily transformed to / from popular formats such as JSON, Tabular formats like CSV and Columnar formats Parquet. These formats make the data easily compatible with common software engineering, data sciences and visualization tools. Being interchangeable with popular software and data science ready formats is another major advantage of the Leaf Node model.
Written by: Spin Wang
Earlier this year, we discussed the basic concepts of Allotrope Data Format (ADF) in this blog post: Allotrope 101. One of the keys of Allotrope is its Data Description, a triple store leveraging the semantic web and Resource Description Framework (RDF) graphs.
Traditionally Allotrope data description is presented by what we call a “Full Graph” stored in RDF, something that looks like the following --
The goal of the full graph is very exciting, it captures the relationship of the entities. For example
Such relationships can potentially allow the machine (computer software) to understand what is a sample, a chromatography injection, and autosampler and how they are related to each other in an abstract way, like a lab scientist. However, the downside is that it introduces a LOT of ontological, taxonomy and even philosophical complexity & overhead. In preparation for the machine to understand the data in the future, scientists now incur more overhead and get stuck.
Quickly the community realizes that it is non-trivial to build and use a Full Graph data model, for the following reasons
In light of this observation, the community proposed the concept of Leaf Node model, which is the theme of this blogpost.
Essentially, the Leaf Node model says -- “What if we only focus on the Leaves of the Full Graph, namely those nodes that are directly associated with the data?”.
This allows the scientists, software developers, and data scientists to quickly zoom into what is the important part of the graph, the actual data fields, and their values. Example Leaf Nodes are listed below. They look like
When represented in RDF and saved in ADF, Leaf Node is composed of the following triples.
Leaf Node model is essentially whole bunch of Leaf Nodes. Leaf Node model also supports an array of Leaf Nodes which are related to each other, such as an array of chromatogram peaks. We will leave this to future articles to explain in more detail.
You will probably ask: how about the semantic meaning and relationship? How is the information such as “sample is input to an experiment” captured and how can we tell the machine to understand what is a “sample”?
The answer to this actually quite delightful, since an IRI is attached to node (for example, result#AFR_0001111 is attached to cell viability), and bacause that IRI is already part of an ontology, then the relationship between the nodes have already been rigorously and elegantly defined in the ontology. That IRI serves as a semantic hook or label, that explains what is “viability” and bridges the Leaf Node graph with the ontology. If you were to want a “Full Graph”, simply combine the Leaf Node graph with the ontology.
With this approach, lab scientists, software engineers, and data scientists can rapidly iterate on the data model, easily create standardized data sets, easily query the data sets. While the knowledge engineers and semantic experts can focus on the development of the ontology, which is the right place to describe what is a sample and it can be part of an experiment.
Leaf Node approach is essentially a division of labor, allowing what should be easy & fast to be actually easy & fast, while enabling compatibility with the semantic world via the IRI. The delineation of data (captured in leaf node) and semantics (captured in the ontology) enables scientists, software engineers and data scientists to easily create, validate and read the data, while sementic experts can focus on the ontology.
An added benefit of the Leaf Node model is that it can be easily transformed to / from popular formats such as JSON, Tabular formats like CSV and Columnar formats Parquet. These formats make the data easily compatible with common software engineering, data sciences and visualization tools. Being interchangeable with popular software and data science ready formats is another major advantage of the Leaf Node model.
{
"experiment": {
"name": "my test"
},
"sample": {
"id": "123"
},
"injection": {
"volume": {
"value": 1,
"unit": "Microliter"
}
}
}
EXPERIMENT_NAMESAMPLE_IDINJECTION_VOLUME_VALUEINJECTION_VOLUME_UNITmy test1231Microliter
To find out more about how you can use Allotrope Leaf Node Model and automatically standardize your lab data for analytics, automation and archiving, please reach out to us at www.tetrascience.com/contact-us!