April 6, 2020
Data Integration

TetraScience ADF Converter -- Delivering on the Promise of Allotrope and a Startup’s Journey

This week, we are very excited to launch the first cloud-native, data science ready ADF Converter. Our application converts your heterogeneous experimental data sets into ADF files, leveraging Leaf Node pattern, Allotrope Foundation Ontology and QUDT for ontology, HDF5 for underlying format. We would like to share how we get started and how we gained clarity and conviction to officially support ADF in our journey.

We believe that storing your heterogeneous experimental data sets in one single HDF5 file, embedding ontological labels on key data fields, and doing that in a way that can be consumed by any data science and 3rd party software, and can be extended to full semantic graph, is indeed a promising data management strategy for organizations to consider.

Authors: Evan Anderson - Delivery Engineer, Vincent Chan - Product Owner/Software Engineer, and Spin Wang - CEO & Co-Founder

Phase 1: Excitement

TetraScience joined the Allotrope Foundation in 2017 as a member of the Allotrope Partner Network. For the last 2.5 years, we have worked actively with the foundation and its working groups, many pharmaceutical companies, and instrument manufacturers to pursue the vision of experimental data standardization for the Digital Lab. We are deeply impressed and motivated by the community built by and around the Allotrope Foundation, and by their deep belief in data standardization to reduce error and enable analytics and data science.

Phase 2: Questions and Exploration

However, truth be told, for the team at TetraScience, it was not a smooth journey; we struggled and frequently questioned our choices.

In the beginning, when the challenge was mainly about learning various concepts involved in Allotrope, the community provided a lot of assistance and helped us to become more knowledgeable. Our team must have read the ADF Primer more than 10 times. After being able to convert one file, we then struggled to convert other data types scalably and repeatedly; then we struggled to make sure the file could be consumed easily and to get the information out.

Because of our conviction in the vision, and also the unwavering enthusiasm we saw from the community members, these challenges did not really bother us. Instead, these became exciting problems that our engineering and product team discussed on a daily basis. We treated these as opportunities to contribute and received tremendous help from the Foundation and the community.

We published our internal training document: Allotrope 101, and contributed automation and libraries we built to the Foundation. We provided at least two presentations in each of the past five Allotrope Connect workshops to report our findings and progress. As we contributed, we also voiced our concerns and questions:

  • Why not just zip a folder to collect all the experimental data?
  • Why not use popular file formats like Apache Parquet and JSON?
  • Full semantic web (full graph) approach is very time consuming to build a model, generate instance graphs, validate, and read. How could we use ontology, but in a more scalable way?
  • ADF files can only be read using Foundation's Java/C# library. How could we lower the barrier of consumption and make ADF compatible with Data Science applications?
  • How could we quickly and cost effectively access a large amount of ADF files stored in the cloud? In other words, what is a Hadoop/Spark equivalent for ADF?

As a start-up with high opportunity cost and limited resources, this is a major decision. We can not support Allotrope without having conviction that it has a differentiated value proposition and without genuinely feeling confident. We relentlessly asked ourselves - what are the use cases, why do those use cases have to be achieved by ADF and not the alternatives, and by the way, what are the alternatives?

That’s why, in the last 2.5 years, despite our active participation, we have always been hesitant on what we should do. Our strategy is not to follow and adopt blindly. Instead, we question why the use cases have to be achieved by ADF, and what are ADF’s unique advantages?

Phase 3: Clarity and Conviction

With the help of the Foundation, our customers, and through many debates and iterations, we started to gain more clarity around ADF’s unique value proposition.

ADF, as a file format, is really HDF5 + semantic web RDF graphs. These solve very different use cases and have specific, unique benefits.

On the HDF5 side

  • HDF5 is a well supported and adopted open technology to store heterogeneous data sets, such as text file, binary file, images and large numerical multi-dimensional matrix, together in one file. This makes HDF5 more suitable to package everything related to an experiment into one portable file compared to a zip file, JSON, or Parquet file. A folder with large numerical matrices stored in clear text is not efficient and Parquet is not suitable for handling multiple different file formats. HDF5 simplifies the burden of performing data integrity check, portability, and archiving, especially for data sets with multi-dimensional numerical arrays.
  • ADF is built on HDF5, allowing for the theoretical possibility that any tools supporting HDF5 could open ADF files (if the ADF files are constructed with this in mind).
  • HDF group’s Kita and HSDS started to provide a viable and promising option to store a vast number of HDF5 files in the cloud (e.g., AWS S3), and perform distributed queries. Now you can slice and dice your chromatograms from HDF5 files stored in the cloud.

On the semantic web RDF side

  • Full semantic graph can be complicated and time consuming to implement (here are some good articles about semantic web: What Happened to the Semantic Web? and Hacker News Discussion). However, it does not need to be the only way to get to an ontologically aware data set.
  • Leaf node pattern simplifies the semantic web framework, allowing automatic generation and query of knowledge graphs by leveraging these reusable building blocks. Allotrope Ontologies can be easily embedded in the leaf node, while the ontologies themselves can provide semantic relationships and explanations for each term. Leaf node allows scientists, software engineers, and data scientists to easily create, validate, and read the data, while semantic experts can focus on the ontology. We dive deeper into the Leaf Node Model in this blogpost: Allotrope Leaf Node Model -- the Balance between Practical Solution and Semantics Compatibility

It’s also important that we realize ADF is not suitable for all use cases; it is not a magic bullet that will solve all the issues.

  • For example, if you are interested in a single file for your Pandas data frame, HDF5 may not be the best option and you may consider Feather or Apache Parquet.
  • For example, systems like GE UNICORN, Waters Empower, and Shimadzu LabSolutions client/server edition are based on databases; it is not possible to just export the data related to one experiment and save into ADF. Thus, ADF is not suitable as the archive format. You can backup the entire data base and save it in ADF. Or you can export one single experiment's data into a file and then put into ADF, but then you can not restore that file back to the vendor software. For the majority of the file-based data outputs, ADF can indeed be used for archiving.

Instead of discouraging us, understanding the limitations provided us with much more clarity on how to recommend ADF to our customers and thus gave us conviction on where ADF is uniquely positioned to excel!

We believe that storing your heterogeneous experimental data sets in one single HDF5 file, embedding ontological labels on key data fields, and doing that in a way that can be consumed by any data science and 3rd party software, and can be extended to full semantic graph, is indeed a promising data management strategy for organizations to consider.

During this process, we continue to be encouraged by the community’s enthusiasm -- people from different pharmaceutical companies and manufacturers working together collaboratively, iterating towards a common vision, willing to learn from past experience, and finding the right balance between vision and execution. A lot of these ideas and opinions were inspired by others in the community. We could not be more grateful for the inspirations we received from the community!

Today: ADF Converter

This week (early April of 2020), we are very excited to launch the first cloud-native, data science ready ADF Converter. It is an application built on the TetraScience Data Integration Platform that converts your experimental data into ADF files, leveraging Leaf Node pattern, Allotrope Foundation Ontology (AFO) and QUDT for Ontology, and HDF5 for underlying file format. It supports all the instrument types modeled thus far by the Allotrope Modeling Working Group using Leaf Node pattern, and is expected to support up to 40 instrument models by the end of 2020. You can leverage these models to document and analyze key experimental data in ELNs, LIMS, visualization, data science tools, or any software you write yourselves.

Here are our core convictions that shaped the ADF Converter.

First, we believe that the barrier to data access needs to be as low as possible. The data format serves the need of use cases and is a means to the end.

Right now, the only way to read an ADF file is via the Java/C# library provided by the Allotrope Foundation. This limits the ability for Electronic Lab Notebooks, Data Science tools (Python, R), and Business Intelligence & Analytics tools (Dotmatics Vortex, Tibco Spotfire, Tableau) to use the ADF files.

To address this limitation, we made sure that ADF files can now also be opened using Python, R, and easily consumed by Data Scientists, and made them compatible with HDF group’s Kita and HSDS. Now, leveraging any tools or programming language compatible with HDF5, using R Data Functions in Spotfire, you can easily extract the information from ADF into your data science related applications.

Image: use Python to explore and visualize your ADF file

Read ADF using Python





Second, we believe that programmatic API and automation are crucial for data standardization. Standardization is only meaningful when the number of files following that standard exceeds a critical amount. To reach that critical mass, automation is necessary. As a result, the ADF Converter has a flexible programmatic interface for you to ingest input files, monitor conversion progress, and search and retrieve converted ADF files. Third party applications can now leverage this API for conversion.

Image: ADF Converter ingesting files + third party application retrieving the files

ADF Converter API



Third, we believe in starting with simple patterns, because simplification allows automation, automation leads to scale, scale leads to adoption and momentum, which provides more user requirements and use cases. These requirements and use cases can help to solidify the next iteration. The Agile development philosophy also applies to data modeling. The ADF Converter starts with the Leaf Node pattern and will include the aggregation pattern after the working groups finish the design.

Image: Allotrope Leaf Node pattern

Leaf Node

End

It has been a great journey. We encourage more people and organizations to be involved and participate actively. It’s a collaborative community that will result in value and deep relationships! We hope our effort will enable organizations to overcome the barrier of initial adoption and start to truly evaluate and explore ADF in real production settings against your use cases. By doing this, a data format like this one can be truly battle-tested and demonstrate its vitality. There are still many unknowns, and we look forward to further collaborating on this journey of Data Standardization for Digital Labs!

Read more about the ADF Converter and our approach, or contact us with questions, comments, or feedback.

Follow TetraScience on social media for continued insights about experimental data standardization:
• Twitter - @TetraScience
• LinkedIn - @TetraScience
• YouTube - @TetraScience

Learn more about how we automatically harmonize and centralize experimental data, connecting disparate silos to activate the flow of data across your R&D ecosystem. www.TetraScience.com

Contact a Solution Architect

Build an Integration
Spin Wang
Cornell Applied Physics and MIT EECS. Co-founder and CEO of TetraScience. Forbes 30 Under 30 in Science.

Read more posts by this author