Creating open data interfaces with ODPi

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

July 10, 2020

This article was contributed by Sean Kerner

OSSNA

Connecting one source of data to another isn't always easy because of different standards, data formats, and APIs to contend with, among the many challenges. One of the groups that is trying to help with the challenge of data interoperability is the Linux Foundation's Open Data Platform initiative (ODPi). At the 2020 Open Source Summit North America virtual event on July 2, ODPi Technical Steering Committee chairperson Mandy Chessell outlined the goals of ODPi and the projects that are part of it. She also described how ODPi is taking an open-source development approach to make data more easily accessible.

While perhaps not as well-known as other Linux Foundation efforts, ODPi has actually been around since 2015. Chessell explained that ODPi's initial role was to help different vendors using Apache Hadoop to interoperate, since each had its own set of data connectors. As usage and the number of Hadoop vendors has declined in recent years, ODPi defined a broader vision to be an initiative focused on creating open-source data standards to help users understand and make use of data across different platforms.

ODPi has multiple projects

ODPi is an umbrella organization that is comprised of multiple sub-projects, including, OpenDS4All, BI & AI, and Egeria. The governance structure of ODPi follows a similar pattern to other initiatives within the Linux Foundation, with a Board of Directors handling the business operations and a Technical Steering Committee (TSC). Chessell described the role of the TSC as providing mentoring to the projects that comprise ODPi, as well as helping the leaders of the projects with best practices and ideas for technical improvement.

While code is an important part of ODPi, technology alone isn't the only way to make effective use of data. Chessell noted that as organizations try to become more data-driven, they face cultural, organizational, and technology problems. "For many organizations they operate as a sort of hierarchy that creates silos between each of the different IT systems, and to make use of data you have to sort of break down those silos and allow data and collaboration to flow laterally across the organization," she said.

The OpenDS4All project is one such non-code effort within ODPi. OpenDS4All is an open data-science project that is focused entirely on education, creating materials that educators and organizations can use to build a data-science curriculum. The project got started in February 2020 based on materials originally created by professors at the University of Pennsylvania.

The material that OpenDS4All provides includes Python-based tools that can help students to learn about different aspects of data science, including data modeling, integration, and analysis. The project makes use of Jupyter Notebooks to set up the data-science environments. (LWN has looked at the use of Jupyter notebooks for education and collaboration in the past.) All of the OpenDS4All components are available via the project's GitHub repository.

BI & AI

OPDi is also host to the Business Intelligence and Artificial Intelligence (BI & AI) project. Chessell explained that BI is a marketing term used to describe data platforms that are used by companies to create reports about their operations. BI platforms also typically include the ability to make various charts from the data, as well as dashboards for companies to use to analyze their data. Chessell remarked that BI platforms include a large range of data sources that have been specially created to support the reporting process. The data that is created for BI can also be useful for AI use cases.

The goal of the BI & AI project is to help make BI data more accessible so that it can be used by AI frameworks. She added that the BI & AI project team members are working on defining a standard interface to allow AI models to be plugged into a BI platform. "So the first phase they're working on now is around the specification for this bridge between AI and BI," Chessell said. "The second phase will be actually to build a reference implementation of that bridge so that the vendors can choose to demonstrate the bridge operating with their platform."

ODPi Egeria

Chessell spent most of the time during her presentation talking about the Egeria project, which is all about metadata: data that describes some collection of data. The project includes code, as well as best practices and educational elements, to help users better understand and create data systems that can interoperate with different types of metadata. She noted that many organizations use software tools that can store information about the data that they have; that metadata is used to build a what's often referred to as a data catalog. The metadata can be used to provide different types of information about the data structure and how it can be organized.

Typically, each vendor's metadata repository is proprietary and locked down, such that if a company wants to move to a new tool, there is no easy way to migrate the existing metadata. "So the result was that different business units within an organization, were operating in isolated silos and knowledge wasn't being shared," Chessell said. "A data-driven organization needs knowledge and data to flow laterally between the different silos."

Chessell added that although some vendors have tried to create open APIs with their own technologies, competitive vendors are often reluctant to integrate with each other. That's where the Egeria project fits in; it is an open-source initiative for data APIs and protocols that brings together vendors to work cooperatively on metadata interoperability. She emphasized that the goal with Egeria is not to create a central metadata repository that every tool connects to, but rather to enable a peer-to-peer communication between different metadata repositories.

Metadata is not one format or data type either, which adds to the complexity of management and interoperability. Chessell said that the Egeria project mapped out the types of metadata used by organizations and found 500 different metadata types. So what the Egeria project has done is create an approach to classifying what a given piece of metadata is being used for and how it relates to other data. According to Chessell, by having a way to define what a specific piece of metadata is about, there is a common language that can be used as the basis for creating an interoperable metadata system. "As each vendor maps to the same model, we start to understand the correspondence between the different technologies and the data that they store," she said.

How Egeria works

A core part of Egeria is developing the Open Metadata and Governance (OMAG) Server Platform. Chessell explained that OMAG servers can be deployed in on-premises or cloud environments to integrate a series of services for metadata discovery, governance, and interoperability.

There are three main types of OMAG servers; "Cohort Members" are a type that is used for doing peer-to-peer exchange with a metadata access point. Cohort Members also include a conformance test server that helps to validate that a server adheres to the OMAG specifications and can properly share metadata information. "View Servers" provide REST interfaces to connect with data repositories behind firewalls. Then there are "Governance Servers", which provide a layer of management and security services for handling metadata. Governance Servers also provide discovery services that can enable deduplication when there are multiple copies of the same data asset.

Egeria is still in a state of active development. Chessell noted that the developer pieces, which include metadata repository services, are production-ready, though she said that the View Server component, which enables integration with backend metadata servers, is currently less mature.

The promise of ODPi is nothing short of making data better (which was the title of Chessell's presentation). In order to make use of data it's painfully obvious that organizations need to be able to get access to data and it's surprising in some respects that so much data is still in silos and locked behind proprietary interfaces. With the continued effort of ODPi and its constituent projects, hopefully more data will be open and accessible in the years to come.

Index entries for this article
GuestArticles	Kerner, Sean
Conference	Open Source Summit North America/2020

(Log in to post comments)