Data Catalog for Data Mesh

Data Catalogs aid the users to identify appropriate data from multiple data stores.

Arup Nanda
11 min readAug 5, 2021
Photo by Aleksi Tappura on Unsplash

TL;DR

In the previous article, you understood the concept of a Data Mesh. In this article you will learn that one of the two keys to a successful Data Mesh architecture is an accurate data catalog. Creating, maintaining and validating the catalog is essential to the users finding out where they can find that data. The Catalog can be maintained when the data is acquired by the system, or land in a raw zone and then the updated in the catalog later. Creating a single point of entry for the data, called an ingestor, allows the catalog to be maintained accurately at the time of entry.

Preamble

“I couldn’t wait anymore,” declared the visibly excited Ted. “Thank you for explaining the evolution of data systems over the years so lucidly to us, Debbie. Now please continue explaining how the Data Mesh can be successfully implemented at Acme.”

Ted, the Chief Technology Officer at Acme Widgets was referring to an earlier meeting introducing the Data Mesh concept as captured in this article. [Readers are urged to read that before reading this piece]

Data Mesh, re-explained

To start the ball rolling, Debby brings everyone up to speed on the concepts introduced in the last meeting. Most data are located in the databases all over the company in formats not necessarily conducive to analytical workloads. Analytical workloads typically benefit from a columnar format of datastore where the columns are stored together instead of rows. The technology used also makes a huge difference, with many vendors making claims about their product being the best. Traditionally, many companies tried to bring all data into a single data lake or a data warehouse for the entire company, allowing users to retrieve the data in a single place. Doing so invites the same arguments about the right format, right technology, right vendor and so on for that single data repository. Rather than being embroiled in that unfruitful debate, Data Mesh architecture allows the data to be located as close to the source as possible with users accessing multiple datastores to access data they want instead of a single data lake or warehouse, as shown below in Figure 1.

Figure 1: Data Mesh

Here each user has access to all the datastores on the left. The data is not copied to a single location. This leads to several advantages.

  1. Data stays in the most logical location as appropriate, close to the sources
  2. These datastores can employ the most appropriate technology, format and vendor as needed
  3. Since the data is no copied to be consumed, it reduced the latency of the data, making it more valuable
  4. Each datastore is independently managed, with the desirable SLAs. The downtime on one doesn’t affect the consumers for other stores.
  5. The absence of a single datastore for the entire company makes it easier to scale and manage.
  6. Each datastore can have its own privilege management system, which could be optionally tied to a central privilege management system.

Finding the Right Data

Aaron, the Chief Architect of Acme, had been silently taking it all slowly. He interjects. “Getting all data into one place, be it a data lake or a data warehouse, has an enormous benefit. The data consumers can easily find all they want in a single place. Distributing it across multiple datastores will make it very hard for them as they have to scour many places to find what they are looking for,” he challenges, “isn’t it?”

Debbie smiles. In her long career as a data leader she has seen her share of different problems getting conflated. And it’s no wonder why the problems still remain unsolved. She clarifies that there are two different problems.

  1. Getting the right data from the right place, e.g. running a SELECT statement on the exact database with the exact table and column. This requires the user to have the privilege on the data and using the right technology to get it. For instance, running this against an Oracle Database may need to have SQL statement but against an S3 bucket in AWS may require an Athena query.
  2. Locating the right data, i.e. before running the query, the user must identify that right database, right table and the right column.

Debbie points to Sally, the sales analyst, who wants to get the following question answered:

How many widgets were sold last month?

To answer the question, Sally needs to know:

  1. Which specific database to go to
  2. Which specific table and column has that information
  3. If multiple tables and columns have that information, which ones to run the query against
  4. What type of technology to use to get that information

“Ever wonder how Sally usually gets this information?” Debbie implores the audience. In many cases, she says, these becomes tribal knowledge. Sally just knows this, perhaps she has been doing it for a while after getting the information from someone earlier and trusted it. If she does not know, she has to ask someone for this information.

That is highly unscalable, Aaron observes. Of course, Debbie completes his thought, that is precisely the point. Relying on tribal knowledge of where to look for data is not a practical approach. Instead, the information about the data — also sometimes referred to as metadata — is important to be developed. It is generally known as a Catalog.

Elements of Catalog

Acatalog is where users like Sally would go and find where they can find what they are looking for. Going back to the question Sally was trying to answer:

How many widgets were sold last month?

Sally will first need to find out which data store and what tables and columns contain data on sales. Imagining a search interface of the hypothetical catalog search engine, Debbie explains, Sally will likely search for terms like SALES on both tables and column, which the catalog will hopefully show where the data resides with the table and columns names.

Another element to be stored in the catalog is “Destination”, where the data is stored. When users like Sally look for presence of data they are interested in, they need to look into the Catalog anyway and they will need to get the datastore name in addition to tables and columns. The presence of data in single data lake or warehouse, or in multiple places does not really matter as long as the location of the data is also available to the end users.

Aaron mulls over his own question. It is true that under Data Mesh the user may find the information in many stores rather than a single datastore. That discovery is done in the Catalog anyway; hence the number of datastores is not important. That answers the first question about locating the right data.

Keeping the Catalog Current

This brings up another question: how to make sure the Catalog has the most current data. For instance a table SALES in Catalog may show up as:

SALE_ID
SALE_DT
CUST_ID
PRODUCT_ID
AMOUNT

However the actual data may contain an additional column — PRICE_PAID — to reflect the actual price. The catalog entry assumed that the price paid by the customer will be derived from the PRODUCT table. This was probably correct when the data structures were designed; but later the data modelers found that the actual prices paid were different. Hence they stored the actual price paid on each sales record on the actual data itself; but neglected to update the catalog. A user checking the catalog will find wrong information. The problem may manifest itself in different ways with worse issues. For instance, Catalog may contain a column name but the actual data may not have it. A user will be mistakenly led to the data only to find that the data doesn’t have what is needed.

Hence the catalog must be kept in sync with the actual data. To do that, Debbie suggests a two pronged approach:

  • Front Book — where each dataset (a single record or a whole file with multiple records) coming in must have an entry in the catalog to be stored in a data store.
  • Back Book — Despite the best efforts in the front book, some datasets will be out of sync. A process needs to exist that extracts the schema from the actual data, compare them against the Catalog entry and update as necessary.

Metadata should be able to be registered in two ways: Frontbook: while ingesting; and Backbook: from already ingested data. Both are needed in a Data Mesh.

Enter the Ingestor

The group silently mulls the proposal. Ted breaks the silence. “The front book approach sounds the best”. Of course, Debbie adds; but the key is to make sure that the process to register the data before it is stored is in place. To enforce this mandatory registration, she proposes the data be accepted into the data ecosystem only via a single interface known as “Ingestor”. The producers will send data by calling this ingestor process, which can execute this registration process. Since this is the only way to get data in, the frontbook registration process is guaranteed, Debbie contends. She presents a flowchart as shown in Fig 2 to explain the process.

Figure 2 Flowchart for Front Book Registration in Catalog

Walking everyone through the flowchart, Debby explains to the team that the concept is simple. The activities are:

  1. Infer or extract the schema from every dataset coming in
  2. If there is already an entry in the Catalog, check the conformance of the schema in that
  3. If the schemas do not match, then put that dataset in a special zone called Quarantine
  4. If it is a new dataset and the Catalog entry does not exist, simply register it in the Catalog

In all cases the data producer needs to be notified of the action at the end so they can take corrective action on the quality of data coming in.

Deploying an Ingestor as a single point of entry of any data into the data ecosystem ensures the catalog is accurate and current.

Backbook Cataloging

But the whole point of a data lake is providing a place for an unstructured data, counters Aaron, and therefore forcing a schema conformance upfront is not a good idea.

Of course, concedes Debbie. The data lake can ingest raw data, which is yet to be structured, and its meaning derived. Once that data lands in the lake, there are two potential issues to handle:

  1. It can inadvertently be exposed to the normal data consumers who may either not make any sense of it or draw incorrect conclusions about the schema leading to wrong interpretations.
  2. The metadata could change as the new data comes in, e.g. the schema could potentially change for the same dataset but applicable to a new partition.

Hence, Debbie explains, there is a need for extracting the schema from existing datasets in the data destinations where it could potentially change and then compare the extracted schema against the currently registered schema in the catalog. This is typically called back book catalog maintenance. The good news, she says, is that it does not have to be a separate system to accomplish this. Backbook catalog maintenance can leverage the systems developed to implement the process shown in frontbook catalog entry (shown in Fig 2). In fact, Debby explains, in some cases it may be necessary to accept data whose schema is not known and infer the schema afterwards by applying some predefined or experimented logic specific to the datasets. In those cases the datasets can be stored into a special raw zone to be explored later. The keys, Debby cautions, is to make sure that the raw zone:

  • has minimum registration in the catalog, e.g. the data owner, the location, etc.
  • is marked as such, i.e. it is not exposed to the consumers as fully form ed schemas

These two measures, Debby contends, are enough to satisfy the key requirements of a data mesh, i.e.

  1. Producers can locate the data wherever the best location of the data is
  2. Consumers can identify the location of the data from the single catalog
  3. The data is still restricted; not generally available, which prevents misinterpretation.

The raw zone, Debbie explains, does not have to be a separate physical location. It can be in the same physical location as the other discoverable data, with one exception, their attributes should show their label as raw, which allows them to be undiscovered in general.

Unclassified data, with unknown structure needs to land in a logical raw zone where it is not discoverable to the general audience until the structure is established. Without this the data may be misinterpreted.

But Catalog is not Enough

The audience members took some time to digest the proposal. By and large, the design makes sense — they agree; but Sally still looks a bit skeptical. Sure, the catalog, when searched, brings up a large number of tables and columns; but how she is supposed to pick the right one for the data she wants, she muses.

That is a thorny problem, Debbie immediately concurs. Mere presence of tables and columns will not help Sally. The catalog needs to show the comments or description of the columns as well. The description provides additional explanation to aid the consumer in their task of zeroing on the right tables and columns.

“Hmm…,” Sally seems even more skeptical. “Who actually writes these descriptions and how do I know that these are even accurate?”

Debbie smiles. This is why, she explains, a single catalog is not enough to locate the right data. Other attributes of the data elements, such as who has used it, where it comes from, what classification is assigned to it, where it goes to, what kind of checks were performed on it, when it was last refreshed, etc. This is a whole different focus on the metadata called data observability, which Acme’s data management team has implemented.

Unfortunately, the time was up, Ted observed, and asked Debbie to hold another session on that later. The meeting was adjourned.

Summary

  1. Since the data can be found at various places, a single catalog for the entire company is important for consumers to find the correct location of the data
  2. The catalog needs to contain the correct description of the structure of the data, without which the consumer may misinterpret the data. For instance the data may have positions 1–9 as social security number; but without the description the consumer will not be able to know that.
  3. The schema can be derived at the time of entry of the data into the ecosystem (frontbook), or later (backbook) using the same process.
  4. If the structure of the data can’t be, or shouldn’t be discovered at the time of entry, it should be landed in a special zone and its discoverability should be limited.
  5. Using a single point of entry into the ecosystem will ensure schema extraction, registration in the catalog and conformance to an existing entry. This single point of data acquisition is called an Ingestor.

--

--

Arup Nanda

Award winning data/analytics/ML and engineering leader, raspberry pi junkie, dad and husband.