What is Data Mesh Anyway?

<a href=’https://www.freepik.com/vectors/technology'>Technology vector created by macrovector — www.freepik.com</a>

TLDR;

Background

Almost as if on a cue her phone rang. It was Ted, the CTO of the company, asking her to drop everything and to get on a video conference call. On the call were many top executives including Ted. “I just heard about this wonderful concept called Data Mesh”, starts Ted after thanking Debby for joining the call. “And we in the executive committee are perplexed: why are we not using it?”

Debby sighed. Distilling concepts of data management had always been a bit challenging; but it has been particularly so in the era of video conferencing thanks to the pandemic that has redefined the workplace lately. Without wasting a single second, she answered, “But of course, Ted, we are using it”.

“Huh?” asks the Aaron, the Chief Architect of the company. “I don’t remember seeing a single document that drives this concept at a high level.”

“Debby always knows her stuff and this time likely be no exception,” suggests Ted. “Why don’t you give us an overview of this concept, why it is useful and how we are using it in Acme?”

Over her entire career Debby has been championing clear thinking amidst the swirling jargons, using advancement in technology to drive strategy instead of shoehorning the strategy to fit into some buzzword. Now, she realizes, is such a moment.

Evolution of Data Systems

Figure 1 File Based Applications
Figure 1 File Based Applications
Figure 1 File Based Applications

This, Debby continues, led to two problems:

The data was location specific. For instance, you would have had to know that you were expecting to get balance of account number 123, you will need to read the 51st column from the 100th record from the 23rd file. If a new record were to be added, the 100th record might have become 101st and the file might have become 24th. A data consumer would have had to know the location such as the specific file and the relative position of the record. Similarly if a new column were to be added, the 51st column could have become 52nd, forcing your understanding the change as well.

You would had to write a program every time you wanted the data. Most data users were not programmers; so they had to rely on the programmers to get them the data they needed. The programmer pool was limited and so was this approach. Hence the data access was almost always for programmatic and processing oriented rather than adhoc analysis. In other words, data was just a bunch of text to be used by programs; not for consumers to glean meaningful insights.

The Evolution of SQL

First, it still used files to store data; but it separated the physical location of the data from the logical presentation. For instance, to read the balance of account 123, shown in the previous section, the user did not have to know the exact location, i.e. the file, record and column position. The user simply asked for the information and the system translated the exact location that could be found and returned the value, as shown in Fig 2.

Figure 2 Role of a Database Management System
Figure 2 Role of a Database Management System
Figure 2 Role of a Database Management System

“Sort of like the DNS where an address like acmewidgets.com actually points to a numeric IP address?” ponders Ted.

“Exactly,” confirms Debby. It translated the request to actual location, read it and returned the value to the user. Similarly when a new record or a new column is introduced, it simply made the change in the file keeping it abstracted from the user. This gave rise to a new type of system called a Database Management System. Almost all modern data systems follow this principle. The database management system also moved the records or files around to the most optimal location for performance and reliability without with complete opacity to the end user.

Second, it introduced a new language called Structured Query Language (SQL) to read and write data, which is very English-like. Most business users learned it quickly and could get the data without the reliance on a program. This led to a whole new concept called adhoc data analysis. In short, SQL brought the data and users much closer and allowed the value in data to be unlocked significantly.

Distributed Databases

Since database systems made it easy for users to get the data they need to function, the databases proliferated. Soon there were multiple databases in a single company, each one built for a specific purpose or for a department. For instance, there was a database for Finance department for the use of those in that specific department. And there was a different one in Sales used by users there, as shown in Fig 3.

Figure 3 Separation of Data Along Divisional Boundaries
Figure 3 Separation of Data Along Divisional Boundaries
Figure 3 Separation of Data Along Divisional Boundaries

This worked well as the databases were divided along functional or departmental boundaries. But soon it became a victim of its own success, Debby explains. Suddenly people realized that they could do a whole lot of analysis with the data at the disposal, only if they had access to it. For instance some folks in Finance department felt they really couldn’t do forecasting since they never had access to the data from Sales. What if they had data from Sales, and all other departments while at it? They would have been able to perform their jobs more effectively.

Data Landscape expanded the access to multiple datastores in two different ways: Copy and Bridge.

Data Copy

Data Bridge

Figure 4 Data Bridge
Figure 4 Data Bridge
Figure 4 Data Bridge

“This,” Debby clarified “is sometimes known a distributed database system where the data used by applications and users are spread over multiple databases but still available from one.

“Sounds like a reasonable solution”, the audience muses. “What went wrong?”

The problems, Debby explains, are many:

  1. This is usually a vendor specific solution. The databases need to be on the same vendor’s platform for this bridge to work in many cases.
  2. Sometimes the vendors allow exposing the data from another vendor’s database; but that is limited to only the supported file types, e.g. external tables in Hadoop on Parquet files. But the interoperability is limited.
  3. Even in those cases the recipient database may need to interpret the files, a process known as serialization/deserialization (serde), which reduces performance and negates the advantages of a database engine.

However, if an organization standardized on a single database vendor, it was possible to create a truly distributed database platform.

“But”, counters Aaron, “isn’t the whole point of database platform being the best for that purpose? Going to a single vendor for all types of database needs completely destroys that concept.”

Rise of the Warehouse

So the industry looked forward. Since the 1990s another concept was coming to the data landscape — data warehouses. More precisely, the concept was analytical processing, as opposed to transactional processing that was common for data systems until then. A warehouse was built along an offshoot of relational model called a dimensional model. The big difference between the two was that with dimensional model, only a few tables (called “fact”) are allowed to grow to capture the details while the other types of data which does not change as much are allowed to stored in tables known as “dimension” tables. This design was great for boosting performance for analytical processing and hence needed different types of technology.

“Can you give some examples of what were the different types of technology?” asks Ted.

In a typical relational database systems for transaction processing, Debby explains, the data is stored in records with each record as an amalgamation of columns packed together, as shown in the left side of Fig 5, under the label Row Database.

Figure 5 Difference between Row and Column Databases
Figure 5 Difference between Row and Column Databases
Figure 5 Difference between Row and Column Databases

Pointing to the figure, Debby explains that the row database has 999 rows and each row is marked by a “start of row” marker. That’s how the database system knows where the records are split from each other. The other type of database called Column Database, shown on the right hand side of the Fig 5. The database does not store the data as records; but as columns. In this case, it takes all the values of the column called Account No and creates a column group in that name. Then it repeats the process for all the other columns.

“But why,” asks a clearly intrigued Aaron.

“Remember, a typical analytical query asks for an aggregation function, e.g. total of all balances, or average, min, max, etc.” Debby explains. “In case of a row database, the system has to find the beginning of record marker, locates the column, gets the value, repeats it for all rows. This takes a lot of time. In a columnar database, it gets the entire column values in one shot and applies the aggregation function, making the process orders of magnitude faster.”

So, muses Ted, why all relational database technologies are designed with this columnar approach.

Referring back to the figure 5, Debby points to the row database. In a transactional system, the applications and users typically select a small number of rows but all the columns. Since all the columns are selected, the system will have to read all the column groups and select a few rows from them. This will affect the performance significantly. Similarly updates and deletes of individual records are common in transactional systems but not warehouses; hence a columnar design would have hurt the transactional systems.

Columnar systems are optimized for aggregate queries while transactional systems are optimized for individual record access.

Warehouses also have additional optimizing features such as storage indexes, bitmap indexes, etc. to make the analytical queries faster. The audience understood why the systems need to be tooled differently.

When the fit for purpose analytical platforms were built, organizations needed it to be hydrated with data from various operational systems, as shown in Fig 6.

Figure 6 All Data to Warehouse
Figure 6 All Data to Warehouse
Figure 6 All Data to Warehouse

In this approach, the data is extracted from the source systems, transformed and loaded into a single datawarehouse. This decouples the technologies used for the operational databases from that of the warehouse. The analytical users access the warehouse as much as they need to and the warehouse is built with the proper technology.

This made everyone happy until about the 2010’s, Debby says. Why ? What happened then— the audience was perplexed.

Problems with a Single Warehouse

  1. The technology is fit for purpose, leaving the right technology for the operational systems.
  2. The analytical users do not affect the performance of the operational systems
  3. The analytical platforms could apply some transformation to fix data for better understanding. For instance, while the operational systems may have been storing 0 and 1 as Inactive and Active, the warehouse could store the actual descriptive values to avoid misinterpretation.

However, there are limitations, as Debby continues:

  1. The warehouse is always a copy. The data there is never most recent. The data is as current as the most recent ETL job.
  2. The concept of a single data warehouse simply wasn’t practical for some organizations, leading to multiple warehouses. The reasons: the scale was too huge for a single warehouse, the organizational boundaries required data to be placed in multiple warehouses, regulatory requirements dictated the creation of warehouses across geographic boundaries, e.g. data of a specific country needed to be within the borders of that country, etc.

Here Comes Data Lake

Data Lake, Debby explains, is very similar to warehouse in the sense that it is for analytical purposes as well. However, the Lake is much more flexible in the kinds of data it can accept and store, even completely unstructured data. Typically, data warehouses are structured and have specialized software to optimize data placement and access while lakes are unstructured and use non-proprietary software, which obviously reduces cost; but does not optimize the data access. The cost factor is usually a significant one when deciding between the two formats. The lower cost allows organizations to store data for the longer term cheaply.

“So,” summarizes Ted, “Warehouses are structured, have proprietary software and data formats and hence have better performance. Lakes are unstructured, have no proprietary software and hence lack in performance; but that makes them less expensive and flexible to accept any types of data. Did I sum it correctly?”

In a nutshell, Debby agrees, yes; however the lines have been blurred. Datawarehouses with proprietary software can also store non-proprietary format data as external tables. However the performance is not at par with the warehouse and the costs could be elevated. Hence many organizations follow a dual pattern:

  1. Lakes for long term data stores, unstructured storage, analytical needs where performance is less of a concern. Archival systems, getting access to the long term data for Machine Learning, etc. fall into this category.
  2. Warehouses for short term, medium to high access data and where performance is necessary. Short term reporting or Business Intelligence functions and what-if analyses play well here.

In general, compared to data warehouses, data lakes offer much cheaper but less performant solution that may be acceptable in many cases.

Debby acknowledges that some vendors actually offer both capabilities in their product; but stresses that the functionalities are different. In addition, like the previous case of datawarehouses, an organization can also create multiple data lakes to address concerns of security, geographic preference, technology, vendor’s cloud platform and so on. Not a single size fits all; so most organizations have multiple platforms for lakes and warehouses.

Introducing the Data Mesh

The point is, Debby smiles, there are many different databases and datastores now. Multiple datawarehouses, multiple data lakes and multiple operational systems are all available as potential data sources to the users in the organizations. A typical user may need to access data from all of them. How will they do it? Will Acme need to create an uber Datawarehouse and copy all the data from all other sources, have all analytical users access that single datastore as shown in Fig 7.

Figure 7 An Uber Datawarehouse
Figure 7 An Uber Datawarehouse
Figure 7 An Uber Datawarehouse

Debby let the question hang for a second there for the audience to ponder the weight of the question.

Ted speaks up. “Creating such an uber lake or warehouse is not only supremely expensive; but it adds latency making it far less attractive than going after the sources themselves. So it’s impractical as well.”

Precisely, agrees Debby. This is why Acme chose not to do that. Instead, Acme recognized that there would be multiple legitimate sources of data, each with its own unique fit for purpose technologies, data formats, etc. and customers would have the freedom to pull their desired data from the most relevant source, as shown in Fig 8.

Figure 8 Data Mesh
Figure 8 Data Mesh
Figure 8 Data Mesh

Pointing to the figure, Debby explains that there are multiple datastores accessible to all users. The data from these databases will not need to be copied to a central datawarehouse or lake.

Then Debby walks the group through the benefits:

  1. There is no central lake or warehouse; so the cost associated with it is no longer a concern. This cost would have been pretty significant.
  2. There is no latency of the data being available to the customer. Since there is no ETL to move the data to the central datatstore, the data is available immediately.
  3. There is no central datastore; so there is no debate about whether it should be a lake or a warehouse. Each of the datastores can be configured as a lake or a warehouse.
  4. Since there is no central datastore, there is no debate about technology there. Each of the datastores can store the data in the appropriate technology. For instance Database 1 can be an on-prem Hadoop cluster while Database 2 can be a cloud native datastore built for time series data.
  5. Each database is independently managed. The unavailability of one does not affect the access to the others.
  6. Each database can decide to have its own entitlement strategy. Entitlement refers to the ability of users to access the datasets they need. Each datastore may follow its own privilege management system eliminating the need to have a central authorization system.

Not such a good idea?

Aaron does not seem convinced. “These are all good reasons; but there must be something less desirable.”

Yes, there are — agrees Debby. First, it’s the last advantage — the independence of entitlements — may be a disadvantage in some cases. Since the privilege management is typically specific to the database technologies, a user’s permissions to access data needs to be managed at the datastore. This creates complexity for the user to get to the data it needs. As a corollary, it also makes it hard to determine what access should be revoked from the user when the requirements change. A single central datastore will have had a single privilege management system and hence it would have been much easier to manage the entitlements. However, Debby asserts, that a small disadvantage for Acme compared to the advantages.

The audience agrees.

“But how do users find what they are looking for among all those databases?” asks Ted.

“Aha! That a perennial problem of datastores,” smiles Debby. The answer to that? A central Data Catalog.

The audience wanted to learn more about it; but they realized the time was up. They reluctantly left the video conference call but only after Debby promised them that she would hold another session to continue this line of discussion later.

Conclusion

In summary:

  1. Data Mesh is simply allowing users access the most logical location for their data needs rather than copying that data to a central location.
  2. Each database is independently managed
  3. Each database can employ the best technology fit for that purpose
  4. The data is always current as of those datastores
  5. A Central Catalog and an Entitlement Process makes this setup possible.

In the next article in this series, you will learn more about the challenges with Data Mesh and Acme solved them.

Award winning data management and engineering leader, Chief Architect of Cloud Data at JPMorganChase, raspberry pi junkie, dad and husband.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store