Data Architecture The Right Way

Learn Various Data Architecture Approaches and How to Structure the Data Architects for Each Approach

Arup Nanda
11 min readJul 3, 2024
A wall with various types of blocks representing data
Photo by Joshua Fuller on Unsplash

Why Should You Care?

How do you structure your data architects? It depends on how the data landscape is architected, which has evolved significantly over years from the monolithic database to modern decoupled pub-sub approach. In this article learn how data ecosystem can be architected in different ways and how the architecture team can be structured for maximum effectiveness.

Traditional Practice

During the era of monolithic application designs, applications were typically disjointed from the data systems. To maximize the data assets, data stores were accessible to everyone to write to and read from, as shown in Fig 1.

Fig 1

Database design was extremely important. Since many applications update the data assets (e.g. tables), and many were reading them, it was critical to get the design of those tables right. Data architecture was extremely important, and the architects would have had the need to have knowledge of the applications. Note the use of plural. Yes, applications, many of them.

Separation of Analytical Datastores

With the emergence of analytical applications, the need to read data grew at a much faster pace than actually generating it. But this caused a conflict with the writing of data. Readers, especially bulk readers, often caused performance issues for writers by competing for the same resources. Readers also needed additional structures such as indexes to improve their own performance but these were impediments to the writers. Readers also wanted the data structures in a different format for efficiency, such as “columnar” format which was vastly inferior in a single record updates typical in transactional systems.

This led to the development of a new type of data store called analytical store to differentiate it from the transactional store. Analytical stores got the data they need from the various transactional stores via a process typically labeled as Extract/Transform/Load (ETL). Analytical stores had to store the data in different formats, flatten multiple tables to create flatter tables, design differently (often called dimensional models to contrast it with the normal relational models) and so on. This was what the T of the ETL did. This is shown in Fig 2.

Fig 2

As long as the latency of the data was not a problem, and in many analytical applications such as reports, what-if analysis, latency was not really an issue. A few hours behind didn’t matter for these applications. If some realtime data was needed, for instance, to find out the current status of the order, the transactional system could still be queried for individual records but for large reports such as end of the day sales month-over-month could be done from the analytical store which was designed with that bulk read only access in mind. And the transactional store, where the company runs its business, was not impacted. Everyone was happy.

This also led to the development of a new types of data architects, who would be proficient in the design nuances of the analytical data systems, such as dimensional modeling, But this also reduced the dependence on the deep expertise required in the previous case. Most of the data in the analytical stores are created with clearly defined labels during the design of the ETL process. Consider an example in Fig 2a below [Please note: this is a highly simplified view of a transformation merely to illustrate the point]. That status_code column shows 0, 1, 2, etc. What do they mean? The analytical stores received it and transformed it to descriptive values such as “Put in Cart”, “Ordered”, “Payment Processed”, etc. So the need for deep subject matter expertise was limited to the transactional systems; not the analytical systems. Everyone was happy.

Fig 2a

Service Oriented Architecture

Look at Fig 2. Many applications write to the central data store. For instance, there is a table called ORDERS, which contains orders from customers. All applications write to it and read from it. Like many things, this table needs maintenance: new columns need to be added, old ones deprecated, moved to a faster data platform and so on. Which applications does it affect?

It was a nightmare. The larger the company, the more complex the technology systems and the ability to pinpoint the applications that would be affected, let alone how, was a non-trivial task. Elaborate change review processes were created to handle it; but nothing removed the dependence on human knowledge, many of them.

In the early 21st century, a new paradigm was buzzing around to address this problem, among many others. The proposal was to create one application to hide the actual data stores behind them. So this ORDERS table would not be visible to anyone except a special application called a “service” called, say, order_service. This service will then accept requests from all others to write data and dispense data about orders. Initially the service used a specialized protocol called Simple Object Access Protocol (SOAP). Later the scope of this service was reduced and a new term was introduced called microservice. It had the same concepts, i.e. operate on the underlying data and exposing its interfaces for anyone to access. The access protocol was http and the interaction was highly standardized as Representational State Transfer (REST), shown in Fig 3.

Fig 3

This design brought in much needed benefit to the needs for data architecture. Since the underlying data assets were not exposed to anyone other than the microservice, the knowledge of the data architects did not have to extend beyond that boundary. The microservice defined the interfaces such as accept_order along with a bunch of parameters. Under the covers whether it wrote into a single table called ORDERS, or multiple tables; whether the columns stored the status as 0, 1, 2, etc. or more descriptively, was not relevant outside of the microservice. Similarly the data store technology could be moved to meet the changing needs or to address growing business demands without affecting anyone else as long as the microservice interface remained the same.

The data architecture teams were then split up and were aligned with the individual service teams, reducing the dependence on legacy knowledge even more.

Service Groups

Pretty soon a new problem surfaced. Microservices were too granular. Consider this in Fig 4. There are microservices for accept_order, process_order, send_to_fulfillment and so on. They need to operate on the same table. According to the principle of microservices, the tables cannot be shared across them; so each of them had a table for ORDERS which is pretty much the same to start with; but as the processing continues, the contents will change.

Fig 4

Creating redundant copies of data is a big no-no in data architecture. This design does not pass common sense. Ideally, they should operate on the same table. This leads to a slightly changed design of the microservices, with a “shared” table used by many microservices in a “family” of them, sometimes referred to as “domains”. In this case we have defined a domain called Orders with all these individual services operating on the same table called ORDERS. The table is not visible to anyone outside of the family. Likewise there is another family of microservices for customers which interact only with a single table called CUSTOMERS. This family can’t see the ORDERS table.

Fig 5

This brought much needed practicality and also didn’t change the data architecture scope and skillset requirements. Data architects were assigned to domains anyway and had to know about the domain well. Not it just became more obvious with this development.

Of course, this also introduced a new risk of affecting multiple microservices when the underlying data store goes through changes. However with all the services in the same family, the impact was significantly reduced.

Evolution of ETL

Notice in the above evolution, the analytical stores were pretty much left alone. The microservice design so prevalent in the transactional systems did not see traction in the analytical world. Why?

There were two problems:

  1. The microservices, implemented through API calls, typically process records in serialized fashion. Many analytical data applications need access to all data, immediately. For instance a Spark application first sucks up all the data into an RDD in its own cluster memory and operates on it. If it gets records at a time, it will be incredibly slow. So it needs a direct access to the datastore, not through an API.
  2. Some analytical tools had to get the data available in a certain format to work. For instance, Amazon Sagemaker can only access data on S3 buckets or via Lake Formation and not through API calls.

Therefore direct access to the analytical data stores still remains a top requirement. So, how does this change the ballgame for data architecture teams?

They need to be SMEs of their domain as well. But this remains a challenge. People move, legacy knowledge can’t be packaged up and moved with them. The bottleneck needs to be resolved.

Data Products

With the explosion of analytical usecases, the need for the analytical stores also exploded, bringing in a new challenge to the scaling of the data architects in that area. This is why we had to move to rely less on architects to tell users what to use; and move to a self service model where the users would be able to determine that on their own. This led to the development of data products which have enough information on them, available in a catalog, discoverable by tools. The comprehensive metadata about the data allows the consumers to determine on their own if that’s what they want and can use.

For instance, a data product be ORDERS, which is culled by the ETL process from multiple tables in the transactional store and presented as a single or multiple tables in the analytical store. If the consumers need another element of data not already visible, they need to send the request to the “owner” of that data product, who can then determine how to add that, or even whether to add that. These elements of the data product have extensive description to be reasonably self-explanatory and doesn’t need the data architects to explain much. The data architects are still needed to decide on the other aspects of the data systems — the technology of choice, the format, the resiliency, and so on.

These data products are typically created by a specialized types of people called data stewards, who are intimately familiar with their own products. I have seen in some organizations both the roles — stewardship and architecture — performed by the same individuals; but making sure their individual remits are different, they can be, and should be different.

Streaming

So far we have talked about a point to point solution, using ETLs to move, transform and load the data into the target store. Even in a medium sized organization, it will be almost impossible to scale this out, with multiple sources and destinations. Likewise with the advent of Data Mesh (https://medium.com/@arupnanda/what-is-data-mesh-anyway-c65886cab127), this becomes an even bigger challenge. There are multiple destinations for the same data, perhaps and there is a possibility of the technology changing. How do we solve for it?

This is where a different type of architecture comes in. As shown in Fig 6, publisher systems publish their data in a pre-determined schema format to a streaming system. All the consumers pick up the payload from the stream if they are supposed to get that. The publishers do not know who the consumers are; all they do it is publish it in a known structure. The operative word is “known”; the structure or the schema is well known and is defined in some catalog where the consumers can look up.

Fig 6

This arrangement frees up the systems to build bridges (ETLs) from source to destinations. If a new analytical platform is required, all we have to do is to bring it up and pick the payload from the stream. There is no need to build an ETL system.

The data architecture requirement then falls heavily on the accurate and effective definition of the schema the publishers will need to publish as. The architects need to be proficient in business knowledge of the publishing schema, the nuances the terms could pose and so on. On the other end, i.e. consumer systems consuming from the stream will get the precise, well understood data from the publishers but they may need to perform transformations and enrichment to make the data useful in their respective systems. Let’s explore that with an example.

There are two streams of data coming via the stream as follows:

  • ORDERS: the streaming of all orders coming from the order systems. It has order ID, the details of the order
  • FULFILLMENTS: the streaming of all active orders received at the warehouse but not fulfilled yet. It has order ID (but not the other details of the order) and fulfillment details like the warehouse and the worker it has been assigned.
Fig 7

These are different streams of data because they come from different systems and have different types of data. However the customer service department will need a dashboard for all the information: the order details, the warehouse details, etc. Hence the application on the other end of the stream combines the two streams, joining on ORDER_ID, and gets the customer and product information from the appropriate databases and creates a third enriched stream. This third stream could be picked up by a different application to be persisted in some data warehouse, or just be available for any other application to read and act on it.

Note how we did go through transformation; but we did not build ETL pipelines. We were completely free to define the systems and transformations as we saw fit and we could quickly roll them out. That was the power of a decoupled architecture. Data architects on the consumption side need to know the data structures they are getting, whether to join them (and if so, how and with what), how to choose the right technology to persist them and so on. So the amount of knowledge needed to be distributed and closer to the actual producers and consumers of data.

Conclusion

Setting up a data architecture team depends on the architecture of the data ecosystem you are planning to have. In my experience, the decoupled streaming architecture explained earlier is the only practical and scalable way for a medium sized organization with an established Data Mesh and the data architecture teams have to be staffed up in a distributed fashion as well to be effective. But if you are planning on a monolithic ETL based architecture, then it makes sense to have a central architecture team as they will need to keep tabs on what data is being passed and how it is used.

I hope this article helped you to not only in your decision to structure your data architecture team but the architecture of your data ecosystem to meet your specific needs.

--

--

Arup Nanda

Award winning data/analytics/ML and engineering leader, raspberry pi junkie, dad and husband.