data platform series:
Cloud Native Data Platform:
Reference Architecture

Many of our clients come to us as they’re in the early stages of defining their data strategy and need some help defining what technical solution they need. One of the key parts of developing their strategy is showing them a typical reference architecture or blue print for a cloud native data platform. I thought it would be useful to share this example as part of our data platform strategy series.

Whilst every organisation is different and every data solution varies, a lot of value can be gained from utilising a common Cloud Data Platform (CDP) Reference Architecture as a starting point for building internal knowledge and baselining the desired solution.

The six architectural layers

Six Layer Architecture of a Cloud Native Data Platform

The diagram here shows an extensive, end-to-end reference architecture for a Cloud Data Platform which contains six primary layers – some of which will be relevant to all customers/use cases, and some of which will only be relevant in certain circumstances or use cases, which would be subject to further analysis.

Ingestion Layer

The ingestion layer is responsible for getting data into the platform and differentiates deliberately between batch and streaming data as they are different use cases.

Storage Layer

The Storage layer isolates slow and fast storage due to their characteristics, performance, and cost considerations.

Processing Layer

Services designed to process an abundance of data types – broken down into batch and streaming data (aligned to the ingestion layer)

Metadata Layer

Used to store and organise metadata which drives the operational efficiency of the Cloud Data Platform

Delivery Layer

Along with a Data Warehouse from where data is served, this also includes data consumers, which can be both humans i.e., analysts and other applications.

ETL/Orchestration Layer

Often a combination of tools which provide enhanced functionality from both a business and/or data perspective.

Each layer is a functional component (often consisting of more than one tool/technology) that performs a specific task. Logically, the architecture is Cloud Agnostic but reference implementations exists for each of the major Cloud Providers – Azure, AWS and Google Cloud.

I’ve described each layer in further detail in the following section.

Data ingestion layer

At its simplest, this layer is responsible for getting data out of source systems and bringing them into to the data platform. It does so by ingesting the data into a Data Lake in its original format. Both Batch and Streaming data should be considered first class citizens, although either may be omitted based on the solution and/or desired state.

Capabilities of the ingestion layer primarily focus on:

Securely connecting to a wide range of data sources on a range of protocols and delivery patterns
Reliably transporting data into the data lake with no changes. This is vital for scenarios where audit is required or reprocessing may be desirable.
Populating the metadata layer with MI to ensure downstream processing is as efficient as possible and to facilitate observability of the CDP health

As volumes coming into the platform will often be unknown, it is important for this layer to be highly scalable, modular, extensible and highly available.

STORAGE LAYER

The ingestion layer is responsible for getting data into the platform but does not store any data itself. It is therefore imperative that the Storage Layer provides ample capability to store data as quickly as required and to the volumes necessary – especially as there could be multiple ingestion jobs running concurrently.

Data Storage can be broadly categorised as fast or slow. Fast storage is designed for high frequency, smaller pieces of data – often a few kilobytes or less but with much, much higher performance characteristics. Examples of this might be streaming data coming from millions of IoT devices. Another example of fast storage may be a high capacity message queue.

Slow storage offers a completely different capability set and efficiently handles much larger data sets – from megabytes to petabytes and beyond. The capacity and (often) lifespan of slow data is much higher and the cost typically much lower. An example of slow storage might be a tiered object store.

Capabilities of the storage layer primarily focus on:

Scalability – handling unknown and logically ‘unlimited’ amounts of data
Reliability – handling different types of data from different delivery paradigms. Component failure should not affect reliability of the CDP
Performance – Not just about being ‘fast enough’, performance is about providing the correct tooling and infrastructure for the data required. This will vary based on latency or throughput considerations along with volume and frequency
Cost Efficiency – When considering large data volumes, cost becomes a key factor. Storing the data as cheaply as possible is a central consideration for almost any client and solution.

processing layer

The processing layer is often considered the ‘brain’ of the CDP – this is where business logic is applied, data validation is performed and some transformations take place. This is often (but not always) a three stage process:

3 stage process in the Processing Layer of data platform

Capabilities of the processing layer primarily focus on:

Support for both batch and real time data
Provide a programming interface supporting popular languages including C#, Python, Java or no-code, drag and drop style interfaces
Provide a SQL or SQL-like interface to enable ad-hoc interactions.

Metadata LAYER

The metadata layer refers to technical or operational metadata, not business or application specific metadata. That is, this metadata provides a store of context and often state that enables the processing layer (and all other layers) to efficiently perform its duties.

Examples of data stored in the metadata layer includes:

Status of ingestion tasks
Volumes of data ingested
Schema information
Pipeline metadata
Error information
Failure rates
Throughput Information

delivery layer

The delivery layer provides a flexible means of getting data to consumers. Some may wish to access data through a SQL interface and query the underlying Data Warehouse, whilst others may wish to consume the data direct from storage. It is the responsibility of the delivery layer to provide these capabilities to ensure consumers can efficiently (and securely) access the data they require.

Capabilities of the ingestion layer primarily focus on:

Elastic cost model – to be able to scale up or down depending on load/volume. Compute resource may spike during business hours when lots of BI reporting is performed, but be unnecessary when other CDP tasks are running (e.g. overnight ingestion tasks)
NoOps – there should be little to no tuning or optimisation required beyond those supplied by the cloud vendor.

orchestration LAYER

This layer is often a polyglot assortment of tools that may be a combination of custom code, commercial products of service offerings. The orchestration layer is responsible for ordering the processing of jobs – tracking and managing dependency chains, handling errors/retries and ensuring the smooth end to end operation. It is often calling into many of the other layers to perform its role. The ETL layer often provides an interface to define and maintain data pipelines that allows the data to be shared to the business or solution needs.

The ETL layer will also update the metadata layer and in some (simpler) cases, perform the orchestration role too by coordinating multiple jobs.

As I said at the beginning, this is a common reference architecture for a Cloud Native Data Platform but the important part is to ensure that it fits with the needs of your organisation and its Data Strategy. Not all organisations will require the full suite of capabilities outlined.

I have seen huge, costly mistakes made from overtly complex solutions being designed for simple analytics routines, for example. It is about examining your use cases, starting simple and scaling up as needed (evolutionary architecture is highly applicable to data). I will examine and explain how to select the right architecture and solution in the next blog post in our data platform series.

data platform series:
Cloud Native Data Platform:
Reference Architecture

The six architectural layers

Ingestion Layer

Storage Layer

Processing Layer

Metadata Layer

Delivery Layer

ETL/Orchestration Layer

Data ingestion layer

STORAGE LAYER

processing layer

Metadata LAYER

delivery layer

orchestration LAYER

Latest Insights

State of Digital Government Review

Thinking beyond AI replacing you: The end of bespoke apps (and SaaS landscape) as we know it

Microsoft Fabric’s place in the Azure data ecosystem

Why Microsoft Power Apps is a better option than traditional custom app development for a lot of organisations

Moving beyond nice-to-have data dashboards: treating data platforms as a strategic business imperative worth trillions

Introducing Neena to Pivotl

data platform series:Cloud Native Data Platform: Reference Architecture

The six architectural layers

Ingestion Layer

Storage Layer

Processing Layer

Metadata Layer

Delivery Layer

ETL/Orchestration Layer

Data ingestion layer

STORAGE LAYER

processing layer

Metadata LAYER

delivery layer

orchestration LAYER

Latest Insights

State of Digital Government Review

Thinking beyond AI replacing you: The end of bespoke apps (and SaaS landscape) as we know it

Microsoft Fabric’s place in the Azure data ecosystem

Why Microsoft Power Apps is a better option than traditional custom app development for a lot of organisations

Moving beyond nice-to-have data dashboards: treating data platforms as a strategic business imperative worth trillions

Introducing Neena to Pivotl

data platform series:
Cloud Native Data Platform:
Reference Architecture