- Data mesh is an architectural paradigm about how you think and organize data and its services, similar to microservices which is an architectural pattern for designing and building applications.
- Microservices based architecture helps to solve the challenges which prevents an organization to be agile and respond quickly to changes like delay in introducing new features, long testing cycles, dependency between teams, scalability etc.
- The major reason for these challenges were the monolithic design of the system , which was one single large deployable unit.
- Microservices solved these problems by providing an approach to build applications as a suite of independent services each running in its own process and communicating with lightweight mechanisms often REST APIs. These independent services are built around business domain and independently deployable by fully automated deployment machinery.
- Data mesh also takes a similar approach for data engineering to help organizations be more agile by moving away from large monolithic data platforms as its difficult to make changes when we have an extensive and rapidly changing variety of data sources and use cases.
- Monolithic large data platforms architecture have these 3 primary goals
- Host data from various internal and external sources into one single repository.
- Build data pipelines to collect, transform and enrich the source data into forms which is needed by the consumer.
- Serve the datasets to a variety of consumers with a diverse set of needs . This ranges from analytical consumption to exploring the data looking for insights, machine learning based decision making, to business intelligence reports that summarize the performance of the business.
- Challenges with this architecture :
- Centralized systems are a bottle neck for innovation and agility and such large systems are difficult to scale
- Building and managing such large big data platform requires hyper specialized tools and skills. This leads to a culture where the specialized data engineers are siloed with the parts of organizations where data gets generated and is consumed
- Responding to new request requires a change in the whole data pipeline, which makes it difficult to stay agile and responsive.
- Leads to an architecture and organization structure that does not scale and does not deliver the promised value of creating a data-driven organization
- Data Mesh tries to address these problems by leveraging the techniques that have been instrumental in building modern distributed architecture at scale – scale being applied to constant change of data landscape, proliferation of both sources of data and consumers, diversity of transformation and processing that use cases require, speed of response to change.
- There are four underpinning principles for a data mesh implementation
- Domain-oriented decentralized data ownership and architecture
- Data as a product
- Self-serve data infrastructure as a platform
- Federated computational governance
- Domain-oriented decentralized data ownership and architecture : Similar to microservices architecture where services are organized around business domains, data-mesh also advocates decomposing data around domains instead of doing it at pipeline level. This drives an architecture where domains are aligned to the origin of data and their consumers , most importantly the ownership of the data is distributed.
- Data as a product: Data is considered a product by each team that publishes it.They’re wholly responsible for the data, including its quality, its representation, and its cohesiveness.Every node on the data mesh is a product and the data owner should ensure that the data is easy to discover and use by its consumers.Therefore it needs to have the characteristics of good documentation, discovery, security, metering etc. similar to how microservices or APIs are.
- Self-serve data infrastructure as a platform: The goal of self service data platform is to lower the lead time required to create a new data product on the infrastructure. This requires providing tools and processes which abstracts the complexity and friction of provisioning and managing the lifecycle of data products. Therefore the need is to build the data infrastructure as a platform which is domain agnostic and hides all the underlying complexity and provides the data infrastructure components in a self-service manner
- Federated computational governance: Data Mesh requires a federated computational governance model that maintains global controls while increasing local adaptability. Semantic standards, security policies and compliance are managed centrally while the responsibility to comply is decentralized to data product owners; the policy execution in the context of each and every data product is enabled computationally by the platform
Data mesh is a network for exchanging data about various domains of an organization. The nodes in the mesh are data products . These nodes produces high quality, locally curated data, within the mesh. Data is curated by the team who owns the data domain for the benefit of the other nodes. The data flows from node to node, and thus from domain to domain, as needed
To summarize data mesh is a distributed data platform with decentralized ownership & federated governance, Domain as first class concern and DATA as a product. Use cases for Data Mesh encompass both operational data and analytic data, which is one key difference from conventional Data Lakes and Data Warehouses
Implementation of Data Mesh requires technical and cultural change as it is a new and a different way of thinking about data at scale.