Architecting Cloud-Native Platforms: The Role of Domain-Driven Design and Cell-Based Architecture
Modularity stands as a fundamental principle in software engineering, essential for creating systems that are both easy to manage and maintain. By breaking down a system into smaller, well-defined modules, each responsible for a specific functionality, software architects can ensure that complexity is controlled. These modules interact through established interfaces, allowing for clear communication paths and defined roles within the system.
Domain-Driven Design
Domain-driven design (DDD) represents a specialized pattern of modularity in software design. It involves a collaborative approach where developers, domain experts and business analysts work together to create software models that accurately reflect the fundamental business concepts of a domain. This strategy ensures that the design of the software is deeply integrated with, and highly reflective of, business needs. DDD simplifies complex systems by breaking them down into smaller, autonomous services, each dedicated to a specific business segment. This modular approach not only makes the system more manageable but also enhances the clarity and coherence of the overall software architecture.
DDD applied to microservices involves two key phases. In the strategic phase, identify and map out the Bounded Contexts on a context map. The tactical phase involves modeling each Bounded Context according to the business rules of the subdomain. DDD is inherently iterative and adaptive.
Reference Architecture for Scalable Cloud-Native Platforms
While beginning with DDD is a strong starting point, truly leveraging its benefits requires robustly architecting cloud-native platforms.
Figure 1 : Reference Architecture for A Scalable Cloud Native Platforms
While DDD streamlines team organization and simplifies software design, cell-based architecture, bolstered by cloud-native technologies, provides the essential building blocks for a scalable platform. Domains or subdomains can be mapped into network-bounded cells, and managed through well-defined gateways. This structure allows smaller, agile teams, often the size of two pizza groups, to perform frequent releases, supported by a CI/CD infrastructure to avoid operational bottlenecks. Built-in resilience techniques and policies within the platform will strengthen the robustness of architectures like microservices. Additionally, to offer essential debugging capabilities, implementing well-defined observability is crucial. Technologies such as service mesh will play a pivotal role in advancing these efforts. Adhering to zero-trust principles ensures necessary isolation and data security for all user applications and data. To enhance the collaboration among multiple teams while maintaining governance, the platform should include a robust service discovery system. It should also support an API-first development approach to streamline integration and functionality across services.Reference ImplementationLet’s delve into the implementation of a scalable cloud-native platform. A central strategy in this implementation is to avoid locking into a specific cloud vendor, aiming instead for a cloud-vendor-agnostic solution as much as possible.One challenge might be selecting the right cloud-native tools well-suited for building the platform. The current landscape of cloud-native technology is quite complex. Platform engineers should dedicate sufficient time to accurately evaluate the necessary cloud-native toolset and develop expertise in these tools and technologies. The implementation described utilizes over 20 cloud-native tools and technologies.
Figure 2 : Choreo – Reference Implementation
Developers can concentrate on their business application code and link the source code via a Git repository. GitHub Actions can trigger builds using relevant build packs that support multiple languages. Once built, container images are pushed to a container registry after undergoing a security scan with tools like Trivy. Relevant Kubernetes artifacts are then generated and deployed into a cell within the Kubernetes cluster, adhering to best practices.
A cell provides a network-bounded space, and the services running inside must be explicitly exposed to the external. Technically, a cell corresponds to a Kubernetes namespace integrated with Cilium network policies. Cilium serves both as a container network interface (CNI) and a service mesh, utilizing technologies like eBPF to enhance network security and observability. Cilium’s L3/L4 network policies block all network traffic to a cell, only permitting traffic from external and internal API gateways, along with traffic generated within the cell.
Furthermore, all network traffic within the Kubernetes cluster is encrypted using WireGuard, integrated with the Cilium CNI, which leverages eBPF to manage transparent traffic routing for encryption.In Figure 2, the F1 service running in the “Foo” cell is explicitly exposed to external traffic through the external API Gateway. This API Gateway, which utilizes Envoy proxy, verifies all traffic ensuring proper authentication and authorization. A fundamental principle of zero-trust security, never trust, always verify, is enforced through this API Gateway implementation. Additionally, the cell-based architecture facilitates microsegmentation, another key principle of zero-trust security. This design ensures that if a service within a cell is compromised, it does not affect the services running in another cell due to the micro-segmentation of the cell.
The cell gateway facilitates API-first development and inherently supports service discovery and governance, which are vital for collaborative development among autonomous team structures. This aligns well with the principles of DDD and microservices.
eBPF (extended Berkeley Packet Filter) is a powerful technology that boosts the functionality of the Linux kernel without altering its source code. It enables developers to run sandboxed programs directly in the kernel, offering a flexible and efficient way to augment kernel behavior. Cilium leverages eBPF in its built-in Hubble metrics to gather network L3/L4 flow metrics, in conjunction with L7 metrics captured by the Envoy proxy. An in-cluster Prometheus deployment collects these metrics, providing deeper insights essential for troubleshooting production issues. This setup ensures minimal impact on performance, thanks to the capabilities of eBPF technology.
Logs are crucial for debugging. Fluentbit, configured as a daemon set in Kubernetes, collects all container logs and forwards them to OpenSearch. OpenSearch then stores, indexes, and offers a rich query language that facilitates the correlation of multiple container logs. This setup enables troubleshooting of issues and provides a comprehensive log viewing console.
Cilium Envoy can be configured to enforce resilience policies, including automated retries for any HTTP 503 errors encountered during calls. It also supports additional resilience strategies like circuit breakers and deployment strategies such as canary and blue/green deployments.
Takeaways
DDD offers a scalable approach to complex software design that aligns seamlessly with cell-based architecture. By integrating Kubernetes and advanced cloud-native technologies such as eBPF, an effective cell-based architecture can be implemented. This architecture provides a scalable and resilient platform, enabling businesses to enhance agility, security and efficiency.