It’s becoming widely understood that testing needs to shift left as organizations rapidly grow their development environments. The question that remains: Exactly how should these organizations approach end-to-end testing and integration testing in Kubernetes and multi-cloud environments?
To address this question, I’ll discuss how a new approach to realizing high-fidelity environments can help an organization scale the development and testing of microservices. I’ll also describe how such environments can be built on Kubernetes. Finally, I’ll review key considerations when implementing a microservices-based lightweight environment solution.
Environment Challenges at Scale
As the number of microservices grows beyond, say, 20, there’s a greater need for high-fidelity environments that are closer to production. Add cloud databases and third-party APIs into the mix, and it becomes clear that traditional approaches no longer work. In growing organizations, testing becomes possible only in pre-production environments that mimic production.
Limitations of Traditional Pre-Production Environments
There are limitations if you deploy a whole namespace or a separate cluster that contains every single service, especially at scale. One of these limitations is infrastructure cost. Another is operational burden. As the number of environments, microservices and development teams increases, the infrastructure and operational costs, in turn, multiply exponentially.
Another issue with multiple environments is that different versions of services and APIs are running in each environment. Even if you were to test everything against one of those environments, the test results aren’t necessarily representative because someone might have committed changes to one of the microservices that you were using as a dependency.
Creative solutions to this problem include timesharing the pre-production environment, where one team uses the staging environment at a time. However, this kind of workaround tends to incur substantial costs and cause bottlenecks that slow down the development process.
Testing Using Sandboxes
Here’s where sandboxes come in. A sandbox is a lightweight environment that combines the test version of one or more microservices with a shared pool of services corresponding to the latest stable versions of microservice dependencies. The fundamental idea is that you have a shared pool of dependencies that is updated constantly. These updates come from the master branch, which is always running stable versions of the microservices. Your changes are the only things being deployed into what is essentially a clean baseline environment.
This baseline environment contains a representative microservice stack, as shown in the image. Say you have a couple services that are constantly being updated from the master. A sandbox could be something that’s in a branch somewhere or a pull request that contains a test version of a service. Here, you can see the yellow path taken by the request. At this point, you can tell the request to follow the yellow path instead of the original black path. You can then repeat this process as many times as you want with different sets of microservices. In other words, it’s possible to have separate test environments because each of these request flows is being realized in isolation.
Defining and Deploying Kubernetes Workloads
The next logical question is this: What does it take to build such a sandbox inside Kubernetes? Let’s look at some important considerations and design choices, starting with how we define and deploy workloads. As shown in the image above, one or more sandboxed services are being deployed. A service can be deployed as part of the CI/CD process, or you can take a different approach by including the Kubernetes service in the test version of the cluster. One way to realize a test version is to push the entire YAML specification of the service into the cluster and then store it somewhere or templatize the baseline deployment.
In practice, we’ve found that usually only a few things change when it comes to sandboxed workloads. It’s rare to have a sandboxed workload whose configuration is different in every possible way from the baseline.
All you need to do is specify the “fork” and then reference a runtime deployment in Kubernetes. Next, you specify what has changed. The changes are the only customization that occurs.
description: sandbox env
– image: repo/image:latest
– name: DEBUG
Because the test version is derived from the baseline, one key advantage is that any changes to the baseline can be automatically reflected in the test version of the workload. Fortunately, you don’t have to update deployments every time the baseline is updated.
Stateful Resources in Sandboxes
The next part of the equation involves stateful resources. Some resources like databases and message queues may need additional isolation. Because it may not be possible to isolate them at the request level, you may need a way to deploy these ephemeral stateful resources alongside the sandbox.
For example, take a sandbox with one Kubernetes workload inside of it. You can deploy other resources and include them with that sandbox, like a message queue that it needs to talk to or a database that has been seeded with some data. The idea here is to tie each resource to the lifecycle so that the resource gets cleaned up when the sandbox goes away.
It’s also important to think about whether you need isolation at the infrastructure level. In traditional environments, everything is isolated at the infrastructure level. Here, however, you have a choice. If the data store or the message queue offers some notion of tenancy, logical isolation would be preferable because it’s a much more lightweight mechanism. (For instance, it would be like associating a Kafka topic with a sandbox as opposed to an entire Kafka cluster.)
You’ll need to decide at which level you want to isolate. In any case, those resources need to be included as part of your sandbox. Then you would supply those credentials back to your workloads so they can be tied together. The goal is to keep the sandbox self-contained.
Request Labeling and Context Propagation
Next comes request routing, which is how the request actually knows which flow it belongs to. As shown in the image, routing key k1 needs to persist all the way through this chain so a local routing decision can be made to send requests to the test workload instead of the baseline workload.
The easiest way to accomplish this goal is through request headers. Thanks to projects like OpenTelemetry, this process is quite easy. The W3C standard has tracestate and baggage headers, which are well supported in OpenTelemetry. In many languages, it’s possible to add the library and use request headers without any extra effort. In other cases, their use requires reading the request on the incoming side and then pushing that routing key or context to the outgoing side.
In this case, it would use some primitives that are coming from tracing, but the tracing backend itself is not needed. The only thing that’s needed is for every request coming into the service to persist routing key k1 as it passes the request along to the next service in the chain. In special instances, if you’re talking to an external service such as Pub/Sub, the service might not support headers. In this scenario, you would use a query parameter to preserve the context all through the chain.
Finally, we come to the question of how the actual routing occurs. This is a much more local decision. With every service, you want to decide whether to reroute that request to a test workload or continue to send it across to the baseline. There are many approaches, each with advantages and disadvantages. The easiest approach is using Sidecar. A Sidecar container runs alongside your main workload and can handle things like network policy. Adding a Sidecar container can help you intercept that request and make the routing decision. Another approach is to use Istio or some type of service mesh, as the service mesh itself can be taught to route those requests.
For performance-sensitive development efforts, there’s also the option of using L7 protocol interceptors (e.g., HTTP and gRPC) to make the routing decision in the application layer itself. Unless you have a mesh, the Sidecar container route is probably the easiest to adopt.
New Possibilities with Sandboxes
Now let’s talk a bit about what these sandboxes enable. First, you can make a backend change no matter how deep it is in your stack and test that change from the front end at any point in time. As a result, you’ll have full confidence that your feature does what it’s supposed to do before you merge code.
Second, multiple sandboxes can work together. A feature can span different microservices and be combined into one routing context so requests can be sent through both of them. That’s one way to realize cross-service testing before moving to a shared staging environment.
Third, because sandboxes are so lightweight – they typically spin up in under 30 seconds – it becomes cost-effective to perform testing at scale. You can easily spin up a lightweight environment to run a few integration tests against real environments. Using this model speeds up the development life cycle considerably without incurring a great deal of cost.
Sandboxes enable you to:
- Test every backend change from the front end (e.g., web app and mobile app).
- Test changes spanning different microservices before merging.
- Facilitate testing across the entire stack.
- Spin up thousands of lightweight environments quickly and easily at low cost.
Teams can create many sandboxes within one physical Kubernetes cluster without the infrastructure expense or operational burden of duplicating physical environments. Large companies like Uber and Lyft have built scalable, in-house testing solutions that rely on this notion of a sandbox. Other companies, like DoorDash, use sandboxes to realize environments that use minimal resources and spin up instantly.