Don’t Let Your Service Mesh Become A Service Mess

January 8, 2021January 11, 2021 Pankaj Gupta Citrix, service mesh, service mesh adoption

Service mesh is a hot topic in the microservices world because it is often considered to be the north star architecture for cloud-native applications. While a service mesh environment can enhance the traffic management and security of microservices’ communications and provide a complete picture of what’s happening in those applications, it can be difficult to implement and complex to manage. However, there are steps you can take to make the process easier.

Decide What’s Important and Map Out Your Journey

Before you start your service mesh journey, you must determine what’s important to you and plan your journey.

For many companies, enabling zero-trust communications among microservices has become a critical imperative, but every organization’s needs are different. Perhaps you need the advanced traffic management that service mesh can provide. Or the enhanced observability that sidecar proxies offer.

Whatever your needs, you must prioritize them and get buy-in from your developers, SREs and SecOps teams before you start, so you can focus your efforts. A service mesh implementation may fail because you try to boil the ocean and implement everything, all at once.

Once you’re confident you correctly prioritized your goals, create a roadmap for the service mesh journey, similar to any journey to an unfamiliar place. The roadmap should lay out the order in which you will tackle implementation, and identify how each step will align with IT and business goals. For example, you may decide to rank enhanced observability – to speed issue resolution and improve application uptime – higher than the goal of better traffic management. Use that ranking to stay focused on what you want to achieve with a service mesh and the benefits it will bring.

Choose Your Service Mesh Wisely

While there are many service mesh control planes available, they’re not all equal, and each has different strengths. When choosing a service mesh, first make sure that it supports the environment you want to run. If you are tied into systems like Mesos or your own proprietary/legacy architecture or a specific public cloud, check to make sure that it’s supported.

Second, decide which service mesh control plane to deploy. While all service mesh control planes provide similar basic functionality, they have different features and levels of maturity. To determine if a service mesh control plane suits your use case, research how it stacks up in the areas that are important to you. In general, Istio is ahead of the pack. For example, Istio has been out in front with mutual TLS for zero-trust security between microservices, whereas other are still catching up.

Third, evaluate how much complexity you can manage comfortably with the skill set and resources you have. As you add features, increase the size of the service mesh or add multiple clusters, the more complex things become. Remember, it’s easy to underestimate complexity because you don’t know what lies ahead – know your limits and allow a buffer.

Select the best service mesh architecture based on key “must-haves:” observability, security and traffic management – and the skill sets your organization already has. Ask yourself if you really need a sidecar per pod, or if there is an alternative or variant architecture like Citrix® service mesh lite that can satisfy your needs.

Plan for Surprises and Complexity

No matter how much you plan, you will come across surprises as you implement your service mesh. So plan, plan, and plan some more. You’ll be glad you did.

Know That Proxies are Not So Transparent

More than likely, you’ll discover that proxies can be quite opaque. Normally, when a microservice makes a call to a non-existent or stressed resource, the call times out. The mere presence of the proxy can distort application timeouts, because each microservice believes its request has been received almost immediately. Your application timeouts may require careful adjustment. Read best practices for timeouts at http://github.com/chemicL/envoy-timeouts.

Also, proxies are not transparent to HTTP traffic. Many proxies convert HTTP headers to lower case for compliance, consistency and better resource consumption. In fact, HTTP/2 requires that headers be lower case for compliance. If your application relies on case-sensitive HTTP headers, the proxy’s behavior may break it. You need to ensure that the nuances of proxy communication do not break your apps, and be prepared to fine-tune proxies or applications to fit your specific ecosystem.

Test Early and Test Often

Without a crystal ball, you don’t know what’s going to break. A service mesh is a complex, distributed system with many moving parts, and a lot of opportunity for things to fail. When an application does fail, you need to work out if it’s an issue with the application, the sidecar or something else. So, make sure you implement incrementally, monitor continuously and test frequently.

To do this, a full observability stack is a must-have; including logging, metrics, distributed tracing and service graphs. Distributed tracing and service graphs are critical for service observability. Distributed tracing monitors request flow through microservices to build a latency map through each microservice hop, and helps you troubleshoot latency issues. Service graphs are dynamic, graphical representations of microservices, their interdependencies and health, and provide an easy way to visualize your environment and spot issues as they arise.

While this seems like common sense, continuoud testing is frequently forgotten or skipped, leading to frustration and, often, causing projects to derail. Being proactive always helps. Consider writing an end-to-end, 24×7 test service which tests your microservices on an ongoing basis.

Prepare for a Tsunami of Revisions

What starts as a few sidecars today can become thousands tomorrow, and you need to be prepared. You will likely need to tune the default CPU and RAM allocations to minimize resource consumption. Similarly, once you start implementing your service mesh, revisions will come at you like a tidal wave. You must have a plan to make upgrades to thousands of sidecar proxies without disrupting your production applications. Think of it like having to change every sidecar on every motorcycle in a caravan without anyone noticing – while people are riding in them. Your plan better be good.

They say that intelligence is learning from your mistakes, but wisdom is learning from the mistakes of others. Service meshes promise so much in terms of security, advanced traffic management and observability, but they can be complex to implement. Plan carefully and be ready to make adjustments along the way, and your journey will be smoother – and maybe even fun!