- Single common load balancer for all the services which could be HTTP (path based) or gRPC (header based) or even TCP based.
- Traffic shifting (v1/v2) with simple YAML configuration.
- Circuit Breaking, Traffic Mirroring, Fault Injection etc.. etc…
for all microservices FOR FREE. Secondly as its also installed as (k8s)Pods scaling Istio’s control plane should also be as simple as scaling application pods. Right ? Except its NOT! Keep reading…
We picked v0.8.0 (and installed without authentication using
Our initial benchmarks when running a simple hello-world service proved that Istio with all the proxy containers, policy, ingressgateway, pilot pods is adding a ton of latency for our endpoints’ responses. Worse when you have a chain of microservices, where latency is added at each hop. These are our benchmark results when recorded using Fortio (http) on a simple Flask based http service running as Level-1 service, calling a similar service at Level-2, which inturn calling a similar service at Level-3.
Here are the docker images if you would like to give it a try.
Do you see the final column ? These 99% latencies introduced by Istio were NOT at all acceptable. As application pods started to scale, scaling Istio’s control plane invited more troubles. Secondly Istio’s pods were leaking system resources like CPU. Here are a few examples:
- Istio-tracing pod Memory Leak issue: https://github.com/istio/istio/issues/5782
- Istio-proxy container taking High memory : https://github.com/istio/istio/issues/7912
- At one point in our PROD equivalent tests
- Some of the
istio-policypods took 3.9 Gbs
- Some of the
istio-ingressgatewaypods took 1.4 Gbs
istio-pilotpods count reached to a max of 40 replicas. (This over a longer period of time and not with any heavy traffic.) https://github.com/istio/istio/issues/6962
- Some of the
We also saw cases where news pods are NOT scheduled automatically as
istio-sidecar-injector was failing with a
certficate signed by unknown authority issue. The only fix suggested was recreating sidecar-injector pods.
Onto Istio v1.0
While it is claimed some of the above issues are fixed in v1.0, some are not. We also found cases where Istio 1.0 takes much more system resources than 0.8.0.
After our failed experience with 0.8.0, we started looking into Istio v1.0.2 which was released close to a month back. Once istio’s control plane is installed using the same
istio-demo.yaml, we have seen ingressgateway, pilot, policy pods are taking a ton of system resources hence their HPA is kicking in pretty fast. (As a matter of fact istio-ingressgateway pod now gained 1% requested CPU quota as opposed to none in v0.8.0, but upon load looks like this is breached pretty easily hence HPA is kicking in pretty fast!!) p99 latencies also shot up with increased (istio)pods count. Secondly, although Istio’s default installation is NOT meant for Production, there is not much docs/guidelines out there which clearly say how do we tweak existing settings to make it production ready. Although Istio is used in production at scale in some big companies like IBM, eBay, it still looks like a distant dream to be widely adopted by many companies. One shouldn’t ignore the fact that this is going to change as the community is rapidly growing. Reducing the latency overhead to a single digit number looks like is one of the goals made by Istio’s contributors by this year end.
Any Middleware will always incur costs. With Istio the cost was a little more than expected.
So for now we (at BigBasket) have to park Istio aside until it becomes stable, performant, robust and most importantly “production ready at scale”.
The Road Ahead
Since we (at BigBasket) are trying microservices for the first time where the focus is primarily on robustness, performance and most importantly scalability, every middleware solution like a Service Mesh or an API Gateway for that matter will be picked only if it doesn’t add too much overhead interms of response time. Any solution with a single digit overhead should be perfectly ok.