Home Technology Bigbasket’s Experience with Istio

Bigbasket’s Experience with Istio

by none-da

The Story

Bigbasket started moving some of the core components to Microservices and wanted to give Istio v0.8.0 a go as it gives you

  • Telemetry
  • Single common load balancer for all the services which could be HTTP (path based) or gRPC (header based) or even TCP based.
  • Traffic shifting (v1/v2) with simple YAML configuration.
  • Circuit Breaking, Traffic Mirroring, Fault Injection etc.. etc…

for all microservices FOR FREE. Secondly as its also installed as (k8s)Pods scaling Istio’s control plane should also be as simple as scaling application pods. Right ? Except its NOT! Keep reading…

We picked v0.8.0 (and installed without authentication using istio-demo.yaml).

Our initial benchmarks when running a simple hello-world service proved that Istio with all the proxy containers, policy, ingressgateway, pilot pods is adding a ton of latency for our endpoints’ responses. Worse when you have a chain of microservices, where latency is added at each hop. These are our benchmark results when recorded using Fortio (http) on a simple Flask based http service running as Level-1 service, calling a similar service at Level-2, which inturn calling a similar service at Level-3.
Here are the docker images if you would like to give it a try.

Do you see the final column ? These 99% latencies introduced by Istio were NOT at all acceptable. As application pods started to scale, scaling Istio’s control plane invited more troubles. Secondly Istio’s pods were leaking system resources like CPU. Here are a few examples:

We also saw cases where news pods are NOT scheduled automatically as istio-sidecar-injector was failing with a certficate signed by unknown authority issue. The only fix suggested was recreating sidecar-injector pods.

Onto Istio v1.0

While it is claimed some of the above issues are fixed in v1.0, some are not. We also found cases where Istio 1.0 takes much more system resources than 0.8.0.

After our failed experience with 0.8.0, we started looking into Istio v1.0.2 which was released close to a month back. Once istio’s control plane is installed using the same istio-demo.yaml, we have seen ingressgateway, pilot, policy pods are taking a ton of system resources hence their HPA is kicking in pretty fast. (As a matter of fact istio-ingressgateway pod now gained 1% requested CPU quota as opposed to none in v0.8.0, but upon load looks like this is breached pretty easily hence HPA is kicking in pretty fast!!) p99 latencies also shot up with increased (istio)pods count. Secondly, although Istio’s default installation is NOT meant for Production, there is not much docs/guidelines out there which clearly say how do we tweak existing settings to make it production ready. Although Istio is used in production at scale in some big companies like IBM, eBay, it still looks like a distant dream to be widely adopted by many companies. One shouldn’t ignore the fact that this is going to change as the community is rapidly growing. Reducing the latency overhead to a single digit number looks like is one of the goals made by Istio’s contributors by this year end.

The Learning

Any Middleware will always incur costs. With Istio the cost was a little more than expected.
So for now we (at BigBasket) have to park Istio aside until it becomes stable, performant, robust and most importantly “production ready at scale”.

The Road Ahead

Since we (at BigBasket) are trying microservices for the first time where the focus is primarily on robustness, performance and most importantly scalability, every middleware solution like a Service Mesh or an API Gateway for that matter will be picked only if it doesn’t add too much overhead interms of response time. Any solution with a single digit overhead should be perfectly ok.

You may also like

9 comments

Mandar Jog October 20, 2018 - 12:17 am

@none-da it would be great if you run a performance test with a setup that does not use istio-demo.yaml.
For example, istio-demo has tracing turned up to 100%. There are also many other parameters that are not tuned in demo, since that is the demo of istio functions.

I have explained in the issue https://github.com/istio/istio/issues/7879#issuecomment-430888771 , and provided sizing guidelines.

Istio telemetry does take extra resources (and we are working to reduce cpu), however it does do “new” work which was not being done before. It collects mesh metrics, so this should be looked at the the cost of collecting metrics and not overhead of istio.

Look at resources section for specific guidance. https://github.com/mandarjog/tools/blob/p1/perf/istio/values-istio-test.yaml

Reply
none-da October 29, 2018 - 12:01 pm

Thats good news Mandar! We will try the recommendations.

Reply
James October 21, 2018 - 5:28 pm

Appreciate the insights into istio! Sounds like some of the resource issues are being addressed in it, which is heartening. Though as you do acknowledge, *any* middleware is going to add some overhead… I’d be curious if you adopted or tried out something similar!

Reply
none-da October 29, 2018 - 11:57 am

We tried Kong for a similar need.

Reply
Aghi November 1, 2018 - 2:15 am

Hi None – how was your experience Kong?

fyi: In the coming weeks, Kong will release 1.0RC with mesh/service-proxy support.

Reply
Ali October 23, 2018 - 5:53 pm

Hi,
Which environment have you used for deploying Istio? Kubernetes or Docker?
Thanks.

Reply
none-da October 29, 2018 - 11:56 am

Kubernetes v1.10

Reply
Ali November 5, 2018 - 8:37 pm

may I ask if you have tried the bookinfo application on Istio? and how much is the latency for this application that you get?
I have been trying to get kind of a benchmark from Istio to see how fast it can perform for example with the bookinfo application to be able to compare my results to that. so how much should the latency be with this example or the delay introduced by Istio and envoys?
Thanks.

Reply
Renjish November 7, 2018 - 7:15 pm

@none-da: How many microservices do you have in total? Istio can be an overhead, if you have only few service workloads.

Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: