Erik Jankovič | @kirecek

Published on 28 Mar 2021

[503 Stories] Istio protocol selection and h2c

I have decided to write a series of posts about HTTP 503 errors. 503 are sometimes very confusing and very hard to debug so I’d like to share couple interesting misconfigurations that lead to these errors.

In this post I’ll describe a 503 which happened in one of ours Istio meshes in the company (kiwi.com). Idea was simple. Create ServiceEntry for one external service running in another Istio cluster so we can enable cross-cluster mTLS communication. For the purpose of this blog let’s assume the external service is httpbin.

Setup

First of all, if you would like to have hands-on on the problem described in this blog, you can apply this manifest which deploys all necessary components (Istio, httpbin service, gateways):

$ kubectl apply -f https://kirecek.me/manifests/istio-503-demo.yaml

And the mentioned ServiceEntry:

cat << EOF kubectl -n demo apply -f -
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: httpbin
spec:
  addresses:
  - 240.0.0.5
  endpoints:
  - address: istio-ingressgateway.istio-system
    ports:
      http1: 15443
  hosts:
  - httpbin.demo.example.internal
  location: MESH_INTERNAL
  ports:
  - name: http1
    number: 8080
    protocol: http
  resolution: DNS

Using an address of the local ingress gateway (address: istio-ingressgateway.istio-system) is a little trick to test “multi-cluster” traffic in only one cluster. In this article I care only about replicating request flow so this is just enough.

Service Unavailable

It’s a matter of course to test every new configuration so give it a try and do HTTP request to httpbin.demo.example.internal.

$ kubectl -n demo exec -it deploy/curl -- curl httpbin.demo.example.internal:8080
upstream connect error or disconnect/reset before headers. reset reason: connection termination
$ kubectl -n demo exec -it deploy/curl -- curl httpbin.demo.example.internal:8080 -I
HTTP/1.1 503 Service Unavailable
content-length: 95
content-type: text/plain
date: Sun, 28 Mar 2021 18:14:22 GMT
server: envoy
x-envoy-upstream-service-time: 82

And here we go. The promised 503.

I’m asking myself if there is something wrong with the configuration or just the service is misbehaving. The setup I shared above contains also HTTP gateway for httpbin so I wonder if other requests will result into the same error code. But I highly doubt that since ServiceEntry just can’t affect existing setup.

$ kubectl -n demo exec -it deploy/curl -- curl istio-ingressgateway.istio-system:80 -H 'Host: httpbin.example.com' -I
HTTP/1.1 200 OK
server: envoy
date: Sun, 28 Mar 2021 18:28:50 GMT
content-type: text/html; charset=utf-8
content-length: 9593
access-control-allow-origin: *
access-control-allow-credentials: true
x-envoy-upstream-service-time: 60

As expected, this works. Just to cover all use-cases, let me test one more request inside local cluster.

$ kubectl -n demo exec -it deploy/curl -- httpbin.demo.svc.cluster.local:8080 -I

HTTP/1.1 200 OK
server: envoy
date: Sun, 28 Mar 2021 18:30:47 GMT
content-type: text/html; charset=utf-8
content-length: 9593
access-control-allow-origin: *
access-control-allow-credentials: true
x-envoy-upstream-service-time: 8

Clearly, the issue is present only for the external host httpbin.demo.example.internal. Unfortunately, the error message (upstream connect error) and 503 status code itself don’t provide much information about the failure. Let’s see logs to get better idea what’s happening.

$ kubectl -n istio-system logs -l app=istio-ingressgateway -f

[2021-03-28T18:12:58.505Z] "- - -" 0 - - - "-" 5005 8286 2945758 - "-" "-" "-" "-" "10.244.0.36:80" outbound_.8080_._.httpbin.demo.svc.cluster.local 10.244.0.33:42484 10.244.0.33:15443 10.244.0.35:54882 outbound_.8080_._.httpbin.demo.example.internal -

TIP: Enable envoy debug logs by execing to pod and sending a POST request: curl -XPOST localhost:150000/logging?level=debug.

Root case

Nothing suspicious in ingress logs. On the contrary, the route / endpoint in the log is correct. Let’s jump to next hop - istio sidecar logs:

$ kubectl -n demo logs deploy/httpbin -c istio-proxy -f

[2021-03-28T19:28:40.590Z] "GET / HTTP/1.1" 200 - via_upstream - "-" 0 9593 6 5 "-" "curl/7.58.0" "ccc4a992-5ed9-4978-9d18-5f1d16e0d964" "httpbin.demo:8080" "127.0.0.1:80" inbound|80|| 127.0.0.1:41530 10.244.0.36:80 10.244.0.35:57910 outbound_.8080_._.httpbin.demo.svc.cluster.local default
[2021-03-28T19:13:33.663Z] "GET / HTTP/2" 503 UC upstream_reset_before_response_started{connection_termination} - "-" 0 95 2 - "-" "curl/7.58.0" "cb36325b-a6d4-4b30-baf1-039645db4d9d" "httpbin.demo.example.internal:8080" "127.0.0.1:80" inbound|80|| 127.0.0.1:60970 10.244.0.36:80 10.244.0.33:49318 outbound_.8080_._.httpbin.demo.example.internal default

Confusing here is that the request to httpbin.demo.example.internal is using HTTP/2 and the other, working one, is using HTTP/1.1. To be honest, this behaviour suprised me a bit. The control-plane has a global policy which should upgrade all connections to HTTP/2 so I would not expect different behaviour here. I’ll get back to this later.

It’s obious. The problematic seems to be only HTTP/2 request and I’m sure the httpbin service does not support HTTP/2 ptotocol.

$ curl httpbin.demo:8080 --http2-prior-knowledge

upstream connect error or disconnect/reset before headers. reset reason: connection termination

Or:

$ kubectl -n demo exec -it deploy/httpbin -c httpbin -- curl localhost --http2-prior-knowledge

curl: (16) Error in the HTTP2 framing layer
command terminated with exit code 16

I would like to see more details about the error so I need to dig deeper and fire up Wireshark. In order to inspect packets and to avoid hacking pods on my own, I can just use ksniff - kubectl plugin that utilize tcpdump and Wireshark. Open wireshark and capture the local lo interface in the httpbin pod. I don’t care about eth0 because traffic there is encrypted.

TIP: instead of tcpdump or wireshark filters I use status.sidecar.istio.io/port: "0" pod annotation during debugging to ommit annoying healhchecks from capture.

$ kubectl sniff -n demo -p $(kubectl -n demo get pods -l app=httpbin -o=name | cut -d '/' -f 2) -i lo

$ kubectl -n demo exec -it deploy/curl -- curl httpbin.demo.example.internal:8080 -I # make request again so it's captured

In the picture above, notice the selected GOAWAY frame which is used to initiate shutdown of a connection or to signal serious error conditions. Also note the TCP reset (RST) at the end of the stream - this actually resulted into UC envoy error code in the sidecar logs. If we take a closer look we will see detailed error message PROTOCOL ERROR which is totally expected since our httpbin app does not support http2, as I previously montioned.

Solution

So now we know the problem which is incorect protocol selected by Istio. But this pops up 2 questions.

Part of the answer for both questions sits in protocol selection docs. Especially in explicit protocol selection. As stated in docs we can specify the protocol in the name of the service name: <protocol>[-<suffix>]. In case we use http[-suffix] for the httpbin sservice, envoy proxy will “downgrade” all HTTP/2 requests to HTTP/1 and 503 should go away.

Add the named port:

$ cat << EOF | kubectl apply -f
apiVersion: v1
kind: Service
metadata:
  name: httpbin
  labels:
    app: httpbin
spec:
  ports:
  - port: 8080
    targetPort: 80
    name: http
  selector:
    app: httpbin
EOF

Make another request to httpbin.demo.example.internal:

$ kubectl -n demo exec -it deploy/curl -- curl httpbin.demo.example.internal:8080 -I

HTTP/1.1 200 OK
server: envoy
date: Sun, 28 Mar 2021 21:40:59 GMT
content-type: text/html; charset=utf-8
content-length: 9593
access-control-allow-origin: *
access-control-allow-credentials: true
x-envoy-upstream-service-time: 10

And of course because of explicit http protocol, there are no more HTTP/2 frames in the capture.

And for this simple reason - explicit protocol - there were not any issue with httpbin.example.com in the begining. httbin Gateway used for httbin.example.com host uses HTTP protocol so h2UpgradePolicy does not have any impact here.

$ kubectl -n demo get gateway httpbin -o=yaml | grep 'protocol:'
      protocol: HTTP

So why did httpbin.demo.svc.cluster.local work despite global upgrade policy? I could not find any references in the Istio docs but that upgrade policy seems to be applied only for ingressgateways. Communication between services in local cluster is not affected unless you specify it in destination rules.

Conclusion

Alright, the first 503 story is over.

The key takeaway here is that we should use istioctl validate for services. This nice istioctl sub-command can warn users about various antipatterns and misconfigurations in manifests. For examle it will warn you if you use unnamed ports which could save me a lot of time :)

$ istioctl validate -f service.yaml

Error: 1 error occurred:
	* Service//httpbin: service "httpbin//:" has an unnamed port. This is not recommended, See https://istio.io/v1.9/docs/ops/deployment/requirements/

Anyways, this problem was not very complicated but I think it was interesting and fun. Looking forward for another one!