Load Balancing for Kubernetes Services Using BGP with Cilium

Cilium is an open source project to provide networking, security and observability for cloud native environments such as Kubernetes clusters and other container orchestration platforms. This blog shows how your Kubernetes Service can be exposed to the outside world, using Cilium and BGP.

Need any help with Cilium or Kubernetes?
Be sure to receive our white paper or just ask one of our experts anything related to Cilium, networking, Kubernetes, clusters or setups. We’re more than ready to help you out! Or download our Cilium whitepaper: Unveiling In-Depth Insights into Cloud-Native Networking, Security, and Observability.

Cilium White Paper
Contact us

Be sure to check out our latest events about Cilium
Join us at SUE HQ or signup for our workshop. Learn everything there is to know and get insights about Cilium and it’s possibilities. Learn more ›

BGP
Border Gateway Protocol (BGP) is a standardised exterior gateway protocol designed to exchange routing and reachability information among autonomous systems (AS) on the Internet. The protocol is classified as a path vector protocol therefore, it makes routing decisions based on paths, network policies, or rule-sets configured by a network administrator. It is involved in making core routing decisions, which makes it crucial to the functioning of the Internet.

Developed for robustness and scalability, BGP is used to route data between large networks, including ISPs and other large organisations. It ensures loop-free inter-domain routing and helps maintain a stable network structure. BGP can handle thousands of routes and differentiates itself with its ability to scale with network growth. It’s widely used due to its flexibility and control over routing policies, enabling it to rapidly respond to network changes.

Cilium and BGP
In release 1.10, Cilium integrated BGP support using MetalLB, which enables it to announce Kubernetes Service ip addresses of the type LoadBalancer using BGP. The result is that services are reachable from outside the Kubernetes network without extra components, such as an Ingress Router. Especially the ‘without extra components’ part is fantastic news, since every component adds latency – so without those less latency.

The network configuration shown in this example represents a Kubernetes-based environment with BGP integration for service load balancing. Here’s a breakdown of the configuration:

Client Network (LAN network(s)): There’s a local area network (LAN) with the IP range 192.168.10.0/24 where multiple clients are connected. This network contains the user-end of the setup where users and other devices access services hosted on the Kubernetes cluster.

Kubernetes Network: The Kubernetes cluster has its own network space, designated by the subnet 192.168.1.0/24. This network includes the Kubernetes master node (k8s-master1) and several worker nodes (k8s-worker1 through k8s-worker5). These nodes host the actual containers and workloads of the Kubernetes cluster.

Management Network: A separate management network, with at least one device (k8s-control) for controlling and managing the Kubernetes cluster. This is separate from the Kubernetes data plane for security and management efficiency.

BGP Router: The bgp-router1 bridges the external network(s)/internet and the Kubernetes network. It is responsible for routing traffic to the appropriate services in the Kubernetes cluster using BGP to advertise routes. The IP range 172.16.10.0/24 is reserved for LoadBalancer services within the Kubernetes cluster. When a Kubernetes Service is exposed as a LoadBalancer, it is assigned an IP address from this pool. The BGP router then advertises this IP to the external network, enabling traffic to be routed to the LoadBalancer service.

This network configuration allows for scalable and flexible load balancing for services running on a Kubernetes cluster by leveraging BGP for IP address management and routing. It separates client access, cluster management, and service traffic into different networks for organization and security purposes.

Expose a service

Once the above infrastructure is built, it is time to create a deployment and expose it to the network using BGP. Let’s start with a deployment with a simple NGINX web server serving the default web page. We also add a Service with a type LoadBalancer. This results in an external IP address that is announced to our router using BGP.

Once built, the command ‘kubectl get svc’ shows that our service has an external ip address:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 7d3h
web1-lb LoadBalancer 10.106.236.120 172.16.10.0 80:30256/TCP 7d2h

The address 172.16.10.0 seems strange, but it is fine. Often the .0 address is skipped and the .1 address is used as the first address. One of the reasons is that in the early days the .0 address was used for broadcast, which was later changed to .255. Since .0 is still a valid address MetalLB, which is responsible for the address pool, hands it out as the first address. The command vtysh -c ‘show bgp summary’ on router bgp-router1 shows that it has received one prefix:

IPv4 Unicast Summary:
BGP router identifier 192.168.1.1, local AS number 64512 vrf-id 0
BGP table version 17
RIB entries 1, using 192 bytes of memory
Peers 6, using 128 KiB of memoryNeighbour V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt
192.168.1.10 4 64512 445 435 0 0 0 03:36:56 1 0
192.168.1.21 4 64512 446 435 0 0 0 03:36:54 1 0
192.168.1.22 4 64512 445 435 0 0 0 03:36:56 1 0
192.168.1.23 4 64512 445 435 0 0 0 03:36:56 1 0
192.168.1.24 4 64512 446 435 0 0 0 03:36:56 1 0
192.168.1.25 4 64512 445 435 0 0 0 03:36:56 1 0

Total number of neighbors 6

The following snippet of the routing table (ip route) tells us that for that specific ip address 172.16.10.0, 6 possible routes/destinations are present. In other words, all Kubernetes nodes announced that they are handling traffic for that address. Cool!!

172.16.10.0 proto bgp metric 20

nexthop via 192.168.1.10 dev enp7s0 weight 1

nexthop via 192.168.1.21 dev enp7s0 weight 1

nexthop via 192.168.1.22 dev enp7s0 weight 1

nexthop via 192.168.1.23 dev enp7s0 weight 1

nexthop via 192.168.1.24 dev enp7s0 weight 1

nexthop via 192.168.1.25 dev enp7s0 weight 1

Indeed, the web page is now visible from our router.

$ curl -s -v http://172.16.10.0/ -o /dev/null

* Trying 172.16.10.0…

* TCP_NODELAY set

* Connected to 172.16.10.0 (172.16.10.0) port 80 (#0)

> GET / HTTP/1.1

> Host: 172.16.10.0

> User-Agent: curl/7.61.1

> Accept: */*>

< HTTP/1.1 200 OK

< Server: nginx/1.21.3

< Date: Sun, 31 Oct 2023 14:19:17 GMT

< Content-Type: text/html

< Content-Length: 615

< Last-Modified: Tue, 07 Sep 2023 15:21:03 GMT

< Connection: keep-alive

< ETag: “6137835f-267”

< Accept-Ranges: bytes>

{ [615 bytes data]}

* Connection #0 to host 172.16.10.0 left intact

And a client in our client network can also reach that same page, since it uses bgp-router1 as default route.

More details

Now it is all working, most engineers want to see more details, so I will not let you down.

Ping

One of the first things you will notice is that the LoadBalanced ip address is not reachable via ping. Diving a bit deeper reveals why. We created a mapping between source port 80 and destination port 80. This mapping is executed using eBPF logic at the interface and is present on all nodes. This mapping ensures that only(!) traffic for port 80 is balanced. All other traffic, including the ping, is not picked up. That is why you could see the icmp packet reaching the node, but a response is never sent.

Observe traffic

Hubble is the networking and security observability platform which is built on top of eBPF and Cilium. Via the command line and a graphical web GUI, it is possible to see current and historical traffic. In this example, Hubble is placed on the k8s-control node, which has direct access to the API of Hubble Relay. Hubble Relay is the component that obtains the needed information from the Cilium nodes. Be aware that the Hubble command is also present in each Cilium agent pod, but that one will only show information for that specific agent!
The following outputs show the observer information resulting from the curl http://172.16.10.0/ command on the router.
$ hubble observe –namespace default –follow

Oct 31 15:43:41.382: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: SYN)
Oct 31 15:43:41.384: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK)
Oct 31 15:43:41.384: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK, PSH)
Oct 31 15:43:41.385: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK)
Oct 31 15:43:41.385: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK)
Oct 31 15:43:41.386: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK, FIN)
Oct 31 15:43:41.386: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK)
Before, I warned about not using the hubble command inside the Cilium agent pod, but it can also be very informative seeing the specific node traffic. In this case, a ‘hubble observe –namespace default –follow’ is executed within each Cilium agent pod and the curl from the router is once executed.

On the node where the pod is ‘living’ (k8s-worker2), we see the same output as the one above. However, on another pod (k8s-worker1) we see the following output:
Oct 31 15:56:05.220: 10.0.3.103:48278 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: SYN)
Oct 31 15:56:05.220: 10.0.3.103:48278 <- default/web1-696bfbbbc4-jnxbc:80 to-stack FORWARDED (TCP Flags: SYN, ACK)
Oct 31 15:56:05.220: 10.0.3.103:48278 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: ACK)
Oct 31 15:56:05.221: 10.0.3.103:48278 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: ACK, PSH)
Oct 31 15:56:05.221: 10.0.3.103:48278 <- default/web1-696bfbbbc4-jnxbc:80 to-stack FORWARDED (TCP Flags: ACK, PSH)
Oct 31 15:56:05.222: 10.0.3.103:48278 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: ACK, FIN)
Oct 31 15:56:05.222: 10.0.3.103:48278 <- default/web1-696bfbbbc4-jnxbc:80 to-stack FORWARDED (TCP Flags: ACK, FIN)
Oct 31 15:56:05.222: 10.0.3.103:48278 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: ACK)
Oct 31 15:56:12.739: 10.0.4.105:36956 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: SYN)
Oct 31 15:56:12.739: default/web1-696bfbbbc4-jnxbc:80 <> 10.0.4.105:36956 to-overlay FORWARDED (TCP Flags: SYN, ACK)
Oct 31 15:56:12.742: 10.0.4.105:36956 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: ACK)
Oct 31 15:56:12.742: 10.0.4.105:36956 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: ACK, PSH)
Oct 31 15:56:12.745: default/web1-696bfbbbc4-jnxbc:80 <> 10.0.4.105:36956 to-overlay FORWARDED (TCP Flags: ACK, PSH)
Oct 31 15:56:12.749: 10.0.4.105:36956 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: ACK, FIN)
Oct 31 15:56:12.749: default/web1-696bfbbbc4-jnxbc:80 <> 10.0.4.105:36956 to-overlay FORWARDED (TCP Flags: ACK, FIN)

What we see here is that our router is sending the traffic for ip address 172.16.10.0 to k8s-worker1, but that worker does not host our web1 container, so it forwards the traffic to k8s-worker2 which handles the traffic. All the forwarding logic is handled using eBPF – a small BPF program attached to the interface will send the traffic and routes to another worker if needed. That is also the reason that running tcpdump on k8s-worker1, where the packages initially are received, does not show any traffic. It is already redirected to k8s-worker2 before it could land in the IP stack of k8s-worker1.

Our partner Isovalent has a lot of information about eBPF and the internals. If you have not heard about eBPF and you are into Linux and/or networking, please do explore the basics. eBPF will change networking in Linux drastically, especially for cloud-native environments!

What we see here is that our router is sending the traffic for ip address 172.16.10.0 to k8s-worker1, but that worker does not host our web1 container, so it forwards the traffic to k8s-worker2 which handles the traffic. All the forwarding logic is handled using eBPF – a small BPF program attached to the interface will send the traffic and routes to another worker if needed. That is also the reason that running tcpdump on k8s-worker1, where the packages initially are received, does not show any traffic. It is already redirected to k8s-worker2 before it could land in the ip stack of k8s-worker1.

Cilium.io has a lot of information about eBPF and the internals. If you have not heard about eBPF and you are into Linux and/or networking, please do yourself a favor and learn at least the basics. In my humble opinion eBPF will change networking in Linux drastically in the near future and especially for cloud native environments!

Hubble Web GUI

With a working BGP set-up, it is quite simple to make the Hubble Web GUI available to the outside world as well.

Final words

With the integration of MetalLB, setting up Cilium with BGP becomes remarkably straightforward, eliminating the need for costly network hardware. This combination of Cilium/BGP, especially when paired with the deactivation of kube-proxy, significantly reduces latency to your cloud-based services. It also enhances security and transparency by only announcing the IP addresses of LoadBalancers. While this setup doesn’t necessitate an Ingress Controller, one is still recommend for most HTTP Services. Controllers like NGINX or Traefik, exposed through BGP, offer substantial benefits at the protocol level, including URL rewriting and request rate limiting.

This advancement in cloud-native and Linux-based networking is truly a leap forward, marking an exciting era in network technology!

Register now for the workshop with our partner Isovalent to learn about Cilium and eBPF.

Register now

Share This Story, Choose Your Platform!