July 8, 2020

Single-node K8s: KinD, Istio, & MetalLB

Single-node K8s: KinD, Istio, & MetalLB

I have a cheap dedicated server with the German hosting company Hetzner. It's just for my own personal use, with lots of disk space to satisfy my data hoarding and enough ram to do the things I need to do. I have it mostly because my home internet is slow. Over time, there got to be a lot of services -- Plex, a Bitcoin node, an IPFS gateway, my e-mail, and recently in my attempts to de-google my life, I've added a search engine, a youtube replacement, a twitter replacement, and a pastebin replacement, among other things. (My rants about big data and the erosion of privacy are for another time!) Those services have been running with docker-compose but since I've been working with Kubernetes for my master's thesis project (a distributed stock exchange order matching engine running on K8s), I wanted to move them over to that. Aside from practicing my Kubernetes-fu, I wanted a setup that could easily be transferred from my single server to a more robust multi-node potentially cloud-hosted platform when I eventually upgrade.

The problem until recently has been that the single-server Kubernetes options were not great. Minikube worked fine but had the added overhead of a VM, and attaching host networking so I could expose my public IP was difficult. Other Docker-based options like RKE worked well but had networking issues or other problems that made them not quite a full K8s experience. However, recently I tried Kubernetes in Docker (KinD), and found that none of the issues I had previously were present. It worked without hassle or extra setup, and it was easy to mount my host folders into the K8s nodes for services such as Plex and my backup jobs. And while logically there are many extra layers, those translate physically into nothing more than namespaces and other context wrappers -- which means they don't suffer the performance drawbacks of something like a VM. (I was initially concerned that the KinD nodes were VMs wrapped in containers a la Rancher VM, but it turns out they're just containerd wrapped in containers.) Obviously, there is still overhead, but it was little enough that the benefits finally outweighed the costs.

Although the configuration works well, there were a handful of small issues I ran into, so I thought I'd describe how I did it -- my single-node full Kubernetes instance with Kubernetes-in-Docker, MetalLB, and Istio. There's one hacky workaround to tunnel inbound traffic from the host into Kubernetes, but other than that, it's all mostly straightforward.

The first step is to have Docker and the KinD binary installed on your system. From there, you do a normal KinD installation using whatever configuration you want. For me, that's a basic installation except I've added some host directories to use for app storage rather than a third party PV provider. (I considered setting up a storage backend like Minio S3 or Ceph, but at some point practicality needs to win, and especially for things like Plex, the unnecessary overhead was more than a philosophical issue.)

Wanting to use the most recent version of Kubernetes, which at the time of this writing was 1.18.4, I manually specified the node image, and ran the setup:

$ kind create cluster --image docker.io/kindest/node:v1.18.4 \
  --name kind --config /path/to/kind-config.yaml

The config file merely specified one control plane node and three workers, with my host storage mounted on every worker node. (This way, I could specify the node path in the application's pod spec and know that it would be there on whatever node the pod was created.)

$ cat kind-config.yaml
---
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
  extraMounts:
  - hostPath: /path/to/host/storage
    containerPath: /mnt/host_storage/
- role: worker
  extraMounts:
  - hostPath: /path/to/host/storage
    containerPath: /mnt/host_storage/
- role: worker
  extraMounts:
  - hostPath: /path/to/host/storage
    containerPath: /mnt/host_storage/
---

At this point, once KinD has bootstrapped the cluster, you need to either get the raw kubeconfig data by running kind get kubeconfig, or let KinD set your kubectl context automatically by running kind export kubeconfig. Ideally, everything shows up as working correctly:

$ kubectl get all --all-namespaces
NAMESPACE            NAME                                             READY   STATUS    RESTARTS   AGE
kube-system          pod/coredns-66bff467f8-blj4s                     1/1     Running   0          69s
kube-system          pod/coredns-66bff467f8-jvtn5                     1/1     Running   0          69s
kube-system          pod/etcd-kind-control-plane                      1/1     Running   0          78s
kube-system          pod/kindnet-fgmqb                                1/1     Running   0          52s
kube-system          pod/kindnet-k8xpw                                1/1     Running   0          69s
kube-system          pod/kindnet-q47hs                                1/1     Running   0          52s
kube-system          pod/kindnet-rflnb                                1/1     Running   2          52s
kube-system          pod/kube-apiserver-kind-control-plane            1/1     Running   0          78s
kube-system          pod/kube-controller-manager-kind-control-plane   1/1     Running   0          78s
kube-system          pod/kube-proxy-4w24t                             1/1     Running   0          52s
kube-system          pod/kube-proxy-6k5zx                             1/1     Running   0          69s
kube-system          pod/kube-proxy-tz2vh                             1/1     Running   0          52s
kube-system          pod/kube-proxy-xgf7j                             1/1     Running   0          52s
kube-system          pod/kube-scheduler-kind-control-plane            1/1     Running   0          78s
local-path-storage   pod/local-path-provisioner-67795f75bd-jx4dg      1/1     Running   0          69s

NAMESPACE     NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP                  86s
kube-system   service/kube-dns     ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   85s

NAMESPACE     NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   daemonset.apps/kindnet      4         4         4       4            4           <none>                   84s
kube-system   daemonset.apps/kube-proxy   4         4         4       4            4           kubernetes.io/os=linux   85s

NAMESPACE            NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
kube-system          deployment.apps/coredns                  2/2     2            2           85s
local-path-storage   deployment.apps/local-path-provisioner   1/1     1            1           83s

NAMESPACE            NAME                                                DESIRED   CURRENT   READY   AGE
kube-system          replicaset.apps/coredns-66bff467f8                  2         2         2       69s
local-path-storage   replicaset.apps/local-path-provisioner-67795f75bd   1         1         1       69s

and the host storage mounts are accessible everywhere an application might be deployed:

$ docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS                       NAMES
0c981780e342        kindest/node:v1.18.4   "/usr/local/bin/entr…"   2 minutes ago       Up 2 minutes        127.0.0.1:41459->6443/tcp   kind-control-plane
ecc5a3120914        kindest/node:v1.18.4   "/usr/local/bin/entr…"   2 minutes ago       Up 2 minutes                                    kind-worker
c126cf79ed65        kindest/node:v1.18.4   "/usr/local/bin/entr…"   2 minutes ago       Up 2 minutes                                    kind-worker3
5c05cf09e5d6        kindest/node:v1.18.4   "/usr/local/bin/entr…"   2 minutes ago       Up 2 minutes                                    kind-worker2
$ docker exec -it ecc5a3120914 ls -lAh /mnt/host_storage/
total 4.0K
-rw-r--r-- 1 1000 1000 11 Jul  8 04:17 file_on_host.txt
$ docker exec -it c126cf79ed65 ls -lAh /mnt/host_storage/
total 4.0K
-rw-r--r-- 1 1000 1000 11 Jul  8 04:17 file_on_host.txt
$ docker exec -it 5c05cf09e5d6 ls -lAh /mnt/host_storage/
total 4.0K
-rw-r--r-- 1 1000 1000 11 Jul  8 04:17 file_on_host.txt

I had in my head that Istio could also take over the function of the base overlay network for Kubernetes, but I did not manage to get that working, so I let KinD install its default overlay network Weave, and then I installed Istio on top of that. Even though I probably should have used the Kubernetes-native way of using a Helm chart, I'm lazy and just used the default profile that comes with Istio's istioctl program:

$ istioctl install

This will install the default Istio profile into the cluster. Proceed? (y/N) Y
Detected that your cluster does not support third party JWT authentication. Falling back to less secure first party JWT. See https://istio.io/docs/ops/best practices/security/#configure-third-party-service-account-tokens for details.
✔ Istio core installed
✔ Istiod installed
✔ Ingress gateways installed
✔ Addons installed
✔ Installation complete

And now we see Istio up and running, with its ingress gateway waiting patiently for an external IP that we will give it shortly.

$ kubectl get all -n istio-system
NAME                                        READY   STATUS    RESTARTS   AGE
pod/istio-ingressgateway-66c7db878f-gp9vz   1/1     Running   0          3m31s
pod/istiod-7cdc645bb4-2h75j                 1/1     Running   0          4m15s
pod/prometheus-5c84c494dd-879h4             2/2     Running   0          3m31s

NAME                           TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                                                      AGE
service/istio-ingressgateway   LoadBalancer   10.100.197.167   <pending>     15021:31353/TCP,80:32377/TCP,443:31939/TCP,15443:30195/TCP   3m31s
service/istiod                 ClusterIP      10.105.18.223    <none>        15010/TCP,15012/TCP,443/TCP,15014/TCP,53/UDP,853/TCP         4m15s
service/prometheus             ClusterIP      10.107.187.215   <none>        9090/TCP                                                     3m31s

NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/istio-ingressgateway   1/1     1            1           3m31s
deployment.apps/istiod                 1/1     1            1           4m15s
deployment.apps/prometheus             1/1     1            1           3m31s

NAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/istio-ingressgateway-66c7db878f   1         1         1       3m31s
replicaset.apps/istiod-7cdc645bb4                 1         1         1       4m15s
replicaset.apps/prometheus-5c84c494dd             1         1         1       3m31s

NAME                                                       REFERENCE                         TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/istio-ingressgateway   Deployment/istio-ingressgateway   <unknown>/80%   1         5         1          3m31s
horizontalpodautoscaler.autoscaling/istiod                 Deployment/istiod                 <unknown>/80%   1         5         1          4m15s

At this point, everything works, but I still need to somehow route traffic from the host into the Kubernetes network. Since I wanted the manifests portable, I also wanted to be able to specify LoadBalancer as a service type, instead of just always using NodePorts for my configurations. After some research I ended up turning to MetalLB, the "bare metal load balancer" -- i.e. a set of configurations to setup Kubernetes with manual IP addresses.

The first issue I ran into with MetalLB was that it didn't create its namespace metallb-system and it would fail to install if that wasn't present. So I manually created the namespace by applying the following simple yaml:

---
apiVersion: v1
kind: Namespace
metadata:
  name: metallb-system
  labels:
    app: metallb
    istio-injection: disabled
---

Next was to create a basic config for it, describing which IP addresses it could give out as LoadBalancer addresses. I don't know if this was the right thing to do or not, but I ended up deciding to use the node IP address space for the KinD nodes. I was worried that if I told it to always use my host public IP that that might cause issues. I wasn't sure if the node IP address space would work or be visible from the nodes or services, but I tried it out and it worked... for the most part. The cluster nodes and the pods inside the cluster could access the IP, but the host machine could not. (Don't worry, we'll hack a fix for that shortly.)

The node cluster IP space can be found a number of ways. You can run

 $ docker inspect network kind

or you can just query a node directly with something like

$ docker exec -it 0c981780e342 ip addr show dev eth0

My cluster's node IP space was 172.18.0.0/16 with nodes being assigned inside 172.18.0.0/24. I decided to use 172.18.5.0/24 and 172.18.6.0/24 figuring ~500 IPs would be plenty for whatever I was going to do. So the config map yaml which I applied next was the following:

---
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    address-pools:
    - name: custom-ip-space
      protocol: layer2
      addresses:
      - 172.18.5.2-172.18.5.254
      - 172.18.6.2-172.18.6.254
---

I didn't think I needed to leave room for a gateway and broadcast, since this wasn't actually a new network, but I also wasn't sure how MetalLB worked, so I figured better safe than sorry.

Lastly, before MetalLB will work, we need to generate a random secret for it to use. The secret is nothing more than 128 random bytes, base64-encoded, with the line breaks intact. An example method to generate this secret string would be:

$ dd if=/dev/urandom bs=1 count=128 2>/dev/null | base64

or

$ openssl rand -base64 128

The secret has to have the name memberlist, be in the metallb-system namespace, and be a single key/value, with the key name secretkey and the value being the random bytes generated earlier.

An example command to create this new secret might be:

$ kubectl create secret generic -n metallb-system \
  memberlist --from-literal=secretkey="$(openssl rand -base64 128)"

Once this new secret is created, you can just install the latest version of MetalLB directly from the source. In my instance, the most recent release on github was 0.9.3, so I applied the following:

$ kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.9.3/manifests/metallb.yaml

which created several new resources:

$ kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.9.3/manifests/metallb.yaml
podsecuritypolicy.policy/controller created
podsecuritypolicy.policy/speaker created
serviceaccount/controller created
serviceaccount/speaker created
clusterrole.rbac.authorization.k8s.io/metallb-system:controller created
clusterrole.rbac.authorization.k8s.io/metallb-system:speaker created
role.rbac.authorization.k8s.io/config-watcher created
role.rbac.authorization.k8s.io/pod-lister created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:speaker created
rolebinding.rbac.authorization.k8s.io/config-watcher created
rolebinding.rbac.authorization.k8s.io/pod-lister created
daemonset.apps/speaker created
deployment.apps/controller created

With that, the basic installation is complete. Istio is configured and any services that we assign LoadBalancers to will get an IP from the 172.18.5.0/24 or 172.18.6.0/24 range and be accessible from all pods and nodes. Services can also explicitly request/assign their own IP from within that range, which will come in handy shortly.

But first, let's run get all and make sure that everything is correct, including Istio's ingress gateway getting an IP address:

$ kubectl get all --all-namespaces
NAMESPACE            NAME                                             READY   STATUS    RESTARTS   AGE
istio-system         pod/istio-ingressgateway-66c7db878f-gp9vz        1/1     Running   0          107m
istio-system         pod/istiod-7cdc645bb4-2h75j                      1/1     Running   0          108m
istio-system         pod/prometheus-5c84c494dd-879h4                  2/2     Running   0          107m
kube-system          pod/coredns-66bff467f8-blj4s                     1/1     Running   0          119m
kube-system          pod/coredns-66bff467f8-jvtn5                     1/1     Running   0          119m
kube-system          pod/etcd-kind-control-plane                      1/1     Running   0          119m
kube-system          pod/kindnet-fgmqb                                1/1     Running   0          119m
kube-system          pod/kindnet-k8xpw                                1/1     Running   0          119m
kube-system          pod/kindnet-q47hs                                1/1     Running   0          119m
kube-system          pod/kindnet-rflnb                                1/1     Running   2          119m
kube-system          pod/kube-apiserver-kind-control-plane            1/1     Running   0          119m
kube-system          pod/kube-controller-manager-kind-control-plane   1/1     Running   0          119m
kube-system          pod/kube-proxy-4w24t                             1/1     Running   0          119m
kube-system          pod/kube-proxy-6k5zx                             1/1     Running   0          119m
kube-system          pod/kube-proxy-tz2vh                             1/1     Running   0          119m
kube-system          pod/kube-proxy-xgf7j                             1/1     Running   0          119m
kube-system          pod/kube-scheduler-kind-control-plane            1/1     Running   0          119m
local-path-storage   pod/local-path-provisioner-67795f75bd-jx4dg      1/1     Running   0          119m
metallb-system       pod/controller-57f648cb96-7q9cx                  1/1     Running   0          72s
metallb-system       pod/speaker-4pf7b                                1/1     Running   0          72s
metallb-system       pod/speaker-c22hp                                1/1     Running   0          72s
metallb-system       pod/speaker-c2f2l                                1/1     Running   0          72s
metallb-system       pod/speaker-crx56                                1/1     Running   0          72s

NAMESPACE      NAME                           TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                                                      AGE
default        service/kubernetes             ClusterIP      10.96.0.1        <none>        443/TCP                                                      119m
istio-system   service/istio-ingressgateway   LoadBalancer   10.100.197.167   172.18.5.2    15021:31353/TCP,80:32377/TCP,443:31939/TCP,15443:30195/TCP   107m
istio-system   service/istiod                 ClusterIP      10.105.18.223    <none>        15010/TCP,15012/TCP,443/TCP,15014/TCP,53/UDP,853/TCP         108m
istio-system   service/prometheus             ClusterIP      10.107.187.215   <none>        9090/TCP                                                     107m
kube-system    service/kube-dns               ClusterIP      10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP                                       119m

NAMESPACE        NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
kube-system      daemonset.apps/kindnet      4         4         4       4            4           <none>                        119m
kube-system      daemonset.apps/kube-proxy   4         4         4       4            4           kubernetes.io/os=linux        119m
metallb-system   daemonset.apps/speaker      4         4         4       4            4           beta.kubernetes.io/os=linux   72s

NAMESPACE            NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
istio-system         deployment.apps/istio-ingressgateway     1/1     1            1           107m
istio-system         deployment.apps/istiod                   1/1     1            1           108m
istio-system         deployment.apps/prometheus               1/1     1            1           107m
kube-system          deployment.apps/coredns                  2/2     2            2           119m
local-path-storage   deployment.apps/local-path-provisioner   1/1     1            1           119m
metallb-system       deployment.apps/controller               1/1     1            1           72s

NAMESPACE            NAME                                                DESIRED   CURRENT   READY   AGE
istio-system         replicaset.apps/istio-ingressgateway-66c7db878f     1         1         1       107m
istio-system         replicaset.apps/istiod-7cdc645bb4                   1         1         1       108m
istio-system         replicaset.apps/prometheus-5c84c494dd               1         1         1       107m
kube-system          replicaset.apps/coredns-66bff467f8                  2         2         2       119m
local-path-storage   replicaset.apps/local-path-provisioner-67795f75bd   1         1         1       119m
metallb-system       replicaset.apps/controller-57f648cb96               1         1         1       72s

NAMESPACE      NAME                                                       REFERENCE                         TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
istio-system   horizontalpodautoscaler.autoscaling/istio-ingressgateway   Deployment/istio-ingressgateway   <unknown>/80%   1         5         1          107m
istio-system   horizontalpodautoscaler.autoscaling/istiod                 Deployment/istiod                 <unknown>/80%   1         5         1          108m

Looks good. The only problem is that the host (and outside traffic) cannot connect inbound to that load balancer IP address. To accomplish this, we finally get to the hacky band-aid patch that I'm not thrilled about but which has so far worked extremely well: We run socat in host's Docker as reverse proxy, connecting a port on the host to the KinD node network space.

As an example, let us forward the host's TCP port 80 the Istio ingress gateway's load balancer on TCP port 80. To do this, on the host machine's Docker, we run a socat instance like so:

$ docker run -d --network kind -p 80:80 docker.io/alpine/socat:latest \
  -dd tcp-listen:80,fork,reuseaddr tcp-connect:172.18.5.2:80

You can see at the end where the load balancer IP address goes, and you can also see with the "-p 80:80" how the container port is mapped from the host. This creates a straight line from a host port bind over to socat, over to the load balancer service inside the cluster. If we imagine UDP port 9000 was also listening on the Istio ingress gateway, we could do a UDP port forward from the host in a similar manner:

$ docker run -d --network kind -p "9000:9000/udp" \
  docker.io/alpine/socat:latest -dd \
  UDP4-RECVFROM:9000,fork UDP4-SENDTO:172.18.5.2:9000

(The -dd flag is just the verbosity of logging/debugging output and can be ommitted.)

For my setup, I have a single docker-compose file with each load-balanced service port getting its own socat container. An example might look like:

---
version: "3"
services:

  socat_istio_ingress:
    image: docker.io/alpine/socat:latest
    restart: unless-stopped
    container_name: socat_istio_ingress
    networks:
      - kind
    ports:
      - "0.0.0.0:80:80/tcp"
    command: "-dd tcp-listen:80,fork,reuseaddr tcp-connect:172.18.5.2:80"

  socat_istio_ingress_udp9000:
    image: docker.io/alpine/socat:latest
    restart: unless-stopped
    container_name: socat_istio_ingress_udp9000
    networks:
      - kind
    ports:
      - "0.0.0.0:9000:9000/udp"
    command: "-dd UDP4-RECVFROM:9000,fork UDP4-SENDTO:172.18.5.2:9000"

networks:
  kind:
    external:
      name: kind
---

This feels too hacky to me, but it works phenomenally well, and I have had no issues with speed or bandwidth. It also allows me to cut off external traffic without touching the Kubernetes cluster. There's only two major downsides. The first one is that I no longer get source IP addresses, since as far as K8s is concerned, all traffic originates from the socat container IPs. That's not an issue for me (yet) on my single-node personal setup, but it would be problematic for organizations looking to block DDoS attacks, gather user geolocation metrics, or other things that utilize source IP addresses. The other (and lesser) downside is that I can't connect from the host to NodePort addresses, so I have to set those as load balancers, and map the host port as a localhost bind rather than a 0.0.0.0 bind. I know I could use iptables for the port forwarding, and that's probably the best solution for everything, but I want to keep things containerized and predictable without messing around with host networking. This setup is mostly so I can get some K8s experience, utilizing its full feature set, so I don't mind so hacking around the edges to make my setup work. (As an aside, kubectl port-forward also works reasonably well and you can set the timeout to inifinity, so I might just end up using that.)

Edit: After posting this article, I found out about a recent feature in KinD which is an additional configuration option extraPortMappings that lets you bind host ports to KinD container ports, similar to Docker's port forwarding option. This solves the issue and makes my socat hack unnecessary, and it is what I have switched to for forwarding incoming traffic.

Even with the few small issues, this setup does indeed provide a full feature-complete Kubernetes on my single bare-metal server, with the ability to assign load balancers to services. This allows my configuration to be more portable (even though I'll have to remove the manually specified LB IP addresses). And it allows me to work with and practice using Kubernetes in managing my own services. And of course, while there is certainly overhead to all of this, it's not too terrible. Plex still works fine, and I get all the added benefits of Istio, such as easy monitoring and tracing. (Although instead of Istio's ingress gateway, I use Traefik 2 as my ingress.)

There wasn't too much that was difficult about this, but it was still a kind of interesting setup, and I had to google a few times to find all of the information to glue all the bits and pieces together. So I thought it was still worth putting out there.

Plus, I needed content for my new homepage...