Fortanix Data Security Manager (Release 4.3) Kubernetes Version Upgrade to 1.14 K8s

Introduction

The purpose of this guide is to describe the steps to upgrade Kubernetes from version 1.10 k8s to 1.14 k8s for Fortanix-Data-Security-Manager (DSM) release 4.3.

Overview

The Fortanix DSM 4.3 release will upgrade the system from Kubernetes version 1.10 k8s to 1.14 k8s.

Subsequent Kubernetes upgrades will be released as part of regular upgrades or could continue to be independent upgrades.

After upgrading Fortanix DSM to the 4.3 version, you will not be able to downgrade to previous releases. The Fortanix DSM UI will not allow a downgrade after 4.3 is installed. Please work with Fortanix Support to ensure you have a valid backup that can be used to perform a manual recovery.

Also, you will need to upgrade Fortanix DSM to 4.3 before moving to any future release.

Prerequisites

The following are the prerequisites before upgrading:

Ensure that Disk space of more than 15 GB is available in /var and root directory (/) by executing the following command:

root@us-west-eqsv2-13:~# df -h /var/ /

The following is the output:

Filesystem            Size  Used Avail Use% Mounted on

/dev/mapper/main-var   46G   28G   16G  64% /var

/dev/sda2              46G   21G   24G  47% /

Ensure that all software versions are available in all the endpoints by executing the following command:

root@us-west-eqsv2-13:~# kubectl  get ep -n swdist

The following is the output:

NAME      ENDPOINTS                                         AGE

swdist    10.244.0.212:22,10.244.1.191:22,10.244.2.152:22   242d

v2649     10.244.0.212:22,10.244.1.191:22,10.244.2.152:22   4d

v2657     10.244.0.212:22,10.244.1.191:22,10.244.2.152:22   2d

Ensure that Docker registry status “systemctl status docker-registry” is active and running before and after the software is uploaded. Also, ensure that the overlay mount matches with this on each node. The following is the command:
```
cat /etc/systemd/system/var-opt-fortanix-swdist_overlay.mount.d/options.conf
[Mount]
```
```
Options=lowerdir=/var/opt/fortanix/swdist/data/vXXXX/registry:/var/opt/fortanix/swdist/data/vYYYY/registry
```
Here, ‘vXXXX’ is the previous version and ‘vYYYY’ is the upgraded version.
Ensure that the latest backup is triggered and verify that it is a successful backup (size and so on).

Upgrading Kubernetes from 1.10 to 1.14

Ensure that you read the ‘Prerequisites’ section before upgrading.

Pre Checks

All nodes should report healthy and should be running Kubernetes version v1.10.13, docker version 17.3.1, and kernel 5.4.
Run the command kubectl get nodes and look for the version number under the column VERSION. It should show v1.10.13 for each of the nodes. The following is an example output:
```
NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ali1 Ready master 2d v1.10.13 <none> Ubuntu 20.04.3 LTS 5.4.0-81-generic docker://17.3.1
nuc3 Ready master 3d v1.10.13 <none> Ubuntu 20.04.3 LTS 5.4.0-81-generic docker://17.3.1
```
All pods are healthy in the default swdist and kube-systems namespace.

Check the etcd status.

etcd should be TLS migrated. The following command should output the list of etcd members where peerURLs should have both ports listed, 2380(http) and 2382(https).

sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl exec  -nkube-system -- etcdctl --ca-file /etc/kubernetes/pki/etcd/ca.crt --cert-file /etc/kubernetes/pki/etcd/healthcheck-client.crt --key-file /etc/kubernetes/pki/etcd/healthcheck-client.key --endpoints https://127.0.0.1:2379 member list

The output of the above command should be similar to:

37bc079e7f15c970: name=ali1 peerURLs=http://10.197.192.14:2380,https://10.197.192.14:2382 clientURLs=https://10.197.192.14:2379 isLeader=false
e6214c803ea4e0c6: name=nuc3 peerURLs=http://10.197.192.12:2380,https://10.197.192.12:2382 clientURLs=http://10.197.192.12:2379 isLeader=true

Check the etcd version on each of the etcd pods. It should be 3.2.18.

sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl exec <etcd pod name> -nkube-system -- etcd --version

Check the etcd cluster health. It should report that the cluster is healthy. For example:

$ sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl exec etcd-us-west-eqsv2-prod-1 -nkube-system -- etcdctl --ca-file /etc/kubernetes/pki/etcd/ca.crt --cert-file /etc/kubernetes/pki/etcd/healthcheck-client.crt --key-file /etc/kubernetes/pki/etcd/healthcheck-client.key --endpoints https://127.0.0.1:2379 cluster-health
member 60bf9f6f5fbf9ee3 is healthy: got healthy result from https://10.197.64.4:2379
member 63e02b05bdd1e768 is healthy: got healthy result from https://10.197.64.5:2379
member 6a6e23ad086373b9 is healthy: got healthy result from https://10.197.64.7:2379
member b4f9fd50b1fc3926 is healthy: got healthy result from https://10.197.64.1:2379
member efe173d68e9fe699 is healthy: got healthy result from https://10.197.64.3:2379
cluster is healthy

Check the image versions for all Kubernetes control plane components. On each node, go to the directory /etc/kubernetes/manifests and run the following commands. The desired output should be the following.

administrator@us-west-eqsv2-prod-1:/etc/kubernetes/manifests$ sudo cat etcd.yaml | grep "image:"
image: containers.fortanix.com:5000/etcd-amd64:3.2.18
image: containers.fortanix.com:5000/etcd-amd64:3.2.18
administrator@us-west-eqsv2-prod-1:/etc/kubernetes/manifests$ sudo cat kube-scheduler.yaml | grep "image:"
image: containers.fortanix.com:5000/kube-scheduler-amd64:v1.10.13
administrator@us-west-eqsv2-prod-1:/etc/kubernetes/manifests$ sudo cat kube-controller-manager.yaml | grep "image:"
image: containers.fortanix.com:5000/kube-controller-manager-amd64:v1.10.13
administrator@us-west-eqsv2-prod-1:/etc/kubernetes/manifests$ sudo cat kube-apiserver.yaml | grep "image:"
image: containers.fortanix.com:5000/kube-apiserver-amd64:v1.10.13

Make sure the Kubernetes certificates have not expired or are about to expire.
- Check the certificates under /etc/kubernetes/pki and /etc/kubernetes/pki/etcd.
- If the certificates have expired, renew them using
  /opt/fortanix/sdkms/bin/renew-k8s-certs.sh.

The kubelet, docker, and docker-registry service should be running on each node.

systemctl status docker
systemctl status kubelet
systemctl status docker-registry

There should be disk space of 15+ GB available in /var and / . If not, please delete the oldest version of Fortanix DSM from the UI.

$ df -h /var/ /
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/main-var 47G 26G 21G 56% /var
/dev/sda2 47G 13G 33G 28% /

Post Upgrade

The following are the post-update details to note.

Kubernetes version is now upgraded from v1.10.13 to v1.14.10. This means that the packages kubeadm, kubectl, kubelet are upgraded to v.1.14.10.

kubectl does not support -a option now. The command kubectl get pods -a throws an error.

kubelet configuration now uses a file called /etc/default/kubelet for extra arguments.

sudo dpkg -l | grep 1.14.10
ii kubeadm 1.14.10-00fortanix amd64 Kubernetes Cluster Bootstrapping Tool
ii kubectl 1.14.10-00 amd64 Kubernetes Command Line Tool
ii kubelet 1.14.10-00 amd64 Kubernetes Node Agent

The docker version is upgraded from version 17.03 to version 18.06.

$ sudo dpkg -l | grep docker-ce
ii docker-ce 18.06.3~ce~3-0~ubuntu amd64 Docker: the open-source application container engine

etcd upgraded to 3.3.10 from 3.2.18.

Member list should show all members listening to peers on 2382(https) and as a client on 2379(https). There could be members that show peers listening on port 2380(http) alongside https usage.

$ sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl exec etcd-sdkms-server-1 -nkube-system -- etcdctl --ca-file /etc/kubernetes/pki/etcd/ca.crt --cert-file /etc/kubernetes/pki/etcd/healthcheck-client.crt --key-file /etc/kubernetes/pki/etcd/healthcheck-client.key --endpoints https://127.0.0.1:2379 member list
9c5daa70fca6432: name=sdkms-server-1 peerURLs=https://10.197.192.43:2382 clientURLs=https://10.197.192.43:2379 isLeader=true
8d2bc14567dfd781: name=sdkms-server-2 peerURLs=https://10.197.192.44:2382 clientURLs=https://10.197.192.44:2379 isLeader=false

Flannel upgrade to v0.11.0-1-g3b757492 from v0.9.0.

Flannel CNI docker image is no longer used. Only a flannel docker image is used. Flannel CNI version updated in kube-flannel configmap from 0.3.0 to 0.3.1.

$sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl describe ds kube-flannel-ds -nkube-system | grep Image
Image: containers.fortanix.com:5000/flannel:v0.11.0-1-g3b757492
Image: containers.fortanix.com:5000/flannel:v0.11.0-1-g3b757492

$ sudo -E kubectl get configmap kube-flannel-cfg -nkube-system -oyaml | grep "cniVersion"
"cniVersion": "0.3.1"

Pause docker image updated from version 3.0 to 3.1. To check this on the node, run the following command:
```
docker ps | grep pause
```
The swdist is updated as it has the kubectl specific version as part of the Dockerfile. This would cause swdist to roll out. It would be rolled out after nodes have been upgraded to v1.14. Check the age of the pods using the command below:
```
kubectl get pods -owide -nswdist
```
Kube Proxy patch:
- Kube proxy daemon set has been patched after each k8s version to include patched kube-proxy docker image (this is done to patch the bug in kube-proxy and avoid contention on iptables lock).
- Kube proxy docker image versions for each k8s version.
  - v1.11.10 - v1.11.10-3-cab99e3cb4b51f
  - v1.12.10 - v1.12.10-3-85f7b5925c428e
  - v1.13.12 - v1.13.12-3-6e71bdf7e97b1c
  - v1.14.10 - v1.14.10-5-740026d6e146df

Kernel is upgraded from 5.4.0-81-generic to 5.8.0-50-generic. The Kubernetes version on each node is upgraded to 1.14 prior to doing the kernel upgrade. This can be checked with the following command under column KERNEL-VERSION.

$kubectl get nodes -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
sdkms-server-3 Ready master 8h v1.14.10 10.197.192.45 <none> Ubuntu 20.04.3 LTS 5.8.0-50-generic docker://18.6.3

Kured version is upgraded from 1.1.0 to 1.2.0

Troubleshooting

In case kubelet client certificates expire (/var/lib/kubelet/pki/kubelet-client.crt) and there is no /var/lib/kubelet/pki/kubelet-client-current.pem file present, then you can create the certificates using the following commands:

TEMP_DIR=/etc/kubernetes/tmp
mkdir -p $TEMP_DIR
BACKUP_PEM="/var/lib/kubelet/pki/kubelet-client-current.pem"
KEY="/var/lib/kubelet/pki/kubelet-client.key"
CERT="/var/lib/kubelet/pki/kubelet-client.crt"

echo "Stopping kubelet service"
systemctl stop kubelet

echo "Creating a new key and cert file for kubelet auth"
nodename=$(echo "$HOSTNAME" | awk '{print tolower($0)}')
openssl req -out $TEMP_DIR/tmp.csr -new -newkey rsa:2048 -nodes -keyout $TEMP_DIR/tmp.key -subj "/O=system:nodes/CN=system:node:$nodename"
cat > $TEMP_DIR/kubelet-client.ext << HERE
keyUsage = critical,digitalSignature,keyEncipherment
extendedKeyUsage = clientAuth
HERE
echo "Signing the generated csr with kubernetes CA"
openssl x509 -req -days 365 -in $TEMP_DIR/tmp.csr -CA /etc/kubernetes/pki/ca.crt -CAkey /etc/kubernetes/pki/ca.key -CAcreateserial -out $TEMP_DIR/tmp.crt -sha256 -extfile $TEMP_DIR/kubelet-client.ext
cp $TEMP_DIR/tmp.crt $CERT
cp $TEMP_DIR/tmp.key $KEY

chmod 644 $CERT
chmod 600 $KEY

if grep -q "client-certificate-data" $KUBELET_CONF; then
    echo "Updating file $KUBELET_CONF to add reference to restored certificates"
    sed -i "s|\(client-certificate-data:\s*\).*\$|client-certificate: $CERT|" $KUBELET_CONF
    sed -i "s|\(client-key-data:\s*\).*\$|client-key: $KEY|" $KUBELET_CONF
fi

echo "Starting kubelet service"
systemctl start kubelet

Upgrade on a 2 node cluster can fail due to etcd quorum failure. In such a scenario, if pods are healthy, you can re-run the deploy job manually using the following command. This will eventually upgrade the cluster to 1.14.
```
sdkms-cluster deploy --stage DEPLOY --version <version>
```
WARNING
2 node upgrades are not recommended.
When a cluster is upgraded from build 4.2.2087 to <4.3.xxxx> on a 3-node cluster, it is possible that the deploy job is exited and marked completed before cluster upgrade. In such a scenario, if all the pods are healthy, you can deploy the version again.