Fortanix Data Security Manager (Release 4.11) Kubernetes Version Upgrade to 1.19 K8s

Introduction

The purpose of this guide is to describe the steps to upgrade Kubernetes from version 1.16 k8s to 1.19 k8s for Fortanix-Data-Security-Manager (DSM) release 4.11.

Overview

The Fortanix DSM 4.11 release will upgrade the system from Kubernetes version 1.16 k8s to 1.19 k8s.

Subsequent Kubernetes upgrades will be released as part of regular upgrades or could continue to be independent upgrades.

After upgrading Fortanix DSM to the 4.11 version, you will not be able to downgrade to previous releases. The Fortanix DSM UI will not allow a downgrade after 4.11 is installed. Please work with Fortanix Support to ensure you have a valid backup that can be used to perform a manual recovery.

Also, you will need to upgrade Fortanix DSM to 4.11 before moving to any future release.

Prerequisites

The following are the prerequisites before upgrading:

Ensure that Disk space of more than 15 GB is available in /var and root directory (/) by executing the following command:

root@us-west-eqsv2-13:~# df -h /var/ /

The following is the output:

Filesystem            Size  Used Avail Use% Mounted on

/dev/mapper/main-var   46G   28G   16G  64% /var

/dev/sda2              46G   21G   24G  47% /

If not, delete the oldest version of Fortanix DSM from UI.

$ df -h /var/ /
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/main-var   47G   26G   21G  56% /var
/dev/sda2              47G   13G   33G  28% /

Ensure that all software versions are available in all the endpoints by executing the following command:

root@us-west-eqsv2-13:~# kubectl  get ep -n swdist

The following is the output:

NAME      ENDPOINTS                                         AGE

swdist    10.244.0.212:22,10.244.1.191:22,10.244.2.152:22   242d

v2649     10.244.0.212:22,10.244.1.191:22,10.244.2.152:22   4d

v2657     10.244.0.212:22,10.244.1.191:22,10.244.2.152:22   2d

Ensure that Docker registry status “systemctl status docker-registry” is active and running before and after the software is uploaded. Also, ensure that the overlay mount matches with this on each node. The following is the command:
```
cat /etc/systemd/system/var-opt-fortanix-swdist_overlay.mount.d/options.conf
[Mount]
```
```
Options=lowerdir=/var/opt/fortanix/swdist/data/vXXXX/registry:/var/opt/fortanix/swdist/data/vYYYY/registry
```
Here, ‘vXXXX’ is the previous version and ‘vYYYY’ is the upgraded version.
Ensure that the latest backup is triggered and verify that it is a successful backup (size and so on).

All nodes must report as healthy and be running Kubernetes version 1.16.15, Docker version 18.6, and kernel 5.8. Execute the command kubectl get nodes -o wide. Look for the version number under the column VERSION. It should show v1.16.15 for each of the nodes.

NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ali1 Ready master 2d v1.14.10 <none> Ubuntu 20.04.3 LTS 5.8.0-50-generic docker://18.6.3
nuc3 Ready master 3d v1.14.10 <none> Ubuntu 20.04.3 LTS 5.8.0-50-generic docker://18.6.3

All pods are healthy in the default swdist and kube-systems namespace.
Check kubeadm configuration on the cluster.
```
kubectl get configmap kubeadm-config -oyaml -nkube-system
```
This should return the following values for parameters in the Master configuration
- kubernetesVersion: v1.16.15
- imageRepository: http://containers.fortanix.com:5000/

Check the etcd status and if “isLeader=true“ is assigned to one of the etcd node.

etcd should be TLS migrated. The following command should output the list of etcd members where peerURLs should have both ports listed, 2380(http) and 2382(https).

sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl exec  -nkube-system -- etcdctl --ca-file /etc/kubernetes/pki/etcd/ca.crt --cert-file /etc/kubernetes/pki/etcd/healthcheck-client.crt --key-file /etc/kubernetes/pki/etcd/healthcheck-client.key --endpoints https://127.0.0.1:2379 member list

The output of the above command should be similar to:

6eac4cd6e44f7cb0: name=srv1-sitlab-dc peerURLs=https://10.4.65.11:2382 clientURLs=https://10.4.65.11:2379 isLeader=truee6214c803ea4e0c6: name=nuc3 peerURLs=http://10.197.192.12:2380,https://10.197.192.12:2382 clientURLs=http://10.197.192.12:2379 isLeader=true

Check the etcd version on each of the etcd pods. It should be 3.3.15.

sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl exec <etcd pod name> -nkube-system -- etcd --version

Check the etcd cluster health. It should report that the cluster is healthy. For example:

root@ip-172-31-0-231:/home/administrator# sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl exec etcd-ip-172-31-0-231 -nkube-system -- etcdctl --ca-file /etc/kubernetes/pki/etcd/ca.crt --cert-file /etc/kubernetes/pki/etcd/healthcheck-client.crt --key-file /etc/kubernetes/pki/etcd/healthcheck-client.key --endpoints https://127.0.0.1:2379 cluster-health
member db0f79a7474b0b2c is healthy: got healthy result from https://172.31.0.231:2379
cluster is healthy

Check the image versions for all Kubernetes control plane components. On each node, go to the directory /etc/kubernetes/manifests and run the following commands. The desired output should be the following.

root@ip-172-31-0-231:/etc/kubernetes/manifests# ls
etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml
root@ip-172-31-0-231:/etc/kubernetes/manifests# cat etcd.yaml | grep "image:"
image: containers.fortanix.com:5000/etcd:3.3.15-0
image: containers.fortanix.com:5000/etcd:3.3.15-0
root@ip-172-31-0-231:/etc/kubernetes/manifests# cat kube-apiserver.yaml | grep "image:"
image: containers.fortanix.com:5000/kube-apiserver:v1.16.15
root@ip-172-31-0-231:/etc/kubernetes/manifests# cat kube-controller-manager.yaml | grep "image:"
image: containers.fortanix.com:5000/kube-controller-manager:v1.16.15
root@ip-172-31-0-231:/etc/kubernetes/manifests# cat kube-scheduler.yaml | grep "image:"
image: containers.fortanix.com:5000/kube-scheduler:v1.16.15
root@ip-172-31-0-231:/etc/kubernetes/manifests#

Make sure the Kubernetes certificates have not expired or are about to expire.
- Check the certificates under /etc/kubernetes/pki and /etc/kubernetes/pki/etcd.
- If the certificates have expired, renew them using
  /opt/fortanix/sdkms/bin/renew-k8s-certs.sh.

The kubelet, docker, and docker-registry service should be running on each node.

systemctl status docker
systemctl status kubelet
systemctl status docker-registry

Upgrading Kubernetes from 1.16 to 1.19

Ensure that you read the ‘Prerequisites’ section before upgrading.

Post Upgrade

The following are the post-update details to note.

The status of the deploy job should be “Completed“.

# pod status
$ sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl get pods | grep deploy
deploy-rrv8v 0/1 Completed 0 18d

# job status
$ sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl get job deploy
NAME COMPLETIONS DURATION AGE
deploy 1/1 4h54m 18d

Kubernetes version is now upgraded from v1.16.15 to v1.19.16. This means that the packages kubeadm, kubectl, kubelet are upgraded to v.1.19.16.

$ dpkg -l | grep kube
ii kubeadm 1.19.16-00fortanix amd64 Kubernetes Cluster Bootstrapping Tool
ii kubectl 1.19.16-00 amd64 Kubernetes Command Line Tool
ii kubelet 1.19.16-00 amd64 Kubernetes Node Agent
ii kubernetes-cni 0.8.7-00 amd64 Kubernetes CNI

Flannel upgraded to v0.19.1 from v0.15.1-5-gd3cf066f.

$ sudo -E kubectl describe ds kube-flannel-ds -nkube-system | grep Image
Image: containers.fortanix.com:5000/flannel-cni-plugin:v1.0.0
Image: containers.fortanix.com:5000/flannel:v0.19.1
Image: containers.fortanix.com:5000/flannel:v0.19.1

The swdist container is updated (image tag 0.13.0). Check image version (0.13.0):

$ sudo -E kubectl describe ds swdist -nswdist | grep Image
Image: containers.fortanix.com:5000/swdist:0.13.0
Image: containers.fortanix.com:5000/swdist:0.13.0
Image: containers.fortanix.com:5000/swdist:0.13.0
Image: containers.fortanix.com:5000/swdist:0.13.0
Image: containers.fortanix.com:5000/swdist:0.13.0
Image: containers.fortanix.com:5000/swdist:0.13.0
Image: containers.fortanix.com:5000/swdist:0.13.0
Image: containers.fortanix.com:5000/swdist:0.13.0

If you are using DC Labeling fortanix-data-security-manager-data-center-labeling, you can verify that the zone label is added by the YAML of the node.
```
kubectl get node node_name -o yaml | grep -i 'zone'
```

etcd version is upgraded to 4.13-0. Execute the following command for each of the etcd pods in the cluster.

$ sudo -E kubectl describe pod etcd-sdkms-server-1 -nkube-system | grep Image
Image: containers.fortanix.com:5000/etcd:3.4.13-0
Image ID: docker-pullable://containers.fortanix.com:5000/etcd@sha256:1d142ee20719afc2168b2caa3df0c573d6b51741b2f47ea29c5afafa1e3bbe41
Image: containers.fortanix.com:5000/etcd:3.4.13-0
Image ID: docker-pullable://containers.fortanix.com:5000/etcd@sha256:1d142ee20719afc2168b2caa3df0c573d6b51741b2f47ea29c5afafa1e3bbe41

kube-proxy is upgraded to image “v1.19.14-rc.0.62-1515af898c52b9“.

$ sudo -E kubectl describe ds kube-proxy -nkube-system | grep Image
Image: containers.fortanix.com:5000/kube-proxy:v1.19.14-rc.0.62-1515af898c52b9

The kured pod is running with image version “1.3.0“

$ sudo -E kubectl describe ds kured -nkube-system | grep Image
Image: containers.fortanix.com:5000/kured:1.3.0

The kube-apiserver, kube-controller-manager, kube-scheduler are upgraded to 1.19.16 . This should be checked on each of the nodes in the clusterKube proxy docker image versions for each k8s version.

$ sudo cat /etc/kubernetes/manifests/kube-scheduler.yaml | grep "image:"
image: containers.fortanix.com:5000/kube-scheduler:v1.19.16
$ sudo cat /etc/kubernetes/manifests/kube-controller-manager.yaml | grep "image:"
image: containers.fortanix.com:5000/kube-controller-manager:v1.19.16
$ sudo cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep "image:"
image: containers.fortanix.com:5000/kube-apiserver:v1.19.16

Check the status of the nodes and the k8s version using the following command:

$ sudo -E kubectl get nodes 
NAME STATUS ROLES AGE VERSION
sdkms-server-1 Ready master 11m v1.19.16
sdkms-server-2 Ready master 6m15s v1.19.16
sdkms-server-3 Ready master 90s v1.19.16

Troubleshooting

In case kubelet client certificates expire (/var/lib/kubelet/pki/kubelet-client.crt) and there is no /var/lib/kubelet/pki/kubelet-client-current.pem file present, then you can create the certificates using the following commands:

TEMP_DIR=/etc/kubernetes/tmp
mkdir -p $TEMP_DIR
BACKUP_PEM="/var/lib/kubelet/pki/kubelet-client-current.pem"
KEY="/var/lib/kubelet/pki/kubelet-client.key"
CERT="/var/lib/kubelet/pki/kubelet-client.crt"

echo "Stopping kubelet service"
systemctl stop kubelet

echo "Creating a new key and cert file for kubelet auth"
nodename=$(echo "$HOSTNAME" | awk '{print tolower($0)}')
openssl req -out $TEMP_DIR/tmp.csr -new -newkey rsa:2048 -nodes -keyout $TEMP_DIR/tmp.key -subj "/O=system:nodes/CN=system:node:$nodename"
cat > $TEMP_DIR/kubelet-client.ext << HERE
keyUsage = critical,digitalSignature,keyEncipherment
extendedKeyUsage = clientAuth
HERE
echo "Signing the generated csr with kubernetes CA"
openssl x509 -req -days 365 -in $TEMP_DIR/tmp.csr -CA /etc/kubernetes/pki/ca.crt -CAkey /etc/kubernetes/pki/ca.key -CAcreateserial -out $TEMP_DIR/tmp.crt -sha256 -extfile $TEMP_DIR/kubelet-client.ext
cp $TEMP_DIR/tmp.crt $CERT
cp $TEMP_DIR/tmp.key $KEY

chmod 644 $CERT
chmod 600 $KEY

if grep -q "client-certificate-data" $KUBELET_CONF; then
    echo "Updating file $KUBELET_CONF to add reference to restored certificates"
    sed -i "s|\(client-certificate-data:\s*\).*\$|client-certificate: $CERT|" $KUBELET_CONF
    sed -i "s|\(client-key-data:\s*\).*\$|client-key: $KEY|" $KUBELET_CONF
fi

echo "Starting kubelet service"
systemctl start kubelet

Upgrade on a 2 node cluster can fail due to etcd quorum failure. In such a scenario, if pods are healthy, you can re-run the deploy job manually using the following command. This will eventually upgrade the cluster to 1.14.
```
sdkms-cluster deploy --stage DEPLOY --version <version>
```
WARNING
2 node upgrades are not recommended.
When a cluster is upgraded from build 4.2.2087 to <4.3.xxxx> on a 3-node cluster, it is possible that the deploy job is exited and marked completed before cluster upgrade. In such a scenario, if all the pods are healthy, you can deploy the version again.