1.0 Introduction
The purpose of this article is to describe the steps to upgrade Kubernetes from version 1.21.14 to 1.27.6 for Fortanix DSM release 4.23.
2.0 Overview
The Fortanix DSM 4.23 release will upgrade the system from Kubernetes version 1.21 to 1.27.
Subsequent Kubernetes upgrades will be released as part of regular upgrades or could continue to be independent upgrades.
After upgrading Fortanix DSM to the 4.23 version, you will not be able to downgrade to previous releases. The Fortanix DSM UI will not allow a downgrade after 4.23 is installed. Please work with Fortanix Support to ensure you have a valid backup that can be used to perform a manual recovery.
Also, you will need to upgrade Fortanix DSM to 4.23 before moving to any future release.
3.0 Pre-Upgrade Checks
Before upgrading the Kubernetes, ensure the following:
3.1 Check and Manage Disk Space
Run the following command to check if the disk space of more than 15 GB is available in
/var
and root (/
) directories:$ df -h /var/ /
The following is the sample output:
Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1p1 993G 22G 972G 3% / /dev/nvme0n1p1 993G 22G 972G 3% /
Run the following command to delete the oldest version of Fortanix DSM from UI if the disk space is less than 15 GB:
$ df -h /var/ /
The following is the sample output:
Filesystem Size Used Avail Use% Mounted on /dev/mapper/main-var 47G 26G 21G 56% /var /dev/sda2 47G 13G 33G 28% /
3.2 Configure and Validate Kubernetes
Verify the following keys in
kube-apiserver.yaml
of each node and ensure that the assigned IP address is same as the host IP.kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint
advertise-address
startupProbe IP
readinessProbe IP
livenessProbe IP
In case of any mismatch, edit the yaml file to replace the assigned IP address with host IP.
The following lines are reference from
/etc/kubernetes/manifests/kube-apiserver.yaml
file.Annotation:
annotations: kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint: 172.31.1.166:6443
Advertise-address:
spec: containers: - command: - kube-apiserver - --advertise-address=172.31.1.166
livenessProbe:
livenessProbe: failureThreshold: 8 httpGet: host: 172.31.1.166
ReadinessProbe:
readinessProbe: failureThreshold: 3 httpGet: host: 172.31.1.166
startupProbe:
startupProbe: failureThreshold: 24 httpGet: host: 172.31.1.166
3.3 Check Software Versions in Endpoints
Run the following command to check if all software versions are available in all the endpoints:
kubectl get ep -n swdist
The following is the sample output:
NAME ENDPOINTS AGE swdist 10.244.0.212:22,10.244.1.191:22,10.244.2.152:22 242d v2649 10.244.0.212:22,10.244.1.191:22,10.244.2.152:22 4d v2657 10.244.0.212:22,10.244.1.191:22,10.244.2.152:22 2d
Run the following command to check the status of docker registry:
systemctl status docker-registry
Ensure that the status is active and running before and after the software is uploaded.
3.4 Check Cluster and Node Health
Run the following command to ensure that the overlay mount matches with this on each node:
cat /etc/systemd/system/var-opt-fortanix-swdist_overlay.mount.d/options.conf [Mount]
Options=lowerdir=/var/opt/fortanix/swdist/data/vXXXX/registry:/var/opt/fortanix/swdist/data/vYYYY/registry
Here, ‘
vXXXX
’ is the previous version and ‘vYYYY
’ is the upgraded version.Ensure that the latest backup is triggered and verify that it is a successful backup (size and other metrics).
All nodes must report as healthy and be running Kubernetes version
1.21.14
and kernel5.4.0-147-generic
. Run the following command to get the nodes and list the IP:kubectl get nodes -o wide
Look for the version number under the column
VERSION
and it must bev1.21.14
for each of the nodes.NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ali1 Ready master 2d v1.21.14 Ubuntu 20.04.3 LTS 5.8.0-50-generic docker://19.3.11 nuc3 Ready master 3d v1.21.14 Ubuntu 20.04.3 LTS 5.8.0-50-generic docker://19.3.11
All pods are healthy in the
default
,swdist
andkube-system
namespaces.Run the following command to check
kubeadm
configuration on the cluster:kubectl get configmap kubeadm-config -oyaml -nkube-system
This should return the following values for parameters in the master configuration:
kubernetesVersion: v1.21.14
imageRepository: http://containers.fortanix.com:5000/
3.5 Check Etcd Cluster and Component
Run the following command to check the status of
etcd
and ifisLeader=true
is assigned to one of theetcd
node.etcd
should be TLS migrated.
Run the following command to generate the list ofetcd
members:sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl exec etcd-ip-172-31-0-83 -nkube-system -- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/healthcheck-client.crt --key /etc/kubernetes/pki/etcd/healthcheck-client.key --endpoints https://127.0.0.1:2379 member list Defaulted container "etcd" out of: etcd, etcd-wait (init)
The following is the sample output of the above command:
23fa1b1fefa943ca, started, ip-172-31-2-51, https://172.31.2.51:2380, https://172.31.2.51:2379, false 319b193f3bafd483, started, ip-172-31-1-157, https://172.31.1.157:2380, https://172.31.1.157:2379, false 60fb3858c74022f5, started, ip-172-31-0-83, https://172.31.0.83:2380, https://172.31.0.83:2379, false
Run the following command to ensure that version of
etcd
on each of theetcd
pods is3.4.13
:sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl exec etcd-ip-172-31-0-83 -nkube-system -- etcd --version Defaulted container "etcd" out of: etcd, etcd-wait (init)
The following is the sample output of the above command:
etcd Version: 3.4.13 Git SHA: ae9734ed2 Go Version: go1.12.17 Go OS/Arch: linux/amd64
Run the following command to check the health of
etcd
cluster and ensure that the health of the cluster is healthy:sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl exec etcd-ip-172-31-0-83 -nkube-system -- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/healthcheck-client.crt --key /etc/kubernetes/pki/etcd/healthcheck-client.key --endpoints https://127.0.0.1:2379 endpoint health Defaulted container "etcd" out of: etcd, etcd-wait (init)
The following is the sample output of the above command:
https://127.0.0.1:2379 is healthy: successfully committed proposal: took = 7.286578ms
On each node, navigate to
/etc/kubernetes/manifests
directory and run the following command to check the image versions for all kubernetes control-plane components:ls etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml cat etcd.yaml | grep "image:" image: containers.fortanix.com:5000/etcd:3.4.13-0 image: containers.fortanix.com:5000/etcd:3.4.13-0 cat kube-apiserver.yaml | grep "image: " image: containers.fortanix.com:5000/kube-apiserver:v1.21.14 cat kube-controller-manager.yaml | grep "image: " image: containers.fortanix.com:5000/kube-controller-manager:v1.21.14 cat kube-scheduler.yaml | grep "image: " image: containers.fortanix.com:5000/kube-scheduler:v1.21.14
Perform the following steps to check the expiry of the Kubernetes certificates.
Check the certificates under
/etc/kubernetes/pki
and/etc/kubernetes/pki/etcd
directories.Run the following command to renew the expired certificates:
/opt/fortanix/sdkms/bin/renew-k8s-certs.sh
Run the following command on each node to check the status of
kubelet
,docker
, anddocker-registry
service:systemctl status containerd systemctl status kubelet systemctl status docker-registry
NOTE
Ensure that the status of the services is
Running
.Run the following command on each node to check the status of
kubelet
,containerd
anddocker-registry
service:systemctl status containerd systemctl status kubelet systemctl status docker-registry
NOTE
Ensure that the status of the services is
Running
.
4.0 Post-Upgrade Checks
Ensure to refer to Pre-Upgrade Checks before upgrading the Kubernetes:
4.1 Check Node and Deployment Status
Run the following command to check the status of the deploy job:
# kubectl get pods | grep deploy
The following is the sample output of the above command:
deploy-vqq7r 0/1 Completed 0 125m
NOTE
Ensure that the status of the pod is
Completed
.Run the following command to get the list of the deploy job:
# kubectl get job deploy
The following is the sample output of the above command:
NAME COMPLETIONS DURATION AGE deploy 1/1 4h54m 18d
NOTE
Verify the completion and duration of the job.
If you are using DC Labeling, run the following command to verify if the zone label is added by the YAML of the node:
kubectl get node node_name -o yaml | grep -i 'zone'
Run the following command to check the status of the nodes and the k8s version and the role must be control-plane:
kubectl get nodes -o wide
The following is the sample output of the above command:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-172-31-0-235 Ready control-plane 4h41m v1.27.6 172.31.0.235 Ubuntu 20.04.6 LTS 5.4.0-155-generic containerd://1.7.2 ip-172-31-1-96 Ready control-plane 4h32m v1.27.6 172.31.1.96 Ubuntu 20.04.6 LTS 5.4.0-155-generic containerd://1.7.2 ip-172-31-2-139 Ready control-plane 4h37m v1.27.6 172.31.2.139 Ubuntu 20.04.6 LTS 5.4.0-155-generic containerd://1.7.2
NOTE
Ensure the following:
Status of the nodes is
Ready
VERSION column reflects
v1.27.6
ROLE column reflects
control-plane
KERNEL_VERSION reflects
5.4.0-155-generic
4.2 Check Kubernetes and Component Version
Run the following command to generate the list of
etcd
members:kubectl exec etcd-ip-172-31-0-235 -nkube-system -- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/healthcheck-client.crt --key /etc/kubernetes/pki/etcd/healthcheck-client.key --endpoints https://127.0.0.1:2379 member list Defaulted container "etcd" out of: etcd, etcd-wait (init)
The following is the sample output of the above command:
2a7aa68b5fd7001e, started, ip-172-31-2-139, https://172.31.2.139:2380, https://172.31.2.139:2379, false 752400b1b0eb1984, started, ip-172-31-1-96, https://172.31.1.96:2380, https://172.31.1.96:2379, false 9d46aef2058b6a38, started, ip-172-31-0-235, https://172.31.0.235:2380, https://172.31.0.235:2379, false
Run the following command to check if
kube-proxy
is upgraded to imagev1.27.6-1-840fae1b914b0d
:$ sudo -E kubectl describe ds kube-proxy -nkube-system | grep Image
The following is the sample output of the above command:
Image: containers.fortanix.com:5000/kube-proxy:v1.27.6-1-840fae1b914b0d
Run the following command to check if
kured
pod is running with image version1.14.0
:$ sudo -E kubectl describe ds kured -nkube-system | grep Image
The following is the sample output of the above command:
Image: containers.fortanix.com:5000/kured:1.14.0
Run the following command on each of the nodes in the cluster to check if
kube-apiserver
,kube-controller-manager
,kube-scheduler
are upgraded to1.27.6
:$ sudo cat /etc/kubernetes/manifests/kube-scheduler.yaml | grep "image:" image: containers.fortanix.com:5000/kube-scheduler:v1.27.6 $ sudo cat /etc/kubernetes/manifests/kube-controller-manager.yaml | grep "image:" image: containers.fortanix.com:5000/kube-controller-manager:v1.27.6 $ sudo cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep "image:" image: containers.fortanix.com:5000/kube-apiserver:v1.27.6
Run the following command to check the version of
etcd
:kubectl get pod etcd-ip-172-31-0-235 -n kube-system -o yaml | grep image:
The following is the sample output of the above command:
image: containers.fortanix.com:5000/etcd:3.5.7-0 image: containers.fortanix.com:5000/etcd:3.5.7-0 image: containers.fortanix.com:5000/etcd:3.5.7-0 image: containers.fortanix.com:5000/etcd:3.5.7-0
Run the following command to check the version of
cert-manager
helm chart:helm list -A
The following is the sample output of the above command:
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION certmanager cert-manager 4 2023-09-25 05:18:17.673679069 +0000 UTC deployed cert-manager-v1.13.0 v1.13.0 csiplugin cert-manager 4 2023-09-25 05:18:21.061921817 +0000 UTC deployed cert-manager-csi-driver-v0.5.0 v0.5.0
NOTE
Ensure that the
helm chart
version is1.13.0
andcsiplugin
version is0.5.0
.Run the following command to check if the Kubernetes version is upgraded to
v1.27.6
(includingkubeadm
,kubectl
,kubelet packages
):$ dpkg -l | grep kube
The following is the sample output of the above command:
ii kubeadm 1.27.6-00fortanix amd64 Kubernetes Cluster Bootstrapping Tool ii kubectl 1.27.6-00 amd64 Kubernetes Command Line Tool ii kubelet 1.27.6-00 amd64 Kubernetes Node Agent ii kubernetes-cni 1.2.0-00 amd64 Kubernetes CNI
Run the following command to check if image tag
0.25.0
forswdist
container is updated:$ sudo -E kubectl describe ds swdist -nswdist | grep Image
The following is the sample output of the above command:
Image: containers.fortanix.com:5000/swdist:0.25.0 Image: containers.fortanix.com:5000/swdist:0.25.0 Image: containers.fortanix.com:5000/swdist:0.25.0 Image: containers.fortanix.com:5000/swdist:0.25.0 Image: containers.fortanix.com:5000/swdist:0.25.0 Image: containers.fortanix.com:5000/swdist:0.25.0 Image: containers.fortanix.com:5000/swdist:0.25.0 Image: containers.fortanix.com:5000/swdist:0.25.0
Run the following command to check the replicas of
coredns
deployment:sudo -E kubectl get pods -nkube-system -owide | grep coredns
The following is the sample output of the above command:
coredns-786bdcfc9c-bvzzf 1/1 Running 0 131m 10.244.0.117 ip-172-31-0-235 <none> <none> coredns-786bdcfc9c-fkw7s 1/1 Running 0 131m 10.244.1.116 ip-172-31-2-139 <none> <none> coredns-786bdcfc9c-r2s8c 1/1 Running 0 131m 10.244.2.98 ip-172-31-1-96 <none> <none>
NOTE
Ensure that number of duplicate
coredns
must be equal to the number of nodes in the cluster.Run the following command to check the version of
flannel
andflannel-plugin
:kubectl get ds kube-flannel-ds -n kube-system -o yaml | grep image:
The following is the sample output of the above command:
image: containers.fortanix.com:5000/flannel:v0.22.3 image: containers.fortanix.com:5000/flannel-cni-plugin:v1.1.2 image: containers.fortanix.com:5000/flannel:v0.22.3
NOTE
Ensure that the
flannel
version is0.22.3
andflannel plugin
version is1.1.2
.
4.3 Check cert-manager Configuration
Run the following command to check all the resources of
cert-manager
:kubectl get all -n cert-manager
The following is the sample output of the above command:
NAME READY STATUS RESTARTS AGE pod/cert-manager-csi-driver-9lvw2 3/3 Running 4 (14h ago) 15h pod/certmanager-cert-manager-5fd9f859bb-7slz2 1/1 Running 0 14h pod/certmanager-cert-manager-cainjector-5998546469-pk9kb 1/1 Running 0 14h pod/certmanager-cert-manager-webhook-878f95fb5-699lp 1/1 Running 0 14h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/certmanager-cert-manager ClusterIP 10.245.213.126 9402/TCP 15h service/certmanager-cert-manager-webhook ClusterIP 10.245.20.237 443/TCP 15h NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/cert-manager-csi-driver 1 1 1 1 1 15h NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/certmanager-cert-manager 1/1 1 1 15h deployment.apps/certmanager-cert-manager-cainjector 1/1 1 1 15h deployment.apps/certmanager-cert-manager-webhook 1/1 1 1 15h NAME DESIRED CURRENT READY AGE replicaset.apps/certmanager-cert-manager-5fd9f859bb 1 1 1 14h replicaset.apps/certmanager-cert-manager-6c6bdd85d9 0 0 0 15h replicaset.apps/certmanager-cert-manager-cainjector-5998546469 1 1 1 14h replicaset.apps/certmanager-cert-manager-cainjector-7b7cbc6988 0 0 0 15h replicaset.apps/certmanager-cert-manager-webhook-555cbb78cd 0 0 0 15h replicaset.apps/certmanager-cert-manager-webhook-878f95fb5 1 1 1 14h
Run the following command to check the
DEPLOYMENT_STATUS
environment variable in all Cassandra pod. It should be set asCERT_MANAGER_ONLY
as illustrated in the example forcassandra-0
:kubectl exec -it cassandra-0 -- env | grep DEPLOYMENT_STAGE DEPLOYMENT_STAGE=CERT_MANAGER_ONLY
Run the following command to check the configmap with name
cassandra-cert-manager-migration-state
:kubectl get cm cassandra-cert-manager-migration-state -ojsonpath='{.data}' {"DEPLOYMENT_STAGE":"CERT_MANAGER_ONLY"}
5.0 Troubleshooting
In case kubelet client certificates expire (
/var/lib/kubelet/pki/kubelet-client.crt
) and there is no/var/lib/kubelet/pki/kubelet-client-current.pem
file present, then you can create the certificates using the following commands:TEMP_DIR=/etc/kubernetes/tmp mkdir -p $TEMP_DIR BACKUP_PEM="/var/lib/kubelet/pki/kubelet-client-current.pem" KEY="/var/lib/kubelet/pki/kubelet-client.key" CERT="/var/lib/kubelet/pki/kubelet-client.crt" echo "Stopping kubelet service" systemctl stop kubelet echo "Creating a new key and cert file for kubelet auth" nodename=$(echo "$HOSTNAME" | awk '{print tolower($0)}') openssl req -out $TEMP_DIR/tmp.csr -new -newkey rsa:2048 -nodes -keyout $TEMP_DIR/tmp.key -subj "/O=system:nodes/CN=system:node:$nodename" cat > $TEMP_DIR/kubelet-client.ext << HERE keyUsage = critical,digitalSignature,keyEncipherment extendedKeyUsage = clientAuth HERE echo "Signing the generated csr with kubernetes CA" openssl x509 -req -days 365 -in $TEMP_DIR/tmp.csr -CA /etc/kubernetes/pki/ca.crt -CAkey /etc/kubernetes/pki/ca.key -CAcreateserial -out $TEMP_DIR/tmp.crt -sha256 -extfile $TEMP_DIR/kubelet-client.ext cp $TEMP_DIR/tmp.crt $CERT cp $TEMP_DIR/tmp.key $KEY chmod 644 $CERT chmod 600 $KEY if grep -q "client-certificate-data" $KUBELET_CONF; then echo "Updating file $KUBELET_CONF to add reference to restored certificates" sed -i "s|\(client-certificate-data:\s*\).*\$|client-certificate: $CERT|" $KUBELET_CONF sed -i "s|\(client-key-data:\s*\).*\$|client-key: $KEY|" $KUBELET_CONF fi echo "Starting kubelet service" systemctl start kubelet
Upgrade on a 2 node cluster can fail due to
etcd
quorum failure. In such a scenario, if pods are healthy, you can re-run the deploy job manually using the following command. This will eventually upgrade the cluster to1.14
.sdkms-cluster deploy --stage DEPLOY --version
WARNING
2 node upgrades are not recommended.
When a cluster is upgraded from build
4.2.2087
to<4.3.xxxx>
on a 3-node cluster, it is possible that the deploy job is exited and marked completed before cluster upgrade. In such a scenario, if all the pods are healthy, you can deploy the version again.