Fortanix Data Security Manager (Release 4.23) Kubernetes Version Upgrade to 1.27 K8s

1.0 Introduction

The purpose of this article is to describe the steps to upgrade Kubernetes from version 1.21.14 to 1.27.6 for Fortanix DSM release 4.23.

2.0 Overview

The Fortanix DSM 4.23 release will upgrade the system from Kubernetes version 1.21 to 1.27.
Subsequent Kubernetes upgrades will be released as part of regular upgrades or could continue to be independent upgrades.

After upgrading Fortanix DSM to the 4.23 version, you will not be able to downgrade to previous releases. The Fortanix DSM UI will not allow a downgrade after 4.23 is installed. Please work with Fortanix Support to ensure you have a valid backup that can be used to perform a manual recovery.

Also, you will need to upgrade Fortanix DSM to 4.23 before moving to any future release.

3.0 Pre-Upgrade Checks

Before upgrading the Kubernetes, ensure the following:

3.1 Check and Manage Disk Space

  1. Run the following command to check if the disk space of more than 15 GB is available in /var and root (/) directories: 

    $ df -h /var/ /

    The following is the sample output:

    Filesystem Size Used Avail Use% Mounted on
    /dev/nvme0n1p1 993G 22G 972G 3% /
    /dev/nvme0n1p1 993G 22G 972G 3% /
  2. Run the following command to delete the oldest version of Fortanix DSM from UI if the disk space is less than 15 GB:

    $ df -h /var/ /

    The following is the sample output:

    Filesystem Size Used Avail Use% Mounted on
    /dev/mapper/main-var 47G 26G 21G 56% /var
    /dev/sda2 47G 13G 33G 28% /

3.2 Configure and Validate Kubernetes

  1. Verify the following keys in kube-apiserver.yaml of each node and ensure that the assigned IP address is same as the host IP.

    • kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint

    • advertise-address

    • startupProbe IP

    • readinessProbe IP

    • livenessProbe IP

    In case of any mismatch, edit the yaml file to replace the assigned IP address with host IP.

    The following lines are reference from /etc/kubernetes/manifests/kube-apiserver.yaml file.

    • Annotation:

      annotations:
         kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint: 172.31.1.166:6443
    • Advertise-address:

      spec:
        containers:
        - command:
          - kube-apiserver
          - --advertise-address=172.31.1.166
    • livenessProbe:

      livenessProbe:
           failureThreshold: 8 
           httpGet: 
             host: 172.31.1.166
    • ReadinessProbe:

      readinessProbe:
           failureThreshold: 3
           httpGet:
             host: 172.31.1.166
    • startupProbe:

      startupProbe:
           failureThreshold: 24
           httpGet:
             host: 172.31.1.166

3.3 Check Software Versions in Endpoints

  1. Run the following command to check if all software versions are available in all the endpoints:

    kubectl get ep -n swdist

    The following is the sample output:

    NAME      ENDPOINTS                                         AGE
    swdist    10.244.0.212:22,10.244.1.191:22,10.244.2.152:22   242d
    v2649     10.244.0.212:22,10.244.1.191:22,10.244.2.152:22   4d
    v2657     10.244.0.212:22,10.244.1.191:22,10.244.2.152:22   2d
  2. Run the following command to check the status of docker registry:

    systemctl status docker-registry

    Ensure that the status is active and running before and after the software is uploaded.

3.4 Check Cluster and Node Health

  1. Run the following command to ensure that the overlay mount matches with this on each node:

    cat /etc/systemd/system/var-opt-fortanix-swdist_overlay.mount.d/options.conf
    [Mount]
    Options=lowerdir=/var/opt/fortanix/swdist/data/vXXXX/registry:/var/opt/fortanix/swdist/data/vYYYY/registry

    Here, ‘vXXXX’ is the previous version and ‘vYYYY’ is the upgraded version.

  2. Ensure that the latest backup is triggered and verify that it is a successful backup (size and other metrics).

  3. All nodes must report as healthy and be running Kubernetes version 1.21.14 and kernel 5.4.0-147-generic. Run the following command to get the nodes and list the IP:

    kubectl get nodes -o wide

    Look for the version number under the column VERSION and it must be v1.21.14 for each of the nodes.

    NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
    ali1 Ready master 2d v1.21.14  Ubuntu 20.04.3 LTS 5.8.0-50-generic docker://19.3.11
    nuc3 Ready master 3d v1.21.14  Ubuntu 20.04.3 LTS 5.8.0-50-generic docker://19.3.11
  4. All pods are healthy in the default, swdistand kube-system namespaces.

  5. Run the following command to check kubeadm configuration on the cluster:

    kubectl get configmap kubeadm-config -oyaml -nkube-system

    This should return the following values for parameters in the master configuration:

    • kubernetesVersion: v1.21.14

    • imageRepository: http://containers.fortanix.com:5000/

3.5 Check Etcd Cluster and Component

  1. Run the following command to check the status of etcd and if isLeader=true is assigned to one of the etcd node.

    • etcd should be TLS migrated.
      Run the following command to generate the list of etcd members:

      sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl exec etcd-ip-172-31-0-83 -nkube-system -- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/healthcheck-client.crt --key /etc/kubernetes/pki/etcd/healthcheck-client.key --endpoints https://127.0.0.1:2379 member list
      Defaulted container "etcd" out of: etcd, etcd-wait (init)

      The following is the sample output of the above command:

      23fa1b1fefa943ca, started, ip-172-31-2-51, https://172.31.2.51:2380, https://172.31.2.51:2379, false
      319b193f3bafd483, started, ip-172-31-1-157, https://172.31.1.157:2380, https://172.31.1.157:2379, false
      60fb3858c74022f5, started, ip-172-31-0-83, https://172.31.0.83:2380, https://172.31.0.83:2379, false
  2. Run the following command to ensure that version of etcd on each of the etcd pods is 3.4.13:

    sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl exec etcd-ip-172-31-0-83 -nkube-system -- etcd --version
    Defaulted container "etcd" out of: etcd, etcd-wait (init)

    The following is the sample output of the above command:

    etcd Version: 3.4.13
    Git SHA: ae9734ed2
    Go Version: go1.12.17
    Go OS/Arch: linux/amd64
  3. Run the following command to check the health of etcd cluster and ensure that the health of the cluster is healthy:

    sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl exec etcd-ip-172-31-0-83 -nkube-system -- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/healthcheck-client.crt --key /etc/kubernetes/pki/etcd/healthcheck-client.key --endpoints https://127.0.0.1:2379 endpoint health
    Defaulted container "etcd" out of: etcd, etcd-wait (init)

    The following is the sample output of the above command:

    https://127.0.0.1:2379 is healthy: successfully committed proposal: took = 7.286578ms
  4. On each node, navigate to /etc/kubernetes/manifests directory and run the following command to check the image versions for all kubernetes control-plane components:

    ls
    etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml
    cat etcd.yaml | grep "image:"
        image: containers.fortanix.com:5000/etcd:3.4.13-0
        image: containers.fortanix.com:5000/etcd:3.4.13-0
    cat kube-apiserver.yaml | grep "image: "
        image: containers.fortanix.com:5000/kube-apiserver:v1.21.14
    cat kube-controller-manager.yaml | grep "image: "
        image: containers.fortanix.com:5000/kube-controller-manager:v1.21.14
    cat kube-scheduler.yaml | grep "image: "
        image: containers.fortanix.com:5000/kube-scheduler:v1.21.14
    
  5. Perform the following steps to check the expiry of the Kubernetes certificates.

    1. Check the certificates under /etc/kubernetes/pki and /etc/kubernetes/pki/etcd directories.

    2. Run the following command to renew the expired certificates:

      /opt/fortanix/sdkms/bin/renew-k8s-certs.sh
  6. Run the following command on each node to check the status of kubelet, docker, and docker-registry service:

    systemctl status containerd
    systemctl status kubelet
    systemctl status docker-registry

    NOTE

    Ensure that the status of the services is Running.

  7. Run the following command on each node to check the status of kubelet, containerd and docker-registry service:

    systemctl status containerd
    systemctl status kubelet
    systemctl status docker-registry
    

    NOTE

    Ensure that the status of the services is Running.

4.0 Post-Upgrade Checks

Ensure to refer to Pre-Upgrade Checks before upgrading the Kubernetes:

4.1 Check Node and Deployment Status

  1. Run the following command to check the status of the deploy job:

    # kubectl get pods | grep deploy

    The following is the sample output of the above command:

    deploy-vqq7r     0/1     Completed   0    125m

    NOTE

    Ensure that the status of the pod is Completed.

  2. Run the following command to get the list of the deploy job:

    # kubectl get job deploy

    The following is the sample output of the above command:

    NAME     COMPLETIONS   DURATION   AGE
    deploy   1/1           4h54m      18d
    

    NOTE

    Verify the completion and duration of the job.

  3. If you are using DC Labeling, run the following command to verify if the zone label is added by the YAML of the node: 

    kubectl get node node_name -o yaml | grep -i 'zone'
  4. Run the following command to check the status of the nodes and the k8s version and the role must be control-plane:

    kubectl get nodes -o wide

    The following is the sample output of the above command:

    NAME              STATUS   ROLES           AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
    ip-172-31-0-235   Ready    control-plane   4h41m   v1.27.6   172.31.0.235           Ubuntu 20.04.6 LTS   5.4.0-155-generic   containerd://1.7.2
    ip-172-31-1-96    Ready    control-plane   4h32m   v1.27.6   172.31.1.96            Ubuntu 20.04.6 LTS   5.4.0-155-generic   containerd://1.7.2
    ip-172-31-2-139   Ready    control-plane   4h37m   v1.27.6   172.31.2.139           Ubuntu 20.04.6 LTS   5.4.0-155-generic   containerd://1.7.2

    NOTE

    Ensure the following:

    • Status of the nodes is Ready

    • VERSION column reflects v1.27.6

    • ROLE column reflects control-plane

    • KERNEL_VERSION reflects 5.4.0-155-generic

4.2 Check Kubernetes and Component Version

  1. Run the following command to generate the list of etcd members:

    kubectl exec etcd-ip-172-31-0-235 -nkube-system -- etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/healthcheck-client.crt --key /etc/kubernetes/pki/etcd/healthcheck-client.key --endpoints https://127.0.0.1:2379 member list
    Defaulted container "etcd" out of: etcd, etcd-wait (init)

    The following is the sample output of the above command:

    2a7aa68b5fd7001e, started, ip-172-31-2-139, https://172.31.2.139:2380, https://172.31.2.139:2379, false
    752400b1b0eb1984, started, ip-172-31-1-96, https://172.31.1.96:2380, https://172.31.1.96:2379, false
    9d46aef2058b6a38, started, ip-172-31-0-235, https://172.31.0.235:2380, https://172.31.0.235:2379, false
  2. Run the following command to check if kube-proxy is upgraded to image v1.27.6-1-840fae1b914b0d:

    $ sudo -E kubectl describe ds kube-proxy -nkube-system | grep Image

    The following is the sample output of the above command:

    Image: containers.fortanix.com:5000/kube-proxy:v1.27.6-1-840fae1b914b0d
  3. Run the following command to check if kured pod is running with image version 1.14.0:

    $ sudo -E kubectl describe ds kured -nkube-system | grep Image

    The following is the sample output of the above command:

    Image: containers.fortanix.com:5000/kured:1.14.0
  4. Run the following command on each of the nodes in the cluster to check if kube-apiserver, kube-controller-manager, kube-scheduler are upgraded to 1.27.6:

    $ sudo cat /etc/kubernetes/manifests/kube-scheduler.yaml | grep "image:"
        image: containers.fortanix.com:5000/kube-scheduler:v1.27.6
    $ sudo cat /etc/kubernetes/manifests/kube-controller-manager.yaml | grep "image:"
        image: containers.fortanix.com:5000/kube-controller-manager:v1.27.6
    $ sudo cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep "image:"
        image: containers.fortanix.com:5000/kube-apiserver:v1.27.6
    
  5. Run the following command to check the version of etcd:

    kubectl get pod etcd-ip-172-31-0-235 -n kube-system -o yaml | grep image:

    The following is the sample output of the above command:

    image: containers.fortanix.com:5000/etcd:3.5.7-0
    image: containers.fortanix.com:5000/etcd:3.5.7-0
    image: containers.fortanix.com:5000/etcd:3.5.7-0
    image: containers.fortanix.com:5000/etcd:3.5.7-0
    
  6. Run the following command to check the version of cert-manager helm chart:

    helm list -A

    The following is the sample output of the above command:

    NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                          APP VERSION
    certmanager     cert-manager    4               2023-09-25 05:18:17.673679069 +0000 UTC deployed        cert-manager-v1.13.0           v1.13.0
    csiplugin       cert-manager    4               2023-09-25 05:18:21.061921817 +0000 UTC deployed        cert-manager-csi-driver-v0.5.0 v0.5.0
    

    NOTE

    Ensure that the helm chart version is 1.13.0 and csiplugin version is 0.5.0.

  7. Run the following command to check if the Kubernetes version is upgraded to v1.27.6 (including kubeadm, kubectl, kubelet packages):

    $ dpkg -l | grep kube

    The following is the sample output of the above command:

    ii kubeadm 1.27.6-00fortanix amd64 Kubernetes Cluster Bootstrapping Tool
    ii kubectl 1.27.6-00 amd64 Kubernetes Command Line Tool
    ii kubelet 1.27.6-00 amd64 Kubernetes Node Agent
    ii kubernetes-cni 1.2.0-00 amd64 Kubernetes CNI
  8. Run the following command to check if image tag 0.25.0 for swdist container is updated:

    $ sudo -E kubectl describe ds swdist -nswdist | grep Image

    The following is the sample output of the above command:

        Image:      containers.fortanix.com:5000/swdist:0.25.0
        Image:      containers.fortanix.com:5000/swdist:0.25.0
        Image:      containers.fortanix.com:5000/swdist:0.25.0
        Image:      containers.fortanix.com:5000/swdist:0.25.0
        Image:      containers.fortanix.com:5000/swdist:0.25.0
        Image:      containers.fortanix.com:5000/swdist:0.25.0
        Image:      containers.fortanix.com:5000/swdist:0.25.0
        Image:      containers.fortanix.com:5000/swdist:0.25.0
  9. Run the following command to check the replicas of coredns deployment:

    sudo -E kubectl get pods -nkube-system -owide | grep coredns

    The following is the sample output of the above command:

    coredns-786bdcfc9c-bvzzf                  1/1     Running   0              131m   10.244.0.117   ip-172-31-0-235      <none> <none> 
    coredns-786bdcfc9c-fkw7s                  1/1     Running   0              131m   10.244.1.116   ip-172-31-2-139      <none> <none> 
    coredns-786bdcfc9c-r2s8c                  1/1     Running   0              131m   10.244.2.98    ip-172-31-1-96       <none> <none>         

    NOTE

    Ensure that number of duplicate coredns must be equal to the number of nodes in the cluster.

  10. Run the following command to check the version of flannel and flannel-plugin:

    kubectl get ds kube-flannel-ds -n kube-system -o yaml | grep image:

    The following is the sample output of the above command:

    image: containers.fortanix.com:5000/flannel:v0.22.3
    image: containers.fortanix.com:5000/flannel-cni-plugin:v1.1.2
    image: containers.fortanix.com:5000/flannel:v0.22.3

    NOTE

    Ensure that the flannel version is 0.22.3 and flannel plugin version is 1.1.2.

4.3 Check cert-manager Configuration

  1. Run the following command to check all the resources of cert-manager:

    kubectl get all -n cert-manager

    The following is the sample output of the above command:

    NAME READY STATUS RESTARTS AGE
    pod/cert-manager-csi-driver-9lvw2 3/3 Running 4 (14h ago) 15h
    pod/certmanager-cert-manager-5fd9f859bb-7slz2 1/1 Running 0 14h
    pod/certmanager-cert-manager-cainjector-5998546469-pk9kb 1/1 Running 0 14h
    pod/certmanager-cert-manager-webhook-878f95fb5-699lp 1/1 Running 0 14h
    
    NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
    service/certmanager-cert-manager ClusterIP 10.245.213.126  9402/TCP 15h
    service/certmanager-cert-manager-webhook ClusterIP 10.245.20.237  443/TCP 15h
    
    NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
    daemonset.apps/cert-manager-csi-driver 1 1 1 1 1  15h
    
    NAME READY UP-TO-DATE AVAILABLE AGE
    deployment.apps/certmanager-cert-manager 1/1 1 1 15h
    deployment.apps/certmanager-cert-manager-cainjector 1/1 1 1 15h
    deployment.apps/certmanager-cert-manager-webhook 1/1 1 1 15h
    
    NAME DESIRED CURRENT READY AGE
    replicaset.apps/certmanager-cert-manager-5fd9f859bb 1 1 1 14h
    replicaset.apps/certmanager-cert-manager-6c6bdd85d9 0 0 0 15h
    replicaset.apps/certmanager-cert-manager-cainjector-5998546469 1 1 1 14h
    replicaset.apps/certmanager-cert-manager-cainjector-7b7cbc6988 0 0 0 15h
    replicaset.apps/certmanager-cert-manager-webhook-555cbb78cd 0 0 0 15h
    replicaset.apps/certmanager-cert-manager-webhook-878f95fb5 1 1 1 14h
  2. Run the following command to check the DEPLOYMENT_STATUS environment variable in all Cassandra pod. It should be set as CERT_MANAGER_ONLY as illustrated in the example for cassandra-0:

    kubectl exec -it cassandra-0 -- env | grep DEPLOYMENT_STAGE
    DEPLOYMENT_STAGE=CERT_MANAGER_ONLY
  3. Run the following command to check the configmap with name cassandra-cert-manager-migration-state:

    kubectl get cm cassandra-cert-manager-migration-state -ojsonpath='{.data}'
    {"DEPLOYMENT_STAGE":"CERT_MANAGER_ONLY"}

5.0 Troubleshooting

  1. In case kubelet client certificates expire (/var/lib/kubelet/pki/kubelet-client.crt) and there is no /var/lib/kubelet/pki/kubelet-client-current.pem file present, then you can create the certificates using the following commands:

    TEMP_DIR=/etc/kubernetes/tmp
    mkdir -p $TEMP_DIR
    BACKUP_PEM="/var/lib/kubelet/pki/kubelet-client-current.pem"
    KEY="/var/lib/kubelet/pki/kubelet-client.key"
    CERT="/var/lib/kubelet/pki/kubelet-client.crt"
    
    echo "Stopping kubelet service"
    systemctl stop kubelet
    
    echo "Creating a new key and cert file for kubelet auth"
    nodename=$(echo "$HOSTNAME" | awk '{print tolower($0)}')
    openssl req -out $TEMP_DIR/tmp.csr -new -newkey rsa:2048 -nodes -keyout $TEMP_DIR/tmp.key -subj "/O=system:nodes/CN=system:node:$nodename"
    cat > $TEMP_DIR/kubelet-client.ext << HERE
    keyUsage = critical,digitalSignature,keyEncipherment
    extendedKeyUsage = clientAuth
    HERE
    echo "Signing the generated csr with kubernetes CA"
    openssl x509 -req -days 365 -in $TEMP_DIR/tmp.csr -CA /etc/kubernetes/pki/ca.crt -CAkey /etc/kubernetes/pki/ca.key -CAcreateserial -out $TEMP_DIR/tmp.crt -sha256 -extfile $TEMP_DIR/kubelet-client.ext
    cp $TEMP_DIR/tmp.crt $CERT
    cp $TEMP_DIR/tmp.key $KEY
    
    chmod 644 $CERT
    chmod 600 $KEY
    
    if grep -q "client-certificate-data" $KUBELET_CONF; then
        echo "Updating file $KUBELET_CONF to add reference to restored certificates"
        sed -i "s|\(client-certificate-data:\s*\).*\$|client-certificate: $CERT|" $KUBELET_CONF
        sed -i "s|\(client-key-data:\s*\).*\$|client-key: $KEY|" $KUBELET_CONF
    fi
    
    echo "Starting kubelet service"
    systemctl start kubelet
  2. Upgrade on a 2 node cluster can fail due to etcd quorum failure. In such a scenario, if pods are healthy, you can re-run the deploy job manually using the following command. This will eventually upgrade the cluster to 1.14.

    sdkms-cluster deploy --stage DEPLOY --version 

    WARNING

    2 node upgrades are not recommended.

  3. When a cluster is upgraded from build 4.2.2087 to <4.3.xxxx> on a 3-node cluster, it is possible that the deploy job is exited and marked completed before cluster upgrade. In such a scenario, if all the pods are healthy, you can deploy the version again.