Fortanix Data Security Manager Software Pre-Upgrade Checks - Manual

1.0 Introduction

This article describes the prechecks to be performed by the System Administrator before performing a Fortanix DSM software upgrade.

2.0 Prechecks

The run_precheck.sh script provides three options during execution, and the user can select the most suitable one by entering the correct input number.

  1. Remote – It executes the /opt/fortanix/sdkms/bin/check-dsm-health.sh script remotely on all the cluster nodes and fetches the related files, formats them, and displays the status on the screen.

  2. Local – It executes the /opt/fortanix/sdkms/bin/check-dsm-health.sh script locally on the same machine.

  3. IPMI – It verifies the Intelligent Platform Management Interface (IPMI) connectivity.

2.1 Option 1 – Remote

Perform the following steps to execute the Fortanix DSM precheck script remotely on all the nodes:

  1. Run the following command to go to the dsm_prechecks directory:

    sudo su
    cd /opt/fortanix/sdkms/bin/dsm_prechecks
  2. Run the following command to create the config.txt file:

    kubectl get no --no-headers -owide | awk '{print $6 }' > config.txt
  3. In the parameters.txt file, update the values for the following parameters:

    REMOTE_USER=""
    AUTH_TYPE="PASSWORD" or "PRIVATE_KEY"
    PRIVATE_KEY_FILE=""
    NO_PASSWORD="true" or "false"

    Where,

    • REMOTE_USER refers to the name of the user. For example, administrator.

    • AUTH_TYPE can be either PASSWORD or PRIVATE_KEY.

    • PRIVATE_KEY_FILE refers to the path of the private key. This is only applicable if the user selects the value for AUTH_TYPE parameter as PRIVATE_KEY.

    • NO_PASSWORD can be either true or false based on the sudo profile of the remote user.

  4. Run the following command to execute the Fortanix DSM prechecks:

    ./run_prechecks.sh
  5. Enter 1 to select "remote".
    The screen will prompt for password only if the user has selected the AUTH_TYPE=PASSWORD in Step 3.

    NOTE

    Create a node_password.txt file if you want to provide a password in the format IP|PASSWORD through the file.

  6. The user can check the detailed DSM prechecks logs in the following directory created by the script:

    /tmp/health_checks/remote_logs

2.2 Option 2 - Local

Perform the following steps to execute the Fortanix DSM precheck script locally:

  1. Run the following command to execute the Fortanix DSM prechecks:

    sudo ./run_prechecks.sh
  2. Enter 2 to select "local".

  3. The user can check the detailed Fortanix DSM precheck logs in the following directory created by the script:

    /tmp/health_checks/logs

2.3 Option 3 - IPMI

Perform the following steps to verify the Intelligent Platform Management Interface (IPMI) connectivity:

NOTE

Ensure that the IPMI is in the same network.

  1. Run the following command to go to the dsm_prechecks directory:

    cd /opt/fortanix/sdkms/bin/dsm_prechecks
  2. Run the following command to create the ipmi.txt file:

    vi ipmi.txt
  3. Add IPMI address and username in the ipmi.txt file in the following format:

    10.10.10.10|admin
    11.11.11.11|admin
  4. Run the following command to execute the Fortanix DSM prechecks:

    sudo ./run_prechecks.sh
  5. Enter 3 to select "ipmi connectivity check".

3.0 Handling Upgrade Issues Related to Attestation Value (Non-SGX Only)

If you are upgrading to a version lower than Fortanix DSM 4.31 and using a non-sgx build, check the value for the config-values parameter using the following command:

sdkms-cluster get config --system

If the attestation value is ias, then follow the workaround as outlined below. For more information, contact the Fortanix Support Team.

Perform the following steps:

  1. Add the following parameter in the config.yaml file to update the attestation value:

    attestation: null
  2. Run the following deploy command to apply the updated configurations:

    sdkms-cluster --config config.yaml --stage DEPLOY

    Monitor the deployment job and ensure all pods are running.

  3. Run the following command to view the status of the pods:

    kubectl get pods –A -owide
  4. Run the following command to verify the attestation value in the config-values file:

    kubectl get cm config-values -o yaml

    The following must be the output of the command:

    apiVersion: v1
    data:
      config-values: |-
        ---
        global:
          attestation: ~

4.0 Checks Handled by the Script

Check Name

Check Type

Purpose

HEALTH_CHECK_QUORUM

NODE

Checks the health of the sdkms and Cassandra pods, the script check-dsm-health.sh makes the following API call sys/v1/health?consistency=quorum

HEALTH_CHECK_ALL

NODE

Checks the health of the sdkms and Cassandra pods, the script check-dsm-health.sh makes the following API call sys/v1/health?consistency=all

When we send “all”, it means the client is expecting a response from all replicas (for a 3-node fully replicated cluster, a response from 3 Cassandra pods is considered successful).

DISK_CHECK

NODE

Checks the “/var”, “/”, and “/data” directory usage. The threshold is 70%, so if disk usage is higher than that, you will see “WARN” in the status.

NTP_CHECK

NODE

The script check-dsm-health.sh executes the ntpq -p command and checks if the node is syncing with at least one NTP server .

API_CERT

NODE

Checks for the availability of /etc/kubernetes/pki/apiserver.crt cert and also expiry date. If the expiration date is less than 30 days, then it flags it as “WARN”.

KUBELET_CERT

NODE

Checks for the kubelet.conf file, /etc/kubernetes/kubelet.conf, and verifies whether the certificate is embedded; if yes, it checks the expiry of the certificate-data.


If it is pointing to some other file in /var/lib/kubelet/pki, then it checks the expiry of that certificate.


If kubelet.conf is pointing to /var/lib/kubelet/pki/kubelet-cert-current.pem, then no action is required.

DOCKER_REGISTRY_SRVC_CHECK

NODE

Checks whether the docker-registry service is up and running.

KUBELET_SRVC_CHECK

NODE

Checks whether the kubelet service is up and running.

CRI_SRVC_CHECK


NODE

Checks whether the docker or containerd service is up and running.

1M_CPU_CHECK

NODE

Checks whether the 1 minute CPU load average is less than the defined threshold value. (90%)

5M_CPU_CHECK

NODE

Checks whether the 5 minute CPU load average is less than the defined threshold value (90%).

15M_CPU_CHECK

Checks whether the 15 minute CPU load average is less than the defined threshold value (90%).

MEM_CHECK

NODE

Checks whether the memory utilization is less than the defined threshold value (75%).

DB_FILES_PERM_CHECK

NODE

Checks whether the file permissions of /data/cassandra/public are valid: user - 999 and group - docker, if not then report as “WARN”.

CONN_Q_CHECK

NODE

Checks the number of public connections for each node querying the Metrics API, if the value is greater than 2000 then it flags it as “WARN”, else “OK”.


Checks the backend Q size querying Metrics API, if the value is greater than 10000 then flag it as “WARN”, else “OK”.

CAS_ACCT_CHECK

NODE

Checks the count of rows in account and account_primary tables, if not matching then Flag it as “WARN”, else “OK”.

CAS_ADMIN_ACCT_CHECK

NODE

Checks the count of sysadmin accounts. If the count is <= 1 then “WARN”, else “OK”.

SGX_CHECK

NODE

Checks whether DSM pod is running “enclave” process and also whether the machine has /dev/*sgx driver, if yes then report “SGX SDKMS POD”.


If driver is present and sdkms pod is not running “enclave” process, then report “NON SGX SDKMS POD ON SGX MACHINE”.


Flag it as “WARN” because we expect to run SGX deployment.

NODE_CHECK

CLUSTER

Checks whether all the nodes are in ready status “kubectl get nodes”.

PODS_CHECK

CLUSTER

Checks whether all pods are in Running or Completed status, if not, then flag them as “WARN”.

JOB_CHECK

CLUSTER

Checks whether all jobs are in “Complete” status; if not, it flags them as “WARN” for all running and failed jobs.


The running jobs are flagged because before you start the upgrade, all jobs must be in completed status.

CONTAINER_CHECK

CLUSTER

Checks whether all containers within pods are in a “ready” state; if not, then flags them as “WARN”.

IPMI_INFO_CHK

NODE

Verifies the IPMI IP address, default gateway mac and default gateway IP are valid, if not, then flags them as “WARN”.

CAS_REP_CHECK

CLUSTER

Checks whether the added DC Labeling is matching with the replication strategy in Cassandra.


(Please note this validation is for Network Topology and fully replicated cluster) if there is any mismatch then it flags it as “WARN”.


If in case the DC labeling is present, but the Strategy is Simple Strategy in Cassandra, then it flags it as “WARN” (Simple Strategy is not recommended for production clusters).

CAS_CERT_CHECK

NODE

Checks whether the Cassandra pod certificate is valid if not it flags it as “WARN”.

REPLICA_CHECK

CLUSTER

Checks the “replicas” field value in the configuration map using the command :


sdkms-cluster get config --system


and compares that with “replicas” of deployment and statefulset. If there is no match, then it flags it as “WARN”, else “OK”.

BIOS_VER_CHECK

NODE

Verifies that the node has latest BIOS installed, if not it flags it as “WARN”.

IMAGE_VERSION_CHECK

CLUSTER

Validates the version value from /etc/fortanix/sdkms_version/sdkms_version file with the Image version of SDKMS/SDKMS-PROXY/SDKMS-UI deployments.

If there is any mismatch, then it flags it as “WARN”. (This check also helps post upgrade/cluster creation)

BACKUP_SETUP_CHECK

CLUSTER

Checks if the CRON job sdkms-backup exists; if it is not present, it flags it as “WARN”.

KUBEAPI_IP_CHECK

CLUSTER

Checks whether the kube-apiserver manifest file has the correct IP address, if not then it flags it as “WARN”.

SWDIST_CHECK

CLUSTER

Checks the count of directories in the path /var/opt/fortanix/swdist/data/ and Swdist endpoints and versions file (/var/opt/fortanix/swdist/versions) are matching.

CAS_NODETOOL_CHECK

CLUSTER

Executes the nodetool status command and checks whether there are any nodes which are not of the pattern “UN” (Up and Normal)


Checks whether the count of pattern “UN” is matching with the Cassandra pod count.

ETCD_HEALTH_CHECK

CLUSTER

Verifies whether the etcd cluster is healthy, if not it flags it as “WARN”.

SWDIST_OVERLAY_SRVC_CHECK

NODE

Verifies the status of /var/opt/fortanix/swdist_overlay service.

PERM_DAEMON_SRVC_CHECK

NODE

Verifies the status of perm_daemon.service service.

5.0 Fortanix DSM Prechecks Output Status

This section describes the following three type of precheck output status:

  • OK: No action item and it is successful.

  • WARN: Requires attention and the user needs to check the logs created on the node /tmp/health_checks/ and the user must share the log details with the Fortanix Support team.

  • SKIPPED: Check was not executed due to some issue in the cluster.
    For example, Cassandra Replication Strategy check requires Cassandra to be healthy to fetch details from the Cassandra pod, but if the pod is not healthy, then that check will be SKIPPED.
    Hence, it is required to check the logs to understand the reason for skipping the check. The user must share the log details with the Fortanix support team.