Fortanix Data Security Manager Cluster Management Quick Reference

1.0 Introduction

Welcome to the Fortanix-Data-Security-Manager (DSM) Cluster Management Quick Reference Guide. This guide is designed to facilitate the assessment of Fortanix DSM cluster health through prechecks, commonly used commands, and troubleshooting steps for mitigating known issues within the Fortanix DSM cluster environment.

This quick reference guide is intended to be used by technical stakeholders of Fortanix DSM who will be responsible for setting up and managing Fortanix DSM clusters.

2.0 Fortanix DSM Cluster Management Commands

The table below provides a comprehensive list of commands used to manage Fortanix DSM cluster:

NOTE

Before executing kubectl commands, ensure the admin.conf file is loaded using the command: export KUBECONFIG=/etc/kubernetes/admin.conf

TASK

COMMAND

Verify nodes and pods status

Non root users:sudo -E kubectl get nodes,pods -owide

Root users:kubectl get nodes,pods -owide

Verify pods status in kube-system namespace

sudo -E kubectl get pods -n kube-system

List Fortanix DSM pods and cassandra alone

  • sudo -E kubectl get pods -l app=cassandra -owide

  • sudo -E kubectl get pods -l app=sdkms -owide

Capture pod logs

sudo -E kubectl logs pod_name -f

Capture pod logs of different namespace

sudo -E kubectl logs pod_name -n namespace_name

Get into the pod

sudo -E kubectl exec -it pod_name bash

Label the nodes

sudo -E kubectl label node nodename

Verify nodetool status

sudo -E kubectl exec -it cassandra-0 -- nodetool status

Verify replication strategy

sudo -E kubectl exec cassandra-0 -- cqlsh -e "select * from system_schema.keyspaces where keyspace_name ='public'";

Check the current system configuration

sdkms-cluster get config --system

Check the initial system configuration

sdkms-cluster get config –user

Create and list Kubeadm tokens

  • kubeadm token create

  • kubeadm token list

Create a cluster

sdkms-cluster create --self=ip_addr --config config.yaml

Where, self_ip_address is the IP address of the node.

Join a cluster

sdkms-cluster join --peer=ip_address --token= --self=self_ip_address

Initiate cluster join with DC labeling

sdkms-cluster join --peer=ip_address --token= --self=self_ip_address --label datacenter=""

Reset the cluster

sdkms-cluster reset --delete-data --reset-iptables

NOTE

Do not run this command if there is any active node associated with the cluster.

Remove the node from the cluster

sdkms-cluster remove --force --node nodename

NOTE

Select the appropriate node that needs to be removed from the active cluster.

Re-deploy the cluster after modifying the configuration file (config.yaml)

sdkms-cluster deploy --config config.yaml --stage DEPLOY

Perform Fortanix DSM pods rolling restart

Navigate to /opt/fortanix/sdkms/bin/dsm_backend_rolling_restart.sh to restart Fortanix DSM pods.

View all cronjobs

sudo -E kubectl get cronjobs

Disable all cronjobs

sudo -E kubectl get cj --no-headers | awk '{print $1}' | while read name; do sudo -E kubectl patch cronjob $name -p '{"spec”: {“suspend”: true}} ‘; done

Enable all cronjobs

sudo -E kubectl get cj --no-headers | awk '{print $1}' | while read name; do sudo -E kubectl patch cronjob $name -p '{"spec”: {“suspend”: false}} ‘; done

3.0 Safe Shutdown and Restarting of a DSM Server

Perform the following steps to ensure a safe and controlled shutdown of a Fortanix DSM server in the Kubernetes cluster:

  1. Run the following command to prevent new pods from being scheduled on the node before shutdown:

    kubectl cordon <node-name>

    Here, <node-name> refers to the name of the Fortanix DSM node.

    NOTE

    Ensure that the cluster has met global quorum. Removing the node should not impact services.

    This marks the node as unschedulable, ensuring that no new workloads are assigned.

  2. Run the following command to safely move workloads from the node before shutdown:

    kubectl drain <node-name> --ignore-daemonsets

    Here, <node-name> refers to the name of the Fortanix DSM node.

  3. Run the shutdown command to shut down the node:

    sudo shutdown -h now

    NOTE

    • For hardware DSM machines, ensure Intelligent Platform Management Interface (IPMI) access is available in case of issues bringing the server online.

    • For virtual machines (VMs) hosted on ESXi/vSphere, web console access is required to power on the machine.

3.1 Restarting the DSM Server

You can start the Fortanix DSM node from IPMI and uncordon node to end the maintenance.

Perform the following steps to restart the Fortanix DSM server:

  1. Power on the machine using the IPMI console.

  2. Once the node is back online, access it using SSH.

  3. Run the following command to allow scheduling on the node:

    kubectl uncordon <node-name>

    Here, <node-name> refers to the name of the Fortanix DSM node.

  4. Run the following to verify the status of the node and workloads:

    kubectl get nodes,pods

4.0 Fortanix DSM Prechecks

Fortanix DSM runs run_precheck.sh script to analyze the cluster health status. The script is located at /opt/fortanix/sdkms/bin/dsm_prechecks/run_precheck.sh.

The following table provides a comprehensive list of Fortanix DSM cluster management prechecks handled by the above script:

CHECK NAME

CHECK TYPE

PURPOSE

SWDIST_CHECK

NODE

To check the discrepancy in swdist endpoint files and directories

KUBEAPI_IP_CHECK

NODE

To check the kube-apiserver IP address inconsistency

CAS_ADMIN_ACCT_CHECK

NODE

  • To check the total Sysadmin account

  • To check best practices to have more than one user as Sysadmin

1M_CPU_CHECK

NODE

To check last minute CPU load average

SWDIST_OVERLAY_SRVC_CHECK

NODE

To verify SWDIST_OVERLAY is up and running

SGX_CHECK

NODE

To verify if the machine supports Software Guard Extension (SGX) technology

PERM_DAEMON_SRVC_CHECK

NODE

To verify if PERM_DAEMON is up and running

NTP_CHECK

NODE

To verify if Network Time Protocol (NTP) is configured

MEM_CHECK

NODE

To check system memory utilization

KUBELET_SRVC_CHECK

NODE

To verify if Kubelet is up and running

KUBELET_CERT

NODE

To verify Kubelet certificate validity

KUBEAPI_SERVER_CERT

NODE

To check KUBEAPI SERVER certificate validity

HEALTH_CHECK_QUORUM

NODE

To confirm the quorum status of nodes, distinguish between local and global

HEALTH_CHECK_ALL

NODE

To verify responses from all the Fortanix DSM pods in the cluster. If they pass, it returns OK.

DOCKER_REGISTRY_SRVC_CHECK

NODE

To verify DOCKER_REGISTRY is up and running

DISK_CHECK_[/var]

NODE

To verify disk space usage

DISK_CHECK_[/]

NODE

DISK_CHECK_[/data]

NODE

DB_FILES_PERM_CHECK

NODE

To verify if the /data/Cassandra/public files have executable permissions or not

CRI_SRVC_CHECK

NODE

To verify if CRI (Container Runtime Interface) is up and running

CPU_MODEL_CHECK

NODE

To list out the information on CPU, attestation, and Fortanix DSM appliance series type

CAS_CERT_CHECK

NODE

To verify Cassandra certificate expiry

CAS_ACCT_CHECK

NODE

To verify the discrepancy in Cassandra account_primary and account table

5M_CPU_CHECK

NODE

To check the last 5 minutes CPU load average

15M_CPU_CHECK

NODE

To check the last 15 minutes CPU load average

IPMI_INFO_CHECK

NODE

To print IPMI configuration info such as IPMI IP address, default gateway mac address, and default gateway IP address.

SWDIST_DUP_RULE_CHECK

NODE

To verify iptable duplicate entries

CONTAINER_CHECK

CLUSTER

To check the container readiness

BACKUP_SETUP_CHECK

CLUSTER

To verify if the backup has been configured for the cluster

REPLICA_CHECK

CLUSTER

To verify replica counts in deployment and configmap

PODS_CHECK

CLUSTER

To verify the health status of the pods and their readiness

NODE_CHECK

CLUSTER

To verify if nodes are in a ready state

LB_SETUP_CHECK

CLUSTER

To verify the LB_setup whether it is external or Internal

JOB_CHECK

CLUSTER

To validate if the jobs are executed and completed

IMAGE_VERSION_CHECK

CLUSTER

To verify and report the Fortanix DSM version

ETCD_HEALTH_CHECK

CLUSTER

To verify if ETCD cluster is healthy

CAS_REP_CHECK

CLUSTER

To report replication strategy: This can be Simple or Strategy network topology

CAS_NODETOOL_CHECK

CLUSTER

To verify Cassandra nodetool status

CONN_Q_CHECK

CLUSTER

To calculate all public connections to the SDKMS pod

4.1 Fortanix DSM Prechecks Output Status

The following are the types of Fortanix DSM precheck output status:

  • OK: No action item and it is successful.

  • WARN: Requires attention and the user needs to check the logs created on the node /tmp/health_checks/ and the user must share the log details with the Fortanix Support team.

  • SKIPPED: The check was not executed due to some issue in the cluster. For example, Cassandra Replication Strategy check requires Cassandra to be healthy to fetch details from the Cassandra pod, but if the pod is not healthy, then that check will be SKIPPED. Hence, it is required to check the logs to understand the reason for skipping the check. The user must share the log details with the Fortanix support team.

5.0 Troubleshooting

The following table lists potential causes of errors and exceptions, along with details on how to fix them for various Fortanix DSM use cases:

5.1 Cluster Create and Cluster Join

ISSUE

DESCRIPTION

RESOLUTION

Hostname of the node is in uppercase.

Sample error log:
[etcd] Waiting for the etcd pod to come up (this might take 2 minutes)

Error from server: (NotFound): pods "etcd-DEV-FRTNX01" not found ;

Fortanix DSM hostname must be in Lowercase letters.

The error is due to a failure in cluster creation as it was waiting for etcd pods to become ready.

Command to set the hostname:

sudo hostnamectl set-hostname newhostname

The user needs to reset the cluster and reinitialize cluster creation.

Domain name resolution

sample error log: sudo sdkms-cluster create --self=ip_address --config config.yaml sudo: unable to resolve host cslab-5 Temporary failure in name resolution [sdkms-cluster]

WARNING: BIOS version file not found. Skipping test

ERROR: Error parsing /etc/resolv.conf

/etc/resolv.conf file should not be empty.

Verify /etc/network/interfaces for the DNS nameservers and add the same entries in /etc/resolv.conf.

[coredns] Waiting for coredns pod to be ready

Cluster creation fails when the network configuration does not have DNS nameservers configured.

Verify /etc/network/interfaces for the DNS nameservers and add the same entries in /etc/resolv.conf.

NTP SERVERS on the nodes is not in sync

The node joining process will fail due to missing NTP configuration and if the peer and self-node timings are not the same and this causes clock difference in the etcd pod.

Resolve the issue by properly configuring NTP.

Port requirement for intra-cluster communication between the nodes

Communication between various Kubernetes control plane components, such as the API server, scheduler, controller manager, and etcd, also occurs over specific ports.

Refer to Fortanix Data Security Manager Port Requirements – Fortanix support guide and ensure all the required ports are open for communication.

Access to the URLs of IAS (Intel Attestation Service) should be reachable from the joining node before initiating the sdkms-join process.

This can manifest as sdkms-join pods failing to attest and therefore halting the upgrade.

DSM communicates with IAS during:

  • Cluster creation

  • Addition of new node Software upgrade

Refer to Fortanix Data Security Manager Cluster Attestation Guide – Fortanix support guide.

Ensure that nodes can connect to the IAS by using the following commands:

nc -v ps.sgx.trustedservices.intel.com 80
nc -v trustedservices.intel.com 443
nc -v trustedservices.intel.com 80 
nc -v whitelist.trustedservices.intel.com 80
nc -v iasproxy.fortanix.com 443

[kubelet] Waiting for node to become ready [kubelet] Installing cluster configuration

ERROR: Found unreplaced IP address in manifests/etcd.yaml

The error message indicates that there is an unreplaced IP address in the manifests/etcd.yaml file.

Kindly raise the support ticket with the output of the following commands:

ls -lrt /etc/kubernetes/pki/etcd 
ls -lrt /etc/Kubernetes cat /etc/kubernetes/bootstrap-kubelet.conf

5.2 POD Status

STATUS

DESCRIPTION

RESOLUTION

ERROR

Pods in an error state could be due to various reasons. Performing a detailed analysis of pod logs helps to identify the root cause.

Please raise a support ticket if you encounter any pods in the ERROR state, and kindly include the output of the following commands:

sudo -E kubectl describe pod pod_name
sudo -E kubectl get pods -owide
sudo -E kubectl get pods -n kube-system
sudo -E kubectl logs pod_name

PENDING

When pods remain pending without being placed on any nodes, several factors could be responsible for preventing pod scheduling. A detailed analysis of logs is required to identify the underlying issues.

  • Verify the status of the node to ensure the node is in a ready state using the command:

    sudo -E kubectl get nodes -o wide
  • Verify the status of kube-system pods using the command:

    sudo -E kubectl get pods -n kube-system
  • Describe the pod to know the errors using the command:

    sudo -E kubectl describe pod pod_name

IMAGEPULLBACKOFF

When pods inside the container cannot fetch the images required, throws an imagepullbackoff error.

  • Run the script located in /opt/fortanix/sdkms/bin/ restart-docker-registry.sh

  • Delete the pods that are in imagepullbackoff and it comes back healthy.

If the issue persists, kindly reach out to Fortanix Support.

CRASHLOOPBACKOFF

A detailed analysis of pod logs is required to identify the underlying issues.

Kindly reach out to Fortanix Support with the required logs as mentioned above.

CREATECONFIGERROR

While pod creation, sometimes it fails to fetch the required resource.

Restart the pods using the following command:

sudo -E Kubectl delete pod pod_name

5.3 Node Not Ready

ISSUE

DESCRIPTION

RESOLUTION

kubectl commands will not be accessible from the nodes that are not in the ready state.

Nodes may enter the not-ready state due to various factors. A detailed analysis is required to determine the root cause.

Kindly create the support ticket and share the output of the following commands:

systemctl status kubelet
journactl -fu kubelet

Running kubectl commands from other nodes within the cluster report nodes that are not in a ready state as 'not ready’.

-

Kindly create the support ticket and share the output of the following commands:

sudo -E kubectl get node,pods -owide
sudo -E kubectl get pods -n kube-system