Fortanix Data Security Manager Cluster Management Quick Reference

1.0 Introduction

This article is designed to facilitate the assessment of Fortanix-Data-Security-Manager (DSM) cluster management health through prechecks, commonly used commands, and troubleshooting steps for mitigating known issues within the Fortanix DSM cluster environment.

This quick reference guide is intended to be used by technical stakeholders of Fortanix DSM who will be responsible for setting up and managing Fortanix DSM clusters.

2.0 Fortanix DSM Cluster Management Commands

The table below provides a comprehensive list of commands used to manage Fortanix DSM cluster:

NOTE
Before executing kubectl commands, ensure the admin.conf file is loaded using the command:
export KUBECONFIG=/etc/kubernetes/admin.conf

TASK	COMMAND
Verify nodes and pods status	Non root users:`sudo -E kubectl get nodes,pods -owide` Root users:`kubectl get nodes,pods -owide`
Verify pods status in `kube-system` namespace	`sudo -E kubectl get pods -n kube-system`
List Fortanix DSM pods and cassandra alone	`sudo -E kubectl get pods -l app=cassandra -owide` `sudo -E kubectl get pods -l app=sdkms -owide`
Capture pod logs	`sudo -E kubectl logs pod_name -f`
Capture pod logs of different namespace	`sudo -E kubectl logs pod_name -n namespace_name`
Get into the pod	`sudo -E kubectl exec -it pod_name bash`
Label the nodes	`sudo -E kubectl label node nodename`
Verify `nodetool` status	`sudo -E kubectl exec -it cassandra-0 -- nodetool status`
Verify replication strategy	`sudo -E kubectl exec cassandra-0 -- cqlsh -e "select * from system_schema.keyspaces where keyspace_name ='public'";`
Check the current system configuration	`sdkms-cluster get config --system`
Check the initial system configuration	`sdkms-cluster get config –user`
Create and list `Kubeadm` tokens	`kubeadm token create` `kubeadm token list`
Create a cluster	`sdkms-cluster create --self=ip_addr --config config.yaml` Where, `self_ip_address` is the IP address of the node.
Join node to a cluster	`sdkms-cluster join --peer=ip_address --token= --self=self_ip_address`
Join node to a cluster with DC labeling	Increment replication factor by 1 and run the following command: `sdkms-cluster join --peer=ip_address --token= --self=self_ip_address --label datacenter=dc_label`
Initiate cluster join with DC labeling	`sdkms-cluster join --peer=ip_address --token= --self=self_ip_address --label datacenter=""`
Reset the cluster	`sdkms-cluster reset --delete-data --reset-iptables` NOTE Do not run this command if there is any active node associated with the cluster.
Remove the node from the cluster	`sdkms-cluster remove --force --node nodename` NOTE Select the appropriate node that needs to be removed from the active cluster.
Remove the node from the cluster with DC labeling	`sudo sdkms-cluster remove --node <node name> --force` Reduce the replication factor by 1 after removal of node.
Re-deploy the cluster after modifying the configuration file `(config.yaml)`	`sdkms-cluster deploy --config config.yaml --stage DEPLOY`
Perform Fortanix DSM pods rolling restart	Navigate to `/opt/fortanix/sdkms/bin/dsm_backend_rolling_restart.sh` to restart Fortanix DSM pods.
View all cronjobs	`sudo -E kubectl get cronjobs`
Disable all cronjobs	`sudo -E kubectl get cj --no-headers \| awk '{print $1}' \| while read name; do sudo -E kubectl patch cronjob $name -p '{"spec”: {“suspend”: true}} ‘; done`
Enable all cronjobs	`sudo -E kubectl get cj --no-headers \| awk '{print $1}' \| while read name; do sudo -E kubectl patch cronjob $name -p '{"spec”: {“suspend”: false}} ‘; done`

3.0 Safe Shutdown and Restarting of a Fortanix DSM Server

Perform the following steps to ensure a safe and controlled shutdown of a Fortanix DSM server in the Kubernetes cluster:

Run the following command to prevent new pods from being scheduled on the node before shutdown:
```
kubectl cordon <node-name>
```
Here, <node-name> refers to the name of the Fortanix DSM node.
NOTE
Ensure that the cluster has met global quorum. Removing the node should not impact services.
This marks the node as unschedulable, ensuring that no new workloads are assigned.
Run the following command to safely move workloads from the node before shutdown:
```
kubectl drain <node-name> --ignore-daemonsets
```
Here, <node-name> refers to the name of the Fortanix DSM node.
Run the shutdown command to shut down the node:
```
sudo shutdown -h now
```
NOTE
- For hardware DSM machines, ensure Intelligent Platform Management Interface (IPMI) access is available in case of issues bringing the server online.
- For virtual machines (VMs) hosted on ESXi/vSphere, web console access is required to power on the machine.

3.1 Restarting Fortanix DSM Server

You can start the Fortanix DSM node from IPMI and uncordon node to end the maintenance.

Perform the following steps to restart the Fortanix DSM server:

Power on the machine using the IPMI console.
Once the node is back online, access it using SSH.
Run the following command to allow scheduling on the node:
```
kubectl uncordon <node-name>
```
Here, <node-name> refers to the name of the Fortanix DSM node.
Run the following to verify the status of the node and workloads:
```
kubectl get nodes,pods
```

4.0 Fortanix DSM Prechecks

Fortanix DSM runs run_precheck.sh script to analyze the cluster health status. The script is located at /opt/fortanix/sdkms/bin/dsm_prechecks/run_precheck.sh.

The following table provides a comprehensive list of Fortanix DSM cluster management prechecks handled by the above script:

CHECK NAME	CHECK TYPE	PURPOSE
SWDIST_CHECK	NODE	To check the discrepancy in swdist endpoint files and directories
KUBEAPI_IP_CHECK	NODE	To check the `kube-apiserver` IP address inconsistency
CAS_ADMIN_ACCT_CHECK	NODE	To check the total Sysadmin account To check best practices to have more than one user as Sysadmin
1M_CPU_CHECK	NODE	To check last minute CPU load average
SWDIST_OVERLAY_SRVC_CHECK	NODE	To verify `SWDIST_OVERLAY` is up and running
SGX_CHECK	NODE	To verify if the machine supports Software Guard Extension (SGX) technology
PERM_DAEMON_SRVC_CHECK	NODE	To verify if `PERM_DAEMON` is up and running
NTP_CHECK	NODE	To verify if Network Time Protocol (NTP) is configured
MEM_CHECK	NODE	To check system memory utilization
KUBELET_SRVC_CHECK	NODE	To verify if `Kubelet` is up and running
KUBELET_CERT	NODE	To verify `Kubelet` certificate validity
KUBEAPI_SERVER_CERT	NODE	To check `KUBEAPI SERVER` certificate validity
HEALTH_CHECK_QUORUM	NODE	To confirm the quorum status of nodes, distinguish between local and global
HEALTH_CHECK_ALL	NODE	To verify responses from all the Fortanix DSM pods in the cluster. If they pass, it returns `OK`.
DOCKER_REGISTRY_SRVC_CHECK	NODE	To verify `DOCKER_REGISTRY` is up and running
DISK_CHECK_[/var]	NODE	To verify disk space usage
DISK_CHECK_[/]	NODE
DISK_CHECK_[/data]	NODE
DB_FILES_PERM_CHECK	NODE	To verify if the `/data/Cassandra/public files` have executable permissions or not
CRI_SRVC_CHECK	NODE	To verify if CRI (Container Runtime Interface) is up and running
CPU_MODEL_CHECK	NODE	To list out the information on CPU, attestation, and Fortanix DSM appliance series type
CAS_CERT_CHECK	NODE	To verify Cassandra certificate expiry
CAS_ACCT_CHECK	NODE	To verify the discrepancy in Cassandra `account_primary` and account table
5M_CPU_CHECK	NODE	To check the last 5 minutes CPU load average
15M_CPU_CHECK	NODE	To check the last 15 minutes CPU load average
IPMI_INFO_CHECK	NODE	To print IPMI configuration info such as IPMI IP address, default gateway mac address, and default gateway IP address.
SWDIST_DUP_RULE_CHECK	NODE	To verify `iptable` duplicate entries
CONTAINER_CHECK	CLUSTER	To check the container readiness
BACKUP_SETUP_CHECK	CLUSTER	To verify if the backup has been configured for the cluster
REPLICA_CHECK	CLUSTER	To verify replica counts in deployment and `configmap`
PODS_CHECK	CLUSTER	To verify the health status of the pods and their readiness
NODE_CHECK	CLUSTER	To verify if nodes are in a ready state
LB_SETUP_CHECK	CLUSTER	To verify the `LB_setup` whether it is external or Internal
JOB_CHECK	CLUSTER	To validate if the jobs are executed and completed
IMAGE_VERSION_CHECK	CLUSTER	To verify and report the Fortanix DSM version
ETCD_HEALTH_CHECK	CLUSTER	To verify if `ETCD` cluster is healthy
CAS_REP_CHECK	CLUSTER	To report replication strategy: This can be Simple or Strategy network topology
CAS_NODETOOL_CHECK	CLUSTER	To verify Cassandra `nodetool` status
CONN_Q_CHECK	CLUSTER	To calculate all public connections to the SDKMS pod

4.1 Fortanix DSM Prechecks Output Status

The following are the types of Fortanix DSM precheck output status:

OK: No action item and it is successful.
WARN: Requires attention and the user needs to check the logs created on the node /tmp/health_checks/ and the user must share the log details with the Fortanix Support team.
SKIPPED: The check was not executed due to some issue in the cluster. For example, Cassandra Replication Strategy check requires Cassandra to be healthy to fetch details from the Cassandra pod, but if the pod is not healthy, then that check will be SKIPPED. Hence, it is required to check the logs to understand the reason for skipping the check. The user must share the log details with the Fortanix support team.

5.0 Troubleshooting

The following table lists potential causes of errors and exceptions, along with details on how to fix them for various Fortanix DSM use cases:

5.1 Cluster Create and Cluster Join

ISSUE	DESCRIPTION	RESOLUTION
Hostname of the node is in uppercase. Sample error log: [etcd] Waiting for the etcd pod to come up (this might take 2 minutes) Error from server: `(NotFound): pods "etcd-DEV-FRTNX01" not found` ;	Fortanix DSM hostname must be in Lowercase letters. The error is due to a failure in cluster creation as it was waiting for etcd pods to become ready.	Command to set the hostname: `sudo hostnamectl set-hostname newhostname` The user needs to reset the cluster and reinitialize cluster creation.
Domain name resolution sample error log: `sudo sdkms-cluster create --self=ip_address --config config.yaml sudo: unable to resolve host cslab-5 Temporary failure in name resolution [sdkms-cluster]` WARNING: BIOS version file not found. Skipping test ERROR: Error parsing `/etc/resolv.conf`	`/etc/resolv.conf` file should not be empty.	Verify `/etc/network/interfaces` for the DNS nameservers and add the same entries in `/etc/resolv.conf`.
[coredns] Waiting for coredns pod to be ready	Cluster creation fails when the network configuration does not have DNS nameservers configured.	Verify `/etc/network/interfaces` for the DNS nameservers and add the same entries in `/etc/resolv.conf`.
NTP SERVERS on the nodes is not in sync	The node joining process will fail due to missing NTP configuration and if the peer and self-node timings are not the same and this causes clock difference in the etcd pod.	Resolve the issue by properly configuring NTP.
Port requirement for intra-cluster communication between the nodes	Communication between various Kubernetes control plane components, such as the API server, scheduler, controller manager, and etcd, also occurs over specific ports.	Refer to Fortanix Data Security Manager Port Requirements to ensure all the required ports are open for communication.
Access to the URLs of IAS (Intel Attestation Service) should be reachable from the joining node before initiating the `sdkms-join` process. This can manifest as `sdkms-join` pods failing to attest and therefore halting the upgrade.	DSM communicates with IAS during: Cluster creation Addition of new node Software upgrade	Refer to Fortanix Data Security Manager Cluster Attestation Guide (on-prem only). Ensure that nodes can connect to the IAS by using the following commands: `nc -v ps.sgx.trustedservices.intel.com 80 nc -v trustedservices.intel.com 443 nc -v trustedservices.intel.com 80 nc -v whitelist.trustedservices.intel.com 80 nc -v iasproxy.fortanix.com 443`
[kubelet] Waiting for node to become ready [kubelet] Installing cluster configuration ERROR: Found unreplaced IP address in `manifests/etcd.yaml`	The error message indicates that there is an unreplaced IP address in the `manifests/etcd.yaml` file.	Kindly raise the support ticket with the output of the following commands: `ls -lrt /etc/kubernetes/pki/etcd ls -lrt /etc/Kubernetes cat /etc/kubernetes/bootstrap-kubelet.conf`

5.2 POD Status

STATUS	DESCRIPTION	RESOLUTION
ERROR	Pods in an error state could be due to various reasons. Performing a detailed analysis of pod logs helps to identify the root cause.	Please raise a support ticket if you encounter any pods in the ERROR state, and kindly include the output of the following commands: `sudo -E kubectl describe pod pod_name sudo -E kubectl get pods -owide sudo -E kubectl get pods -n kube-system sudo -E kubectl logs pod_name`
PENDING	When pods remain pending without being placed on any nodes, several factors could be responsible for preventing pod scheduling. A detailed analysis of logs is required to identify the underlying issues.	Verify the status of the node to ensure the node is in a ready state using the command: `sudo -E kubectl get nodes -o wide` Verify the status of kube-system pods using the command: `sudo -E kubectl get pods -n kube-system` Describe the pod to know the errors using the command: `sudo -E kubectl describe pod pod_name`
IMAGEPULLBACKOFF	When pods inside the container cannot fetch the images required, throws an `imagepullbackoff` error.	Run the script located in `/opt/fortanix/sdkms/bin/ restart-docker-registry.sh` Delete the pods that are in `imagepullbackoff` and it comes back healthy. If the issue persists, kindly reach out to Fortanix Support.
CRASHLOOPBACKOFF	A detailed analysis of pod logs is required to identify the underlying issues.	Kindly reach out to Fortanix Support with the required logs as mentioned above.
CREATECONFIGERROR	While pod creation, sometimes it fails to fetch the required resource.	Restart the pods using the following command: `sudo -E Kubectl delete pod pod_name`

5.3 Node Not Ready

ISSUE	DESCRIPTION	RESOLUTION
`kubectl` commands will not be accessible from the nodes that are not in the ready state.	Nodes may enter the not-ready state due to various factors. A detailed analysis is required to determine the root cause.	Kindly create the support ticket and share the output of the following commands: `systemctl status kubelet journactl -fu kubelet`
Running `kubectl` commands from other nodes within the cluster report nodes that are not in a ready state as 'not ready’.	-	Kindly create the support ticket and share the output of the following commands: `sudo -E kubectl get node,pods -owide sudo -E kubectl get pods -n kube-system`

Fortanix Data Security Manager Cluster Management Quick Reference

1.0 Introduction

2.0 Fortanix DSM Cluster Management Commands

3.0 Safe Shutdown and Restarting of a Fortanix DSM Server

3.1 Restarting Fortanix DSM Server

4.0 Fortanix DSM Prechecks

4.1 Fortanix DSM Prechecks Output Status

5.0 Troubleshooting

5.1 Cluster Create and Cluster Join

5.2 POD Status

5.3 Node Not Ready

PLATFORM

Key Insight

Data Security Manager™

Confidential Computing Manager

Enclave Development Platform®

Request A demo

Contact Us

Free Trial

SOLUTIONS

AWS KMS External Key Store (XKS)

Google External Key Manager

Bring Your Own Key (BYOK)

HSM Modernization

Multicloud Key Management

Post Quantum Cryptography

Code Signing

Secrets Management

Tokenization Transparent

Database Encryption

Filesystem Encryption

Confidential Data Search

Confidential AI

Healthcare

Banking & Financial Services

Fintech

Manufacturing

Web 3.0

Federal Government

RESOURCES

Blog

Whitepapers

Datasheets

Solution Briefs

Ebooks

Reports

Case Studies

Webinars

University

Media Kit

Newsletters

COMPANY

About

Careerswe’re hiring

Customers

Partners

Awards

Events

Press

News

Services

Support

FAQ

4.6