1.0 Introduction
Welcome to the Fortanix-Data-Security-Manager (DSM) Cluster Management Quick Reference Guide. This guide is designed to facilitate the assessment of Fortanix DSM cluster health through prechecks, commonly used commands, and troubleshooting steps for mitigating known issues within the Fortanix DSM cluster environment.
This quick reference guide is intended to be used by technical stakeholders of Fortanix DSM who will be responsible for setting up and managing Fortanix DSM clusters.
2.0 Fortanix DSM Cluster Management Commands
The table below provides a comprehensive list of commands used to manage Fortanix DSM cluster:
NOTE
Before executing
kubectl
commands, ensure theadmin.conf
file is loaded using the command:export KUBECONFIG=/etc/kubernetes/admin.conf
TASK | COMMAND |
---|---|
Verify nodes and pods status | Non root users: Root users: |
Verify pods status in |
|
List Fortanix DSM pods and cassandra alone |
|
Capture pod logs |
|
Capture pod logs of different namespace |
|
Get into the pod |
|
Label the nodes |
|
Verify |
|
Verify replication strategy |
|
Check the current system configuration |
|
Check the initial system configuration |
|
Create and list |
|
Create a cluster |
Where, |
Join a cluster |
|
Initiate cluster join with DC labeling |
|
Reset the cluster |
|
Remove the node from the cluster |
|
Re-deploy the cluster after modifying the configuration file |
|
Perform Fortanix DSM pods rolling restart | Navigate to |
View all cronjobs |
|
Disable all cronjobs |
|
Enable all cronjobs |
|
3.0 Safe Shutdown and Restarting of a DSM Server
Perform the following steps to ensure a safe and controlled shutdown of a Fortanix DSM server in the Kubernetes cluster:
Run the following command to prevent new pods from being scheduled on the node before shutdown:
kubectl cordon <node-name>
Here,
<node-name>
refers to the name of the Fortanix DSM node.NOTE
Ensure that the cluster has met global quorum. Removing the node should not impact services.
This marks the node as unschedulable, ensuring that no new workloads are assigned.
Run the following command to safely move workloads from the node before shutdown:
kubectl drain <node-name> --ignore-daemonsets
Here,
<node-name>
refers to the name of the Fortanix DSM node.Run the shutdown command to shut down the node:
sudo shutdown -h now
NOTE
For hardware DSM machines, ensure Intelligent Platform Management Interface (IPMI) access is available in case of issues bringing the server online.
For virtual machines (VMs) hosted on ESXi/vSphere, web console access is required to power on the machine.
3.1 Restarting the DSM Server
You can start the Fortanix DSM node from IPMI and uncordon node to end the maintenance.
Perform the following steps to restart the Fortanix DSM server:
Power on the machine using the IPMI console.
Once the node is back online, access it using SSH.
Run the following command to allow scheduling on the node:
kubectl uncordon <node-name>
Here,
<node-name>
refers to the name of the Fortanix DSM node.Run the following to verify the status of the node and workloads:
kubectl get nodes,pods
4.0 Fortanix DSM Prechecks
Fortanix DSM runs run_precheck.sh
script to analyze the cluster health status. The script is located at /opt/fortanix/sdkms/bin/dsm_prechecks/run_precheck.sh
.
The following table provides a comprehensive list of Fortanix DSM cluster management prechecks handled by the above script:
CHECK NAME | CHECK TYPE | PURPOSE |
---|---|---|
SWDIST_CHECK | NODE | To check the discrepancy in swdist endpoint files and directories |
KUBEAPI_IP_CHECK | NODE | To check the |
CAS_ADMIN_ACCT_CHECK | NODE |
|
1M_CPU_CHECK | NODE | To check last minute CPU load average |
SWDIST_OVERLAY_SRVC_CHECK | NODE | To verify |
SGX_CHECK | NODE | To verify if the machine supports Software Guard Extension (SGX) technology |
PERM_DAEMON_SRVC_CHECK | NODE | To verify if |
NTP_CHECK | NODE | To verify if Network Time Protocol (NTP) is configured |
MEM_CHECK | NODE | To check system memory utilization |
KUBELET_SRVC_CHECK | NODE | To verify if |
KUBELET_CERT | NODE | To verify |
KUBEAPI_SERVER_CERT | NODE | To check |
HEALTH_CHECK_QUORUM | NODE | To confirm the quorum status of nodes, distinguish between local and global |
HEALTH_CHECK_ALL | NODE | To verify responses from all the Fortanix DSM pods in the cluster. If they pass, it returns |
DOCKER_REGISTRY_SRVC_CHECK | NODE | To verify |
DISK_CHECK_[/var] | NODE | To verify disk space usage |
DISK_CHECK_[/] | NODE | |
DISK_CHECK_[/data] | NODE | |
DB_FILES_PERM_CHECK | NODE | To verify if the |
CRI_SRVC_CHECK | NODE | To verify if CRI (Container Runtime Interface) is up and running |
CPU_MODEL_CHECK | NODE | To list out the information on CPU, attestation, and Fortanix DSM appliance series type |
CAS_CERT_CHECK | NODE | To verify Cassandra certificate expiry |
CAS_ACCT_CHECK | NODE | To verify the discrepancy in Cassandra |
5M_CPU_CHECK | NODE | To check the last 5 minutes CPU load average |
15M_CPU_CHECK | NODE | To check the last 15 minutes CPU load average |
IPMI_INFO_CHECK | NODE | To print IPMI configuration info such as IPMI IP address, default gateway mac address, and default gateway IP address. |
SWDIST_DUP_RULE_CHECK | NODE | To verify |
CONTAINER_CHECK | CLUSTER | To check the container readiness |
BACKUP_SETUP_CHECK | CLUSTER | To verify if the backup has been configured for the cluster |
REPLICA_CHECK | CLUSTER | To verify replica counts in deployment and |
PODS_CHECK | CLUSTER | To verify the health status of the pods and their readiness |
NODE_CHECK | CLUSTER | To verify if nodes are in a ready state |
LB_SETUP_CHECK | CLUSTER | To verify the |
JOB_CHECK | CLUSTER | To validate if the jobs are executed and completed |
IMAGE_VERSION_CHECK | CLUSTER | To verify and report the Fortanix DSM version |
ETCD_HEALTH_CHECK | CLUSTER | To verify if |
CAS_REP_CHECK | CLUSTER | To report replication strategy: This can be Simple or Strategy network topology |
CAS_NODETOOL_CHECK | CLUSTER | To verify Cassandra |
CONN_Q_CHECK | CLUSTER | To calculate all public connections to the SDKMS pod |
4.1 Fortanix DSM Prechecks Output Status
The following are the types of Fortanix DSM precheck output status:
OK: No action item and it is successful.
WARN: Requires attention and the user needs to check the logs created on the node
/tmp/health_checks/
and the user must share the log details with the Fortanix Support team.SKIPPED: The check was not executed due to some issue in the cluster. For example, Cassandra Replication Strategy check requires Cassandra to be healthy to fetch details from the Cassandra pod, but if the pod is not healthy, then that check will be SKIPPED. Hence, it is required to check the logs to understand the reason for skipping the check. The user must share the log details with the Fortanix support team.
5.0 Troubleshooting
The following table lists potential causes of errors and exceptions, along with details on how to fix them for various Fortanix DSM use cases:
5.1 Cluster Create and Cluster Join
ISSUE | DESCRIPTION | RESOLUTION |
---|---|---|
Hostname of the node is in uppercase. Sample error log: Error from server: | Fortanix DSM hostname must be in Lowercase letters. The error is due to a failure in cluster creation as it was waiting for etcd pods to become ready. | Command to set the hostname:
The user needs to reset the cluster and reinitialize cluster creation. |
Domain name resolution sample error log: WARNING: BIOS version file not found. Skipping test ERROR: Error parsing |
| Verify |
[coredns] Waiting for coredns pod to be ready | Cluster creation fails when the network configuration does not have DNS nameservers configured. | Verify |
NTP SERVERS on the nodes is not in sync | The node joining process will fail due to missing NTP configuration and if the peer and self-node timings are not the same and this causes clock difference in the etcd pod. | Resolve the issue by properly configuring NTP. |
Port requirement for intra-cluster communication between the nodes | Communication between various Kubernetes control plane components, such as the API server, scheduler, controller manager, and etcd, also occurs over specific ports. | Refer to Fortanix Data Security Manager Port Requirements – Fortanix support guide and ensure all the required ports are open for communication. |
Access to the URLs of IAS (Intel Attestation Service) should be reachable from the joining node before initiating the This can manifest as | DSM communicates with IAS during:
| Refer to Fortanix Data Security Manager Cluster Attestation Guide – Fortanix support guide. Ensure that nodes can connect to the IAS by using the following commands:
|
[kubelet] Waiting for node to become ready [kubelet] Installing cluster configuration ERROR: Found unreplaced IP address in | The error message indicates that there is an unreplaced IP address in the | Kindly raise the support ticket with the output of the following commands:
|
5.2 POD Status
STATUS | DESCRIPTION | RESOLUTION |
---|---|---|
ERROR | Pods in an error state could be due to various reasons. Performing a detailed analysis of pod logs helps to identify the root cause. | Please raise a support ticket if you encounter any pods in the ERROR state, and kindly include the output of the following commands:
|
PENDING | When pods remain pending without being placed on any nodes, several factors could be responsible for preventing pod scheduling. A detailed analysis of logs is required to identify the underlying issues. |
|
IMAGEPULLBACKOFF | When pods inside the container cannot fetch the images required, throws an |
If the issue persists, kindly reach out to Fortanix Support. |
CRASHLOOPBACKOFF | A detailed analysis of pod logs is required to identify the underlying issues. | Kindly reach out to Fortanix Support with the required logs as mentioned above. |
CREATECONFIGERROR | While pod creation, sometimes it fails to fetch the required resource. | Restart the pods using the following command:
|
5.3 Node Not Ready
ISSUE | DESCRIPTION | RESOLUTION |
---|---|---|
| Nodes may enter the not-ready state due to various factors. A detailed analysis is required to determine the root cause. | Kindly create the support ticket and share the output of the following commands:
|
Running | - | Kindly create the support ticket and share the output of the following commands:
|