Fortanix Data Security Manager Cluster Management Quick Reference

1.0 Introduction

Welcome to the Fortanix Data Security Manager (DSM) Cluster Management Quick Reference Guide. This guide is designed to facilitate the assessment of Fortanix DSM cluster health through prechecks, commonly used commands, and troubleshooting steps for mitigating known issues within the Fortanix DSM cluster environment.

This quick reference guide is intended to be used by technical stakeholders of Fortanix DSM who will be responsible for setting up and managing Fortanix DSM clusters.

2.0 Fortanix DSM Cluster Management Commands

The table below provides a comprehensive list of commands used to manage Fortanix DSM cluster:

NOTE
Before executing kubectl commands, ensure the admin.conf file is loaded using the command: export KUBECONFIG=/etc/kubernetes/admin.conf

TASK

COMMAND

Verify nodes and pods status

Non root users:sudo -E kubectl get nodes,pods -owide

Root users:kubectl get nodes,pods -owide

Verify pods status in kube-system namespace

sudo -E kubectl get pods -n kube-system
List Fortanix DSM pods and cassandra alone
  • sudo -E kubectl get pods -l app=cassandra -owide
  • sudo -E kubectl get pods -l app=sdkms -owide

Capture pod logs

sudo -E kubectl logs pod_name -f
Capture pod logs of different namespace sudo -E kubectl logs pod_name -n namespace_name
Get into the pod sudo -E kubectl exec -it pod_name bash
Label the nodes sudo -E kubectl label node nodename
Verify nodetool status sudo -E kubectl exec -it cassandra-0 -- nodetool status
Verify replication strategy sudo -E kubectl exec cassandra-0 -- cqlsh -e "select * from system_schema.keyspaces where keyspace_name ='public'";
Check the current system configuration sdkms-cluster get config --system
Check the initial system configuration sdkms-cluster get config –user
Create and list Kubeadm tokens
  • kubeadm token create
  • kubeadm token list

Create a cluster

 

 

sdkms-cluster create --self=ip_addr --config config.yaml

Where, self_ip_address is the IP address of the node.

Join a cluster sdkms-cluster join --peer=ip_address –token=<token> --self=self_ip_address

Initiate cluster join with DC labeling

sdkms-cluster join --peer=ip_address --token=<token> --self=self_ip_address --label datacenter=""

Reset the cluster

sdkms-cluster reset --delete-data --reset-iptables

NOTE
Do not run this command if there is any active node associated with the cluster.

Remove the node from the cluster

sdkms-cluster remove --force --node nodename
NOTE
Select the appropriate node that needs to be removed from the active cluster.

Re-deploy the cluster after modifying the configuration file (config.yaml)

sdkms-cluster deploy --config config.yaml --stage DEPLOY

Perform Fortanix DSM pods rolling restart

Navigate to /opt/fortanix/sdkms/bin/dsm_backend_rolling_restart.sh to restart Fortanix DSM pods.

View all cronjobs

sudo -E kubectl get cronjobs

Disable all cronjobs

sudo -E kubectl get cj --no-headers | awk '{print $1}' | while read name; do sudo -E kubectl patch cronjob $name -p '{"spec”: {“suspend”: true}} ‘; done

Enable all cronjobs

sudo -E kubectl get cj --no-headers | awk '{print $1}' | while read name; do sudo -E kubectl patch cronjob $name -p '{"spec”: {“suspend”: false}} ‘; done

3.0 Fortanix DSM Prechecks

Fortanix DSM runs run_precheck.sh script to analyze the cluster health status. The script is located at /opt/fortanix/sdkms/bin/dsm_prechecks/run_precheck.sh.

The following table provides a comprehensive list of Fortanix DSM cluster management prechecks handled by the above script:

CHECK NAME

CHECK TYPE

PURPOSE 

SWDIST_CHECK NODE To check the discrepancy in swdist endpoint files and directories
KUBEAPI_IP_CHECK NODE To check the kube-apiserver IP address inconsistency
CAS_ADMIN_ACCT_CHECK NODE
  • To check the total Sysadmin account
  • To check best practices to have more than one user as Sysadmin
1M_CPU_CHECK NODE To check last minute CPU load average
SWDIST_OVERLAY_SRVC_CHECK NODE To verify SWDIST_OVERLAY is up and running
SGX_CHECK NODE To verify if the machine supports Software Guard Extension (SGX) technology
PERM_DAEMON_SRVC_CHECK NODE To verify if PERM_DAEMON is up and running
NTP_CHECK NODE To verify if Network Time Protocol (NTP) is configured
MEM_CHECK NODE To check system memory utilization
KUBELET_SRVC_CHECK NODE To verify if Kubelet is up and running
KUBELET_CERT NODE To verify Kubelet certificate validity
KUBEAPI_SERVER_CERT NODE To check KUBEAPI SERVER certificate validity
HEALTH_CHECK_QUORUM NODE To confirm the quorum status of nodes, distinguish between local and global
HEALTH_CHECK_ALL NODE To verify responses from all the Fortanix DSM pods in the cluster. If they pass, it returns OK.
DOCKER_REGISTRY_SRVC_CHECK NODE To verify DOCKER_REGISTRY is up and running
DISK_CHECK_[/var] NODE To verify disk space usage
DISK_CHECK_[/] NODE
DISK_CHECK_[/data] NODE
DB_FILES_PERM_CHECK NODE To verify if the /data/Cassandra/public files have executable permissions or not
CRI_SRVC_CHECK NODE To verify if CRI (Container Runtime Interface) is up and running
CPU_MODEL_CHECK NODE To list out the information on CPU, attestation, and Fortanix DSM appliance series type
CAS_CERT_CHECK NODE To verify Cassandra certificate expiry
CAS_ACCT_CHECK NODE To verify the discrepancy in Cassandra account_primary and account table
5M_CPU_CHECK NODE To check the last 5 minutes CPU load average
15M_CPU_CHECK NODE To check the last 15 minutes CPU load average
IPMI_INFO_CHECK NODE To print IPMI configuration info such as IPMI IP address, default gateway mac address, and default gateway IP address.
SWDIST_DUP_RULE_CHECK NODE To verify iptable duplicate entries
CONTAINER_CHECK CLUSTER To check the container readiness
BACKUP_SETUP_CHECK CLUSTER To verify if the backup has been configured for the cluster
REPLICA_CHECK CLUSTER To verify replica counts in deployment and configmap
PODS_CHECK CLUSTER To verify the health status of the pods and their readiness
NODE_CHECK CLUSTER To verify if nodes are in a ready state
LB_SETUP_CHECK CLUSTER To verify the LB_setup whether it is external or Internal
JOB_CHECK CLUSTER To validate if the jobs are executed and completed
IMAGE_VERSION_CHECK CLUSTER To verify and report the Fortanix DSM version
ETCD_HEALTH_CHECK CLUSTER To verify if ETCD cluster is healthy
CAS_REP_CHECK CLUSTER To report replication strategy: This can be Simple or Strategy network topology
CAS_NODETOOL_CHECK CLUSTER To verify Cassandra nodetool status
CONN_Q_CHECK CLUSTER To calculate all public connections to the SDKMS pod

3.1 Fortanix DSM Prechecks Output Status

The following are the types of Fortanix DSM precheck output status:

  • OK: No action item and it is successful.
  • WARN: Requires attention and the user needs to check the logs created on the node /tmp/health_checks/ and the user must share the log details with the Fortanix Support team.
  • SKIPPED: The check was not executed due to some issue in the cluster. For example, Cassandra Replication Strategy check requires Cassandra to be healthy to fetch details from the Cassandra pod, but if the pod is not healthy, then that check will be SKIPPED. Hence, it is required to check the logs to understand the reason for skipping the check. The user must share the log details with the Fortanix support team.

4.0 Troubleshooting

The following table lists potential causes of errors and exceptions, along with details on how to fix them for various Fortanix DSM use cases:

4.1 Cluster Create and Cluster Join

ISSUE

DESCRIPTION

RESOLUTION

Hostname of the node is in uppercase.

Sample error log:
[etcd] Waiting for the etcd pod to come up (this might take 2 minutes)

Error from server: (NotFound): pods "etcd-DEV-FRTNX01" not found ;

Fortanix DSM hostname must be in Lowercase letters.

The error is due to a failure in cluster creation as it was waiting for etcd pods to become ready.

Command to set the hostname:

sudo hostnamectl set-hostname newhostname

The user needs to reset the cluster and reinitialize cluster creation.

Domain name resolution

sample error log: sudo sdkms-cluster create --self=ip_address --config config.yaml
sudo: unable to resolve host cslab-5 Temporary failure in name resolution
[sdkms-cluster]

WARNING: BIOS version file not found. Skipping test

ERROR: Error parsing /etc/resolv.conf

/etc/resolv.conf file should not be empty. Verify /etc/network/interfaces for the DNS nameservers and add the same entries in /etc/resolv.conf.
[coredns] Waiting for coredns pod to be ready Cluster creation fails when the network configuration does not have DNS nameservers configured. Verify /etc/network/interfaces for the DNS nameservers and add the same entries in /etc/resolv.conf.
NTP SERVERS on the nodes is not in sync   The node joining process will fail due to missing NTP configuration and if the peer and self-node timings are not the same and this causes clock difference in the etcd pod. Resolve the issue by properly configuring NTP.
Port requirement for intra-cluster communication between the nodes Communication between various Kubernetes control plane components, such as the API server, scheduler, controller manager, and etcd, also occurs over specific ports. Refer to the following support guide and ensure all the required ports are open for communication.
Fortanix Data Security Manager Port Requirements – Fortanix
Access to the URLs of IAS (Intel Attestation Service) should be reachable from the joining node before initiating the sdkms-join process.

DSM Communicates with IAS during

  •  Cluster creation
  • Addition of new node Software upgrade

Refer to the following support guide: Fortanix Data Security Manager Cluster Attestation Guide – Fortanix

Verify whether nodes can reach the Intel attestation service using the commands below:

curl -kv http://ps.sgx.trustedservices.intel.com:80
curl -kv https://trustedservices.intel.com/content/CRL:443
curl -kv http://trustedservices.intel.com/ocsp:80   
curl -kv http://whitelist.trustedservices.intel.com/http://whitelist.trustedservices.intel.com/SGX/LCWL/Linux/sgx_white_list_cert.bin:80
curl -kv https://iasproxy.fortanix.com:443

[kubelet] Waiting for node to become ready [kubelet] Installing cluster configuration

ERROR: Found unreplaced IP address in manifests/etcd.yaml

The error message indicates that there is an unreplaced IP address in the manifests/etcd.yaml file.

Kindly raise the support ticket with the output of the following commands:

ls -lrt /etc/kubernetes/pki/etcd 
ls -lrt /etc/Kubernetes cat /etc/kubernetes/bootstrap-kubelet.conf

 

4.2 POD Status

STATUS

DESCRIPTION

RESOLUTION

ERROR Pods in an error state could be due to various reasons. Performing a detailed analysis of pod logs helps to identify the root cause.

Please raise a support ticket if you encounter any pods in the ERROR state, and kindly include the output of the following commands:

sudo -E kubectl describe pod pod_name
sudo -E kubectl get pods -owide
sudo -E kubectl get pods -n kube-system
sudo -E kubectl logs pod_name
PENDING When pods remain pending without being placed on any nodes, several factors could be responsible for preventing pod scheduling. A detailed analysis of logs is required to identify the underlying issues.
  • Verify the status of the node to ensure the node is in a ready state using the command:
    sudo -E kubectl get nodes -o wide
  • Verify the status of kube-system pods using the command:
    sudo -E kubectl get pods -n kube-system
  • Describe the pod to know the errors using the command:
    sudo -E kubectl describe pod pod_name
IMAGEPULLBACKOFF When pods inside the container cannot fetch the images required, throws an imagepullbackoff error.
  • Run the script located in /opt/fortanix/sdkms/bin/ restart-docker-registry.sh
  • Delete the pods that are in imagepullbackoff and it comes back healthy.

If the issue persists, kindly reach out to Fortanix Support.

CRASHLOOPBACKOFF A detailed analysis of pod logs is required to identify the underlying issues.

Kindly reach out to Fortanix Support with the required logs as mentioned above.

CREATECONFIGERROR While pod creation, sometimes it fails to fetch the required resource. Restart the pods using the following command:
sudo -E Kubectl delete pod pod_name

 

4.3 Node Not Ready

ISSUE

DESCRIPTION

RESOLUTION

kubectl commands will not be accessible from the nodes that are not in the ready state. Nodes may enter the not-ready state due to various factors. A detailed analysis is required to determine the root cause.

Kindly create the support ticket and share the output of the following commands:

systemctl status kubelet
journactl -fu kubelet
Running kubectl commands from other nodes within the cluster report nodes that are not in a ready state as 'not ready’. - Kindly create the support ticket and share the output of the following commands:
sudo -E kubectl get node,pods -owide
sudo -E kubectl get pods -n kube-system

Comments

Please sign in to leave a comment.

Was this article helpful?
0 out of 0 found this helpful