Introduction
Welcome to the Fortanix-Data-Security-Manager (DSM) Administration guide. The purpose of this guide is to describe the automated prechecks using a script before upgrades so that the user can configure a Sensu check and run it using Sensu agent for real-time monitoring.
Prerequisites
The script for automated prechecks is available at the following location from 4.16:
/opt/fortanix/sdkms/bin/check-dsm-health.sh
A Sensu agent running on a Fortanix DSM host that has sudo permission to run scripts in
/opt/fortanix/sdkms/bin/check-dsm-health.sh
A Sensu backend server to configure checks and monitor the alerts.
Configure Sensu check in Sensu Dashboard to execute the script.
Checks Handled by the Script
Check Name | Check Type | Purpose |
---|---|---|
HEALTH_CHECK_QUORUM | NODE | Checks the health of the sdkms and Cassandra pods, the script |
HEALTH_CHECK_ALL | NODE | Checks the health of the sdkms and Cassandra pods, the script When we send |
DISK_CHECK | NODE | Checks the |
NTP_CHECK | NODE | The script |
API_CERT | NODE | Checks for the availability of |
KUBELET_CERT | NODE | Checks for the If it is pointing to some other file in If |
DOCKER_REGISTRY_SRVC_CHECK | NODE | Checks whether the docker-registry service is up and running. |
KUBELET_SRVC_CHECK | NODE | Checks whether the kubelet service is up and running. |
CRI_SRVC_CHECK | NODE | Checks whether the docker or containerd service is up and running. |
1M_CPU_CHECK | NODE | Checks whether the 1 minute CPU load average is less than the defined threshold value. (90%) |
5M_CPU_CHECK | NODE | Checks whether the 5 minute CPU load average is less than the defined threshold value (90%). |
15M_CPU_CHECK |
| Checks whether the 15 minute CPU load average is less than the defined threshold value (90%). |
MEM_CHECK | NODE | Checks whether the memory utilization is less than the defined threshold value (75%). |
DB_FILES_PERM_CHECK | NODE | Checks whether the file permissions of |
CONN_Q_CHECK | NODE | Checks the number of public connections for each node using the Metrics API, if the value greater than 2000 then it flags it as Checks the backend Q size using Metrics API, if value is greater than 10000 then flag it as |
CAS_ACCT_CHECK | NODE | Checks the count of rows in |
CAS_ADMIN_ACCT_CHECK | NODE | Checks the count of sysadmin accounts. If the count is <= 1 then |
SGX_CHECK | NODE | Checks whether sdkms pod is running “enclave” process and also whether the machine has If driver is present and sdkms pod is not running “enclave” process, then report Flag it as |
NODE_CHECK | CLUSTER | Checks whether all the nodes are in ready status |
PODS_CHECK | CLUSTER | Checks whether all pods are in Running or Completed status, if not, then flag them as |
JOB_CHECK | CLUSTER | Checks whether all jobs are in “Complete” status; if not, it flags them as The running jobs are flagged because before you start the upgrade, all jobs must be in completed status. |
CONTAINER_CHECK | CLUSTER | Checks whether all containers within pods are in a |
IPMI_INFO_CHK | NODE | Verifies the IPMI IP address, default gateway mac and default gateway IP are valid, if not, then flags them as |
CAS_REP_CHECK | CLUSTER | Checks whether the added DC Labeling is matching with the replication strategy in Cassandra. (Please note this validation is for Network Topology and fully replicated cluster) if there is any mismatch then it flags it as If in case the DC labeling is present, but the Strategy is Simple Strategy in Cassandra, then it flags it as |
CAS_CERT_CHECK | NODE | Checks whether the Cassandra pod certificate is valid if not it flags it as |
REPLICA_CHECK | CLUSTER | Checks the
and compares that with |
BIOS_VER_CHECK | NODE | Verifies that the node has latest BIOS installed, if not it flags it as |
IMAGE_VERSION_CHECK | CLUSTER | Validates the version value from If there is any mismatch, then it flags it as |
BACKUP_SETUP_CHECK | CLUSTER | Checks if the CRON job |
KUBEAPI_IP_CHECK | CLUSTER | Checks whether the |
SWDIST_CHECK | CLUSTER | Checks the count of directories in the path |
CAS_NODETOOL_CHECK | CLUSTER | Executes the nodetool status command and checks whether there are any nodes which are not of the pattern Checks whether the count of pattern |
ETCD_HEALTH_CHECK | CLUSTER | Verifies whether the etcd cluster is healthy, if not it flags it as |
NOTE
The execution time for node checks is 6 seconds. and for cluster checks, it is 5.5 seconds.
Execution Using Script
The node level and cluster level checks can be executed using the following commands:
./check-dsm-health --node [<info>|<monitor>] [--ignore-checks=check1,check2.. ]
./check-dsm-health --cluster [<info>|<monitor>] [--ignore-checks=check1,check2.. ]
Where,
The first parameter,
--node
and–cluster
determine which types of checks are to be executed.The second parameter takes two values:
info
: This parameter turns off or disables alerting and just publishes the check status.monitor
: This parameter turns on alerting, and if there are checks with WARN status, it generates an alert.
The third parameter,
--ignore-checks=check1,check2..
is optional and takes a list of comma-separated values. This is used in case of any known issues that require time to resolve, so you can add these checks to the ignore list so that the script does not alert for those checks.
Check Status
Status can take 3 values: OK , WARN, SKIPPED
OK: No action item and it is successful.
WARN: Requires attention and the user needs to check the logs created on the node
/tmp/health_checks/logs
.SKIPPED: Check was not executed because of some issue in the cluster.
For instance, Cassandra Replication Strategy check requires Cassandra to be healthy to fetch details from the Cassandra pod, but if the pod is not healthy, then that check will be SKIPPED.
Hence, it is required to check the logs to understand the reason for skipping the check.
NOTE
Logs older than seven days will be cleaned up by the script.
Examples
Configure the check on the Sensu dashboard. Click Configuration → Checks → Click New.
Execute the cluster check with “monitor”
parameter using the following command:
sudo /opt/fortanix/sdkms/bin/check-dsm-health.sh --node monitor
Sensu dashboard snapshot:

Figure 1: Sensu Dashboard [Node Checks]
Execute the cluster check with “info”
parameter using the following command:
sudo /opt/fortanix/sdkms/bin/check-dsm-health.sh –-cluster info
Configure a separate check to run in round robin (as these checks are cluster wide and any one node can execute the checks)

.png?sv=2022-11-02&spr=https&st=2025-04-14T08%3A02%3A47Z&se=2025-04-14T08%3A18%3A47Z&sr=c&sp=r&sig=XMDKKYbBaGiYX3%2F5opgiw%2FhJ1l6bfvYLvrXsqpan%2BuE%3D)
Figure 2: Round Robin Cluster Checks
If required add checks to the ignore list as below:
sudo /opt/fortanix/sdkms/bin/check-dsm-health.sh –-node monitor –-ignore-checks=NTP_CHECK