Fortanix Data Security Manager Upgrade Prechecks Using Sensu

1.0 Introduction

Welcome to the Fortanix-Data-Security-Manager (DSM) Administration guide. The purpose of this guide is to describe the automated prechecks using a script before upgrades so that the user can configure a Sensu check and run it using Sensu agent for real-time monitoring.

2.0 Prerequisites

The script for automated prechecks is available at the following location from 4.16: /opt/fortanix/sdkms/bin/check-dsm-health.sh
A Sensu agent running on a Fortanix DSM host that has sudo permission to run scripts in /opt/fortanix/sdkms/bin/check-dsm-health.sh
A Sensu backend server to configure checks and monitor the alerts.
Configure Sensu check in Sensu Dashboard to execute the script.

3.0 Checks Handled by the Script

Check Name	Check Type	Purpose
HEALTH_CHECK_QUORUM	NODE	Checks the health of the sdkms and Cassandra pods, the script `check-dsm-health.sh` makes the following API call `sys/v1/health?consistency=quorum`
HEALTH_CHECK_ALL	NODE	Checks the health of the sdkms and Cassandra pods, the script `check-dsm-health.sh` makes the following API call `sys/v1/health?consistency=all` When we send `“all”`, it means the client is expecting a response from all replicas (for a 3-node fully replicated cluster, a response from 3 Cassandra pods is considered successful).
DISK_CHECK	NODE	Checks the `“/var”`, `“/”`, and `“/data”` directory usage. The threshold is 70%, so if disk usage is higher than that, you will see `“WARN”` in the status.
NTP_CHECK	NODE	The script `check-dsm-health.sh` executes the `ntpq -p` command and checks if the node is syncing with at least one NTP server .
API_CERT	NODE	Checks for the availability of `/etc/kubernetes/pki/apiserver.crt` cert and also expiry date. If the expiration date is less than 30 days, then it flags it as `“WARN”`.
KUBELET_CERT	NODE	Checks for the `kubelet.conf file`, `/etc/kubernetes/kubelet.conf`, and verifies whether the certificate is embedded; if yes, it checks the expiry of the certificate-data. If it is pointing to some other file in `/var/lib/kubelet/pki`, then it checks the expiry of that certificate. If `kubelet.conf` is pointing to `/var/lib/kubelet/pki/kubelet-cert-current.pem`, then no action is required.
DOCKER_REGISTRY_SRVC_CHECK	NODE	Checks whether the docker-registry service is up and running.
KUBELET_SRVC_CHECK	NODE	Checks whether the kubelet service is up and running.
CRI_SRVC_CHECK	NODE	Checks whether the docker or containerd service is up and running.
1M_CPU_CHECK	NODE	Checks whether the 1 minute CPU load average is less than the defined threshold value. (90%)
5M_CPU_CHECK	NODE	Checks whether the 5 minute CPU load average is less than the defined threshold value (90%).
15M_CPU_CHECK		Checks whether the 15 minute CPU load average is less than the defined threshold value (90%).
MEM_CHECK	NODE	Checks whether the memory utilization is less than the defined threshold value (75%).
DB_FILES_PERM_CHECK	NODE	Checks whether the file permissions of `/data/cassandra/public` are valid: user - 999 and group - `docker`, if not then report as `“WARN”`.
CONN_Q_CHECK	NODE	Checks the number of public connections for each node using the Metrics API, if the value greater than 2000 then it flags it as `“WARN”`, else `“OK”`. Checks the backend Q size using Metrics API, if value is greater than 10000 then flag it as `“WARN”`, else `“OK”`.
CAS_ACCT_CHECK	NODE	Checks the count of rows in `account` and `account_primary` tables, if not matching then Flag it as `“WARN”`, else `“OK”`.
CAS_ADMIN_ACCT_CHECK	NODE	Checks the count of sysadmin accounts. If the count is <= 1 then `“WARN”`, else `“OK”`.
SGX_CHECK	NODE	Checks whether sdkms pod is running “enclave” process and also whether the machine has `/dev/*sgx` driver, if yes then report `“SGX SDKMS POD”`. If driver is present and sdkms pod is not running “enclave” process, then report `“NON SGX SDKMS POD ON SGX MACHINE”`. Flag it as `“WARN”` because you are supposed to run SGX deployment.
NODE_CHECK	CLUSTER	Checks whether all the nodes are in ready status `“kubectl get nodes”`.
PODS_CHECK	CLUSTER	Checks whether all pods are in Running or Completed status, if not, then flag them as `“WARN”`.
JOB_CHECK	CLUSTER	Checks whether all jobs are in “Complete” status; if not, it flags them as `“WARN”` for all running and failed jobs. The running jobs are flagged because before you start the upgrade, all jobs must be in completed status.
CONTAINER_CHECK	CLUSTER	Checks whether all containers within pods are in a `“ready”` state; if not, then flags them as `“WARN”`.
IPMI_INFO_CHK	NODE	Verifies the IPMI IP address, default gateway mac and default gateway IP are valid, if not, then flags them as `“WARN”`.
CAS_REP_CHECK	CLUSTER	Checks whether the added DC Labeling is matching with the replication strategy in Cassandra. (Please note this validation is for Network Topology and fully replicated cluster) if there is any mismatch then it flags it as `“WARN”`. If in case the DC labeling is present, but the Strategy is Simple Strategy in Cassandra, then it flags it as `“WARN”` (this is not recommended for production clusters).
CAS_CERT_CHECK	NODE	Checks whether the Cassandra pod certificate is valid if not it flags it as `“WARN”`.
REPLICA_CHECK	CLUSTER	Checks the `“replicas”` field value in the configuration map using the command : `sdkms-cluster get config --system` and compares that with `“replicas”` of deployment and statefulset. If there is no match, then it flags it as `“WARN”`, else `“OK”`.
BIOS_VER_CHECK	NODE	Verifies that the node has latest BIOS installed, if not it flags it as `“WARN”`.
IMAGE_VERSION_CHECK	CLUSTER	Validates the version value from `/etc/fortanix/sdkms_version/sdkms_version` file with the Image version of `SDKMS/SDKMS-PROXY/SDKMS-UI` deployments. If there is any mismatch, then it flags it as `“WARN”`. (This check also helps post upgrade/cluster creation)
BACKUP_SETUP_CHECK	CLUSTER	Checks if the CRON job `sdkms-backup` exists; if it is not present, it flags it as `“WARN”`.
KUBEAPI_IP_CHECK	CLUSTER	Checks whether the `kube-apiserver` manifest file has the correct IP address, if not then it flags it as `“WARN”`.
SWDIST_CHECK	CLUSTER	Checks the count of directories in the path `/var/opt/fortanix/swdist/data/` and Swdist endpoints and versions file (`/var/opt/fortanix/swdist/versions`) are matching.
CAS_NODETOOL_CHECK	CLUSTER	Executes the nodetool status command and checks whether there are any nodes which are not of the pattern `“UN”` (Up and Normal) Checks whether the count of pattern `“UN”` is matching with the Cassandra pod count.
ETCD_HEALTH_CHECK	CLUSTER	Verifies whether the etcd cluster is healthy, if not it flags it as `“WARN”`.

NOTE
The execution time for node checks is 6 seconds. and for cluster checks, it is 5.5 seconds.

4.0 Execution Using Script

The node level and cluster level checks can be executed using the following commands:

./check-dsm-health --node [<info>|<monitor>] [--ignore-checks=check1,check2.. ]
./check-dsm-health --cluster [<info>|<monitor>] [--ignore-checks=check1,check2.. ]

Where,

The first parameter, --node and –cluster determine which types of checks are to be executed.
The second parameter takes two values:
- info: This parameter turns off or disables alerting and just publishes the check status.
- monitor: This parameter turns on alerting, and if there are checks with WARN status, it generates an alert.
The third parameter, --ignore-checks=check1,check2.. is optional and takes a list of comma-separated values. This is used in case of any known issues that require time to resolve, so you can add these checks to the ignore list so that the script does not alert for those checks.

5.0 Check Status

Status can take 3 values: OK , WARN, SKIPPED

OK: No action item and it is successful.
WARN: Requires attention and the user needs to check the logs created on the node /tmp/health_checks/logs.
SKIPPED: Check was not executed because of some issue in the cluster.
For instance, Cassandra Replication Strategy check requires Cassandra to be healthy to fetch details from the Cassandra pod, but if the pod is not healthy, then that check will be SKIPPED.
Hence, it is required to check the logs to understand the reason for skipping the check.

NOTE
Logs older than seven days will be cleaned up by the script.

6.0 Examples

Configure the check on the Sensu dashboard. Click Configuration → Checks → Click New.

Execute the cluster check with “monitor” parameter using the following command:

sudo /opt/fortanix/sdkms/bin/check-dsm-health.sh --node monitor

Sensu dashboard snapshot:

**Figure 1: Sensu Dashboard [Node Checks]**

Execute the cluster check with “info” parameter using the following command:

sudo /opt/fortanix/sdkms/bin/check-dsm-health.sh –-cluster info

Configure a separate check to run in round robin (as these checks are cluster wide and any one node can execute the checks)

**Figure 2: Round Robin Cluster Checks**

If required add checks to the ignore list as below:

sudo /opt/fortanix/sdkms/bin/check-dsm-health.sh –-node monitor –-ignore-checks=NTP_CHECK

Fortanix Data Security Manager Upgrade Prechecks Using Sensu - Automated

1.0 Introduction

2.0 Prerequisites

3.0 Checks Handled by the Script

4.0 Execution Using Script

5.0 Check Status

6.0 Examples

PLATFORM

Key Insight

Data Security Manager™

Confidential Computing Manager

Enclave Development Platform®

Request A demo

Contact Us

Free Trial

SOLUTIONS

AWS KMS External Key Store (XKS)

Google External Key Manager

Bring Your Own Key (BYOK)

HSM Modernization

Multicloud Key Management

Post Quantum Cryptography

Code Signing

Secrets Management

Tokenization Transparent

Database Encryption

Filesystem Encryption

Confidential Data Search

Confidential AI

Healthcare

Banking & Financial Services

Fintech

Manufacturing

Web 3.0

Federal Government

RESOURCES

Blog

Whitepapers

Datasheets

Solution Briefs

Ebooks

Reports

Case Studies

Webinars

University

Media Kit

Newsletters

COMPANY

About

Careerswe’re hiring

Customers

Partners

Awards

Events

Press

News

Services

Support

FAQ

4.6