Fortanix Data Security Manager Upgrade Prechecks Using Sensu - Automated

Introduction

Welcome to the Fortanix Data Security Manager (DSM) Administration guide. The purpose of this guide is to describe the automated prechecks using a script before upgrades so that the user can configure a Sensu check and run it using Sensu agent for real-time monitoring.

Prerequisites

  • The script for automated prechecks is available at the following location from 4.16: /opt/fortanix/sdkms/bin/check-dsm-health.sh
  • A Sensu agent running on a Fortanix DSM host that has sudo permission to run scripts in /opt/fortanix/sdkms/bin/check-dsm-health.sh
  • A Sensu backend server to configure checks and monitor the alerts.
  • Configure Sensu check in Sensu Dashboard to execute the script.

Checks Handled by the Script

Check Name Check Type Purpose
HEALTH_CHECK_QUORUM NODE Checks the health of the sdkms and Cassandra pods, the script check-dsm-health.sh makes the following API call sys/v1/health?consistency=quorum
HEALTH_CHECK_ALL NODE

Checks the health of the sdkms and Cassandra pods, the script check-dsm-health.sh makes the following API call sys/v1/health?consistency=all

When we send “all”, it means the client is expecting a response from all replicas (for a 3-node fully replicated cluster, a response from 3 Cassandra pods is considered successful).
DISK_CHECK NODE Checks the “/var”, “/”, and “/data” directory usage. The threshold is 70%, so if disk usage is higher than that, you will see “WARN” in the status.
NTP_CHECK NODE The script check-dsm-health.sh executes the ntpq -p command and checks if the node is syncing with at least one NTP server .
API_CERT NODE Checks for the availability of /etc/kubernetes/pki/apiserver.crt cert and also expiry date. If the expiration date is less than 30 days, then it flags it as “WARN”.
KUBELET_CERT NODE Checks for the kubelet.conf file, /etc/kubernetes/kubelet.conf, and verifies whether the certificate is embedded; if yes, it checks the expiry of the certificate-data.
If it is pointing to some other file in /var/lib/kubelet/pki, then it checks the expiry of that certificate.
If kubelet.conf is pointing to /var/lib/kubelet/pki/kubelet-cert-current.pem, then no action is required.
DOCKER_REGISTRY_SRVC_CHECK NODE Checks whether the docker-registry service is up and running.
KUBELET_SRVC_CHECK NODE Checks whether the kubelet service is up and running.
CRI_SRVC_CHECK
NODE Checks whether the docker or containerd service is up and running.
1M_CPU_CHECK NODE Checks whether the 1 minute CPU load average is less than the defined threshold value. (90%)
5M_CPU_CHECK NODE Checks whether the 5 minute CPU load average is less than the defined threshold value (90%).
15M_CPU_CHECK   Checks whether the 15 minute CPU load average is less than the defined threshold value (90%).
MEM_CHECK NODE Checks whether the memory utilization is less than the defined threshold value (75%).
DB_FILES_PERM_CHECK NODE Checks whether the file permissions of /data/cassandra/public are valid: user - 999 and group - docker, if not then report as “WARN”.
CONN_Q_CHECK NODE Checks the number of public connections for each node using the Metrics API, if the value greater than 2000 then it flags it as “WARN”, else “OK”.
Checks the backend Q size using Metrics API, if value is greater than 10000 then flag it as “WARN”, else “OK”.
CAS_ACCT_CHECK NODE Checks the count of rows in account and account_primary tables, if not matching then Flag it as “WARN”, else “OK”.
CAS_ADMIN_ACCT_CHECK NODE Checks the count of sysadmin accounts. If the count is <= 1 then “WARN”, else “OK”.
SGX_CHECK NODE Checks whether sdkms pod is running “enclave” process and also whether the machine has /dev/*sgx driver, if yes then report “SGX SDKMS POD”.
If driver is present and sdkms pod is not running “enclave” process, then report “NON SGX SDKMS POD ON SGX MACHINE”.
Flag it as “WARN” because you are supposed to run SGX deployment.
NODE_CHECK CLUSTER Checks whether all the nodes are in ready status “kubectl get nodes”.
PODS_CHECK CLUSTER Checks whether all pods are in Running or Completed status, if not, then flag them as “WARN”.
JOB_CHECK CLUSTER Checks whether all jobs are in “Complete” status; if not, it flags them as “WARN” for all running and failed jobs.
The running jobs are flagged because before you start the upgrade, all jobs must be in completed status.
CONTAINER_CHECK CLUSTER Checks whether all containers within pods are in a “ready” state; if not, then flags them as “WARN”.
IPMI_INFO_CHK NODE Verifies the IPMI IP address, default gateway mac and default gateway IP are valid, if not, then flags them as “WARN”.
CAS_REP_CHECK CLUSTER Checks whether the added DC Labeling is matching with the replication strategy in Cassandra.
(Please note this validation is for Network Topology and fully replicated cluster) if there is any mismatch then it flags it as “WARN”.
If in case the DC labeling is present, but the Strategy is Simple Strategy in Cassandra, then it flags it as “WARN” (this is not recommended for production clusters).
CAS_CERT_CHECK NODE Checks whether the Cassandra pod certificate is valid if not it flags it as “WARN”.
REPLICA_CHECK CLUSTER Checks the “replicas” field value in the configuration map using the command :
sdkms-cluster get config --system
and compares that with “replicas” of deployment and statefulset. If there is no match, then it flags it as “WARN”, else “OK”.
BIOS_VER_CHECK NODE Verifies that the node has latest BIOS installed, if not it flags it as “WARN”.
IMAGE_VERSION_CHECK CLUSTER

Validates the version value from /etc/fortanix/sdkms_version/sdkms_version file with the Image version of SDKMS/SDKMS-PROXY/SDKMS-UI deployments.

If there is any mismatch, then it flags it as “WARN”. (This check also helps post upgrade/cluster creation)

BACKUP_SETUP_CHECK CLUSTER Checks if the CRON job sdkms-backup exists; if it is not present, it flags it as “WARN”.
KUBEAPI_IP_CHECK CLUSTER Checks whether the kube-apiserver manifest file has the correct IP address, if not then it flags it as “WARN”.
SWDIST_CHECK CLUSTER Checks the count of directories in the path /var/opt/fortanix/swdist/data/ and Swdist endpoints and versions file (/var/opt/fortanix/swdist/versions) are matching.
CAS_NODETOOL_CHECK CLUSTER Executes the nodetool status command and checks whether there are any nodes which are not of the pattern “UN” (Up and Normal)
Checks whether the count of pattern “UN” is matching with the Cassandra pod count.
ETCD_HEALTH_CHECK CLUSTER Verifies whether the etcd cluster is healthy, if not it flags it as “WARN”.
NOTE
The execution time for node checks is 6 seconds. and for cluster checks, it is 5.5 seconds.

Execution Using Script

The node level and cluster level checks can be executed using the following commands:

./check-dsm-health --node [<info>|<monitor>] [--ignore-checks=check1,check2.. ]
./check-dsm-health --cluster [<info>|<monitor>] [--ignore-checks=check1,check2.. ]

Where,

  • The first parameter, --node and –cluster determine which types of checks are to be executed.
  • The second parameter takes two values:
    • info: This parameter turns off or disables alerting and just publishes the check status.
    • monitor: This parameter turns on alerting, and if there are checks with WARN status, it generates an alert.
  • The third parameter, --ignore-checks=check1,check2.. is optional and takes a list of comma-separated values. This is used in case of any known issues that require time to resolve, so you can add these checks to the ignore list so that the script does not alert for those checks.

Check Status

Status can take 3 values: OK , WARN, SKIPPED

  • OK: No action item and it is successful.
  • WARN: Requires attention and the user needs to check the logs created on the node /tmp/health_checks/logs.
  • SKIPPED: Check was not executed because of some issue in the cluster.
    For instance, Cassandra Replication Strategy check requires Cassandra to be healthy to fetch details from the Cassandra pod, but if the pod is not healthy, then that check will be SKIPPED.
    Hence, it is required to check the logs to understand the reason for skipping the check.
NOTE
Logs older than seven days will be cleaned up by the script.

Examples

Configure the check on the Sensu dashboard. Click ConfigurationChecks → Click New.

Execute the cluster check with “monitor” parameter using the following command:

sudo /opt/fortanix/sdkms/bin/check-dsm-health.sh --node monitor

Sensu dashboard snapshot:

SENSU_DASHBOARD__NODE_CHECKS_.png

Figure 1: Sensu Dashboard [Node Checks]

Execute the cluster check with “info” parameter using the following command:

sudo /opt/fortanix/sdkms/bin/check-dsm-health.sh –-cluster info

Configure a separate check to run in round robin (as these checks are cluster wide and any one node can execute the checks)

Picture4.png ROUND_ROBIN_CLUSTER_CHECKS.png

Figure 2: Round Robin Cluster Checks

If required add checks to the ignore list as below:

sudo /opt/fortanix/sdkms/bin/check-dsm-health.sh –-node monitor –-ignore-checks=NTP_CHECK 

Comments

Please sign in to leave a comment.

Was this article helpful?
0 out of 0 found this helpful