Fortanix DSM for Checks and Alerts on Sensu Server

Prev Next

1.0 Introduction

This article describes the Sensu-based checks and alerts for Fortanix Data Security Manager (DSM). The Sensu framework continuously monitors the health of DSM cluster nodes and the critical services running on those nodes. It performs periodic checks and generates alerts when issues are detected.

The solution includes a set of predefined node and service health checks by default. Alert intervals and severity levels can be configured to match deployment requirements, and the framework can be easily extended to add new checks based on customer needs. This document provides details of the supported Sensu checks and alerts, along with the actions triggered when an alert condition is met.

2.0 Terminology References

  • NTP – Network Time Protocol

  • NTPD – Network Time Protocol Daemon

  • API – Application Programming Interface

  • CSR – Certificate Signing Request

  • E&T – Engineering & Tools

  • SPS – Support

  • GB – Gigabytes

3.0 Component-level Checks and Alerts

This section explains the Sensu checks configured for different Fortanix DSM system components and the alerts generated when these checks fail. These component-level checks provide proactive monitoring to ensure DSM cluster health, availability, and timely issue detection.

3.1 Service Component: CPU

Metric: Temperature

Threshold: Warning & Critical

Alert Categorization: Low

Issue Description: This alert indicates that the appliance is operating at a non-ambient temperature caused by environmental issues in the data center.

Recommended Action:

  1. Verify the data center environmental controls.

  2. If the temperature settings are above acceptable limits, contact Fortanix Support.

3.2 Service Component: Memory

Metric: Utilization

Threshold: 80% Warning, 90% Critical

Alert Categorization: Low

Issue Description: This alert indicates that the memory utilization on the host has reached its limits. It is caused by a high workload. If the utilization remains high for an extended period, it indicates the need for capacity expansion.

Recommended Action:

  1. High memory utilization is not always a failure condition. It indicates high client requests. Wait for at least 15 minutes to allow temporary workloads to complete.

  2. Run the following command on the Fortanix DSM node to identify the processes consuming high memory:  

    ps aux | sort -nrk 4,4 | head -n 3                                                       
  3. Analyse the output:

    • If the output of the command includes CassandraDaemon, Elasticsearch, or /root/enclave-runner /root/backend.sgxs, then the issue is likely due to high traffic.

      • If the alert appears only on a few hosts, it indicates suboptimal load balancing. Contact Fortanix Support.

      • If the alert appears on many hosts, check Audit Log in the Fortanix DSM user interface (UI) to identify the application generating high transactions.      

        • Notify the application team to validate whether the traffic is expected.

        • If the traffic is expected, contact Fortanix Support for capacity addition.

    • If the output of the command is different, capture the output and contact Fortanix Support.    

3.3 Service Component: Disk

Metric: Space utilization

Threshold: 80% Warning, 90% Critical

Alert Categorization: Low

Issue Description: This alert indicates that disk utilization on the host has reached its limits. This is caused by Cassandra data occupying a large portion of disk space and may require purging of old data.

Recommended Action:

  1. Run the following command on the Fortanix DSM node to confirm if Cassandra data is consuming disk space:

    du -sh /data/cassandra 
  2. If the output shows disk usage in the range of hundreds of GBs, purge old or unused data:

    1. Identify stale accounts or keys that are not being used.

    2. Delete the identified unused accounts or keys from the Fortanix DSM UI.

  3. If Cassandra is not the cause, the issue may be due to an unaccounted log file consuming disk space. In this case, contact Fortanix Support for log file identification and remediation.

3.4 Service Component: NTP

Metric: Sync offset, stratum, unsynced status

Threshold:  

  • Offset ≥ 20 ms → Warning

  • Offset ≥ 200 ms → Critical

  • Stratum > 15 or NTP unsynced → Critical

Alert Categorization: Low

Issue Description: This alert indicates a possible failure to reach the external NTP server. Accurate time synchronization is critical for database operations and cluster consistency.

Recommended Action:

  1. Verify connectivity to NTP servers using ping. If network connectivity is fine, the issue may be due to a service failure. For example, the Network Time Protocol daemon (NTPD) crashed.

  2. Run the following command on the Fortanix DSM node to check the NTP sync status:

    ntpq -p 

    Where,

    • The output should show a correct sync state and reachability.

    • A * symbol in the first column indicates an active sync connection.

    • If no rows contain *, NTP sync is not established.

  3. Run the following command to restart the NTP service to re-establish sync:

    sudo service ntp restart 

    If the Fortanix DSM version 5.0 or later, then run the following command to restart the NTP service to re-establish sync:

    sudo service ntpsec restart 

    NOTE

    This restart is performed by the Engineering & Tools (E&T) team after initial troubleshooting by the SPS team.

3.5 Service Component: SDKMS REST API

Ports Monitored:

  • sdkms-rest-api → 443

  • sdkms-kmip-api → 5696

  • sdkms-ui-nginx → 4445

  • sdkms-proxy → 4445

  • sdkms-server → 443

Metric: Reachability

Threshold: Service not reachable

Alert Categorization: Low

Issue Description: This alert indicates a failure in reaching one or more SDKMS REST API services.

Recommended Action:

  1. Intermittent failures are usually recoverable as the service automatically restarts. Wait for at least 10 minutes for the alarm to clear.

  2. If the issue persists, contact Fortanix Support with the following information for debugging:

    1. SSH into one of the cluster nodes.

    2. Run the following command to check the status of all pods and capture the output:

      sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl get pods -owide 
    3. Verify that all Fortanix DSM pods are in a running state:

      • 1/1 → Pod is READY

      • 0/1 → Pod is not READY

    4. Run the following command to verify specifically if all the Cassandra pods are up and running. If a Cassandra pod is not in READY state, note the Cassandra pod name and run the following command and capture the logs:

      sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl logs -f sdkms-xxx-xxxxx 
  3. Run the following command to fetch the details of jobs executed:

    kubectl get jobs 
  4. If Cassandra pods are still not in READY state, run the following command and capture the logs:

    sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl logs -f cassandra-x 

3.6 Service Component: API Service

Metric: Status

Threshold: Service state is down

Alert Categorization: Low

Issue Description: This alert indicates a failure in reaching the Fortanix DSM API service.

Recommended Action:

  1. Intermittent failures are usually recoverable because the service automatically restarts. Wait for at least 10 minutes for the alarm to clear.

  2. If the issue persists, contact Fortanix Support with the following information for debugging:

    1. SSH into one of the cluster nodes.

    2. Run the following command to check the status of all pods and capture the output:

      sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl get pods -owide 
    3. Run the following command to verify that all pods are in the running state:

      • 1/1 or 2/2 → Pod is READY

      • 0/1 or 1/2 → Pod is not READY

    4. Run the following command to verify specifically if all the Cassandra pods are up and running. If a Cassandra pod is not in READY state, note the Cassandra pod name and run the following command and capture the logs:

      sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl logs -f sdkms-xxx-xxxxx 
  3. Run the following command to fetch the details of jobs executed:

    KUBECONFIG=/etc/kubernetes/admin.conf kubectl get jobs 
  4. If Cassandra pods are still not in READY state, run the following command and capture the logs:

    sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl logs -f cassandra-x 

3.7 Service Component: HTTPS Certification DSM

Metric: Days to Expire

Threshold: 60 days – Warning, 15 days – Critical

Alert Categorization: Low

Issue Description: This alert indicates that the HTTPS certificate for Fortanix DSM is nearing expiration. It serves as a reminder to renew the certificate before the expiry date to avoid service interruptions.

Recommended Action:

  1. Renew the expiring certificate. SSH into any machine in the cluster.

  2. Run the following command to generate a new CSR:

    sudo get_csrs 
  3. After the CSRs are signed by a trusted CA, run the following command to install the renewed certificate:

    sudo install_certs 

3.8 Service Component: HTTPS Certification DSM UI

Metric: Days to Expire

Threshold: 60 days – Warning, 15 days – Critical

Alert Categorization: Low

Issue Description: This alert indicates that the HTTPS certificate for the Fortanix DSM UI is nearing expiration. It serves as a reminder to renew the certificate before the expiry date to ensure uninterrupted UI access.

Recommended Action:

  1. Renew the expiring certificate. SSH into any machine in the cluster.

  2. Run the following command to generate a new CSR:

    sudo get_csrs 
  3. After the CSRs are signed by a trusted CA, run the following command to install the renewed certificate:

    sudo install_certs 

3.9 Service Component: Cassandra Cluster

Metric: Schema Status

Threshold: More than 1 bad node → Critical

Alert Categorization: Low

Issue Description: This alert indicates that the Cassandra cluster status is degraded. It reports when connected nodes are in a bad state.

Recommended Action:

  1. Intermittent failures are usually recoverable because services are automatically restarted. Wait for at least 10 minutes for the alarm to clear.

  2. If the cluster state is yellow, it is a warning and does not require immediate action.

  3. If the cluster state is red and does not recover automatically, treat it as critical and contact Fortanix Support. For debugging, collect the following information:

    1. SSH into one of the cluster nodes.

    2. Run the following command to check the status of all pods:

      sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl get pods -owide 
    3. Verify that all pods are in the running state:

      • 1/1 or 2/2 → Pod is READY

      • 0/1 or 1/2 → Pod is not READY

    4. Run the following command to verify specifically if all the Cassandra pods are up and running. If a Cassandra pod is not in READY state, note the Cassandra pod name and run the following command and capture the logs:

      sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl logs -f cassandra-x 
  4. Run the following command to log in to the Cassandra pod that is failing:

    sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl exec -ti cassandra-0 bash 
  5. From inside the Cassandra pod, run the following command to check the status of all nodes:

    nodetool status