Fortanix Data Security Manager Monitoring and Alerting Solution

1.0 Introduction

This article describes various monitoring and alerting procedures available in Fortanix-Data-Security-Manager (DSM).

It  provides the following information about the Fortanix DSM:

  • Software Components

  • System Capabilities

  • Deployment

  • Checks and Alerts

2.0 Prerequisites

Download server artifacts from here.

3.0 Technology References

  • Fortanix DSM – Fortanix Data Security Manager

  • KMS – Key Management Service

  • SNMP– Simple Network Management Protocol

  • FQDN– Fully Qualified Domain Name

  • NTP– Network Time Protocol

  • TLS– Transport Layer Security

  • HSM- Hardware Security Module

  • HMG- HSM Management Gateway

4.0 Software Components

Fortanix DSM Monitoring and Alerting Solution includes the following open-source software components:

4.1 Monitoring Client Components

All FX2200 servers have a monitoring agent (sensu-agent) pre-packaged.

4.2 Monitoring Server Components

Alerting is based on “Sensu Go” and the following components are installed on the monitoring server

  • Sensu-Go backend.

  • Sensu Assets related to checks and handlers for notification.

5.0 Architecture

This solution is delivered as a completely self-contained virtual appliance that customers can install and set up on their own in their deployment environment. It is based on client-server architecture.

The following diagram shows the architecture of the solution:

MonitoringAlerting.png

Figure 1: Fortanix DSM Monitoring and Alerting Solution - Deployment Architecture

Fortanix DSM nodes come pre-installed with a monitoring agent:

  • Performs various health checks

  • Publish check results to the defined transport mechanism, which is then received by the Sensu server running in the solution VM.

The Fortanix DSM node just needs to be configured to point to the customer’s deployed instance of the Fortanix DSM monitoring and alerting solution.

6.0 System Capabilities

The solution shows checks and alerts data in a Web-based UI. Users can connect to the VM deployed in the customer environment to view this information. This dashboard has the following capabilities.

  • Show checks currently configured

  • Show currently active alerts in the system by node

  • Allow users to silence individual alerts/checks based on rules. This is useful when maintenance/upgrade is going on in the cluster.

6.1 Alerting Mechanisms

When the system detects an alert, it can deliver notification about the alert by a configured mechanism.

The solution by default supports following alerting mechanism.

  • Email: Will need an SMTP email configuration to deliver emails.

  • Slack: Will need a Slack API key to push alerts into Slack.

  • SNMP Trap: We will need SNMP trap receiver information to send traps to.

  • Custom: This is based on a shell script that the alerting server will invoke to send out alert notifications. This mechanism can be used to invoke any third-party client/executable.

6.2 Adding SNMP Trap Handler on Sensu Monitoring Server

The Sensu handler (sensu-snmp-trap-handler) sends alerts to an SNMP manager using SNMP traps. You should have an SNMP manager/trap receiver on your network to receive these traps.

Please download the artifacts from here.

SNMP handler requirements:

  • SNMP trap receiver FQDN hostname or IP address

  • SNMP trap receiver port (default is UDP port 162)

  • Community string (optional)

The following are the steps to add an SNMP trap handler to the Sensu Monitoring Server:

  1. Copy the handler asset file: sensu-snmp-trap-handler_fortanix_0.2.2_linux_amd64.tar.gz to your web server's document root folder (/var/www/html).

  2. Run the following script to add the SNMP trap handler asset:

    ./add_snmp_assets.sh

    This will prompt you for your Sensu server's webserver URL. Enter the URL as http://sensu_serve_ip or http://sensuFQDN.

  3. Edit file snmp-handler.yml file to add IP address or FQDN of your SNMP Manager/trap received. Edit the following line by replacing "SNMP_TRAP_RECEIVER" with actual value:

    sensu-snmp-trap-handler --host SNMP_TRAP_RECEIVER

    The following additional flags are supported and you can add if needed:

    • community string: The SNMP Community string to use when sending traps (default "public")

    • port int: The SNMP manager trap port (UDP) (default 162)

    • version string: The SNMP version to use (1,2,2c) (default "2")

  4. Add SNMP trap handler by running the following script:

    ./add_snmp_handlers.sh
  5. MIB files are included here. If needed, copy the MIB files under the folder "mibs" to your SNMP Manager/Trap receiver.

7.0 Deployment

Fortanix DSM Monitoring and Alerting Solution will be delivered as a software bundle that users can deploy on their server or virtual machine.

7.1 Minimum Server Specification

  • 2 CPUs with 2 cores each

  • 8 GB RAM

7.2 List of Required Ports

The following ports need to be accessible on the monitoring server for WebUI and receiving notifications from the monitoring agent. The values mentioned below are default values and they can be changed.

Protocol

Port Number

Purpose

TCP

8081

Receive notifications from agents

TCP

3000

Dashboard Web UI

TCP

80

Asset download from Sensu Web Server

7.3 Setting Up Sensu Server

The procedure described here for setting up the Sensu server requires a Red Hat machine or VM. We have tested with Red Hat 7.6 and 7.8. You will also need to have Apache installed on this machine.

  1. Copy the server artifacts tarball (Monitoring-Server-Artifacts.tgz) on to your designated server/VM.

  2. Untar the tarball using the following command:

    tar zxvf Monitoring-Server-Artifacts.tgz
  3. Go to the folder Monitoring-Server-Artifacts.

  4. Install sensu-backend and sensu-cli packages.

    sudo rpm -i sensu-go-backend-6.11.0-7218.x86_64.rpm
    sudo rpm -i sensu-go-cli-6.11.0-7218.x86_64.rpm
  5. Edit and copy backend.yml file.

    cp backend.yml /etc/sensu
  6. Start sensu-backend service.

    systemctl start sensu-backend
  7. Check the status of sensu-backend service.

    systemctl status sensu-backend
  8. Enable sensu-backend service to start automatically on reboot.

    systemctl enable sensu-backend
  9. Initialize sensu-backend service.

    export SENSU_BACKEND_CLUSTER_ADMIN_USERNAME=
    export SENSU_BACKEND_CLUSTER_ADMIN_PASSWORD=
    sensu-backend init
  10. Configure the command-line tool Sensuctl.

    sensuctl configure -n  --username 'admin' --password 'P@ssw0rd!' --namespace default --url 'http://127.0.0.1:8080'
  11. It is strongly recommended to change the default admin password.

    sensuctl user change-password --interactive
  12. Sensu also creates a default agent user with a password P@ssw0rd! that corresponds to the defaults the Sensu agent uses.
    It is strongly recommended to change the default agent password.

    sensuctl user change-password agent --current-password 'P@ssw0rd!' --new-password fortanix
  13. If you have a web server of your own, then copy the “asset tar” files (from the folder sensu-assets) to the “document root” folder of your web server so it can be fetched by Fortanix servers.

    NOTE

    If the assets are being installed on a TLS-enabled web server, then install the web server CA root and the intermediate certificates in the trust store of both your Sensu systems and the DSM nodes using the following commands:

    sudo apt-get install ca-certificates -ysudo ./web_server&
    sudo ln -sfv /etc/sensu/tls/ca.pem /usr/local/share/ca-certificates/sensu-ca.crt
    sudo update-ca-certificates
  14. If you do not have a web server of your own, then start included web server.

    cd sensu-assets
    sudo ./web_server&
    cd ..
  15. Create assets.

    ./add_assets.sh
  16. Create checks.

    ./add_checks.sh
  17. Create handlers.

    ./add_handlers.sh
  18. Go to the Sensu dashboard and verify that all checks are present and you can log in.

    http://<Sensu Server IP Address>:3000
  19. If you want to use TLS to secure communication between the agent and server, make the following changes now.

    • Copy the TLS certificate, key, and CA certificate file in /etc/sensu.

    • Change following in backend.yml

      1. api-url – change the prefix from http to https.

        api-url: "https://localhost:8080"
      2. ssl configuration section – set the following lines (change the file name based on your files)

        cert-file: "/etc/sensu/cert.pem"
        key-file: "/etc/sensu/key.pem"
        trusted-ca-file: "/etc/sensu/ca.pem"
        insecure-skip-tls-verify: true
  20. Restart the sensu-backend.

  21. Access the Sensu dashboard using https://:3000. To learn how to integrate Splunk with an existing Sensu server, please refer to the article Splunk with Sensu Server Integration.

    NOTE

    If you are unable to access the dashboard, please make sure that port 3000 is not blocked by a firewall.

7.4 Set Up Active Directory/LDAP Authentication on the Sensu Server

The procedure described here for setting up active directory/LDAP authentication on the  Sensu server.

  1. Create a new file.

    vi ad.yml

    The following are the contents of the file:

    type: ad
    api_version: authentication/v2
    metadata:
      name: ActiveDirectory
    spec:
      groups_prefix: ad
      servers:
      - binding:
          password: <bind account password>
          user_dn: cn=<bindaccount>,ou=<group>,dc=<domain>,dc=com
        default_upn_domain: <domain.com>
        include_nested_groups: true
        host: <domain controller FQDN>
        insecure: true
        port: 636
        security: tls
        trusted_ca_file: /etc/ssl/certs/downstairs-root-ca.pem
        user_search:
          attribute: sAMAccountName
          base_dn: <DN for root of search>
          name_attribute: displayName
          object_class: user
          group_search:
            attribute: member
            base_dn: ou=groups,dc=downstairs,dc=com
            name_attribute: cn 
            object_class: group
        username_prefix: ad
  2. Create the auth resource.

    sensuctl create --file /location/ad.yml
  3. Verify that the auth resource was created successfully.

    sensuctl auth list
    1. Log in with a user that falls within the search root.

    2. They will be able to log in but will not see any namespaces or other resources.

  4. Either kill sensu-backend service and run without systemd to watch real-time interactions, or for troubleshooting, execute the following:

    journalctl -xe | grep sensu
  5. Create resource role that determines permissions.

    sensuctl role create djuser --namespace sdkms --resource=checks,entities,events --verb=get,list
  6. Create a binding between a group and a role.

    sensuctl role-binding create djuser --role=djuser --group=ad:sensu --namespace sdkms
  7. List roles.

    sensuctl role list

7.5 Setup on Fortanix Servers

Run the following on each Fortanix Server:

  1. Install the Fortanix DSM Monitoring package.

    sudo apt-get install sdkms-monitoring
  2. Copy the file /opt/fortanix/sdkms/monitoring/agent.yml to some location, and edit it to point to your Sensu server VM. Change the following line:

    backend-url:
      - "ws://<YOUR SERVER IP ADDRESS>:8081"
  3. If you want to use TLS to secure communication between the agent and the server, then do the following:

    1. Copy the CA file for the TLS certificate being used by the Sensu server to /etc/sensu folder.

    2. Set the following lines (change file name based on your files)

      trusted-ca-file: "/etc/sensu/ca.pem"
    3. For backend-url use the protocol prefix “wss” instead of “ws”.

    4. If the certificate is self-signed whose root CA is not on Fortanix servers then add the following line:

      insecure-skip-tls-verify: true
  4. Copy the edited agent.yml file to the /etc/sensu folder.

    sudo cp agent.yml /etc/sensu/
  5. Start and enable sensu-agent service to start automatically on reboot.

    sudo systemctl daemon-reload
    sudo systemctl start sensu-agent
    sudo systemctl enable sensu-agent
  6. Check the status of sensu-agent service.

    sudo systemctl status sensu-agent

8.0 Checks and Alerts

The checks and alerts are designed to check the health of each node (server) in the Fortanix DSM cluster and the services that each node runs. The solution by default performs the following checks and alerts. This is easily extensible and configurable. Based on customer needs, we can add additional checks and we can also customize the alert thresholds and intervals. Here is a list of checks and alerts that we currently support along with actions when an alert is triggered.

8.1 System Component: CPU

Metric: - Temperature

Threshold: - Warning & Critical

Alert Categorization: - Low

Issue Description: - This alert indicates environmental issues in the data center resulting in non-ambient temperature for appliances.

Recommended Action:

  1. Check data center environmental controls.

  2. If the data center temperature setting is okay, escalate to Fortanix support. 

8.2 System Component: Memory

Metric: Utilization

Threshold: 80% Warning & 90% Critical

Alert Categorization: Low

Issue Description: These alerts mean that the memory utilization on the host has reached its limits. It is indicative of a high workload, and if this stays for an extended period, this indicates capacity expansion is required.

Recommended Action:

  1. This is not necessarily an indication of failure, but indicative of high requests from clients. Please wait for at least 15 minutes to allow the temporary workloads to be completed.

  2. Verify the process with high memory usage: ps aux | sort -nrk 4,4 | head -n 3

  3. If the above output contains CassandraDaemon or Elasticsearch or /root/enclave-runner /root/backend.sgxs, then the issue is due to high traffic. Otherwise Please note the output and escalate to Fortanix.

    1. If due to high traffic, the alarm appears only for a few hosts, then it indicates suboptimal load balancing. Please check with Fortanix support.

    2. If the alarm appears on many hosts, please note the client using Audit logs. Log in to the Fortanix DSM UI, and note the Application performing high transactions. Please notify the client team owning the App, to verify if this is non-standard traffic and to bring this down if possible. If this is expected traffic, then escalate to Fortanix support for capacity addition.

8.3 System Component: Disk

Metric: - Space utilization

Threshold: - 80% Warning & 90% Critical

Alert Categorization: - Low

Issue Description: - These alerts indicate the disk utilization on the host has reached its limits, which means a purge of old data is required.

Recommended Action:

  1. This is indicative of:

    1. Cassandra data occupying disk space limits.

  2. Check the exact cause:

    1. For Cassandra disk usage, run the command:

      du -sh /data/cassandra
  3. If any of the above outputs show that the disk usage is very high (in 100 Gbs), it means that the old data needs to be purged.

    1. Stale accounts or keys: Please identify accounts and keys not being used recently, and then delete them from the Fortanix DSM UI.

  4. If none of the above methods work, then the cause might be due to some unaccounted log file taking space. Escalate to Fortanix support for the correct identification of this log file and remediation.

8.4 System Component: NTP

Metric: - sync offset, stratum & unsynced

Threshold: - 20ms offset – Warning, 200ms offset – Critical, if stratum > 15 & NTP is not synced.

Alert Categorization: - Low

Issue Description: - This alert indicates possible failure to reach the external NTP server, which is very important for database synchronization.

Recommended Action:

  1. Verify that the network link is up to NTP servers using ping. If network connectivity is fine, then it could indicate a failure in the service (Network Time Protocol daemon (ntpd) crashed).

  2. To verify, run the command:

    1. ntpq -p: The output should show a correct sync state and reachability (The first row should have ‘*’ marked to indicate the sync connection. If none of the rows are marked, then it means that the sync is not set.

  3. To fix the sync, run:

    1. sudo service ntp restart, this activity will be performed by E & T post the initial troubleshooting performed by the SPS team.

8.5 System Component: SDKMS REST API

  • port-check-external for sdkms-rest-api 443

  • sdkms-kmip-api 5696

  • sdkms-ui-nginx 4445

  • sdkms-proxy 4445

  • sdkms-server 443

Metric: - reachability

Threshold: - if not reachable

Alert Categorization: - Low

Issue Description: - This alert indicates service reachability failures.

Recommended Action:

  1. Usually, intermittent failures will be recoverable, and the service will be automatically restarted. Hence, wait for at least 10 minutes for the alarm to recover.

  2. If you are unable to recover from the failure, then please escalate to Fortanix support with the following information for easy debugging:

    1. ssh into one of the nodes of the cluster.

    2. Run the following command to get the status of all the pods and copy the output/contents the same:

      sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl get pods -owide
    3. From the above command verify if all the Fortanix DSM pods are in running state (Usually it would be 0/1 if not READY and 1/1 if READY). Verify specifically if all the Cassandra pods are up and running. If the Cassandra pod is not in READY state, note down the Cassandra pod name, run the following command, and capture the output:

      sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl logs -f sdkms-xxx-xxxxx
    4. Get the list of jobs that ran by running the command and collect the info about the status of the jobs using the command:

      kubectl get jobs
    5. From the command that ran in ‘step b’, verify if all the Cassandra pods are up and running (Usually it would be 0/1 if not READY and 1/1 if READY). If the Cassandra's pod is not in READY state, then note down the Cassandra pod name and run the following command:

      sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl logs -f cassandra-x

8.6 System Component: api-service

Metric: - status

Threshold: - if state is down

Alert Categorization: - Low

Issue Description: - This alert indicates Fortanix DSM API service reachability failures.

Recommended Action:

  1. Usually, intermittent failures will be recoverable, and the service will be automatically restarted. Wait for at least 10 minutes for the alarm to recover.

  2. If it is not recovered, please escalate to Fortanix support with the following information for easy debugging.

    1. ssh into one of the nodes of the cluster.

    2. Run the following command to get the status of all the pods and copy the output/contents of the command:

      sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl get pods -owide
    3. From the command above verify if all the pods are in running state (Usually it would be 0/1 or 1/2 if not READY and 1/1 or 2/2 if READY). Verify specifically if all the Cassandra pods are up and running. If the Cassandra pod is not in READY state, note down the Cassandra pod name, run the following command, and capture the output:

      sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl logs -f sdkms-xxx-xxxxx
  3. Get the list of jobs that have been completed by running the command and collecting the info on the status of jobs:

    KUBECONFIG=/etc/kubernetes/admin.conf kubectl get jobs
  4. From the command run in ‘step b’, verify if all the Cassandra pods are up and running (Usually it would be 0/1 if not READY and 1/1 if READY). If Cassandra's pod is not in READY state, then note down the Cassandra pod name and run the following command:

    sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl logs -f cassandra-x

8.7 System Component: Fortanix DSM-https-certificate-dsm

Metric: - days-to-expire

Threshold: - 60 days – Warning & 15 days – Critical

Alert Categorization: - Low

Issue Description: - This alert indicates certificate expiration in near future and acts as a reminder for certificate renewal.

Recommended Action:

  1. Renew the certificates. SSH in any of the machines in the cluster.

  2. Generate new CSR by running the command "sudo get_csrs" and install new certs using sudo install_certs.

8.8 System Component: Fortanix DSM-https-certificate-dsm-ui

Metric: - days-to-expire

Threshold: - 60 days – Warning & 15 days – Critical

Alert Categorization: - Low

Issue Description: - This alert indicates certificate expiration in the near future and acts as a reminder for certificate renewal.

Recommended Action:

  1. Renew the certificates. SSH in any of the machines in the cluster.

  2. Generate new CSR by running the command sudo get_csrs and install new certs using sudo install_certs.

8.9 System Component: Fortanix DSM-Cassandra-cluster

Metric: - schema-status

Threshold: - Bad nodes >1 - Critical

Alert Categorization: - Low

Issue Description: - This alert indicates the cluster status of Cassandra is Bad for the connected nodes

Recommended Action:

  1. Usually, intermittent failures will be recoverable, and the service will be automatically restarted. Hence wait for at least 10 minutes for the alarm to recover.

  2. If the cluster state is reported as yellow, then it is just a warning and no immediate action is required.

  3. If the cluster state is reported as red and does not recover automatically, then it is critical and escalation to Fortanix support is required. In case the cluster state is in red, the following steps would help in debugging the problem.

    1. ssh into one of the nodes of the cluster.

    2. Run the below command to get the status of all the pods, and then copy the output/contents of the following commands:

      sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl get pods -owide
    3. From the command above verify if all the pods are in running state (Usually it would be 0/1 or 1/2 if not READY and 1/1 or 2/2 if READY). Verify specifically if all the Cassandra pods are up and running. If the Cassandra pod is not in READY state, note down the Cassandra pod name run the following command, and capture the output:

      sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl logs -f cassandra-x
    4. Go into the Cassandra pod that is failing by running the command:

      sudo KUBECONFIG=/etc/kubernetes/admin.conf kubectl exec -ti cassandra-0 bash
    5. On the Cassandra pod terminal, run the following command to get a list of nodes and capture the output:

      nodetool status

9.0 Monitoring and Alerting HSM Management Gateway

This section describes the process of adding checks on the Sensu monitoring server to monitor HSM gateway instances.

These checks are run on the Sensu server using an agent installed on the Sensu server.

9.1 Setting Up Sensu Agent on Sensu Server

If you do not have a Sensu agent installed on your Sensu server then follow the instructions in this section to set up a Sensu agent on the Sensu server for running checks for the HSM gateway instances. If you already have the Sensu agent setup then skip step 1 below.

  1. Download and install the Sensu agent on the Sensu server.

    1. Download the file sensu-go-agent-6.11.0-7218.x86_64.rpm from the URL http://sensu.io/downloads.

    2. Install the Sensu go agent using the following command:

      sudo yum install sensu-go-agent-6.11.0-7218.x86_64.rpm
    3. Copy the following content to the file /etc/sensu/agent.yml and edit it as explained below:

      # Sensu agent configuration       
      ##
      # agent overview
      ##
      #name: "hostname"
      #namespace: "default"
      subscriptions:
        - sdkms-monitoring
        #labels:
        #example_key: "example value"
        #annotations:
        #example/key: "example value"
        
        ##
        # agent configuration
        ##
        backend-url:
          - "wss://127.0.0.1:8081"
          #cache-dir: "/var/cache/sensu/sensu-agent"
          #config-file: "/etc/sensu/agent.yml"
          #log-level: "warn" # available log levels: panic, fatal, error, warn, info, debug
          
          ##
          # api configuration
          ##
          #api-host: "127.0.0.1"
          #api-port: 3031
          #disable-api: false
          #events-burst-limit: 10
          #events-rate-limit: 10.0
          
          ##
          # authentication configuration
          ##
          user: "USER_AGENT_NAME"
          password: "AGENT_PASSWORD"
          
          ##
          # monitoring configuration
          ##
          #deregister: false
          #deregistration-handler: "example_handler"
          #keepalive-timeout: 120
          #keepalive-interval: 20
          
          ##
          # security configuration
          ##
          insecure-skip-tls-verify: true
          #redact:
          # - password
          # - passwd
          # - pass
          # - api_key
          # - api_token
          # - access_key
          # - secret_key
          # - private_key
          # - secret
          trusted-ca-file: "/etc/sensu/ssl/ca.pem"
          
          ##
          # socket configuration
          ##
          #disable-sockets: false
          #socket-host: "127.0.0.1"
          #socket-port: 3030
          
          ##
          # statsd configuration
          ##
          #statsd-disable: false
          #statsd-event-handlers:
          # - example_handler
          #statsd-flush-interval: 10
          #statsd-metrics-host: "127.0.0.1"
          #statsd-metrics-port: 8125

      Edit the following lines:

      • backend-url - If not using TLS, then change the value to "ws://127.0.0.1:8081".

      • trusted-ca-file - If not using TLS, comment it out. If using TLS, make sure the path points to the CA certificate.

      • Replace USER_AGENT_NAME and AGENT_PASSWORD with appropriate values for your setup.

  2. Download and copy the following asset files in your web servers root folder (for example /var/www/html).

    • https://assets.bonsai.sensu.io/a2115474fe198f3895b953f6d90de86607f33722/sensu-plugins-network-checks_5.0.0_centos7_linux_amd64.tar.gz

    • https://assets.bonsai.sensu.io/102d2b8c9dc264b98fa7973bf7657e9216bfb0a8/sensu-ruby-runtime_0.1.0_ruby-2.4.4_centos7_linux_amd64.tar.gz>

  3. Create assets.

    1. Create a file sdkms-monitoring-rhel-assets.yml with the following content and replace "SENSU_SERVER_IP" with your server's IP address or FQDN.

      type: Asset
      api_version: core/v2
      metadata:
        annotations:
          io.sensu.bonsai.api_url: https://bonsai.sensu.io/api/v1/assets/sensu-plugins/sensu-plugins-network-checks
          io.sensu.bonsai.name: sensu-plugins-network-checks
          io.sensu.bonsai.namespace: sensu-plugins
          io.sensu.bonsai.tags: ruby-runtime-2.4.4
          io.sensu.bonsai.tier: Community
          io.sensu.bonsai.url: https://bonsai.sensu.io/assets/sensu-plugins/sensu-plugins-network-checks
          io.sensu.bonsai.version: 5.0.0
       name: sensu-plugins-network-check-rhel
       namespace: default
      spec:
        builds: null
        filters:
        - entity.system.os == 'linux'
        - entity.system.arch == 'amd64'
        - entity.system.platform_family == 'rhel'
        headers: null
        sha512: f0a229918245d2156fcc34e272cb351d09f3d7ee79057cccaa88121d837723951c816593104ff959528b0dec7f18901b6735f7b7cf765ddcce85c6fdbb559378
        url: http://SENSU_SERVER_IP/sensu-plugins-network-checks_5.0.0_centos7_linux_amd64.tar.gz
      ---
      type: Asset
      api_version: core/v2
      metadata:
        annotations:
          io.sensu.bonsai.api_url: https://bonsai.sensu.io/api/v1/assets/sensu/sensu-ruby-runtime
          io.sensu.bonsai.name: sensu-ruby-runtime
          io.sensu.bonsai.namespace: sensu
          io.sensu.bonsai.tags: ""
          io.sensu.bonsai.tier: Community
          io.sensu.bonsai.url: https://bonsai.sensu.io/assets/sensu/sensu-ruby-runtime
          io.sensu.bonsai.version: 0.1.0
        name: sensu-ruby-runtime-rhel
        namespace: default
      spec:
        builds: null
        filters:
        - entity.system.os == 'linux'
        - entity.system.arch == 'amd64'
        - entity.system.platform_family == 'rhel'
        headers: null
        sha512: 2d7800432f90625a02aec4a10b084bc72e253572970694e932b5ccdc72fb30f5cf91ed4b51f90942965df5228e521b8f5f06da3d52b886b172ba08d4130251dc
        url: http://SENSU_SERVER_IP/sensu-ruby-runtime_0.1.0_ruby-2.4.4_centos7_linux_amd64.tar.gz
    2. Run the following command:

      sensuctl create --file sdkms-monitoring-rhel-assets.yml
  4. Add checks.

    1. Create a file sdkms-hmg-monitoring-checks.yml with the following content.

      type: Check
      api_version: core/v2
      metadata:
        name: sdkms-hmg-check
        namespace: default
      spec: 
        check_hooks: null
        command: 'check-port.rb -H HMG_IP -p 4442'
        env_vars: null
        executed: 0
        handlers:
        - YOUR_HANDLER_NAME
        high_flap_threshold: 0
        history: null
        interval: 60
        issued: 0
        last_ok: 0
        low_flap_threshold: 0
        occurrences: 0
        occurrences_watermark: 0
        output: ""
        output_metric_format: ""
        output_metric_handlers: null
        proxy_entity_name: ""
        publish: true
        round_robin: false
        runtime_assets:
        - sensu-plugins-network-check-rhel
        - sensu-ruby-runtime-rhel
        status: 0
        stdin: false
        subdue: null
        subscriptions:
        - sdkms-monitoring
        timeout: 0
        total_state_change: 0
        ttl: 0

      In the above content:

      • Replace "HMG_IP" with the FQDN or IP addresses of your HMG instance. For more than one HMG IP address you can specify them as comma-separated values.

      • The default value of the HMG port is 4442. This is set up in the check definition file. If your HMG instance is running on a different port, then replace the port number.

      • Replace “YOUR_HANDLER_NAME” with the name of the handler in your environment. To check the handler name run the following command:

        sensuctl handler list
    2. Run the following command using the file created in step a.

      sensuctl create --file sdkms-hmg-monitoring-checks.yml
    3. List the checks to verify the new checks have been added.

      sensuctl check list
  5. Start the agent using the following commands.

    sudo systemctl daemon-reload
    sudo systemctl enable sensu-agent
    sudo systemctl start sensu-agent
    sudo systemctl status sensu-agent

10.0 Fortanix Data Security Manager Metrics

Starting with Fortanix DSM version 3.21 metrics data will be available in Prometheus format that can be scraped by a Prometheus server and visualized using a tool like Grafana.

In Fortanix DSM version 3.21, we provide two categories of metrics time series data on each node. Currently, each category publishes the following metrics data. Later versions will add more data in each category.

  • Node Metrics

    • CPU Usage

    • Load Average

    • Memory usage

    • Disk I/O statistics

    • Filesystem usage

    • Network usage

  • DSM Metrics

    • Number of active connections

      • Public (port 443)

      • KMIP (port 5969)

      • Internal Admin (port 4444)

    • Logging Backlog queue length

      • Elasticsearch

      • Splunk

      • Other log integrations

10.1 DSM Monitoring Package Installation

If the Fortanix DSM monitoring package has not been installed yet, then install the monitoring package by running the following command on each Fortanix DSM node.

sudo apt-get install sdkms-monitoring

If this package was already installed before upgrading to 3.21, then this step is not required.

10.2 Setup

To enable publishing metrics information, you need to enable and run a few services. Use the following commands to enable and start these services:

sudo cp /opt/fortanix/sdkms/monitoring/node_exporter.default /etc/default/node_exporter
sudo cp /opt/fortanix/sdkms/monitoring/sdkms_exporter.default /etc/default/sdkms_exporter

sudo systemctl enable node-exporter
sudo systemctl start node-exporter
sudo systemctl status node-exporter

sudo systemctl enable sdkms-metrics
sudo systemctl start sdkms-metrics
sudo systemctl status sdkms-metrics

10.3 Configuring TLS for Metrics Collection

If you want to use TLS in your metrics collection endpoint, you can configure the “sdkms-metrics” to use TLS. Please follow the instructions mentioned below to setup TLS:

Get a TLS private key and certificate that you will be using for this service and save the file in any location.  Both the certificate and private key should be in PEM format. We recommend storing it in the folder “/opt/fortanix/sdkms/monitoring/

Edit the service file “/etc/systemd/system/sdkms-metrics.service” to change the “ExecStart” as follows.

ExecStart=/opt/fortanix/sdkms/monitoring/exporter_exporter -
config.file /opt/fortanix/sdkms/monitoring/sdkms_exporter.yml -
web.tls.cert /opt/fortanix/sdkms/monitoring/CERT_FILENAME -
web.tls.key /opt/fortanix/sdkms/monitoring/KEY_FILENAME -
web.tls.listen-address :9998

NOTE

  • Replace CERT_FILENAME and KEY_FILENAME with the name of the file where you stored the certificate and private key respectively.

  • In the example above we have used port 9998. You can change it to any port number you want.

sudo systemctl daemon-reload
sudo systemctl restart sdkms-metrics.service

10.4 Metrics Collection Endpoints

Metrics data is published on the following endpoints by default.

NOTE

If you are using TLS, then please change the endpoint URL to use “https” and the corresponding port number.

10.4.1 Node Metrics

http://NODE_IP_ADDRESS:9999/proxy?module=node 
Example Data
This is the data available using Prometheus Node/system metrics exporter.

10.4.2 DSM Metrics

http://NODE_IP_ADDRESS:9999/proxy?module=sdkms 
Example Data

# HELP es_backlog Number of pending ES documents
# TYPE es_backlog gauge
es_backlog 0
# HELP other_log_integrations Number of pending audit logs
# TYPE other_log_integrations gauge
other_log_integrations 0
# HELP kmip_connections Number of active kmip connections
# TYPE kmip_connections gauge
kmip_connections 0
# HELP splunk_queue_len Number of pending Splunk log events
# TYPE splunk_queue_len gauge
splunk_queue_len 0
# HELP splunk_pending_logs Number of pending Splunk logs
# TYPE splunk_pending_logs gauge
splunk_pending_logs 0
# HELP admin_connections Number of active admin connections
# TYPE admin_connections gauge
admin_connections 1
# HELP public_connections Number of active public connections
# TYPE public_connections gauge
public_connections 1

10.5 Prometheus Configuration

You can add jobs to your existing Prometheus configuration to collect these metrics. Here is an example of how to add jobs to scrape metrics from Fortanix DSM Node. Fill in targets based on your deployment.

- job_name: 'node_metrics'
     scrape_interval: 300s
     metrics_path: /proxy
     params:
       module:
         - node
     static_configs:
       - targets: ['NODE1_IP:9999']
       - targets: ['NODE2_IP:9999']
       - targets: ['NODE3_IP:9999']

- job_name: 'sdkms_metrics'
  scrape_interval: 60s
  metrics_path: /proxy
  params:
    module:
      - sdkms
  static_configs:
    - targets: ['NODE1_IP:9999']
    - targets: ['NODE2_IP:9999']
    - targets: ['NODE3_IP:9999']

10.6 Visualization

If you are using the Prometheus server to collect these metrics data, you can use Grafana to visualize the data.

For node metrics, use the “Node Exporter” dashboard to visualize the data and customize it as needed.

Here is an example dashboard using Grafana Node Exporter.

Visualization.png

Figure 2: Visualization

For Fortanix DSM metrics, you can create your own dashboard using the collected data. For example, to visualize the number of active connections, you can use data “public_connections” and create a dashboard as shown below:

Dashboard.png

Figure 3: Dashboard