Fortanix Data Security Manager Cluster Recovery Best Practices

Prev Next

1.0 Introduction

This article outlines best practices and considerations for various rollback and recovery scenarios in case issues occur after a Fortanix DSM cluster upgrade.

The article is intended to be used by technical stakeholders of Fortanix DSM who will be responsible for setting up and managing Fortanix DSM clusters.

2.0 Best Practices

This section outlines recommended best practices for various aspects of Fortanix DSM cluster operations.

2.1 Version Management and Deployment

To ensure controlled and stable deployments, Fortanix recommends the following version management practices:

  • Fortanix recommends maintaining an (N–1) DSM version in the production cluster while deploying the (N) version in non-production environments.

  • This approach allows customers to monitor stability before rolling out changes to production.

2.2 Use of Dedicated DR Nodes for Rollback

Most Fortanix DSM cluster deployments use one or more DR nodes for disaster recovery. The use of a DR node is described in the Fortanix DSM Backup Guide. A DR node runs the same version as the production cluster, joins the cluster, and then leaves. It is upgraded along with the production cluster.

Fortanix strongly recommends dedicating at least one DR node specifically for rollback purposes. This rollback-dedicated DR node should continue to run the (N–1) version while the production cluster is being upgraded to version N. It should follow the same process as other DR nodes, that is, it must join the cluster and then leave.

This rollback-dedicated DR node can be used to bootstrap the cluster build-up during a rollback. Using this node significantly improves the Recovery Time Objective (RTO), as it allows for quickly creating a single-node cluster with the older version, enabling applications to resume using DSM while the cluster is being expanded.

2.3 Software Upgrade

To ensure a smooth upgrade and minimize potential risks, follow these best practices before and during the Fortanix DSM software upgrade process:

To ensure a smooth upgrade and minimize potential risks, follow these best practices before and during the Fortanix DSM software upgrade process:

  • Perform a backup before any upgrade. Let us refer to this as a "checkpoint backup." This backup will be useful in case a rollback is needed. For more information, refer to the Backup Guide.

  • Define a time window during which rollback is expected to be possible. During this period, Fortanix recommends the following operational guidelines:

    • Treat this time as a change freeze or at least minimize changes to the system. This includes creating new objects or updating existing ones.

    • If you must create new security objects, to ensure they are not lost in case a rollback is required, Fortanix recommends creating those objects with “export” permission during this period.

The process for creating security objects with “export” permission depends on the use case and application.

  • For PKCS#11-based applications, such as Oracle TDE, this can be done by specifying the following in the PKCS#11 configuration file. This ensures that the EXPORT key operation is always included during key creation using PKCS#11. For more information, refer to the PKCS#11 Guide:

    add_key_ops_override = "EXPORT"
  • For CNG-based applications, such as Microsoft SQL TDE, ensure that the EXPORT key operation is explicitly enabled when generating or importing a key using CNG/EKM/CSP. This can be achieved by executing the following command to override the default key operation restrictions:

    FortanixKmsClientConfig.exe machine --add-key-ops-override EXPORT

3.0 Upgrade and Rollback Considerations

Major Fortanix DSM software upgrades, such as the transition from DSM software version 4.27 to 4.34, introduce multiple updates to Kubernetes, system software, and the underlying kernel. As a result, an in-place rollback to the previous stable version is not entirely seamless.

If an issue is discovered after the upgrade has been completed, and new groups, keys, operations, and logs have been added, rolling back to the last stable configuration can pose significant challenges. A direct rollback may result in data inconsistencies and the potential loss of newly created objects.

To mitigate such failure scenarios, Fortanix recommends the following approach:

3.1 Preventive Measures

Proactive planning can significantly reduce the likelihood and impact of upgrade-related failures.

3.1.1 Thorough Testing in Non-Production

Perform extensive testing in a non-production environment over an extended period before upgrading the production cluster.

3.1.2 Backup Strategy and Monitoring

  • Maintain a continuous backup strategy to ensure system recovery options are available.

  • Implement robust monitoring and alerting mechanisms to detect potential issues early.

3.1.3 Issue Resolution on the Existing Version

If an issue arises with the upgraded version, the preferred approach is to determine whether it can be resolved with a patch or configuration change, without requiring a rollback. This is the least disruptive option and helps minimize downtime.

  • Work with the Fortanix engineering team to diagnose and resolve the issue without rolling back.

  • Apply targeted patches or configuration changes to stabilize the environment when possible.

4.0 Standard Operating Procedure (SOP)

The following SOP outlines the recommended steps for diagnosing and resolving issues through patch deployment, without initiating a rollback:

  • Identify and Assess the Issue

    • Confirm the issue details and their impact after the upgrade.

    • Determine whether a targeted patch can resolve the issue without requiring a rollback.

  • Develop and Test the Patch

    • Create a patch that addresses only the specific issue.

    • Validate the patch in a test or staging environment.

  • Backup and Deploy the Patch

    • Back up critical data and configurations.

    • Deploy the patch version without performing a rollback.

  • Monitor and Validate Post-Patch

    • Verify that the issue is resolved.

    • Check for any unintended side effects or regressions.

  • Closeout and Review

    • Communicate the resolution to stakeholders.

    • Provide a root cause analysis (RCA) and document lessons learned.

4.1 Operating Procedures with Dedicated Rollback DR Node

If a rollback-dedicated DR node is available and meets the criteria mentioned above, follow these steps:

  1. Create a cluster using this node.

  2. Perform data recovery as described in the Data Recovery Procedure section.

  3. Expand the cluster by adding new nodes. For more information, see the Adding Nodes in DSM Cluster section of the Fortanix DSM Installation Guide.

5.0 Rollback When There is no Operating System Change

In cases where the Fortanix DSM software upgrade does not involve an Operating System (OS) change (for example, upgrading from Ubuntu 20.04 to Ubuntu 24.04), rollback requires rebuilding the cluster with the N–1 version. Check the release notes or contact Fortanix Support to determine whether an OS change was involved in your upgrade scenario.

5.1 Standard Operating Procedure

The following are the SOPs for rollback scenarios without OS changes:

  1. Perform a cluster reset by running the following command on all nodes:

    sudo sdkms-cluster reset --delete-data
  2. Uninstall the current Fortanix DSM version from all nodes by running the following command:

    sudo /opt/fortanix/bin/sdkms_cleanup.sh
  3. Install the previous (N–1) DSM version on all nodes and create the cluster. For more information, refer to the Fortanix DSM On-Premises Installation Guide.

  4. Perform data recovery as described in the Data Recovery Procedure section.

6.0 Rollback When There Is an Operating System Change

In cases where the Fortanix DSM software upgrade involves an OS change (for example, upgrading from Ubuntu 20.04 to Ubuntu 24.04), rollback requires re-imaging all nodes and then rebuilding the cluster with the N–1 version. Check the release notes or contact Fortanix Support to determine whether an OS change was part of your upgrade scenario.

6.1 Downtime and Recovery Time Objective Considerations

The following factors should be taken into account when evaluating downtime and RTO implications during rollback scenarios involving OS changes:

  • Significant downtime is expected due to full re-imaging and cluster rebuilding.

  • The RTO depends on the scale of the deployment and the speed of re-imaging, typically ranging from hours to days.

  • Plan the maintenance window accordingly and communicate the potential impact to stakeholders.

6.2 Standard Operating Procedure

The following are the SOPs for rollback scenarios with OS changes:

  1. Re-image all nodes:

    • Engage Fortanix Support to plan and perform node re-imaging using the golden ISO images. This process installs the base OS image for the N–1 version.

    • Nodes are treated as fresh installations, including OS and environment setup.

  2. Configure all nodes as described in the Fortanix DSM On-Premises Installation Guide. This involves:

    • Network configuration

    • Hostname and IP address assignment

    • Security and access settings

  3. Install the previous (N–1) DSM version on all nodes and create the cluster. For more information, see the Fortanix DSM On-Premises Installation Guide.

  4. Perform data recovery as described in the Data Recovery Procedure section.

7.0 Data Recovery Procedure for Rollback

This section describes the general recovery procedure to follow if a rollback becomes necessary. Specific rollback mechanisms are outlined in the previous section.

The steps below should be followed to recover the environment before and after performing a Fortanix DSM software version rollback:

  1. Identify and Securely Export Critical Objects:

    • Before initiating the rollback, identify objects created after the upgrade, such as new keys and groups.

    • If feasible, export these objects securely. Note that any security object not created with “export” permission will be lost during the rollback.

  2. Restore the Checkpoint Backup:

    NOTE

    The restore process will revert the system to the state captured in the checkpoint backup. Any changes made after the upgrade will not be retained.

  3. Import Application Keys:

    • If keys were exported in Step 1, import them into the appropriate groups. You may need to recreate groups if the keys were associated with newly created ones.

    • If new applications were created after the upgrade, you will need to recreate them and reconfigure your systems with the new application credentials.

    • Perform integrity checks to validate that the system is fully operational post-rollback.

By following these recommendations, organizations can minimize risk, improve stability, and ensure business continuity in the event of an issue following an upgrade.

8.0 References