Fortanix Data Security Manager Cluster Recovery Best Practices

1.0 Introduction

This article outlines best practices and considerations for various rollback and recovery scenarios in case issues occur after a Fortanix-Data-Security-Manager (DSM) cluster upgrade.

The article is intended for technical stakeholders of Fortanix DSM, who will be responsible for setting up and managing Fortanix DSM clusters.

2.0 Compatibility Matrix

The following table shows compatibility matrix between DSM software versions and Ubuntu operating system (OS) releases.

DSM SOFTWARE VERSION	UBUNTU 24.04 LTS	UBUNTU 20.04 LTS
5.0	Yes	No
4.36	No	Yes
4.34	No	Yes
4.31	No	Yes
4.23	No	Yes

NOTE
Operating system versions from Ubuntu 16.04 to 20.04 were upgraded as part of DSM Software Version 4.0. These versions are now considered obsolete and are excluded from this compatibility matrix. If you are using one of these versions, please contact Fortanix Support for assistance.

Legend / Notes

Yes – Compatible OS
No – Not Supported or Incompatible OS

3.0 Best Practices

This section outlines recommended best practices for various aspects of Fortanix DSM cluster operations.

3.1 Version Management and Deployment

To ensure controlled and stable deployments, Fortanix recommends the following version management practices:

Fortanix recommends maintaining an (N–1) DSM version in the production cluster while deploying the (N) version in non-production environments.
This approach allows customers to monitor stability before rolling out changes to production.

3.2 Use of Dedicated DR Nodes for Rollback

Most Fortanix DSM cluster deployments use one or more DR nodes for disaster recovery. The use of a DR node is described in Fortanix DSM Backup and Restore - Manual. A DR node runs the same version as the production cluster, joins the cluster, and then leaves. It is upgraded along with the production cluster.

Fortanix strongly recommends dedicating at least one DR node specifically for rollback purposes. This rollback-dedicated DR node should continue to run the (N–1) version while the production cluster is being upgraded to version N.

This rollback-dedicated DR node can be used to bootstrap the cluster build-up during a rollback. Using this node significantly improves the Recovery Time Objective (RTO), as it allows for quickly creating a single-node cluster with the older version, enabling applications to resume using DSM while the cluster is being expanded.

3.3 Software Upgrade

To ensure a smooth upgrade and minimize potential risks, follow these best practices before and during the Fortanix DSM software upgrade process:

Perform a backup before any upgrade. Let us refer to this as a "checkpoint backup." This backup will be useful in case a rollback is needed. For more information, refer to Fortanix DSM Backup and Restore - Manual.
Define a time window during which rollback is expected to be possible. During this period, Fortanix recommends the following operational guidelines:
- Treat this time as a change freeze or at least minimize changes to the system. This includes creating new objects or updating existing ones.
- If you must create new security objects, to ensure they are not lost in case a rollback is required, Fortanix recommends creating those objects with “export” permission during this period.
- The DR Recovery node must follow the same process as other DR nodes, that is, it should join the cluster and then leave after the freeze period.

The process for creating security objects with “export” permission depends on the use case and application.

For PKCS#11-based applications, such as Oracle TDE, this can be done by specifying the following in the PKCS#11 configuration file. This ensures that the EXPORT key operation is always included during key creation using PKCS#11. For more information, refer to the Clients: PKCS#11 Library:
```
add_key_ops_override = "EXPORT"
```
For CNG-based applications, such as Microsoft SQL TDE, ensure that the EXPORT key operation is explicitly enabled when generating or importing a key using CNG/EKM/CSP. This can be achieved by executing the following command to override the default key operation restrictions:
```
FortanixKmsClientConfig.exe machine --add-key-ops-override EXPORT
```

4.0 Upgrade and Rollback Considerations

Major Fortanix DSM software upgrades, such as the transition from DSM software version 4.27 to 4.34, introduce multiple updates to Kubernetes, system software, and the underlying kernel. As a result, an in-place rollback to the previous stable version is not entirely seamless.

If an issue is discovered after the upgrade has been completed, and new groups, keys, operations, and logs have been added, rolling back to the last stable configuration can pose significant challenges. A direct rollback may result in data inconsistencies and the potential loss of newly created objects.

To mitigate such failure scenarios, Fortanix recommends the following approach:

4.1 Preventive Measures

Proactive planning can significantly reduce the likelihood and impact of upgrade-related failures.

4.1.1 Thorough Testing in Non-Production

Perform extensive testing in a non-production environment over an extended period before upgrading the production cluster.

4.1.2 Backup Strategy and Monitoring

Maintain a continuous backup strategy to ensure system recovery options are available.
Implement robust monitoring and alerting mechanisms to detect potential issues early.

4.1.3 Issue Resolution on the Existing Version

If an issue arises with the upgraded version, the preferred approach is to determine whether it can be resolved with a patch or configuration change, without requiring a rollback. This is the least disruptive option and helps minimize downtime.

Work with the Fortanix engineering team to diagnose and resolve the issue without rolling back.
Apply targeted patches or configuration changes to stabilize the environment when possible.

5.0 Standard Operating Procedure (SOP)

The following SOP outlines the recommended steps for diagnosing and resolving issues through patch deployment, without initiating a rollback:

Identify and Assess the Issue
- Confirm the issue details and their impact after the upgrade.
- Determine whether a targeted patch can resolve the issue without requiring a rollback.
Develop and Test the Patch
- Create a patch that addresses only the specific issue.
- Validate the patch in a test or staging environment.
Backup and Deploy the Patch
- Back up critical data and configurations.
- Deploy the patch version without performing a rollback.
Monitor and Validate Post-Patch
- Verify that the issue is resolved.
- Check for any unintended side effects or regressions.
Closeout and Review
- Communicate the resolution to stakeholders.
- Provide a root cause analysis (RCA) and document lessons learned.

5.1 Operating Procedures with Dedicated Rollback DR Node

If a rollback-dedicated DR node is available and meets the criteria mentioned above, follow these steps:

Create a cluster using this node.
Perform data recovery as described in Section 8.0: Data Recovery Procedure for Rollback.
Expand the cluster by adding new nodes. For more information, refer to “Section 8.0: Add Node to an Existing Fortanix DSM Cluster” in the Fortanix Data Security Manager Installation Guide - On-Prem.

6.0 Rollback When There is no Operating System Change

In cases where the Fortanix DSM software upgrade does not involve an Operating System (OS) change (for example, upgrading from Ubuntu 20.04 to Ubuntu 24.04), rollback requires rebuilding the cluster with the N–1 version. Check the release notes or contact Fortanix Support to determine whether an OS change was involved in your upgrade scenario.

6.1 Standard Operating Procedure

The following are the SOPs for rollback scenarios without OS changes:

Perform a cluster reset by running the following command on all nodes:
```
sudo sdkms-cluster reset --delete-data
```
Uninstall the current Fortanix DSM version from all nodes by running the following command:
```
sudo /opt/fortanix/bin/sdkms_cleanup.sh
```
Install the previous (N–1) DSM version on all nodes and create the cluster. For more information, refer to the Fortanix Data Security Manager Installation Guide - On-Prem.
Perform data recovery as described in Section 8.0: Data Recovery Procedure for Rollback.

7.0 Rollback When There Is an Operating System Change

In cases where the Fortanix DSM software upgrade involves an OS change (for example, upgrading from Ubuntu 20.04 to Ubuntu 24.04), rollback requires re-imaging all nodes and then rebuilding the cluster with the N–1 version. Check the release notes or contact Fortanix Support to determine whether an OS change was part of your upgrade scenario.

7.1 Downtime and Recovery Time Objective Considerations

The following factors should be taken into account when evaluating downtime and RTO implications during rollback scenarios involving OS changes:

Significant downtime is expected due to full re-imaging and cluster rebuilding.
The RTO depends on the scale of the deployment and the speed of re-imaging, typically ranging from hours to days.
Plan the maintenance window accordingly and communicate the potential impact to stakeholders.

7.2 Standard Operating Procedure

The following are the SOPs for rollback scenarios with OS changes:

Re-image all nodes:
- Engage Fortanix Support to plan and perform node re-imaging using the golden ISO images. This process installs the base OS image for the N–1 version.
- Nodes are treated as fresh installations, including OS and environment setup.
Configure all nodes as described in the Fortanix Data Security Manager Installation Guide - On-Prem. This involves:
- Network configuration
- Hostname and IP address assignment
- Security and access settings
Install the previous (N–1) DSM version on all nodes and create the cluster. For more information, refer to the Fortanix Data Security Manager Installation Guide - On-Prem.
Perform data recovery as described in Section 8.0: Data Recovery Procedure for Rollback.

8.0 Data Recovery Procedure for Rollback

This section describes the general recovery procedure to follow if a rollback becomes necessary. Specific rollback mechanisms are outlined in the previous section.

The steps below should be followed to recover the environment before and after performing a Fortanix DSM software version rollback:

Identify and Securely Export Critical Objects:
- Before initiating the rollback, identify objects created after the upgrade, such as new keys and groups.
- If feasible, export these objects securely. Note that any security object not created with “export” permission will be lost during the rollback.
Restore the Checkpoint Backup:
- For detailed instructions to restore the "checkpoint backup" taken before the upgrade, refer to Fortanix DSM Restoration Using Script - Automated.
NOTE
The restore process will revert the system to the state captured in the checkpoint backup. Any changes made after the upgrade will not be retained.
Import Application Keys:
- If keys were exported in Step 1, import them into the appropriate groups. You may need to recreate groups if the keys were associated with newly created ones.
- If new applications were created after the upgrade, you will need to recreate them and reconfigure your systems with the new application credentials.
- Perform integrity checks to validate that the system is fully operational post-rollback.

By following these recommendations, organizations can minimize risk, improve stability, and ensure business continuity in the event of an issue following an upgrade.

Fortanix Data Security Manager Cluster Recovery Best Practices

1.0 Introduction

2.0 Compatibility Matrix

3.0 Best Practices

3.1 Version Management and Deployment

3.2 Use of Dedicated DR Nodes for Rollback

3.3 Software Upgrade

4.0 Upgrade and Rollback Considerations

4.1 Preventive Measures

4.1.1 Thorough Testing in Non-Production

4.1.2 Backup Strategy and Monitoring

4.1.3 Issue Resolution on the Existing Version

5.0 Standard Operating Procedure (SOP)

5.1 Operating Procedures with Dedicated Rollback DR Node

6.0 Rollback When There is no Operating System Change

6.1 Standard Operating Procedure

7.0 Rollback When There Is an Operating System Change

7.1 Downtime and Recovery Time Objective Considerations

7.2 Standard Operating Procedure

8.0 Data Recovery Procedure for Rollback

9.0 References

PLATFORM

Key Insight

Data Security Manager™

Confidential Computing Manager

Enclave Development Platform®

Request A demo

Contact Us

Free Trial

SOLUTIONS

AWS KMS External Key Store (XKS)

Google External Key Manager

Bring Your Own Key (BYOK)

HSM Modernization

Multicloud Key Management

Post Quantum Cryptography

Code Signing

Secrets Management

Tokenization Transparent

Database Encryption

Filesystem Encryption

Confidential Data Search

Confidential AI

Healthcare

Banking & Financial Services

Fintech

Manufacturing

Web 3.0

Federal Government

RESOURCES

Blog

Whitepapers

Datasheets

Solution Briefs

Ebooks

Reports

Case Studies

Webinars

University

Media Kit

Newsletters

COMPANY

About

Careerswe’re hiring

Customers

Partners

Awards

Events

Press

News

Services

Support

FAQ

4.6