Fortanix Data Security Manager Continuous Restoration

1.0 Introduction

This article outlines the procedure for performing continuous restoration on the Fortanix Data Security Manager (DSM) Disaster Recovery (DR) cluster.

NOTE

This article pertains to continuous restoration in its beta version, with complete planned support for future Fortanix DSM releases.

Fortanix DSM operates as an N-node cluster (where N>=3) capable of withstanding the failure of (N/2)-1 nodes. However, for production environments with critical applications, a DR plan, as detailed in Section 1.1: Disaster Recovery Plan, is essential to mitigate the risk of catastrophic disasters, such as the unavailability of the production cluster due to the loss of most nodes.

1.1 Disaster Recovery Plan

The DR plan aims to achieve the following objectives:

  • Ensure reliable failover to a DR cluster in case of a disaster.

  • Minimize the time required to switch to the DR cluster for business continuity.

  • Automatically synchronize the DR cluster with the production cluster, eliminating the need for manual restoration during switchover.

  • Ensure a seamless transition for applications.

2.0 Terminology References and Application Behavior

This section provides an overview of terminology and application behavior across different scenarios.

2.1 Quorum

Quorum refers to the minimum number of nodes required for a cluster to be fully operational. Quorum is defined as most of the nodes, calculated as n/2 + 1, where n represents the total number of nodes in the cluster.

2.2 Global Quorum

The global quorum represents the agreement threshold across the entire Fortanix DSM cluster, encompassing nodes located in different data centers. It ensures that most of the nodes in the entire cluster are in consensus before executing operations.

2.3 Local Quorum

The local quorum denotes the agreement threshold within a single data center in the Fortanix DSM cluster. It ensures that operations can proceed within a data center even if connectivity issues occur between data centers if the required number of nodes within the same center agree.

GLOBAL QUORUM

LOCAL QUORUM

APPLICATION BEHAVIOR

YES

YES

Everything is operational.

YES

NO

Everything is operational, although requests landing in a data center without local quorum may experience higher read latencies. The application automatically retries reads with a global quorum if the local quorum is not satisfied, resulting in increased latency.

NO

YES

Read-only mode is enforced if requests land in a data center with a local quorum. Writing operations are not permitted.

NO

No

Cluster failure - neither read nor write operations are functional.

3.0 Pre-Configuration Steps

Before starting with the continuous restoration, ensure the following one-time procedures:

  1. Each DR node must be joined and removed from the production cluster to establish a copy of the Cluster Master Key (CMK) wrapped with the node’s device master key in the database, facilitating the bootstrapping process for the DR cluster.

    NOTE

    After joining the DR node to the production cluster, ensure all pods are in a 1/1 running state before removing them.

  2. Perform the data restoration of production backup on any one of the DR nodes to create a single-node DR cluster. For detailed steps, refer to the Fortanix DSM Restoration Guide - Automated documentation.

  3. Join the remaining DR nodes to the DR cluster created in Step 2.

4.0 Configure Continous Restoration

The following steps are involved in configuring continuous restoration on any one node of the DR cluster:

  1. Run the following command to navigate to the cluster-restore directory:

    cd /opt/fortanix/sdkms/bin/cluster-restore
  2. Run the following command to generate the c_restore_config.txt file:

    ./cr_generate_config.sh

    NOTE

    The script supports restoration from the below 4 locations:

    • LOCAL (expects backup files on the machine, preferred path /data/backup)

    • Secure Copy Protocol (SCP) (remote backup server)

    • Amazon Web Services (AWS)

    • Microsoft Azure

    Sensitive details like SSH password, AWS access or secret key, and Azure connection string can be exported before executing the script instead of adding to the configuration (config) file.

  3. Run the following command to initiate the restoration process using the continuous_restore.sh script with the generated config file:

    ./continuous_restore.sh config/c_restore_config.txt

    The script retrieves the latest backup files from the specified location and restores the data to the cluster.

  4. Configure the cron scheduler entry based on the primary cluster backup schedule.
    For example, to run the continuous restore every 6 hours, the command will be as follows:

    crontab –e #when the editor opens, add below entry and save
    * */6 * * * /opt/fortanix/sdkms/bin/cluster-restore/continuous_restore.sh /opt/fortanix/sdkms/bin/cluster-restore/config/c_restore_config.txt
    

    NOTE

    If backup files remain unchanged, the script skips the restore operation using the snapshot_ids.txt file.

  5. The continuous restoration script generates a log file. Run the following command to navigate to the logs directory:

    cd /data/continuous_restore/logs
  6. Run the following command to verify the status of pods and nodes:

    kubectl get nodes,pods –owide