Using Data Security Manager with Databricks

Prev Next

1.0 Introduction

This article describes how to integrate Fortanix-Data-Security-Manager (DSM) with Databricks to enhance data security measures within the Databricks SQL warehouse. The primary goal of this integration is to enforce robust data security protocols, ensuring the confidentiality of sensitive data stored in the warehouse.

Tokenization and detokenization are fundamental techniques in data security, enabling the protection of sensitive data while preserving its usability.

2.0 Overview

Within this integration framework, the Fortanix DSM's vaultless tokenization is used with Databricks Notebooks and Python user-defined functions (UDFs) to facilitate tokenization and detokenization processes within the Databricks SQL warehouse.

  • Method 1 - Databricks Notebooks: These are essential tools for data science and machine learning workflows. They offer real-time co-authoring, automatic versioning, and built-in data visualizations. In this approach, Fortanix DSM Python Software Development Kits (SDKs) are used to connect to Fortanix DSM API endpoints within Databricks Notebooks to tokenize and detokenize sensitive data stored in the SQL warehouse. This ensures the secure transformation of sensitive data columns in the table.

    NOTE

    Notebooks support both tokenization and detokenization operations.

    For more information, refer to the Introduction to Databricks Notebooks documentation.

  • Method 2 - Python UDFs: These allow for secure and governed execution of Python code through SQL functions. By integrating UDFs with the Fortanix DSM API, you can perform tokenization operations to redact email and phone information from JSON strings, returning the redacted string and preventing unauthorized access.

    NOTE

    Python UDFs are currently in public preview mode and support both tokenization and detokenization operations.

    For more information, refer to the User-defined functions (UDFs) in Unity Catalog documentation.

3.0 Prerequisites

Ensure the following:

4.0 Product Tested Version

  • Fortanix DSM version 4.30 and above.

5.0 Architecture Diagram

Figure 1: Architecture diagram

The integration architecture is divided into two main planes: the Databricks Control Plane and the Databricks Data Plane, both connected to Fortanix DSM for secure data operations.

The Databricks Control Plane manages credentials for connecting to Fortanix DSM APIs and is where SQL commands for operations on sensitive data are written. It handles configuration, authentication, and orchestration.

The Databricks Data Plane is where the data resides and undergoes processing. It includes resources like SQL warehouses and data lakes. The Databricks Data Plane executes the SQL commands for tokenizing and detokenizing sensitive data.

The integration flow begins in the Databricks Control Plane, where Notebooks or Python UDFs use the Fortanix DSM Python SDK to connect to Fortanix DSM. Then, the SQL commands to tokenize and detokenize sensitive data in the Databricks SQL warehouse are written in the Control Plane and executed in the Data Plane.

6.0 Configure Fortanix DSM

A Fortanix DSM service must be configured, and the URL must be accessible. To create a Fortanix DSM account and group, refer to the following sections:

6.1 Signing Up

To get started with the Fortanix DSM cloud service, you must register an account at <Your_DSM_Service_URL>. For example, https://eu.smartkey.io.

For detailed steps on how to set up the Fortanix DSM, refer to the User's Guide: Sign Up for Fortanix Data Security Manager SaaS documentation.

6.2 Creating an Account

Access <Your_DSM_Service_URL> in a web browser and enter your credentials to log in to Fortanix DSM.

Figure 2: Logging in

For more information on how to set up an account in Fortanix DSM, refer to the User's Guide: Getting Started with Fortanix Data Security Manager - UI.

6.3 Creating a Group

Perform the following steps to create a group in the Fortanix DSM:

  1. In the DSM left navigation panel, click the Groups menu item, and then click the + button to create a new group.

    Figure 3: Add groups

  2. On the Adding new group page, do the following:

    1. Title: Enter a name for your group.

    2. Description (optional): Enter a short description of the group.

  3. Click SAVE to create the new group.

The new group is added to the Fortanix DSM successfully.

6.4 Creating an Application

Perform the following steps to create an application (app) in the Fortanix DSM:

  1. In the DSM left navigation panel, click the Apps menu item, and then click the + button to create a new app.

    Figure 4: Add application

  2. On the Adding new app page, do the following:

    1. App name: Enter the name for your application.

    2. ADD DESCRIPTION (optional): Enter a short description of the application.

    3. Authentication method: Select the default API Key as the authentication method from the drop down menu. For more information on these authentication methods, refer to the User's Guide: Authentication.

    4. Assigning the new app to groups: Select the group created in Section 6.3: Creating a Group from the list.

  3. Click SAVE to add the new application.

The new application is added to the Fortanix DSM successfully.

6.5 Copying the API Key

Perform the following steps to copy the API key from the Fortanix DSM:

  1. In the DSM left navigation panel, click the Apps menu item, and then click the app created in Section 6.4: Creating an Application to go to the detailed view of the app.

  2. On the INFO tab, click VIEW API KEY DETAILS.

  3. From the API Key Details dialog box, copy the API Key of the app to be used later.

6.6 Creating a Security Object

NOTE

For this guide, two tokenization security objects are created to tokenize and detokenize the Last Name and Email Address.

Perform the following steps to generate a tokenization key in the Fortanix DSM:

  1. In the DSM left navigation panel, click the Security Objects menu item, and then click the + button to create a new security object.

    Figure 5: Adding security object

  2. On the Add new Security Object page, do the following:

    1. Security Object Name: Enter the name of your security object. For example, db_name_token.

    2. Group: Select the group as created in Section 6.3: Creating a Group.

    3. Select the GENERATE radio button.

    4. In the Choose a type section, select the Tokenization key type to generate.

    5. In the Key Size section, select the size of the key in bits

    6. In the Data type section, select the type of the security object token. For more information, refer to the User's Guide: Tokenization documentation.

    7. In the Key operations permitted, section select the required operations to define the actions that can be performed with the cryptographic keys, such as encryption, decryption, signing, and verifying.

  3. Click GENERATE to create the new security object.

  4. Similarly, repeat the Steps 1 to 3 to create a security object for Email Address as well. For example, db_email_token.

Figure 6: Security objects added

The two new security objects have been added to the Fortanix DSM successfully.

7.0 Creating a Databricks Secret

You can use the Databricks Secret Management to store the Fortanix DSM API key, rather than hardcoding it in Notebooks or Python UDFs to ensure secure and authorized access to the API keys for Notebooks and UDFs at runtime.

Perform the following steps using Databricks CLI:

  1. Run the following commands to create a Databricks secrets scope:

    SECRETS_SCOPE_NAME="<your secret scope name>"
    databricks secrets create-scope $SECRETS_SCOPE_NAME
  2. Run the following commands to list the scopes:

    databricks secrets list-scopes

    The output of the command will be:

    Scope       Backend Type
    hr_scope    DATABRICKS
  3. Run the following commands to add users to the scope to access the secret:

    databricks secrets put-acl $SECRETS_SCOPE_NAME [email protected] MANAGE
  4. Run the following commands to view the list of users who have the access to the secret:

    databricks secrets list-acls $SECRETS_SCOPE_NAME

    The output of the command will be:

    [
      {
        "permission":"MANAGE",
        "principal":"[email protected]"
      }
    ]
  5. Run the following commands to add a secret:

    databricks secrets put-secret $SECRETS_SCOPE_NAME FORTANIX_API_KEY --string-value "<API_KEY_VALUE>"
  6. Run the following commands to view the list of the secrets:

    databricks secrets list-secrets $SECRETS_SCOPE_NAME

    The output of the command will be:

    Key                    Last Updated Timestamp
    FORTANIX_API_KEY       1714076216500
  7. Run the following commands to get the value of the secret:

    databricks secrets get-secret $SECRETS_SCOPE_NAME FORTANIX_API_KEY

    The output of the command will be:

    {
      "key":"FORTANIX_API_KEY",
      "value":"<base64_encoded_value>"
    }
    
  8. Run the following commands to get base64 decoded value of the secret:

    databricks secrets get-secret $SECRETS_SCOPE_NAME FORTANIX_API_KEY | jq -r '.value' | base64 -d

    For more information, refer to the Access Control List documentation.

8.0 Methods

This section elaborates the steps for integrating either of the two Databricks methods as defined in Section 2.0: Overview.

8.1 Using Databricks Notebooks

Perform the following steps to create a new Databricks Notebook to import Fortanix DSM Python SDK, and define the tokenization and detokenization functions:

  1. Log in to your Databricks Secret Management account.

  2. From the left navigation panel, click + NEW → Notebook option to create a new Notebook.

    databricks_add_new_notebook

    Figure 7: Add notebook

  3. On the next screen, click the File tab → Import notebook… option from the drop down menu to import a sample DSM Notebook.

    databricks_import_notebook

    Figure 8: Import notebook

  4. In the Import dialog box, enter the following details:

  5. Click Import to import the sample DSM Notebook file.
    This Notebook fetches the API key from the Databricks secrets as configured in Section 7.0: Creating a Databricks Secret.

    api_key = dbutils.secrets.get(scope="hr_scope", key="FORTANIX_API_KEY")

    Where,

    • scope refers to the secret scope name defined in the Databricks Secret Management.

    • FORTANIX_API_KEY refers to the Fortanix DSM API key of the app copied in Section 6.5: Copying an API Key.

  6. Click the Run cell option to run the Python script in the Notebook to validate if everything works.

    databricks_run_button

    Figure 10: Run button

  7. In the Attach to an existing compute resource form, select the General compute radio button and click Start, attach and run to execute the Notebook.
    Wait for a few minutes to validate the status of the Notebook.
    The Notebook will be created with the name DSM_notebook.

After the status of the Notebook is validated, you need to create a second Notebook in the account to define the cryptographic keys UUIDs and call tokenize_col and detokenize_col as defined in the previous steps.

Perform the following steps:

  1. From the left navigation panel, click + NEW → Notebook option.

    databricks_add_notebook1

    Figure 11: Add notebook

  2. On the next screen, click File → Import notebook… option from the drop down menu.

    databricks_import_notebook1

    Figure 12: Import notebook

  3. In the Import dialog box, enter the following details:

  4. Click Import to import the new sample DSM Notebook file.
    This Notebook references the DSM_notebook notebook as created in the previous section. It defines the key UUID values, and invokes the tokenization (tokenize_col) and detokenization (detokenize_col) functions described as follows:
    These functions expect the table name, column name, and Key UUIDs as inputs.

    1. Run the following command to source the Fortanix DSM Notebook:

      %run "./DSM_notebook"
    2. Run the following command to define the Fortanix Key UUIDs:

      lname_kid = "<LAST_NAME_KEY_UUID>"
      email_kid = "<EMAIL_ADDRESS_KEY_UUID>"

      Where,

      • lname_kid refers to the Fortanix DMS key UUID for the last name key as created in Section 6.6: Creating a Security Object.

      • email_kid refers to the Fortanix DSM key UUID for the email key as created in Section 6.6: Creating a Security Object.

    3. In this sample notebook, the pre-created table has the following structure:

      # "employees.data"
      # employee_id         bigint
      # fname               string
      # lname               string
      # email               string

      databricks_employees_table

      Figure 13: Employees data table

    4. Run the following command to tokenize the columns:

      tokenize_col("employees","data",["lname","email"],[lname_kid,email_kid])
      # Comment out calling insert_tokenizedData() from tokenize_col() in DSM_notebook if you not need another table to be created.

      This sample function tokenizes the lname and email columns and creates a new table called tokenized_<table_name>.

    5. Run the following command to detokenize the columns by specifying the table and column names:

      detokenize_col("employees","tokenized_data",["lname","email"],[lname_kid,email_kid])
  5. Click the Run cell option to run the Python script for the second Notebook.

    databricks_sample_notebook

    Figure 14: Sample notebook

  6. In the Attach to an existing compute resource form, select the General compute radio button and click Start, attach and run to execute the following functions in the Notebook:

    • Tokenize the selected columns, such as, lname, email.

    • Create a new table named employeed.tokenized_data in the Catalog.

    • Detokenize the same columns from table employeed.tokenized_data.

      databricks_tokenized_table

      Figure 15: Employees tokenized data table

8.2 Using Python User-defined functions

Perform the following steps to implement tokenization and detokenization operations in Databricks SQL warehouse using Python UDFs:

NOTE

Ensure that the Python UDF is enabled, which is currently in public preview.

  1. From the left navigation panel, click SQL Editor.

  2. In the New query working space, paste the content from the Tokenization UDF Function file.

  3. Click the run icon to execute the function file.

    databricks_run-button1

    Figure 16: Run button

    The results are displayed in the Raw results section.

  4. After the executing of the script is completed, click the + button to add a new query.

    databricks_create_query

    Figure 17: Create new query

  5. Similarly, in the New query working space, paste the content from the Detokenization UDF Function file.

  6. Click the run icon to execute the function file.

    databricks_new_query

    Figure 18: Run button

  7. Create a new query to tokenize the email column of the employees.data table available in the Catalog.
    Copy and paste the following commands to tokenize the email column:

    SELECT fortanix_tokenize(
        email, 
        map(
            'fortanix_api_endpoint', 'https://apac.smartkey.io',
            'fortanix_api_key', secret('hr_scope', 'FORTANIX_API_KEY'),
            'key_id', 'fdc25a25-f75b-4a5a-83d6-002c8ed71fba'
        )
    ) from data;
    

    Where,

    • fortanix_api_endpoint refers to the URL endpoint for the Fortanix DSM API.

    • fortanix_api_key refers to the Fortanix DSM API key of the app copied in Section 6.5: Copying an API Key.

    • secret refers to the secret scope name defined in the Databricks Secret Management.

    • key_id refers to the unique identifier of the encryption key used for tokenization.

  8. Click the run button execute the Python query to view the results of tokenized email column in the Raw results section:

    databricks_tokenize_query

    Figure 19: Tokenized query

  9. Create a new query to detokenize the email column of the employees.tokenized_data table available in the Catalog.
    Copy and paste the following commands to detokenize the email column:

    SELECT fortanix_detokenize(
        email, 
        map(
            'fortanix_api_endpoint', 'https://apac.smartkey.io',
            'fortanix_api_key', secret('hr_scope', 'FORTANIX_API_KEY'),
            'key_id', 'fdc25a25-f75b-4a5a-83d6-002c8ed71fba'
        )
    ) from tokenized_data;

    Where,

    • fortanix_api_endpoint refers to the URL endpoint for the Fortanix DSM API.

    • fortanix_api_key refers to the Fortanix DSM API key of the app copied in Section 6.5: Copying an API Key.

    • key_id refers to the unique identifier of the encryption key used for tokenization.

  10. Click the run button execute the Python query to view the results of detokenized email column in the Raw results section:

    Figure 20: Detokenized query