Using Data Security Manager with Databricks

1.0 Introduction

This article describes how to integrate Fortanix-Data-Security-Manager (DSM) with the Databricks to enhance data security measures within the Databricks SQL warehouse. The primary goal of this integration is to enforce robust data security protocols, ensuring the confidentiality of sensitive data stored in the warehouse.

Tokenization and detokenization serve as fundamental techniques in data security, enabling the protection of sensitive data while preserving its usability.

2.0 Overview

Within this integration framework, the Fortanix DSM's vaultless tokenization is used with Databricks Notebooks and Python User-defined functions (UDFs) to facilitate tokenization and detokenization processes within the Databricks SQL warehouse.

  • Method 1 - Databricks Notebooks: These are essential tools for data science and machine learning workflows. They offer real-time co-authoring, automatic versioning, and built-in data visualizations. In this approach, Fortanix DSM Python Software Development Kits (SDKs) are used to connect to Fortanix DSM API endpoints within Databricks Notebooks for tokenizing and detokenizing sensitive data stored in the SQL warehouse. This ensures secure transformation of sensitive data columns in the table.

    NOTE

    The Notebooks support both tokenization and detokenization operations.

    For more information, refer to the Introduction to Databricks Notebooks documentation.

  • Method 2 - Python User-defined functions (UDFs): They allow for secure and governed execution of Python code through SQL functions. By integrating UDFs with the Fortanix DSM API, perform tokenization operations to redact email and phone information from JSON strings, returning the redacted string and preventing unauthorized access.

    NOTE

    Python UDF is currently in public preview mode and supports both tokenization and detokenization operations.

    For more information, refer to the Introduction to User-defined functions (UDFs) in Unity Catalog documentation.

3.0 Prerequisites

Ensure the following:

4.0 Product Tested Version

  • Fortanix DSM version 4.30 and above.

5.0 Architecture Diagram

Figure 1: Architecture Diagram

The integration architecture is divided into two main planes: the Databricks Control Plane and the Databricks Data Plane, connected to the Fortanix DSM for secure data operations.

The Databricks Control Plane is where credentials for connecting to Fortanix DSM APIs are managed, and where SQL commands for operations on sensitive data are written. It handles configuration, authentication, and orchestration.

The Databricks Data Plane is where data resides and undergoes processing. It includes resources like SQL warehouses and data lakes. The Databricks Data Plane executes the SQL commands for tokenizing and detokenizing sensitive data.

First, in the Databricks Control Plane, Notebooks or Python UDFs connect to the Fortanix DSM using the Fortanix DSM Python SDK. Then, the SQL commands to tokenize and detokenize sensitive data in the Databricks SQL warehouse are written in the Control Plane and executed in the Data Plane.

6.0 Configure Fortanix DSM

A Fortanix DSM service must be configured, and the URL must be accessible. To create a Fortanix DSM account and group, refer to the following sections:

6.1 Signing Up

To get started with the Fortanix Data Security Manager (DSM) cloud service, you must register an account at <Your_DSM_Service_URL>. For example, https://eu.smartkey.io.

For detailed steps on how to set up the Fortanix DSM, refer to the User's Guide: Sign Up for Fortanix Data Security Manager SaaS documentation.

6.2 Creating an Account

Access the <Your_DSM_Service_URL> on the web browser and enter your credentials to log in to the Fortanix DSM.

Screenshot (279).png

Figure 2: Logging In

6.3 Creating a Group

Perform the following steps to create a group in the Fortanix DSM:

  1. Click the Groups menu item in the DSM left navigation panel and click the + button on the Groups page to add a new group.

    Figure 3: Add Group

  2. On the Adding new group page, enter the following details:

    • Title: Enter a title for your group.

    • Description (optional): Enter a short description for the group.

  3. Click the SAVE button to create the new group.

The new group has been added to the Fortanix DSM successfully.

6.4 Creating an Application

Perform the following steps to create an application (app) for the group created in the previous section:

  1. Click the Apps menu item in the DSM left navigation panel and click the + button on the Apps page to add a new app.

    Figure 4: Add Application

  2. On the Adding new app page, enter the following details:

    • App name: Enter the name of your application.

    • Interface (optional): Select the interface type as REST API from the drop down menu.

    • ADD DESCRIPTION (optional): Enter a short description for the application.

    • Authentication method: Select the default API Key as the method of authentication from the drop down menu. For more information on these authentication methods, refer to User's Guide: Authentication documentation.

    • Assigning the new app to groups: Select the group created in Section 6.3: Creating a Group.

  3. Click the SAVE button to add the new application.

The new application has been added to the Fortanix DSM successfully.

6.5 Copying the API Key

Perform the following steps to copy the API key from the Fortanix DSM:

  1. Navigate to the Apps menu item in the DSM left navigation panel and click the app created in Section 6.4: Creating an Application to go to the detailed view of the app.

  2. On the INFO tab, click the VIEW API KEY DETAILS button.

  3. From the API Key Details dialog box, copy the API Key of the app to be used later.

6.6 Creating a Security Object

NOTE

For this guide, two tokenization security objects are created to tokenize and detokenize Last Name and Email Address.

Perform the following steps to generate a tokenization key in the Fortanix DSM:

  1. Click the Security Objects menu item in the DSM left navigation panel and click the + button on the Security Objects to add a security object.

    Figure 5: Add Security Object

  2. On the Add New Security Object page, enter the following details:

    • Security Object name: Enter the name of your security object. For example, db_name_token.

    • Group: Select the group as created in Section 6.3: Creating a Group.

    • Select the GENERATE radio button.

    • Choose a type: Select the Tokenization key type.

    • Key Size: Indicates the size of the key in bits.

    • Data type: Indicates the type of the security object token. For more information, refer to the User's Guide: Tokenization documentation.

    • Key operations permitted: Select the required operations to define the actions that can be performed with the cryptographic keys, such as encryption, decryption, signing, and verifying.

  3. Click the GENERATE button to create the new security object.

  4. Similarly, repeat the Steps 1 to 3 to create a security object for Email Address as well. For example, db_email_token.

Figure 6: Security Objects Added

The two new security objects have been added to the Fortanix DSM successfully.

7.0 Creating a Databricks Secret

You can use the Databricks Secret Management to store the Fortanix DSM API key, rather than hardcoding it in Notebooks or Python UDFs to ensure secure and authorized access to the API keys for Notebooks and UDFs at runtime.

Perform the following steps using Databricks CLI:

  1. Run the following commands to create a Databricks secrets scope:

    SECRETS_SCOPE_NAME="<your secret scope name>"
    databricks secrets create-scope $SECRETS_SCOPE_NAME
  2. Run the following commands to list the scopes:

    databricks secrets list-scopes

    The output of the command will be:

    Scope       Backend Type
    hr_scope    DATABRICKS
  3. Run the following commands to add users to the scope to access the secret:

    databricks secrets put-acl $SECRETS_SCOPE_NAME [email protected] MANAGE
  4. Run the following commands to view the list of users who have the access to the secret:

    databricks secrets list-acls $SECRETS_SCOPE_NAME

    The output of the command will be:

    [
      {
        "permission":"MANAGE",
        "principal":"[email protected]"
      }
    ]
  5. Run the following commands to add a secret:

    databricks secrets put-secret $SECRETS_SCOPE_NAME FORTANIX_API_KEY --string-value "<API_KEY_VALUE>"
  6. Run the following commands to view the list of the secrets:

    databricks secrets list-secrets $SECRETS_SCOPE_NAME

    The output of the command will be:

    Key                    Last Updated Timestamp
    FORTANIX_API_KEY       1714076216500
  7. Run the following commands to get the value of the secret:

    databricks secrets get-secret $SECRETS_SCOPE_NAME FORTANIX_API_KEY

    The output of the command will be:

    {
      "key":"FORTANIX_API_KEY",
      "value":"<base64_encoded_value>"
    }
    
  8. Run the following commands to get base64 decoded value of the secret:

    databricks secrets get-secret $SECRETS_SCOPE_NAME FORTANIX_API_KEY | jq -r '.value' | base64 -d

    For more information, refer to the Access Control List documentation.

8.0 Methods

This section elaborates the steps for integrating either of the two Databricks methods as defined in Section 2.0: Overview.

8.1 Using Databricks Notebooks

Perform the following steps to create a new Databricks Notebook to import Fortanix DSM Python SDK, and define the tokenization and detokenization functions:

  1. Log in to your Databricks Secret Management account.

  2. From the left navigation panel, click the + NEW button → Notebook option to create a new Notebook.

    databricks_add_new_notebook

    Figure 7: Add Notebook

  3. On the next screen, click the File tab → Import notebook… option from the drop down menu to import a sample DSM Notebook.

    databricks_import_notebook

    Figure 8: Import Notebook

  4. In the Import dialog box, enter the following details:

  5. Click the Import button to import the sample DSM Notebook file.
    This Notebook fetches the API key from the Databricks secrets as configured in Section 7.0: Creating a Databricks Secret.

    api_key = dbutils.secrets.get(scope="hr_scope", key="FORTANIX_API_KEY")

    Where,

    • scope refers to the secret scope name defined in the Databricks Secret Management.

    • FORTANIX_API_KEY refers to the Fortanix DSM API key of the app copied in Section 6.5: Copying an API Key.

  6. Click the Run cell option to run the Python script in the Notebook to validate if everything works.

    databricks_run_button

    Figure 10: Run Button

  7. In the Attach to an existing compute resource form, select the General compute radio button and click the Start, attach and run button to execute the Notebook.
    Wait for a few minutes to validate the status of the Notebook.
    The Notebook will be created with the name DSM_notebook.

After the status of the Notebook is validated, you need to create a second Notebook in the account to define the cryptographic keys UUIDs and call tokenize_col and detokenize_col as defined in the previous steps.

Perform the following steps:

  1. From the left navigation panel, click the + NEW button → Notebook option.

    databricks_add_notebook1

    Figure 11: Add Notebook

  2. On the next screen, click the File button → Import notebook… option from the drop down menu.

    databricks_import_notebook1

    Figure 12: Import Notebook

  3. In the Import dialog box, enter the following details:

  4. Click the Import button to import the new sample DSM Notebook file.
    This Notebook references the DSM_notebook notebook as created in the previous section. It defines the key UUID values, and invokes the tokenization (tokenize_col) and detokenization (detokenize_col) functions described as follows:
    These functions expect the table name, column name, and Key UUIDs as inputs.

    1. Run the following command to source the Fortanix DSM Notebook:

      %run "./DSM_notebook"
    2. Run the following command to define the Fortanix Key UUIDs:

      lname_kid = "<LAST_NAME_KEY_UUID>"
      email_kid = "<EMAIL_ADDRESS_KEY_UUID>"

      Where,

    3. In this sample notebook, the pre-created table has the following structure:

      # "employees.data"
      # employee_id         bigint
      # fname               string
      # lname               string
      # email               string

      databricks_employees_table

      Figure 13: Employees Data Table

    4. Run the following command to tokenize the columns:

      tokenize_col("employees","data",["lname","email"],[lname_kid,email_kid])
      # Comment out calling insert_tokenizedData() from tokenize_col() in DSM_notebook if you not need another table to be created.

      This sample function tokenizes the lname and email columns and creates a new table called tokenized_<table_name>.

    5. Run the following command to detokenize the columns by specifying the table and column names:

      detokenize_col("employees","tokenized_data",["lname","email"],[lname_kid,email_kid])
  5. Click the Run cell option to run the Python script for the second Notebook.

    databricks_sample_notebook

    Figure 14: Sample Notebook

  6. In the Attach to an existing compute resource form, select the General compute radio button and click the Start, attach and run button to execute the following functions in the Notebook:

    • Tokenize the selected columns, such as, lname, email.

    • Create a new table named employeed.tokenized_data in the Catalog.

    • Detokenize the same columns from table employeed.tokenized_data.

      databricks_tokenized_table

      Figure 15: Employees Tokenized Data Table

8.2 Using Python User-defined functions

Perform the following steps to implement tokenization and detokenization operations in Databricks SQL warehouse using Python UDFs:

NOTE

Ensure that the Python UDF is enabled, which is currently in public preview.

  1. From the left navigation panel, click the SQL Editor button.

  2. In the New query working space, paste the content from the Tokenization UDF Function file.

  3. Click the run icon to execute the function file.

    databricks_run-button1

    Figure 16: Run Button

    The results are displayed in the Raw results section.

  4. After the executing of the script is completed, click the + button to add a new query.

    databricks_create_query

    Figure 17: Create New Query

  5. Similarly, in the New query working space, paste the content from the Detokenization UDF Function file.

  6. Click the run icon to execute the function file.

    databricks_new_query

    Figure 18: Run Button

  7. Create a new query to tokenize the email column of the employees.data table available in the Catalog.
    Copy and paste the following commands to tokenize the email column:

    SELECT fortanix_tokenize(
        email, 
        map(
            'fortanix_api_endpoint', 'https://apac.smartkey.io',
            'fortanix_api_key', secret('hr_scope', 'FORTANIX_API_KEY'),
            'key_id', 'fdc25a25-f75b-4a5a-83d6-002c8ed71fba'
        )
    ) from data;
    

    Where,

    • fortanix_api_endpoint refers to the URL endpoint for the Fortanix DSM API.

    • fortanix_api_key refers to the Fortanix DSM API key of the app copied in Section 6.5: Copying an API Key.

    • secret refers to the secret scope name defined in the Databricks Secret Management.

    • key_id refers to the unique identifier of the encryption key used for tokenization.

  8. Click the run button execute the Python query to view the results of tokenized email column in the Raw results section:

    databricks_tokenize_query

    Figure 19: Tokenized Query

  9. Create a new query to detokenize the email column of the employees.tokenized_data table available in the Catalog.
    Copy and paste the following commands to detokenize the email column:

    SELECT fortanix_detokenize(
        email, 
        map(
            'fortanix_api_endpoint', 'https://apac.smartkey.io',
            'fortanix_api_key', secret('hr_scope', 'FORTANIX_API_KEY'),
            'key_id', 'fdc25a25-f75b-4a5a-83d6-002c8ed71fba'
        )
    ) from tokenized_data;

    Where,

    • fortanix_api_endpoint refers to the URL endpoint for the Fortanix DSM API.

    • fortanix_api_key refers to the Fortanix DSM API key of the app copied in Section 6.5: Copying an API Key.

    • key_id refers to the unique identifier of the encryption key used for tokenization.

  10. Click the run button execute the Python query to view the results of detokenized email column in the Raw results section:

    Figure 20: Detokenized Query