Encryption Migrations

This how to guide explains how to migrate your plaintext data to be fully encrypted with CipherStash. It also explains the different encryption modes, and how they work. Finally, it explains how to clean up your plaintext data.

This how to guide covers:

Introduction

You have plaintext data you want to protect.

CipherStash helps you encrypt your data in your existing database, while keeping it searchable.

But what are the steps you need to follow to go from plaintext to fully encrypted?

CipherStash provides a way to do this without any application downtime, while keeping the encrypted and plaintext field values consistent.

CipherStash uses encryption modes to incrementally migrate your plaintext to ciphertext.

Say you have a patients table in your application that has an email column you want to protect.

All the values in that email column are currently in plaintext.

To encrypt that email column, you need to create two additional encrypted columns:

  1. A “source” column, which contains the encrypted value from an AES-GCM encryption operation
  2. An “index” column, which contains an encrypted index that can be used to search and order the encrypted value

The CipherStash driver can operate in 4 different modes:

  1. plaintext — the plaintext column is used for all CRUD operations.
  2. plaintext-duplicate — the plaintext column is used for all CRUD operations, but INSERTs and UPDATEs are also written to the encrypted columns.
  3. encrypted-duplicate — the encrypted columns are used for all CRUD operations, but INSERTs and UPDATEs are also written to the plaintext column.
  4. encrypted — the encrypted columns are used for all CRUD operations.

While it’s possible to do a knife-switch cutover from plaintext to encrypted modes, we recommend CipherStash users migrate their data by stepping through the modes incrementally. This builds confidence in the performance and availability of the encryption, and gives you opportunity to roll back.

This also allows for zero-downtime migrations.

Understanding the different encryption modes

When integrating CipherStash into your application, you create a dataset config which describes which fields you want to encrypt, what encrypted indexes to enable, and what encryption mode to use for each field.

The following is an example of a dataset config with a single field configured:

# dataset.yml
tables:
  - path: patients
    fields:
      - name: email
        in_place: false
        cast_type: utf8-str
        mode: plaintext-duplicate
        indexes:
          - version: 1
            kind: match
            tokenizer:
              kind: ngram
              token_length: 3
            token_filters:
            - kind: downcase
            k: 6
            m: 2048
            include_original: true
          - version: 1
            kind: ore
          - version: 1
            kind: unique

From the config, the field email of the type utf8-str is configured with match, ore and unique indexes. This field should be represented by the following database columns:

columndescription
emailplaintext source value for email
__email_encryptedencrypted source value for email
__email_matchencrypted match index for email
__email_oreencrypted ore index for email
__email_uniqueencrypted unique index for email

The encryption mode for this field is set to plaintext-duplicate.

Each field’s encryption mode can have one of the following values:

  • plaintext-duplicate
  • encrypted-duplicate
  • encrypted

With CipherStash, the driver is responsible for database reads, writes, query mapping, encryption, and decryption. Each encryption mode affects how the driver behaves.

Encryption Mode: plaintext-duplicate

columndescriptionoperations
emailplaintext source value for emailreads, writes, lookups
__email_encryptedencrypted source value for emailwrites
__email_matchencrypted match index for emailwrites
__email_oreencrypted ore index for emailwrites
__email_uniqueencrypted unique index for emailwrites

In plaintext-duplicate mode, the driver uses the plaintext source column for all read operations. However, the driver writes to the plaintext source, encrypted source, and the encrypted index columns.

plaintext-duplicate-reads

Figure: Mapping of SELECT statements and WHERE clauses for a field in plaintext-duplicate mode

The driver handles SELECT statements and WHERE clauses as follows:

  • When selecting for the field’s value, the driver retrieves the plaintext value from the plaintext column directly.
  • When the field is used as a filter in a WHERE clause, the driver will pass that through to the database.
  • In plaintext-duplicate mode, the driver does not use the encrypted source and index columns for read operations.

This works in the same way as plain SQL without the CipherStash driver.

plaintext-duplicate-writes

Figure: Mapping of INSERT and UPDATE statements for a field in plaintext-duplicate mode

The driver handles DB INSERT and UPDATE statements as follows. When inserting a new row, or updating the field, the driver will encrypt the field’s value and encrypted index. The driver will then write the plaintext value, the encrypted value, and the encrypted index values into each column respectively.

This encryption mode is used when first setting up CipherStash on an existing plaintext field. When initially adding the encrypted source column, and the encrypted indexes to the database their values have not been populated yet. This mode allows the application to keep working with the plaintext values during reads, and incrementally write the encrypted values, while also allowing you to bulk encrypt every record in the database in the background without causing application downtime.

Encryption Mode: encrypted-duplicate

In the following example, the encryption mode on the dob field is set to encrypted-duplicate

# dataset.yml
tables:
  - path: patients
    fields:
      - name: email
        in_place: false
        cast_type: utf8-str
        mode: encrypted-duplicate
        indexes:
          - version: 1
            kind: match
            tokenizer:
              kind: ngram
              token_length: 3
            token_filters:
            - kind: downcase
            k: 6
            m: 2048
            include_original: true
          - version: 1
            kind: ore
          - version: 1
            kind: unique
columndescriptionoperations
emailplaintext source value for emailwrites
__email_encryptedencrypted source value for emailreads, writes
__email_matchencrypted match index for emailwrites, match lookups (LIKE)
__email_oreencrypted ore index for emailwrites, ORE lookups (<, <=, >, >=)
__email_uniqueencrypted unique index for emailwrites, exact lookups (=)

In encrypted-duplicate mode, the CipherStash driver:

  • Uses the encrypted source column for retrieving the field value.
  • Uses the encrypted index columns for lookups by this field.
  • Writes to the plaintext source column, encrypted source column, and all index columns, in the same way as in the plaintext-duplicate mode.

encrypted-duplicate-reads

Figure: Mapping of SELECT statements and WHERE clauses for a field in encrypted-duplicate mode

The CipherStash driver handles SELECT statements and WHERE clauses as follows:

  • When selecting for the field’s value, the driver retrieves the ciphertext from the encrypted source column, and decrypts it.
  • When the field is used as a filter in a WHERE clause, the driver will encrypt the filter term, and map the filter to use the appropriate encrypted index column for the filter comparison.

To show how this works in practice, per the diagram above:

  • The clause email LIKE 'user%' is mapped to __email_match @> xxx, where xxx is the encrypted match term.
  • The clause email >= 'user@example.com' is mapped to __email_ore > xxx, where xxx is the encrypted ore term.
  • The clause email = 'user@example.com' is mapped to __email_unique = xxx, where xxx is the encrypted exact term.

Because this relies on the encrypted column and the encrypted index columns, the query will not work correctly if the table contains rows that have not been encrypted. This applies to rows where the encrypted source column and index columns are NULL. Such rows might incorrectly appear with NULL values, or might not appear in queries that filter based on the encrypted index.

Additionally, queries can fail if the required encrypted index columns are not configured. In the example above, supposed that the match index was left out, if you try to do a LIKE query on this field, the CipherStash driver will fail to map the query because a match index is required.

encrypted-duplicate-write

Figure: Mapping of INSERT and UPDATE statements for a field in encrypted-duplicate mode

In encrypted-duplicate mode, the CipherStash driver handles DB INSERT and UPDATE statements similar to plaintext-duplicate mode. When inserting a new row, or updating the field, the driver will encrypt the field’s value and encrypted index. The driver will then insert the plaintext value, the encrypted value, and the encrypted index values into each column respectively.

The encrypted-duplicate mode is used when all records in the table have been fully encrypted, and there are no more encrypted source column or index columns with NULL values. This mode makes the application operate fully on encrypted reads and writes, while still keeping the value in the plaintext column consistent and updated. This allows for the application to be tested extensively with fully encrypted read-writes. If there is a bug or broken functionality discovered at this stage, the application can easily be reverted to the plaintext-duplicate mode.

Encryption Mode: encrypted

In the following example, the field dob encryption mode is set to encrypted:

# dataset.yml
tables:
  - path: patients
    fields:
      - name: email
        in_place: false
        cast_type: utf8-str
        mode: encrypted
        indexes:
          - version: 1
            kind: match
            tokenizer:
              kind: ngram
              token_length: 3
            token_filters:
            - kind: downcase
            k: 6
            m: 2048
            include_original: true
          - version: 1
            kind: ore
          - version: 1
            kind: unique
columndescriptionoperations
__email_encryptedencrypted source value for emailreads, writes
__email_matchencrypted match index for emailwrites, match lookups (LIKE)
__email_oreencrypted ore index for emailwrites, ore lookups (<, <=, >, >=)
__email_uniqueencrypted unique index for emailwrites, exact lookups (=)

In encrypted mode, the CipherStash driver handles reads and writes similar to encrypted-duplicate mode:

  • The driver uses the encrypted source column for retrieving the field value.
  • The driver uses the encrypted index columns for query filters.
  • The driver writes to the encrypted source and the encrypted index columns.

The only difference between the encrypted and encrypted-duplicate mode is that in the encrypted mode, the driver no longer writes to the plaintext column.

encrypted-reads

Figure: Mapping of SELECT statements and WHERE clauses for a field in encrypted mode

The driver handles SELECT statements and WHERE clauses similar to encrypted-duplicate mode. When selecting for the field’s value, the driver retrieves the ciphertext from the encrypted source column, and decrypts it. When the field is used as a filter in a WHERE clause, the driver will encrypt the filter term, and map the filter to use the appropriate encrypted index column for the filter comparison.

encrypted-write

Figure: Mapping of INSERT and UPDATE statements for a field in encrypted mode

In this mode, the driver handles database INSERT and UPDATE statements similar to plaintext-duplicate and encrypted-duplicate modes. However, the driver no longer writes into the plaintext column.

When inserting a new row, or updating the field, the driver will encrypt the field’s value and encrypted index. The driver will then write the encrypted value and the encrypted index values into each column respectively.

This mode is used when the application has been verified to work correctly with CipherStash in encrypted-duplicate mode. When encrypted mode is used, the plaintext column is no longer used, and can be safely removed from the table.

Migrating a field from plaintext-duplicate mode to encrypted

Prerequisites

Before starting, please ensure that you:

  1. Have already created a CipherStash account (you can do this with the stash signup command)
  2. Have already installed Stash CLI in your local development environment
  3. Are connected to the Internet
  4. Have an application that integrates with CipherStash, either:
  5. Have an existing plaintext field that has been configured in plaintext-duplicate mode.

For this example, let’s call this field email on the users table.

Example use case: migrating users.email from plaintext-duplicate to encrypted

For this example, let’s assume that you have a field email on the table users, that have been configured in plaintext-duplicate mode. This field has the following configuration in your dataset config file.

# dataset.yml
tables:
  # other tables and fields above here
  - path: users
    fields:
      - name: email
        in_place: false
        cast_type: utf8-str
        mode: plaintext-duplicate
        indexes:
          - version: 1
            kind: match
            tokenizer:
              kind: ngram
              token_length: 3
            token_filters:
            - kind: downcase
            k: 6
            m: 2048
            include_original: true
          - version: 1
            kind: ore
          - version: 1
            kind: unique

We will also assume that the following columns have been created in your database:

  • email
  • __email_encrypted
  • __email_match
  • __email_ore
  • __email_unique

You can refer to the following tutorials on how to configure a field with CipherStash:

Encrypt all the records

In plaintext-duplicate mode, your application will write to all plaintext, encrypted, and index columns, while only reading from the plaintext column. Because of this, the application will work without issues even when retrieving records that have not been encrypted yet.

However, the application will not function correctly if it is switched to one of the other modes (plaintext-duplicate, encrypted-duplicate, or encrypted) at this stage. Before switching this field to either encrypted-duplicate or encrypted mode, we need to encrypt all records in the table.

The steps to take depends on which type of application you are working with.

Node.js and Sequelize

The following has been adapted from the example in Sequelize Getting Started - Part 2.

Use an example script like below:

// encrypt-data.js

const { sequelize } = require("./models");

/**
 * @type {import('sequelize').Sequelize['models']}
 */
const models = sequelize.models;

const PAGE_SIZE = 1000;

/**
 * @param {import('sequelize').Model} model
 */
async function forceSave(model) {
  // Since we want to encrypt all fields list them all here as changed (even if nothing was)
  model.changed("email", true);
  return await model.save();
}

(async () => {
  let offset = 0;

  // Iterate through all the patients in batches and save them
  while (true) {
    const users = await models.user.findAll({
      offset,
      limit: PAGE_SIZE,
      order: [[ 'id', 'ASC' ]]
    })

    if (users.length === 0) {
      break;
    }

    await Promise.all(users.map(forceSave))

    offset += PAGE_SIZE;
  }

  console.log("Migration successful!")
})();

And then, run it with the following:

node encrypt-data.js

Rails

The following has been adapted from the example in Rails Getting Started - Part 2.

Do the following for Rails app:

rails 'cipherstash:migrate[users]'

Migrating to encrypted-duplicate mode

Now that all the records on the table have been encrypted, it is time to migrate to encrypted-duplicate mode.

First, update encryption mode in the dataset config file to encrypted-duplicate.

# dataset.yml
tables:
# snip
  - path: users
    fields:
      - name: email
        in_place: false
        cast_type: utf8-str
        mode: encrypted-duplicate # updated
# snip

Next, upload the new dataset config to CipherStash.

stash datasets config upload --file dataset.yml --client-id $CS_CLIENT_ID --client-key $CS_CLIENT_KEY

At this point, the CipherStash driver only picks up the dataset config changes from CipherStash, during application start. In order for your application to pick up the updated config, you need to restart your application. This step depends on how your application runtime is managed, and is left as an exercise to the reader.

Once your application restarts, it should read and query for the users.email field using the encrypted source column and the encrypted index columns. At the same time, your application will continue to write to the plaintext source column. This is a good time to test your application to ensure that encrypted reads do not break any functionality, or cause any bugs. If you discovered that some functionality in your application is broken when you switched this field to encrypted-duplicate mode, it is still safe to revert to reading from plaintext.

To do that:

  1. Revert the field encryption mode in the dataset config file to plaintext-duplicate
  2. Upload the dataset config to CipherStash
  3. Restart your application

Otherwise, if your application works correctly in encrypted-duplicate mode, you are all set to migrate to the next mode.

Migrating to encrypted mode

At this point, all the records on the table have been encrypted. You have also verified that your application still works when you switch the users.email field to encrypted-duplicate mode. Your application is now ready to switch this field to encrypted mode.

As before, update the encryption mode in the dataset config file to encrypted.

# dataset.yml
tables:
# snip
  - path: users
    fields:
      - name: email
        in_place: false
        cast_type: utf8-str
        mode: encrypted # updated
# snip

Next, upload the new dataset config to CipherStash.

stash datasets config upload --file dataset.yml --client-id $CS_CLIENT_ID --client-key $CS_CLIENT_KEY

Finally, restart your application.

Once your application restarts, it will no longer write any values to the plaintext column you started with. All reads and writes should only depend on the encrypted source column and the encrypted index columns.

Cleaning up the plaintext column

At this point, it should be safe to drop the plaintext column from the database. The steps will depend on how database migrations are handled in your application.