Protect sensitive data in AI retrieval-augmented generation pipelines with encrypted vector storage and searchable encryption

Securing AI and RAG pipelines

Retrieval-Augmented Generation (RAG) pipelines commonly store sensitive documents alongside vector embeddings. Without encryption, this data is exposed at rest and during retrieval — creating a significant attack surface. CipherStash lets you encrypt sensitive content while preserving the ability to search and retrieve it.

The problem

RAG architectures typically store:

Document chunks — the original text, often containing PII, financial data, or confidential business information
Metadata — source references, user associations, access tags
Vector embeddings — numeric representations used for similarity search

If any of this data is exfiltrated from the database, the plaintext content is immediately exposed. Encryption-at-rest does not help — the data is decrypted as soon as it's queried.

Encrypting RAG context data

Use the Encryption SDK to encrypt sensitive fields before storing them alongside your embeddings.

Define a schema for your documents

schema.ts

import { encryptedTable, encryptedColumn } from "@cipherstash/stack/schema"

export const documents = encryptedTable("documents", {
  content: encryptedColumn("content")
    .freeTextSearch(),
  source: encryptedColumn("source")
    .equality(),
  userId: encryptedColumn("user_id")
    .equality(),
})

Encrypt before storage

ingest.ts

import { Encryption } from "@cipherstash/stack"
import { documents } from "./schema"

const client = await Encryption({ schemas: [documents] })

async function ingestDocument(doc: { content: string; source: string; userId: string; embedding: number[] }) {
  const encryptedContent = await client.encrypt(doc.content, {
    column: documents.content,
    table: documents,
  })

  const encryptedSource = await client.encrypt(doc.source, {
    column: documents.source,
    table: documents,
  })

  const encryptedUserId = await client.encrypt(doc.userId, {
    column: documents.userId,
    table: documents,
  })

  if (encryptedContent.failure || encryptedSource.failure || encryptedUserId.failure) {
    throw new Error("Encryption failed")
  }

  // Store encrypted fields alongside the vector embedding
  await db.query(
    `INSERT INTO documents (content, source, user_id, embedding)
     VALUES ($1::jsonb, $2::jsonb, $3::jsonb, $4)`,
    [encryptedContent.data, encryptedSource.data, encryptedUserId.data, JSON.stringify(doc.embedding)]
  )
}

Decrypt retrieved context

After vector similarity search retrieves relevant documents, decrypt the content before passing it to the LLM:

retrieve.ts

async function retrieveContext(queryEmbedding: number[], topK: number = 5) {
  // Vector similarity search returns encrypted rows
  const results = await db.query(
    `SELECT content, source FROM documents
     ORDER BY embedding <-> $1
     LIMIT $2`,
    [JSON.stringify(queryEmbedding), topK]
  )

  // Decrypt the content for each result
  const decryptedDocs = await Promise.all(
    results.rows.map(async (row) => {
      const content = await client.decrypt(row.content)
      const source = await client.decrypt(row.source)
      return {
        content: content.failure ? null : content.data,
        source: source.failure ? null : source.data,
      }
    })
  )

  return decryptedDocs.filter((doc) => doc.content !== null)
}

Searchable encrypted retrieval

When you need to filter documents by metadata before or alongside vector search, use searchable encryption with EQL:

-- Find documents for a specific user using encrypted equality search
SELECT content, source, embedding
FROM documents
WHERE eql_v2.eq(user_id, $1)
ORDER BY embedding <-> $2
LIMIT 10;

This combines encrypted metadata filtering with vector similarity — without ever decrypting the metadata in the database.

Benefits for AI pipelines

Sensitive context stays encrypted — document chunks containing PII or confidential data are never stored in plaintext
Compliance-ready — encrypted storage meets GDPR, HIPAA, and SOC2 requirements for data protection
Selective decryption — only decrypt what the LLM needs, reducing exposure surface
Audit trail — track who retrieved which documents and when using identity-aware encryption

Securing AI and RAG pipelines

On this page