CipherStash
CipherStash Documentation

Match

Match supports full text search across one or more string fields in queries.

Supported Types

Index Definition

Match indexes are defined with the "match" kind and must specify which fields to index on. There is also some configuration that is common to all match index kinds (i.e. Match, DynamicMatch, and FieldDynamicMatch).

Options specific to Match indexes:

  • kind (required): must be set to "match".
  • fields (required): a list of one or more string fields to index on.

Common options

Some options are common to all match index kinds (i.e. Match, DynamicMatch, and FieldDynamicMatch):

  • tokenFilters (required): a list of filters to apply to normalise tokens before indexing.
  • tokenizer (required): determines how input text is split into tokens.
  • filterSize (optional): The size of the backing bloom filter in bits. Defaults to 256.
  • filterTermBits (optional): The maximum number of bits set in the bloom filter per term. Defaults to 3.
tokenFilters

There are are currently only two token filters available downcase and upcase. These are used to normalise the text before indexing and are also applied to query terms. An empty array can also be passed to tokenFilters if no normalisation of terms is required.

tokenizer

There are two tokenizers provided: standard and ngram. The standard simply splits text into tokens using this regular expression: /[ ,;:!]/.
The ngram tokenizer splits the text into n-grams and accepts a configuration object that allows you to specify the tokenLength.

filterSize and filterTermBits

filterSize and filterTermBits are optional fields for configuring bloom filters that back full text search.

filterSize is the size of the bloom filter in bits (also referred to as m). filterSize must be a power of 2 between 32 and 65536 and defaults to 256.

filterTermBits is the number of hash functions to use per term (also referred to as k). This determines the maximum number of bits that will be set in the bloom filter per term. filterTermBits must be an integer from 3 to 16 and defaults to 3.

Example JSON Schema Definition:

{
  // (type definition omitted)
  "indexes": {
    "nameAndJobTitle": {
      "kind": "match",
      "fields": ["name", "jobTitle"],
      "tokenFilters": [{ "kind": "downcase" }],
      "tokenizer": { "kind": "ngram", "tokenLength": 3 }
    }
  }
}

Caveats around N-Gram Tokenization

While using n-grams as a tokenization method allows greater flexibility when doing arbitrary substring matches, it is important to bear in mind the limitations of this approach. Specifically, searching for strings shorter than the tokenLength parameter will not generally work.

If you’re using n-gram as a token filter, then a token that is already shorter than the tokenLength parameter will be kept as-is when indexed, and so a search for that short token will match that record. However, if that same short string only appears as a part of a larger token, then it will not match that record. In general, therefore, you should try to ensure that the string you search for is at least as long as the tokenLength of the index, except in the specific case where you know that there are shorter tokens to match, and you are explicitly OK with not returning records that have that short string as part of a larger token.

Supported Query Operations

  • match (performs full text search)

Examples

StashRB (Ruby)

StashJS (TypeScript)