Proximity searches#
These kinds of proximity indices allow Cozo to perform fast searches for similar data. The HNSW index is a graph-based index that allows for fast approximate nearest neighbor searches. The MinHash-LSH index is a locality sensitive hash index that allows for fast approximate nearest neighbor searches. The FTS index is a full-text search index that allows for fast string matches.
MinHash-LSH for near-duplicate indexing of strings and lists#
To use locality sensitive search on a relation containing string values, for example:
:create table {k: String => v: String?}
You can create a MinHash-LSH index on the v
field by:
::lsh create table:index_name {
extractor: v,
extract_filter: !is_null(v),
tokenizer: Simple,
filters: [],
n_perm: 200,
target_threshold: 0.7,
n_gram: 3,
false_positive_weight: 1.0,
false_negative_weight: 1.0,
}
This creates a MinHash-LSH index on the v
field of the table. The index configuration includes the following parameters:
extractor: v
specifies that thev
field will be used as the feature extractor. This parameter takes an expression, which must evaluate to a string, a list of values to be indexed, ornull
. If it evaluates tonull
, then the row is not indexed.extract_filter: !is_null(v)
: this is superfluous in this case, but in more general situations you can use this to skip indexing rows based on arbitary logic.tokenizer: Simple
andfilters: []
specifies the tokenizer to be used, see a later section for tokenizer.n_perm: 200
sets the number of permutations for the MinHash LSH index. Higher values will result in more accurate results at the cost of increased CPU and storage usage.target_threshold: 0.7
sets the target threshold for similarity comparisons when searching.n_gram: 3
sets the size of the n-gram used for shingling.false_positive_weight: 1.0
andfalse_negative_weight: 1.0
set the weights for false positives and false negatives.
At search time:
?[k, v] := ~table:index_name {k, v |
query: $q,
k: 2,
filter: 1 + 1 == 2,
}
This will look for the top 2 most similar values to the query q
. The filter
parameter is evaluated on the bindings for the relation, and only those rows for which the filter evaluates to true
are returned, before restricting to k
results. The query
parameter is a string, and will be subject to the same tokenization process.
In addition to strings, you can index and search for list of arbitrary values. In this case, the tokenizer
, filters
and n_gram
parameters are ignored.
Again you can use the associated index relation as a normal relations in your query. There are two now: table:index_name
and table:index_name:inv
. You can use ::columns
to look at their structure. In our case, the first is:
{
hash: Bytes,
src_k: String,
}
and the second is:
{
k: String => minhash: List[Bytes]
}
The first it more useful: it loosely groups together duplicates according to the indexing parameters.
To drop:
::lsh drop table:index_name
Full-text search (FTS)#
Full-text search should be familiar. For the following relation:
:create table {k: String => v: String?}
we can create an FTS index by:
::fts create table:index_name {
extractor: v,
extract_filter: !is_null(v),
tokenizer: Simple,
filters: [],
}
This creates an FTS index on the v
field of the table. The index configuration includes the following parameters:
extractor: v
specifies that thev
field will be used as the feature extractor. This parameter takes an expression, which must evaluate to a string ornull
. If it evaluates tonull
, then the row is not indexed.extract_filter: !is_null(v)
: this is superfluous in this case, but in more general situations you can use this to skip indexing rows based on arbitary logic.tokenizer: Simple
andfilters: []
specifies the tokenizer to be used, see a later section for tokenizer.
That’s it. At query time:
?[s, k, v] := ~table:index_name {k, v |
query: $q,
k: 10,
filter: 1 + 1 == 2,
score_kind: 'tf_idf',
bind_score: s
}
:order -s
This query retrieves the top 10 results from the index index_name
based on a search query $q
. The filter
parameter can be used to filter the results further based on additional conditions. The score_kind
parameter specifies the scoring method, and in this case, it is set to 'tf_idf'
which takes into consideration of global statistics when scoring documents. You can also use 'tf'
. The resulting scores are bound to the variable s
. Finally, the results are ordered in descending order of score (-s
).
The search query must be a string and is processed by the same tokenizer as the index. The tokenizer is specified by the tokenizer
parameter, and the filters
parameter can be used to specify additional filters to be applied to the tokens. There is a mini-language for parsing the query:
hello world
,hello AND world
,"hello" AND 'world'
: these all look for rows where both words occur.AND
is case sensitive.hello OR world
: look for rows where either word occurs.hello NOT world
: look for rows wherehello
occurs butworld
does not.hell* wor*
: look for rows having a word starting withhell
and also a word starting withwor
.NEAR/3(hello world bye)
: look for rows wherehello
,world
,bye
are within 3 words of each other. You can writeNEAR(hello world bye)
as a shorthand forNEAR/10(hello world bye)
.hello^2 OR world
: look for rows wherehello
orworld
occurs, buthello
has twice of its usual weighting when scoring.These can be combined and nested with parentheses (except that
NEAR
only takes literals and prefixes):hello AND (world OR bye)
.
The index relation has the following schema:
{
word: String,
src_k: String,
=>
offset_from: List[Int],
offset_to: List[Int],
position: List[Int],
total_length: Int,
}
Explanation of the fields:
word
: the word that occurs in the document.src_k
: the key of the document, the name and number varies according to the original relation schema.offset_from
: the starting offsets of the word in the document.offset_to
: the ending offsets of the word in the document.position
: the position of the word in the document, counted as the position of entire tokens.total_length
: the total number of tokens in the document.
To drop:
::fts drop table:index_name
Text tokenization and filtering#
Text tokenization and filtering are used in both the MinHash-LSH and FTS indexes. The tokenizer is specified by the tokenizer
parameter, and the filters
parameter can be used to specify additional filters to be applied to the tokens.
CozoDB uses Tantivy’s tokenizers and filters (we incorporated their files directly in our source tree, as they are not available as a library). Tokenizer is specified in the configuration as a function call such as Ngram(9)
, or if you omit all arguments, Ngram
is also acceptable. The following tokenizers are available:
Raw
: no tokenization, the entire string is treated as a single token.Simple
: splits on whitespace and punctuation.Whitespace
: splits on whitespace.Ngram(min_gram?, max_gram?, prefix_only?)
: splits into n-grams.min_gram
is the minimum size of the n-gram (default 1),max_gram
is the maximum size of the n-gram (default tomin_gram
), andprefix_only
is a boolean indicating whether to only generate prefixes of the n-grams (default false).Cangjie(kind?)
: this is a text segmenter for the Chinese language.kind
can be'default'
,'all'
,'search'
or'unicode'
.
After tokenization, multiple filters can be applied to the tokens. The following filters are available:
Lowercase
: converts all tokens to lowercase.AlphaNumOnly
: removes all tokens that are not alphanumeric.AsciiFolding
: converts all tokens to ASCII (lossy), i.e.passé
goes topasse
.Stemmer(lang)
: use a language-specific stemmer. The following languages are available:'arabic'
,'danish'
,'dutch'
,'english'
,'finnish'
,'french'
,'german'
,'greek'
,'hungarian'
,'italian'
,'norwegian'
,'portuguese'
,'romanian'
,'russian'
,'spanish'
,'swedish'
,'tamil'
,'turkish'
. As an exmple, the English stemmer converts'running'
to'run'
.Stopwords(lang)
: filter out stopwords specific to the language. The stopwords come from the stopwords-iso project. Use the ISO 639-1 Code as specified on the project page. For example,Stopwords('en')
for English will remove words such as'the'
,'a'
,'an'
, etc.
For English text, the recommended setup is Simple
for the tokenizer and [Lowercase, Stemmer('english'), Stopwords('en')]
for the filters.