Cozo database documentation#

Welcome to the documentation for CozoDB, a feature-rich, transactional, relational database that uses Datalog for queries, focuses on graph data and algorithms, is capable of time travel, and makes no performance compromises.

You can start immediately with the Tutorial, which you can follow without installing anything.

For installation instructions and language-specific APIs, please refer to the GitHub page.

Tutorial#

This tutorial will teach you the basics of using the Cozo database.

You should open Cozo in Wasm in a separate browser window and run the queries as you go along. If you are familiar with the Python datascience stack, you should following the instruction here instead to run this notebook in Jupyter. The source of this file, a Jupyter notebook, can be downloaded from here.

The following cell is to set up the Jupyter notebook. Ignore if you are using the browser.

[1]:
%load_ext pycozo.ipyext_direct

First steps#

Cozo is a relational database. The “hello world” query

[2]:
?[] <- [['hello', 'world', 'Cozo!']]
[2]:
  _0 _1 _2
0 hello world Cozo!

simply passes an ad hoc relation represented by a list of lists to the database, and the database echoes it back.

You can pass more rows, or a different number of columns:

[3]:
?[] <- [[1, 2, 3], ['a', 'b', 'c']]
[3]:
  _0 _1 _2
0 1 2 3
1 a b c

This example shows how to enter literals for numbers, strings, booleans and null:

[4]:
?[] <- [[1.5, 2.5, 3, 4, 5.5],
        ['aA', 'bB', 'cC', 'dD', 'eE'],
        [true, false, null, -1.4e-2, "A string with double quotes"]]
[4]:
  _0 _1 _2 _3 _4
0 True False None -0.014000 A string with double quotes
1 1.500000 2.500000 3 4 5.500000
2 aA bB cC dD eE

The input literal representations are similar to those in JavaScript. In particular, strings in double quotes are guaranteed to be interpreted in the same way as in JSON. The output are in JSON, but how they appear on your screen depends on your setup. For example, if you are using a Python setup, booleans are displayed as True and False instead of in lowercase, since they are converted to the native Python datatypes for display.

In the last example, you may have noticed that the returned order is not the same as the input order. This is because in Cozo relations are always stored as trees, and trees are always sorted.

Another consequence of relations as trees is that you can have no duplicate rows:

[5]:
?[] <- [[1], [2], [1], [2], [1]]
[5]:
  _0
0 1
1 2

Relations in Cozo follow set semantics where de-duplication is automatic. By contrast, SQL usually follows bag semantics (some databases do this by secretly having a unique internal key for every row, in Cozo you must do this explicitly if you need to simulate duplicate rows).

Why does Cozo break tradition and go with set semantics? Set semantics is much more convenient when you have recursions between relations involved, and Cozo is designed to deal with complex recursions.

Expressions#

The next example shows the use of various expressions and comments:

[6]:
?[] <- [[
            1 + 2, # addition
            3 / 4, # division
            5 == 6, # equality
            7 > 8, # greater
            true || false, # or
            false && true, # and
            lowercase('HELLO'), # function
            rand_float(), # function taking no argument
            union([1, 2, 3], [3, 4, 5], [5, 6, 7]), # variadic function
        ]]
[6]:
  _0 _1 _2 _3 _4 _5 _6 _7 _8
0 3 0.750000 False False True False hello 0.342518 [1, 2, 3, 4, 5, 6, 7]

See here for the full list of functions you can use to build expressions. The syntax is deliberately C-like.

Rules and relations#

Previous examples all start with ?[] <-. The notation denotes a rule named ?. It is a constant rule since it uses <-. The part after ? and before <- is the head of the rule, and the part after <- is the body.

Rules can have other names, but the rule named ? is special in that its evaluation determines the return relation of the query.

We can give bindings in the head of rules:

[7]:
?[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
[7]:
  first second third
0 1 2 3
1 a b c

For constant rules, the number of bindings must match the actual data (the arity), otherwise, you will get an error:

[8]:
?[first, second] <- [[1, 2, 3], ['a', 'b', 'c']]
[8]:
parser::fixed_rule_head_arity_mismatch

  × Fixed rule head arity mismatch
   ╭────
 1 │ ?[first, second] <- [[1, 2, 3], ['a', 'b', 'c']]
   · ─────────────────────────────────────────────────
   ╰────
  help: Expected arity: 3, number of arguments given: 2

Now let’s define rules that apply (use) other rules:

[9]:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[a, b, c] := rule[a, b, c]
[9]:
  a b c
0 1 2 3
1 a b c

This first line defines a constant rule named rule. The ? rule is now an inline rule, denoted by the connecting symbol :=. In its body it applies the fixed rule, by giving the name of the rule followed by three fresh bindings, which are the variables a, b and c.

With inline rules, you can manipulate the order of the columns, or specify which columns are returned:

[10]:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[c, b] := rule[a, b, c]
[10]:
  c b
0 3 2
1 c b

The body of an inline rule consists of atoms. The previous example has a single application atom as the body. Multiple atoms are connected by commas:

[11]:
?[c, b] := rule[a, b, c], is_num(a)
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
[11]:
  c b
0 3 2

Here the second atom is an expression is_num(a). Expression atoms act as filters, so only rows for which the expression evaluates to true are returned. The order in which the rules are given is immaterial.

You can also bind constants to rule applications directly:

[12]:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[c, b] := rule['a', b, c]
[12]:
  c b
0 c b

You introduce additional bindings with the unification operator =:

[13]:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[c, b, d] := rule[a, b, c], is_num(a), d = a + b + 2*c
[13]:
  c b d
0 3 2 9

Having multiple rule applications in the body generates every combination of the bindings:

[14]:
r1[] <- [[1, 'a'], [2, 'b']]
r2[] <- [[2, 'B'], [3, 'C']]

?[l1, l2] := r1[a, l1],
             r2[b, l2]
[14]:
  l1 l2
0 a B
1 a C
2 b B
3 b C

This is a Cartesian join in relational algebra, as bindings in the rule applications are all distinct. If bindings are reused, implicit unification occurs, which is a join in relational algebra:

[15]:
r1[] <- [[1, 'a'], [2, 'b']]
r2[] <- [[2, 'B'], [3, 'C']]

?[l1, l2] := r1[a, l1],
             r2[a, l2] # reused `a`
[15]:
  l1 l2
0 b B

The unification = unifies with a single value. Use in to unify with each value within a list in turn:

[16]:
?[x, y] := x in [1, 2, 3], y in ['x', 'y']
[16]:
  x y
0 1 x
1 1 y
2 2 x
3 2 y
4 3 x
5 3 y

The head of inline rules does not need to use all variables that appear in the body. However, any variable in the head must appear in the body at least once (the safety rule):

[17]:
r1[] <- [[1, 'a'], [2, 'b']]
r2[] <- [[2, 'B'], [3, 'C']]

?[l1, l2, x] := r1[a, l1],
                r2[a, l2]
[17]:
eval::unbound_symb_in_head

  × Symbol 'x' in rule head is unbound
   ╭─[3:1]
 3 │
 4 │ ?[l1, l2, x] := r1[a, l1],
   · 
 5 │                 r2[a, l2]
   ╰────
  help: Note that symbols occurring only in negated positions are not considered bound

Stored relations#

Persistent relations in Cozo are called stored relations. Creating them are simple:

[18]:
r1[] <- [[1, 'a'], [2, 'b']]
r2[] <- [[2, 'B'], [3, 'C']]

?[l1, l2] := r1[a, l1],
             r2[a, l2]

:create stored {l1, l2}
[18]:
  status
0 OK

The :create query option instructs the database to store the result in a stored relation named stored, containing the columns l1 and l2.

If you just want to create the relation without adding any data, you can omit the queries. No need to have an empty ? query.

You can verify that you now have the required stored relation in your system by running a system op:

[19]:
::relations
[19]:
  name arity access_level n_keys n_non_keys n_put_triggers n_rm_triggers n_replace_triggers
0 stored 2 normal 2 0 0 0 0

You can also investigate the columns of the stored relation:

[20]:
::columns stored
[20]:
  column is_key index type has_default
0 l1 True 0 Any? False
1 l2 True 1 Any? False

Stored relations can be applied by using an asterisk * before the name:

[21]:
?[a, b] := *stored[a, b]
[21]:
  a b
0 b B

Unlike relations defined inline, the columns of stored relations have fixed names. You can take advantage of this by selectively referring to columns by name, especially if you have a lot of columns:

[22]:
?[a, b] := *stored{l2: b, l1: a}
[22]:
  a b
0 b B

If the binding is the same as the column name, you can omit the binding:

[23]:
?[l2] := *stored{l2}
[23]:
  l2
0 B

After a stored relation is created, use :put to insert more data:

[24]:
?[l1, l2] <- [['e', 'E']]

:put stored {l1, l2}
[24]:
  status
0 OK
[25]:
?[l1, l2] := *stored[l1, l2]
[25]:
  l1 l2
0 b B
1 e E

Use :rm to remove data:

[26]:
?[l1, l2] <- [['e', 'E']]

:rm stored {l1, l2}
[26]:
  status
0 OK
[27]:
?[l1, l2] := *stored[l1, l2]
[27]:
  l1 l2
0 b B

Use ::remove (double colon!) to get rid of a stored relation:

[28]:
::remove stored
[28]:
  status
0 OK
[29]:
::relations
[29]:
  name arity access_level n_keys n_non_keys n_put_triggers n_rm_triggers n_replace_triggers

So far, all stored relations store all the data in their keys. You can instruct Cozo to only treat some of the data as keys, thereby indicating a functional dependency:

[30]:
?[a, b, c] <- [[1, 'a', 'A'],
               [2, 'b', 'B'],
               [3, 'c', 'C'],
               [4, 'd', 'D']]

:create fd {a, b => c}
[30]:
  status
0 OK
[31]:
?[a, b, c] := *fd[a, b, c]
[31]:
  a b c
0 1 a A
1 2 b B
2 3 c C
3 4 d D

Now if you insert another row with an existing key, that row will be updated:

[32]:
?[a, b, c] <- [[3, 'c', 'CCCCCCC']]

:put fd {a, b => c}
[32]:
  status
0 OK
[33]:
?[a, b, c] := *fd[a, b, c]
[33]:
  a b c
0 1 a A
1 2 b B
2 3 c CCCCCCC
3 4 d D

You can check whether a column is in a key position by looking at the is_key column in:

[34]:
::columns fd
[34]:
  column is_key index type has_default
0 a True 0 Any? False
1 b True 1 Any? False
2 c False 2 Any? False

You may have noticed that columns also have types and default values associated with them, and stored relations can have triggers. These are discussed in detail here.

Before continuing, let’s remove the stored relation we introduced:

[35]:
::remove fd
[35]:
  status
0 OK

Graphs#

Now let’s consider a graph stored as a relation:

[36]:
?[loving, loved] <- [['alice', 'eve'],
                     ['bob', 'alice'],
                     ['eve', 'alice'],
                     ['eve', 'bob'],
                     ['eve', 'charlie'],
                     ['charlie', 'eve'],
                     ['david', 'george'],
                     ['george', 'george']]

:replace love {loving, loved}
[36]:
  status
0 OK

The graph we have created reads like “Alice loves Eve, Bob loves Alice”, “nobody loves David, David loves George, but George only loves himself”, and so on. Here we used :replace instead of :create. The difference is that if love already exists, it will be wiped and replaced with the new data given.

We can investigate competing interests:

[37]:
?[loved_by_b_e] := *love['eve', loved_by_b_e],
                   *love['bob', loved_by_b_e]
[37]:
  loved_by_b_e
0 alice

So far we rule bodies consist of conjunction of atoms only. Disjunction is also available, by using the or keyword:

[38]:
?[loved_by_b_e] := *love['eve', loved_by_b_e] or *love['bob', loved_by_b_e],
                   loved_by_b_e != 'bob',
                   loved_by_b_e != 'eve'
[38]:
  loved_by_b_e
0 alice
1 charlie

Another way to write the same query is to have multiple rule definitions under the same name:

[39]:
?[loved_by_b_e] := *love['eve', loved_by_b_e],
                   loved_by_b_e != 'bob',
                   loved_by_b_e != 'eve'
?[loved_by_b_e] := *love['bob', loved_by_b_e],
                   loved_by_b_e != 'bob',
                   loved_by_b_e != 'eve'
[39]:
  loved_by_b_e
0 alice
1 charlie

When you have multiple definitions of the same inline rule, the rule heads must be compatible. Only inline rules can have multiple definitions.

Negation#

Negation of expressions should be familiar:

[40]:
?[loved] := *love[person, loved], !ends_with(person, 'e')
[40]:
  loved
0 alice
1 george

Rule applications can also be negated, not with the ! operator, but with the not keyword:

[41]:
?[loved_by_e_not_b] := *love['eve', loved_by_e_not_b],
                       not *love['bob', loved_by_e_not_b]
[41]:
  loved_by_e_not_b
0 bob
1 charlie

There are two sets of logical operations in Cozo, one set that acts on the level of expressions, and another set that acts on the level of atoms:

  • For atoms: , or and (conjunction), or (disjunction), not (negation)

  • For expressions: && (conjunction), || (disjunction), ! (negation)

The difference between , and and is operator precedence: and has higher precedence than or, whereas , has lower precedence than or.

There is a safety rule for negation:

[42]:
?[not_loved_by_b] := not *love['bob', not_loved_by_b]
[42]:
eval::unbound_symb_in_head

  × Symbol 'not_loved_by_b' in rule head is unbound
   ╭────
 1 │ ?[not_loved_by_b] := not *love['bob', not_loved_by_b]
   ·   ──────────────
   ╰────
  help: Note that symbols occurring only in negated positions are not considered bound

This query is forbidden because the resulting relation is infinite. For example, ‘gold’ should be in the result, since according to the facts stored in the database, Bob has no interest in ‘gold’.

To make our query finite, we have to explicitly give our query a closed world:

[43]:
the_population[p] := *love[p, _a]
the_population[p] := *love[_a, p]

?[not_loved_by_b] := the_population[not_loved_by_b],
                     not *love['bob', not_loved_by_b]
[43]:
  not_loved_by_b
0 bob
1 charlie
2 david
3 eve
4 george

Recursion#

Inline rules can applying other rules, and can have multiple definitions. If you combine these two, you get recursions:

[44]:
alice_love_chain[person] := *love['alice', person]
alice_love_chain[person] := alice_love_chain[in_person],
                            *love[in_person, person]

?[chained] := alice_love_chain[chained]
[44]:
  chained
0 alice
1 bob
2 charlie
3 eve

You may object that you only need to be able to apply other rules to have recursion, without multiple definitions. Technically correct, but the resulting queries are not useful:

..
[45]:
alice_love_chain[person] := alice_love_chain[in_person],
                            *love[in_person, person]

?[chained] := alice_love_chain[chained]
[45]:
  chained

Similar to the negation case, if there is no way to deduce a fact from the given facts, then the fact itself is considered false. You need multiple definitions to “bootstrap” the query.

Aggregation#

Aggregations are usually used to compute statistics. In Cozo, aggregations are applied in the head of inline rules:

[46]:
?[person, count(loved_by)] := *love[loved_by, person]
[46]:
  person count(loved_by)
0 alice 2
1 bob 1
2 charlie 1
3 eve 2
4 george 2

The usual sum, mean, etc. are all available. Aggregations in the head instead of in the body may seem strange, but is powerful, as we will see later.

Here is the full list of aggregations for you to play with.

Query options#

We have seem query options like :create, :put, :rm for manipulating stored relations. There are also query options for controlling what is returned:

[47]:
?[loving, loved] := *love{ loving, loved }

:limit 1
[47]:
  loving loved
0 alice eve

Next we want the result to be sorted by loved in descending order, then loving in ascending order, and skip the first row:

[48]:
?[loving, loved] := *love{ loving, loved }

:order -loved, loving
:offset 1
[48]:
  loving loved
0 george george
1 alice eve
2 charlie eve
3 eve charlie
4 eve bob
5 bob alice
6 eve alice

Putting - in front of variables in :order clause denotes reverse order. Nothing or + denotes the ascending order.

The full list of query options are explained in this chapter.

Fixed rules#

The <- syntax for constant rules is syntax sugar. The full syntax is:

[49]:
?[] <~ Constant(data: [['hello', 'world', 'Cozo!']])
[49]:
  _0 _1 _2
0 hello world Cozo!

Here we are using the fixed rule Constant, which takes one option named data. The curly tail <~ denotes a fixed rule.

Fixed rules take input relations as arguments, apply custom logic to them and produce its output relation. The Constant fixed rule take zero input relations.

If you are using Cozo in browser, Constant is the only fixed rule you can use. In all other cases your Cozo would have the graph algorithms module enabled, and all graph algorithms are implemented as fixed rules. As an example, let’s find out who is most popular in the love graph by using the PageRank algorithm:

[50]:
?[person, page_rank] <~ PageRank(*love[])

:order -page_rank
[50]:
  person page_rank
0 george 0.280443
1 eve 0.275376
2 alice 0.191205
3 bob 0.103023
4 charlie 0.103023
5 david 0.025000

Here the input relation is the stored relation *love (as noted above, you will receive an error if you run this in the WASM implementation).

Each fixed rule is different: here is the full list.

[51]:
::remove love
[51]:
  status
0 OK

Time travel#

A very useful capability of Cozo is the ability to time travel in a stored relation. Usually when you :put into a relation, the old value is overwritten; and when you :rm, the row is completely gone. Doing it this way amounts to throwing away gold if how your data changes is valuable: we have a short story about it. In this case, time travel is the solution: instead of storing current facts, the stored relation stores the complete history of facts, at least all the available history as history itself unfolds.

If you believe that you don’t want this functionality at all, you can skip to the next section. At Cozo we adopt the “zero-cost, zero-mental-overhead abstraction” philosophy: if you don’t use a functionality, you don’t pay the performance or cognitive overhead.

Let’s have a simple example: storing the head of state of countries. First we have to create a new stored relation:

[52]:
:create hos {state: String, year: Validity => hos: String}
[52]:
  status
0 OK

hos is shorthand for “head of state”. The only thing different about this relation is that we are giving year the type Validity. You can think of validity as a list of two elements, the first element being an integer and the second element being a boolean. The integer indicates the “time” of the fact recorded by the row, the boolean, if true, indicates that this fact is an assertion valid from the indicated time, otherwise it indicates that the previous fact under the same key differing only by the validity is retracted at this time. It is up to you to interpret what “time” really means. Here we will use it to mean the year, for simplicity.

Now let’s insert some data:

[53]:
?[state, year, hos] <- [['US', [2001, true], 'Bush'],
                        ['US', [2005, true], 'Bush'],
                        ['US', [2009, true], 'Obama'],
                        ['US', [2013, true], 'Obama'],
                        ['US', [2017, true], 'Trump'],
                        ['US', [2021, true], 'Biden']]

:put hos {state, year => hos}
[53]:
  status
0 OK

It is OK to assert a still valid fact again, as we have done above. You can use this relation like a normal relation:

[54]:
?[state, year, hos] := *hos{state, year, hos}
[54]:
  state year hos
0 US [2021, True] Biden
1 US [2017, True] Trump
2 US [2013, True] Obama
3 US [2009, True] Obama
4 US [2005, True] Bush
5 US [2001, True] Bush

The curious thing is that it is sorted descendingly by year. Validity sorts descendingly.

For any stored relation that has type Validity at the last slot of the key, the time-travel capability is enabled. Say you have forgotten who the president of the US was in 2019. Easy:

[55]:
?[hos, year] := *hos{state: 'US', year, hos @ 2019}
[55]:
  hos year
0 Trump [2017, True]

In your answer you also got the year that this fact was last known to be true.

You also don’t know about the year 2099:

[56]:
?[hos, year] := *hos{state: 'US', year, hos @ 2099}
[56]:
  hos year
0 Biden [2021, True]

That certainly doesn’t look right. Let’s fix it by retracting facts on or after 2025, and only inserting them back when we have the sure facts:

[57]:
?[state, year, hos] <- [['US', [2025, false], '']]

:put hos {state, year => hos}
[57]:
  status
0 OK

As we have hinted, you retract facts by putting a retraction. Now let’s run the query again:

[58]:
?[hos, year] := *hos{state: 'US', year, hos @ 2099}
[58]:
  hos year

Since the database does not contain facts on or after 2025, your query returns empty.

This functionality is flexible: you can mix different periods in the same query:

[59]:
?[hos2018, hos2010] := *hos{state: 'US', hos: hos2018 @ 2018},
                       *hos{state: 'US', hos: hos2010 @ 2010}
[59]:
  hos2018 hos2010
0 Trump Obama

As the relation hos is just a normal relation, you can still rm facts from it, in which case the facts are irretrievably gone. Whether that’s desirable is up to you: the database gives you the choice of how you want to use it, and trusts that you know how to use it correctly for your use case.

We have said above that it is up to you to interpret the “time”. There is also a default interpretation with a few more bells and whistles. First let’s create a new stored relation to store people’s moods:

[60]:
:create mood {name: String, at: Validity => mood: String}
[60]:
  status
0 OK

I want to record my mood now:

[61]:
?[name, at, mood] <- [['me', 'ASSERT', 'curious']]
:put mood {name, at => mood}
[61]:
  status
0 OK

Instead of giving a list of two elements as we have done above, we have simply used the string ASSERT, and the system will know that we mean an assertion of a current fact. What is “current”?

[62]:
?[name, at, mood] := *mood{name, at, mood}
[62]:
  name at mood
0 me [1672047587447466, True] curious

It’s a big number, but what does it mean? It is the UNIX timestamp at the time this fact was recorded, i.e., microseconds since the UNIX epoch:

[63]:
?[name, time, mood] := *mood{name, at, mood},
                       time = format_timestamp(at)
[63]:
  name time mood
0 me 2022-12-26T09:39:47.447+00:00 curious

To query for current facts, use the string NOW in the validity specification:

[64]:
?[name, time, mood] := *mood{name, at, mood @ 'NOW'},
                       time = format_timestamp(at)
[64]:
  name time mood
0 me 2022-12-26T09:39:47.447+00:00 curious

You can put in facts with manual timestamps as before, so it is possible that your database contains facts about the future. Let’s do just that. Instead of giving a mysterious string of numerals, you can use a string for the timestamp:

[65]:
?[name, at, mood] <- [['me', '2030-01-01T00:00:00.000+00:00', 'hopeful']]
:put mood {name, at => mood}
[65]:
  status
0 OK

Since this is in the future, it shouldn’t affect NOW:

[66]:
?[name, time, mood] := *mood{name, at, mood @ 'NOW'},
                       time = format_timestamp(at)
[66]:
  name time mood
0 me 2022-12-26T09:39:47.447+00:00 curious

In this case, there is also END for the validity specification, meaning to extract facts at the end of time:

[67]:
?[name, time, mood] := *mood{name, at, mood @ 'END'},
                       time = format_timestamp(at)
[67]:
  name time mood
0 me 2030-01-01T00:00:00+00:00 hopeful

Retraction at the current timestamp can be done with the string RETRACT:

[68]:
?[name, at, mood] <- [['me', 'RETRACT', '']]
:put mood {name, at => mood}
[68]:
  status
0 OK

Retraction placed in the future can also be done with stringy timestamps by prefixing with ~:

[69]:
?[name, at, mood] <- [['me', '~9999-01-01T00:00:00.000+00:00', 'who cares']]
:put mood {name, at => mood}
[69]:
  status
0 OK

Now let’s look at the complete history:

[70]:
?[name, time, is_assert, mood] := *mood{name, at, mood},
                                  time = format_timestamp(at),
                                  is_assert = to_bool(at)
[70]:
  name time is_assert mood
0 me 2022-12-26T09:39:47.447+00:00 True curious
1 me 2022-12-26T09:40:19.999+00:00 False
2 me 2030-01-01T00:00:00+00:00 True hopeful
3 me 9999-01-01T00:00:00+00:00 False who cares

Pretty neat, isn’t it? And this time-travel facility is way faster than what you get if you try to implement it directly with Datalog: see the note for more details. Some further technical details of time travel is discussed in its own chapter.

[71]:
::remove mood, hos
[71]:
  status
0 OK

Extended example: the air routes dataset#

Now you have a basic understanding of using the various constructs of Cozo, let’s deal with a small real-world dataset, with about 3700 nodes and 57000 edges.

The data we are going to use, and many examples that we will present, are adapted from the book Practical Gremlin. Gremlin is an imperative query language for graphs, a very different take compared to Datalog.

First, let’s create the stored relations we want (wrapping queries in braces allows you to execute several queries together atomically):

[52]:
{:create airport {
    code: String
    =>
    icao: String,
    desc: String,
    region: String,
    runways: Int,
    longest: Float,
    elev: Float,
    country: String,
    city: String,
    lat: Float,
    lon: Float
}}
{:create country {
    code: String
    =>
    desc: String
}}
{:create continent {
    code: String
    =>
    desc: String
}}
{:create contain { entity: String, contained: String }}
{:create route { fr: String, to: String => dist: Float }}
[52]:
  status
0 OK

The next command applies only if you are using Jupyter notebooks: it downloads a JSON file containing the data and imports it into the database. The commented out line shows how to do the same thing with a local file. If you are using the Cozo WASM interface, click the “Import from URL” or the “Import from File” icon, and paste in the address.

[53]:
%cozo_import_remote_file 'https://raw.githubusercontent.com/cozodb/cozo/dev/cozo-core/tests/air-routes.json'
# %cozo_import_local_file '../../cozo/cozo-core/tests/air-routes.json'

If you feel that the above is too much magic, we will show you the “hard way” of importing the same data at the end of this tutorial. For now let’s move on.

Let’s verify all the relations we want are there:

[54]:
::relations
[54]:
  name arity access_level n_keys n_non_keys n_put_triggers n_rm_triggers n_replace_triggers
0 airport 11 normal 1 10 0 0 0
1 contain 2 normal 2 0 0 0 0
2 continent 2 normal 1 1 0 0 0
3 country 2 normal 1 1 0 0 0
4 route 3 normal 2 1 0 0 0

While we are at it, let’s lock all these tables to prevent accidentally changing their contents:

[55]:
::access_level read_only airport, contain, continent, country, route
[55]:
  status
0 OK

More information about what this does is explained in this chapter.

Let’s just look at some data. Start with airports:

[56]:
?[code, city, desc, region, runways, lat, lon] := *airport{code, city, desc, region, runways, lat, lon}

:limit 5
[56]:
  code city desc region runways lat lon
0 AAA Anaa Anaa Airport PF-U-A 1 -17.352600 -145.509995
1 AAE Annabah Annaba Airport DZ-36 2 36.822201 7.809170
2 AAL Aalborg Aalborg Airport DK-81 2 57.092759 9.849243
3 AAN Al Ain Al Ain International Airport AE-AZ 1 24.261700 55.609200
4 AAQ Anapa Anapa Airport RU-KDA 1 45.002102 37.347301

Airports with the most runways:

[57]:
?[code, city, desc, region, runways, lat, lon] := *airport{code, city, desc, region, runways, lat, lon}

:order -runways
:limit 10
[57]:
  code city desc region runways lat lon
0 DFW Dallas Dallas/Fort Worth International Airport US-TX 7 32.896801 -97.038002
1 ORD Chicago Chicago O'Hare International Airport US-IL 7 41.978600 -87.904800
2 AMS Amsterdam Amsterdam Airport Schiphol NL-NH 6 52.308601 4.763890
3 BOS Boston Boston Logan US-MA 6 42.364300 -71.005203
4 DEN Denver Denver International Airport US-CO 6 39.861698 -104.672997
5 DTW Detroit Detroit Metropolitan, Wayne County US-MI 6 42.212399 -83.353401
6 ATL Atlanta Hartsfield - Jackson Atlanta International Airport US-GA 5 33.636700 -84.428101
7 GIS Gisborne Gisborne Airport NZ-GIS 5 -38.663300 177.977997
8 HLZ Hamilton Hamilton International Airport NZ-WKO 5 -37.866699 175.332001
9 IAH Houston George Bush Intercontinental US-TX 5 29.984400 -95.341400

How many airports are there in total?

[58]:
?[count(code)] := *airport{code}
[58]:
  count(code)
0 3504

Let’s get a distribution of the initials of the airport codes:

[59]:
?[count(initial), initial] := *airport{code}, initial = first(chars(code))

:order initial
[59]:
  count(initial) initial
0 212 A
1 235 B
2 214 C
3 116 D
4 95 E
5 76 F
6 135 G
7 129 H
8 112 I
9 80 J
10 197 K
11 184 L
12 228 M
13 111 N
14 89 O
15 203 P
16 7 Q
17 121 R
18 245 S
19 205 T
20 77 U
21 86 V
22 59 W
23 28 X
24 211 Y
25 49 Z

More useful are the statistics of runways:

[60]:
?[count(r), count_unique(r), sum(r), min(r), max(r), mean(r), std_dev(r)] :=
    *airport{runways: r}
[60]:
  count(r) count_unique(r) sum(r) min(r) max(r) mean(r) std_dev(r)
0 3504 7 4980.000000 1 7 1.421233 0.743083

Using country, we can find countries with no airports:

[61]:
?[desc] := *country{code, desc}, not *airport{country: code}
[61]:
  desc
0 Andorra
1 Liechtenstein
2 Monaco
3 Pitcairn
4 San Marino

The route relation by itself is rather boring:

[62]:
?[fr, to, dist] := *route{fr, to, dist}

:limit 10
[62]:
  fr to dist
0 AAA FAC 48.000000
1 AAA MKP 133.000000
2 AAA PPT 270.000000
3 AAA RAR 968.000000
4 AAE ALG 254.000000
5 AAE CDG 882.000000
6 AAE IST 1161.000000
7 AAE LYS 631.000000
8 AAE MRS 477.000000
9 AAE ORN 477.000000

It just records the starting and ending airports of each route, together with the distance. This relation only becomes useful when used as a graph.

Airports with no routes:

[63]:
?[code, desc] := *airport{code, desc}, not *route{fr: code}, not *route{to: code}
[63]:
  code desc
0 AFW Fort Worth Alliance Airport
1 APA Centennial Airport
2 APK Apataki Airport
3 BID Block Island State Airport
4 BVS Breves Airport
5 BWU Sydney Bankstown Airport
6 CRC Santa Ana Airport
7 CVT Coventry Airport
8 EKA Murray Field
9 GYZ Gruyere Airport
10 HFN Hornafjordur Airport
11 HZK Husavik Airport
12 ILG New Castle Airport
13 INT Smith Reynolds Airport
14 ISL Ataturk International Airport
15 KGG Kédougou Airport
16 NBW Leeward Point Field
17 NFO Mata'aho Airport
18 PSY Stanley Airport
19 RIG Rio Grande Airport
20 SFD San Fernando De Apure Airport
21 SFH San Felipe International Airport
22 SXF Berlin-Schönefeld International Airport *Closed*
23 TUA Teniente Coronel Luis a Mantilla Airport
24 TWB Toowoomba Airport
25 TXL Berlin, Tegel International Airport *Closed*
26 VCV Southern California Logistics Airport
27 YEI Bursa Yenişehir Airport

Airports with the most out routes:

[64]:
route_count[fr, count(fr)] := *route{fr}
?[code, n] := route_count[code, n]

:sort -n
:limit 5
[64]:
  code n
0 FRA 310
1 IST 309
2 CDG 293
3 AMS 283
4 MUC 270

How many routes are there from the European Union to the US?

[65]:
routes[unique(r)] := *contain['EU', fr],
                     *route{fr, to},
                     *airport{code: to, country: 'US'},
                     r = [fr, to]
?[n] := routes[rs], n = length(rs)
[65]:
  n
0 435

How many airports are there in the US with routes from the EU?

[66]:
?[count_unique(to)] := *contain['EU', fr],
                       *route{fr, to},
                       *airport{code: to, country: 'US'}

[66]:
  count_unique(to)
0 45

How many routes are there for each airport in London, UK?

[67]:
?[code, count(code)] := *airport{code, city: 'London', region: 'GB-ENG'}, *route{fr: code}
[67]:
  code count(code)
0 LCY 51
1 LGW 232
2 LHR 221
3 LTN 130
4 STN 211

We need to specify the region, because there is another city called London, not in the UK.

How many airports are reachable from London, UK in two hops?

[68]:
lon_uk_airports[code] := *airport{code, city: 'London', region: 'GB-ENG'}
one_hop[to] := lon_uk_airports[fr], *route{fr, to}, not lon_uk_airports[to];
?[count_unique(a3)] := one_hop[a2], *route{fr: a2, to: a3}, not lon_uk_airports[a3];
[68]:
  count_unique(a3)
0 2353

What are the cities directly reachable from LGW (London Gatwick), but furthermost away?

[69]:
?[city, dist] := *route{fr: 'LGW', to, dist},
                 *airport{code: to, city}
:order -dist
:limit 10
[69]:
  city dist
0 Buenos Aires 6908.000000
1 Singapore 6751.000000
2 Langkawi 6299.000000
3 Duong Dong 6264.000000
4 Taipei 6080.000000
5 Port Louis 6053.000000
6 Rayong 6008.000000
7 Cape Town 5987.000000
8 Hong Kong 5982.000000
9 Shanghai 5745.000000

What airports are within 0.1 degrees of the Greenwich meridian?

[70]:
?[code, desc, lon, lat] := *airport{lon, lat, code, desc}, lon > -0.1, lon < 0.1
[70]:
  code desc lon lat
0 CDT Castellon De La Plana Airport 0.026111 39.999199
1 LCY London City Airport 0.055278 51.505278
2 LDE Tarbes-Lourdes-Pyrénées Airport -0.006439 43.178699
3 LEH Le Havre Octeville Airport 0.088056 49.533901

Airports in a box drawn around London Heathrow, UK:

[71]:
h_box[lon, lat] := *airport{code: 'LHR', lon, lat}
?[code, desc] := h_box[lhr_lon, lhr_lat], *airport{code, lon, lat, desc},
                 abs(lhr_lon - lon) < 1, abs(lhr_lat - lat) < 1
[71]:
  code desc
0 LCY London City Airport
1 LGW London Gatwick
2 LHR London Heathrow
3 LTN London Luton Airport
4 SOU Southampton Airport
5 STN London Stansted Airport

For some spherical geometry: what is the angle subtended by SFO and NRT on the surface of the earth?

[72]:
?[deg_diff] := *airport{code: 'SFO', lat: a_lat, lon: a_lon},
               *airport{code: 'NRT', lat: b_lat, lon: b_lon},
               deg_diff = rad_to_deg(haversine_deg_input(a_lat, a_lon, b_lat, b_lon))
[72]:
  deg_diff
0 73.992112

We mentioned before that aggregations in Cozo are powerful. They are powerful because they can be used in recursions (some restrictions apply).

Let’s find the distance of the shortest route between two airports. One wayis to enumerate all the routes between the two airports, and then apply min aggregation to the results. This cannot be implemented as stated, since the routes may contain cycles and hence there can be an infinite number of routes between two airports.

Instead, think recursively. If we already have all the shortest routes between all nodes, we derive an equation satisfied by the shortest route: the shortest route between a and b is either the distance of a direct route, or the sum of the shortest distance from a to c and the distance of a direct route from c to d. We apply our min aggregation to this recursive set instead.

Write it out and it works. For exmaple, the shortest routes between the airports LHR and YPO:

[73]:
shortest[b, min(dist)] := *route{fr: 'LHR', to: b, dist}
                          # Start with the airport 'LHR', retrieve a direct route from 'LHR' to b

shortest[b, min(dist)] := shortest[c, d1], # Start with an existing shortest route from 'LHR' to c
                          *route{fr: c, to: b, dist: d2},  # Retrieve a direct route from c to b
                          dist = d1 + d2 # Add the distances

?[dist] := shortest['YPO', dist] # Extract the answer for 'YPO'.
                                 # We chose it since it is the hardest airport to get to from 'LHR'.
[73]:
  dist
0 4147.000000

There is a caveat when you try to write similar queries. Say you try to write it in the following way (don’t try to run it):

shortest[a, b, min(dist)] := *route{fr: a, to: b, dist}
shortest[a, b, min(dist)] := shortest[a, c, d1],
                          *route{fr: c, to: b, dist: d2},
                          dist = d1 + d2

?[dist] := shortest['LHR', 'YPO', dist]

You will find that the query does not complete in a reasonable amount of time, despite it being equivalent to the original query. Why?

In the changed query, you are asking the database to compute the all-pair shortest path, and then extract the answer to a particular shortest path. Normally Cozo would apply a technique called magic set rewrite so that only the needed answer would be calculated. However, in the changed query the presence of the aggregation operator min prevents that. In this case, applying the rewrite to the variable a would still yield the correct answer, but rewriting in any other way would give complete nonsense, and in the more general case with recursive aggregations this is a can of worms.

So as explained in the chapter about execution, magic set rewrites are only applied to rules without aggregations or recursions for the moment, until we are sure of the exact conditions under which the rewrites are safe. So for now at least the database executes the query as written, computing the result of the shortest rule containing more than ten million rows (to be exact, 3700 * 3700 = 13,690,000 rows) first!

The bottom line is, be mindful of the cardinality of the return sets of recursive rules.

A tour of graph algorithms#

Now let’s investigate the graph using some graph algorithms. As we have mentioned before, the Cozo running in browsers through WASM does not have the graph algorithms module enabled, so to run the following examples you will need to use some other implementation (for example, the Python one).

Since path-finding is such a common operation on graphs, Cozo has several fixed rules for that:

[74]:
starting[] <- [['LHR']]
goal[] <- [['YPO']]
?[starting, goal, distance, path] <~ ShortestPathDijkstra(*route[], starting[], goal[])
[74]:
  starting goal distance path
0 LHR YPO 4147.000000 ['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']

Not only is it more efficient, but we also get a path for the shortest route.

Not content with the shortest path, the following calculates ten the shortest paths:

[75]:
starting[] <- [['LHR']]
goal[] <- [['YPO']]
?[starting, goal, distance, path] <~ KShortestPathYen(*route[], starting[], goal[], k: 10)
[75]:
  starting goal distance path
0 LHR YPO 4147.000000 ['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
1 LHR YPO 4150.000000 ['LHR', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
2 LHR YPO 4164.000000 ['LHR', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
3 LHR YPO 4167.000000 ['LHR', 'DUB', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
4 LHR YPO 4187.000000 ['LHR', 'MAN', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
5 LHR YPO 4202.000000 ['LHR', 'IOM', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
6 LHR YPO 4204.000000 ['LHR', 'MAN', 'DUB', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
7 LHR YPO 4209.000000 ['LHR', 'YUL', 'YMT', 'YNS', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
8 LHR YPO 4211.000000 ['LHR', 'MAN', 'IOM', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
9 LHR YPO 4212.000000 ['LHR', 'DUB', 'YUL', 'YMT', 'YNS', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']

If efficiency is really important to you, you can use the A* algorithm with a good heuristic function:

[76]:
code_lat_lon[code, lat, lon] := *airport{code, lat, lon}
starting[code, lat, lon] := code = 'LHR', *airport{code, lat, lon};
goal[code, lat, lon] := code = 'YPO', *airport{code, lat, lon};
?[] <~ ShortestPathAStar(*route[],
                         code_lat_lon[node, lat1, lon1],
                         starting[],
                         goal[goal, lat2, lon2],
                         heuristic: haversine_deg_input(lat1, lon1, lat2, lon2) * 3963);
[76]:
  _0 _1 _2 _3
0 LHR YPO 4147.000000 ['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']

There’s a lot more setup required in this case: we need to retrieve the latitudes and longitudes of airports and do processing on them first. The number 3963 above is the radius of the earth in miles.

The most important airports, by PageRank:

[77]:
rank[code, score] <~ PageRank(*route[a, b])
?[code, desc, score] := rank[code, score], *airport{code, desc}

:limit 10;
:order -score
[77]:
  code desc score
0 IST Istanbul International Airport 0.004889
1 DFW Dallas/Fort Worth International Airport 0.004696
2 ORD Chicago O'Hare International Airport 0.004452
3 DEN Denver International Airport 0.004252
4 PEK Beijing Capital International Airport 0.004044
5 FRA Frankfurt am Main 0.004027
6 ATL Hartsfield - Jackson Atlanta International Airport 0.004022
7 DXB Dubai International Airport 0.004002
8 CDG Paris Charles de Gaulle 0.003998
9 DME Moscow, Domodedovo International Airport 0.003817

The following example takes a long time to run since it calculates the betweenness centrality: up to a few seconds, depending on your machine. Algorithms for calculating the betweenness centrality have high complexity.

[78]:
centrality[code, score] <~ BetweennessCentrality(*route[a, b])
?[code, desc, score] := centrality[code, score], *airport{code, desc}

:limit 10;
:order -score
[78]:
  code desc score
0 ANC Anchorage Ted Stevens 1074868.250000
1 KEF Reykjavik, Keflavik International Airport 928450.437500
2 HEL Helsinki Ventaa 581588.500000
3 PEK Beijing Capital International Airport 532021.062500
4 DEL Indira Gandhi International Airport 472979.968750
5 IST Istanbul International Airport 457882.156250
6 PKC Yelizovo Airport 408571.000000
7 MSP Minneapolis-St.Paul International Airport 396433.250000
8 LAX Los Angeles International Airport 393309.968750
9 DEN Denver International Airport 374339.531250

These are the airports that, if disconnected from the network, cause the most disruption. As this example shows, some of the algorithms really struggle when you go beyond small or medium sized dataset.

Community detection can collapse a graph into a supergraph. Here we store the result, since it has too many rows to display nicely:

[79]:
community[detailed_cluster, code] <~ CommunityDetectionLouvain(*route[a, b])
?[code, cluster, detailed_cluster] := community[detailed_cluster, code], cluster = first(detailed_cluster)

:replace community {code => cluster, detailed_cluster}
[79]:
  status
0 OK

We can look at the supernodes containing specific nodes. For example, the supernode for London Gatwick consists of mainly UK and other European airports, as you would expect:

[80]:
community[code] := *community{code: 'LGW', cluster}, *community{code, cluster}
?[country, count(code)] :=
    community[code],
    *airport{code, desc, country: country_code},
    *country{code: country_code, desc: country},

:order -count(code)
:limit 5
[80]:
  country count(code)
0 United Kingdom 54
1 France 50
2 Norway 49
3 Spain 40
4 Greece 38

For JFK on the other hand, its supernode consists of mainly US airports:

[81]:
community[code] := *community{code: 'JFK', cluster}, *community{code, cluster}
?[country, count(code)] :=
    community[code],
    *airport{code, desc, country: country_code},
    *country{code: country_code, desc: country},

:order -count(code)
:limit 5
[81]:
  country count(code)
0 United States 444
1 Canada 111
2 Brazil 108
3 Mexico 57
4 Colombia 50

But it does not always work according to geography. For example, Frankfurt airport is in Germany:

[82]:
?[desc, country_desc] := *airport{code: 'FRA', desc, country: country_code}, *country{code: country_code, desc: country_desc}
[82]:
  desc country_desc
0 Frankfurt am Main Germany

But its supernode:

[83]:
community[code] := *community{code: 'FRA', cluster}, *community{code, cluster}
?[country, count(code)] :=
    community[code],
    *airport{code, desc, country: country_code},
    *country{code: country_code, desc: country},

:order -count(code)
:limit 5
[83]:
  country count(code)
0 United States 444
1 Canada 111
2 Brazil 108
3 Mexico 57
4 Colombia 50

Germany does not even appear in the top five. In fact, FRA is in the same supernode as JFK. What matters is the connectivity in the graph, not the geography. As another example:

[84]:
community[code] := *community{code: 'SIN', cluster}, *community{code, cluster}
?[country, count(code)] :=
    community[code],
    *airport{code, desc, country: country_code},
    *country{code: country_code, desc: country},

:order -count(code)
:limit 5
[84]:
  country count(code)
0 China 216
1 Australia 125
2 Indonesia 68
3 Japan 65
4 Philippines 40

You’d expect SIN to be a Chinese airport. Wrong:

[85]:
?[desc, country_desc] := *airport{code: 'SIN', desc, country: country_code}, *country{code: country_code, desc: country_desc}
[85]:
  desc country_desc
0 Singapore, Changi International Airport Singapore

Finally, let’s collapse the route relation into super_route:

[86]:
?[fr_cluster, to_cluster, count(dist), sum(dist)] := *route{fr, to, dist},
                                                     *community{code: fr, cluster: fr_cluster},
                                                     *community{code: to, cluster: to_cluster}
:replace super_route {fr_cluster, to_cluster => n_routes=count(dist), total_distance=sum(dist)}
[86]:
  status
0 OK

As expected, the “diagonals” where fr_cluster == to_cluster are larger in the super_route graph:

[87]:
?[fr_cluster, to_cluster, n_routes, total_distance] := *super_route{fr_cluster, to_cluster, n_routes, total_distance}, fr_cluster < 2
[87]:
  fr_cluster to_cluster n_routes total_distance
0 0 0 9041 8933554.000000
1 0 1 434 1695379.000000
2 0 2 228 761661.000000
3 0 3 530 1681865.000000
4 0 4 163 391892.000000
5 0 8 3 300.000000
6 0 11 2 283.000000
7 0 19 1 238.000000
8 0 21 2 705.000000
9 0 22 1 975.000000
10 1 0 434 1696858.000000
11 1 1 4474 5142452.000000
12 1 2 1160 1492734.000000
13 1 3 526 1724591.000000
14 1 4 223 361986.000000
15 1 9 1 808.000000

Now the super graph is small enough that all analytics algorithms return instantly:

[88]:
?[cluster, score] <~ PageRank(*super_route[])
:order -score
:limit 5
[88]:
  cluster score
0 3 0.173309
1 0 0.093072
2 1 0.053465
3 2 0.053389
4 4 0.044654

You can now go on to investigate the supernodes, give real-world interpretations to them, etc. For example, a naïve interpretation of the above PageRank result is that North America is (still) the most prosperous part of the world, followed by East Asia in the second place, South Asia in the third place, and Europe in the fourth place.

Importing dataset the hard way#

Previously, we imported the air-routes dataset by using Python under the hood to download a specially-crafted JSON file and feed it to the database. Here we learn how to achieve the same effect by letting Cozo fetch and import a series of CSV files, without Python’s help.

Let’s set the database magic up first:

[89]:
::access_level normal airport, contain, continent, country, route
[89]:
  status
0 OK
[92]:
::remove airport, contain, continent, country, route, community, super_route
[92]:
  status
0 OK

Next, some parameters to make life eaiser (the lines commented out do the same thing by processing local files):

[94]:
%cozo_set AIR_ROUTES_NODES_URL 'https://raw.githubusercontent.com/cozodb/cozo/dev/cozo-core/tests/air-routes-latest-nodes.csv'
%cozo_set AIR_ROUTES_EDGES_URL 'https://raw.githubusercontent.com/cozodb/cozo/dev/cozo-core/tests/air-routes-latest-edges.csv'
# %cozo_set AIR_ROUTES_NODES_URL 'file://./../../cozo/cozo-core/tests/air-routes-latest-nodes.csv'
# %cozo_set AIR_ROUTES_EDGES_URL 'file://./../../cozo/cozo-core/tests/air-routes-latest-edges.csv'

First, import the airport relation:

[95]:
res[idx, label, typ, code, icao, desc, region, runways, longest, elev, country, city, lat, lon] <~
    CsvReader(types: ['Int', 'Any', 'Any', 'Any', 'Any', 'Any', 'Any', 'Int?', 'Float?', 'Float?', 'Any', 'Any', 'Float?', 'Float?'],
              url: $AIR_ROUTES_NODES_URL,
              has_headers: true)

?[code, icao, desc, region, runways, longest, elev, country, city, lat, lon] :=
    res[idx, label, typ, code, icao, desc, region, runways, longest, elev, country, city, lat, lon],
    label == 'airport'

:replace airport {
    code: String
    =>
    icao: String,
    desc: String,
    region: String,
    runways: Int,
    longest: Float,
    elev: Float,
    country: String,
    city: String,
    lat: Float,
    lon: Float
}
[95]:
  status
0 OK

The CsvReader utility downloads a CSV file from the internet and attempts to parse its content into a relation. When we store the relation, we specified types for the columns. The code column acts as a primary key for the airport stored relation.

Next is country:

[96]:
res[idx, label, typ, code, icao, desc] <~
    CsvReader(types: ['Int', 'Any', 'Any', 'Any', 'Any', 'Any'],
              url: $AIR_ROUTES_NODES_URL,
              has_headers: true)
?[code, desc] :=
    res[idx, label, typ, code, icao, desc],
    label == 'country'

:replace country {
    code: String
    =>
    desc: String
}
[96]:
  status
0 OK

continent:

[97]:
res[idx, label, typ, code, icao, desc] <~
    CsvReader(types: ['Int', 'Any', 'Any', 'Any', 'Any', 'Any'],
              url: $AIR_ROUTES_NODES_URL,
              has_headers: true)
?[idx, code, desc] :=
    res[idx, label, typ, code, icao, desc],
    label == 'continent'

:replace continent {
    code: String
    =>
    desc: String
}
[97]:
  status
0 OK

We need to make a translation table for the indices the data use:

[98]:
res[idx, label, typ, code] <~
    CsvReader(types: ['Int', 'Any', 'Any', 'Any'],
              url: $AIR_ROUTES_NODES_URL,
              has_headers: true)
?[idx, code] :=
    res[idx, label, typ, code],

:replace idx2code { idx => code }
[98]:
  status
0 OK

The contain relation contains information on the geographical inclusion of entities:

[99]:
res[] <~
    CsvReader(types: ['Int', 'Int', 'Int', 'String'],
              url: $AIR_ROUTES_EDGES_URL,
              has_headers: true)
?[entity, contained] :=
    res[idx, fr_i, to_i, typ],
    typ == 'contains',
    *idx2code[fr_i, entity],
    *idx2code[to_i, contained]


:replace contain { entity: String, contained: String }
[99]:
  status
0 OK

Finally, the routes between the airports. This relation is much larger than the rest and contains about 60k rows:

[100]:
res[] <~
    CsvReader(types: ['Int', 'Int', 'Int', 'String', 'Float?'],
              url: $AIR_ROUTES_EDGES_URL,
              has_headers: true)
?[fr, to, dist] :=
    res[idx, fr_i, to_i, typ, dist],
    typ == 'route',
    *idx2code[fr_i, fr],
    *idx2code[to_i, to]

:replace route { fr: String, to: String => dist: Float }
[100]:
  status
0 OK

We no longer need the idx2code relation:

[101]:
::remove idx2code
[101]:
  status
0 OK

Let’s verify all the relations we want are there:

[102]:
::relations
[102]:
  name arity access_level n_keys n_non_keys n_put_triggers n_rm_triggers n_replace_triggers
0 airport 11 normal 1 10 0 0 0
1 contain 2 normal 2 0 0 0 0
2 continent 2 normal 1 1 0 0 0
3 country 2 normal 1 1 0 0 0
4 route 3 normal 2 1 0 0 0

That’s it for the tutorial!

Queries#

CozoScript, a Datalog dialect, is the query language of the Cozo database.

A CozoScript query consists of one or many named rules. Each named rule represents a relation, i.e. collection of data divided into rows and columns. The rule named ? is the entry to the query, and the relation it represents is the result of the query. Each named rule has a rule head, which corresponds to the columns of the relation, and a rule body, which specifies the content of the relation, or how the content should be computed.

Relations in Cozo (stored or otherwise) abide by the set semantics. Thus even if a rule computes a row multiple times, the resulting relation only contains a single copy.

There are two types of named rules in CozoScript:

  • Inline rules, distinguished by using := to connect the head and the body. The logic used to compute the resulting relation is defined inline.

  • Fixed rules, distinguished by using <~ to connect the head and the body. The logic used to compute the resulting relation is fixed according to which algorithm or utility is requested.

The constant rules which use <- to connect the head and the body are syntax sugar. For example:

const_rule[a, b, c] <- [[1, 2, 3], [4, 5, 6]]

is identical to:

const_rule[a, b, c] <~ Constant(data: [[1, 2, 3], [4, 5, 6]])

Inline rules#

An example of an inline rule is:

hc_rule[a, e] := rule_a['constant_string', b], rule_b[b, d, a, e]

The rule body of an inline rule consists of multiple atoms joined by commas, and is interpreted as representing the conjunction of these atoms.

Atoms#

Atoms come in various flavours. In the example above:

rule_a['constant_string', b]

is an atom representing a rule application: a rule named rule_a must exist in the same query and have the correct arity (2 here). Each row in the named rule is then unified with the bindings given as parameters in the square bracket: here the first column is unified with a constant string, and unification succeeds only when the string completely matches what is given; the second column is unified with the variable b, and as the variable is fresh at this point (because this is its first appearance), the unification will always succeed. For subsequent atoms, the variable becomes bound: it take on the value of whatever it was unified with in the named relation. When a bound variable is unified again, for example b in rule_b[b, d, a, e], this unification will only succeed when the unified value is the same as the current value. Thus, repeated use of the same variable in named rules corresponds to inner joins in relational algebra.

Atoms representing applications of stored relations are written as:

*stored_relation[bind1, bind2]

with the asterisk before the name. Written in this way using square brackets, as many bindings as the arity of the stored relation must be given.

You can also bind columns by name:

*stored_relation{col1: bind1, col2: bind2}

In this form, any number of columns may be omitted. If the name you want to give the binding is the same as the name of the column, you can write instead *stored_relation{col1}, which is the same as *stored_relation{col1: col1}.

Expressions are also atoms, such as:

a > b + 1

a and b must be bound somewhere else in the rule. Expression atoms must evaluate to booleans, and act as filters. Only rows where the expression atom evaluates to true are kept.

Unification atoms unify explicitly:

a = b + c + d

Whatever appears on the left-hand side must be a single variable and is unified with the result of the right-hand side.

Note

This is different from the equality operator ==, where the left-hand side is a completely bound expression. When the left-hand side is a single bound variable, the equality and the unification operators are equivalent.

Unification atoms can also unify with multiple values in a list:

a in [x, y, z]

If the right-hand side does not evaluate to a list, an error is raised.

Multiple definitions and disjunction#

For inline rules only, multiple rule definitions may share the same name, with the requirement that the arity of the head in each definition must match. The returned relation is then formed by the disjunction of the multiple definitions (a union of rows).

You may also use the explicit disjunction operator or in a single rule definition:

rule1[a, b] := rule2[a] or rule3[a], rule4[a, b]

There is also an and operator, semantically identical to the comma , but has higher operator precedence than or (the comma has the lowest precedence).

Negation#

Atoms in inline rules may be negated by putting not in front of them:

not rule1[a, b]

When negating rule applications and stored relations, at least one binding must be bound somewhere else in the rule in a non-negated context (another safety rule). The unbound bindings in negated rules remain unbound: negation cannot introduce new bindings to be used in the head.

Negated expressions act as negative filters, which is semantically equivalent to putting ! in front of the expression. Explict unification cannot be negated unless the left-hand side is bound, in which case it is treated as an expression atom and then negated.

Recursion and stratification#

The body of an inline rule may contain rule applications of itself, and multiple inline rules may apply each other recursively. The only exception is the entry rule ?, which cannot be referred to by other rules including itself.

Recursion cannot occur in negated positions (safety rule): r[a] := not r[a] is not allowed.

Warning

As CozoScript allows explicit unification, queries that produce infinite relations may be accepted by the compiler. One of the simplest examples is:

r[a] := a = 0
r[a] := r[b], a = b + 1
?[a] := r[a]

It is not even in principle possible for Cozo to rule out all infinite queries without wrongly rejecting valid ones. If you accidentally submitted one, refer to the system ops chapter for how to terminate queries. Alternatively, you can give a timeout for the query when you submit.

Aggregation#

In CozoScript, aggregations are specified for inline rules by applying aggregation operators to variables in the rule head:

?[department, count(employee)] := *personnel{department, employee}

here we have used the familiar count operator. Any variables in the head without aggregation operators are treated as grouping variables, and aggregation is applied using them as keys. If you do not specify any grouping variables, then the resulting relation contains at most one row.

Aggregation operators are applied to the rows computed by the body of the rule using bag semantics. The reason for this complication is that if aggregations are applied with set semantics, then the following query:

?[count(employee)] := *personnel{employee}

does not do what you expect: it either returns a row with a single value 1 if there are any matching rows, or it returns nothing at all if the stored relation is empty.

If a rule has several definitions, they must have identical aggregations applied in the same positions.

Cozo allows aggregations for self-recursion for a limited subset of aggregation operators, the so-called semi-lattice aggregations:

shortest_distance[destination, min(distance)] :=
    route{source: 'A', destination, distance}

shortest_distance[destination, min(distance)] :=
    shortest_distance[existing_node, prev_distance], # recursion
    route{source: existing_node, distance: route_distance},
    distance = prev_distance + route_distance

?[destination, min_distance] :=
    shortest_distance[destination, min_distance]

Here self-recursion of shortest_distance contains the min aggregation.

For a rule-head to be considered semi-lattice-aggregate, the aggregations must come at the end of the rule head. In the above example, if you write the head as shortest_distance[min(distance), destination], the query engine will complain about unsafe recursion through aggregation, since written this way min is considered an ordinary aggregation.

Fixed rules#

The body of a fixed rule starts with the name of the utility or algorithm being applied, then takes a specified number of named or stored relations as its input relations, followed by options that you provide. For example:

?[] <~ PageRank(*route[], theta: 0.5)

In the above example, the relation *route is the single input relation expected. Input relations may be stored relations or relations resulting from rules.

Each utility/algorithm expects specific shapes for their input relations. You must consult the documentation for each utility/algorithm to understand its API.

In fixed rules, bindings for input relations are usually omitted, but sometimes if they are provided they are interpreted and used in algorithm-specific ways, for example in the DFS algorithm bindings.

In the example above, theta is an option of the algorithm, which is required by the API to be an expression evaluating to a constant. Each utility/algorithm expects specific types for the options; some options have default values and may be omitted.

Each fixed rule has a determinate output arity. Thus, the bindings in the rule head can be omitted, but if they are provided, you must abide by the arity.

Query options#

Each query can have options associated with it:

?[name] := *personnel{name}

:limit 10
:offset 20

In the example, :limit and :offset are query options with familiar meanings. All query options start with a single colon :. Queries options can appear before or after rules, or even sandwiched between rules.

Several query options deal with transactions for the database. Those will be discussed in the chapter on stored relations and transactions. The rest of the query options are explained in the following.

:limit <N>

Limit output relation to at most <N> rows. If possible, execution will stop as soon as this number of output rows is collected.

:offset <N>

Skip the first <N> rows of the returned relation.

:timeout <N>

Abort if the query does not complete within <N> seconds. Seconds may be specified as an expression so that random timeouts are possible.

:sleep <N>

If specified, the query will wait for <N> seconds after completion, before committing or proceeding to the next query. Seconds may be specified as an expression so that random timeouts are possible. Useful for deliberately interleaving concurrent queries to test complex logic.

:sort <SORT_ARG> (, <SORT_ARG>)*

Sort the output relation. If :limit or :offset are specified, they are applied after :sort. Specify <SORT_ARG> as they appear in the rule head of the entry, separated by commas. You can optionally specify the sort direction of each argument by prefixing them with + or -, with minus denoting descending order, e.g. :sort -count(employee), dept_name sorts by employee count in reverse order first, then break ties with department name in ascending alphabetical order.

Warning

Aggregations must be done in inline rules, not in output sorting. In the above example, the entry rule head must contain count(employee), employee alone is not acceptable.

:order <SORT_ARG> (, <SORT_ARG>)*

Alias for :sort.

:assert none

The query returns nothing if the output relation is empty, otherwise execution aborts with an error. Useful for transactions and triggers.

:assert some

The query returns nothing if the output relation contains at least one row, otherwise, execution aborts with an error. Useful for transactions and triggers. You should consider adding :limit 1 to the query to ensure early termination if you do not need to check all return tuples.

Stored relations and transactions#

In Cozo, data are stored in stored relations on disk.

Stored relations#

To query stored relations, use the *relation[...] or *relation{...} atoms in inline or fixed rules, as explained in the last chapter. To manipulate stored relations, use one of the following query options:

:create <NAME> <SPEC>

Create a stored relation with the given name and spec. No stored relation with the same name can exist beforehand. If a query is specified, data from the resulting relation is put into the newly created stored relation. This is the only stored relation-related query option in which a query may be omitted.

:replace <NAME> <SPEC>

Similar to :create, except that if the named stored relation exists beforehand, it is completely replaced. The schema of the replaced relation need not match the new one. You cannot omit the query for :replace. If there are any triggers associated, they will be preserved. Note that this may lead to errors if :replace leads to schema change.

:put <NAME> <SPEC>

Put rows from the resulting relation into the named stored relation. If keys from the data exist beforehand, the corresponding rows are replaced with new ones.

:ensure <NAME> <SPEC>

Ensure that rows specified by the output relation and spec exist in the database, and that no other process has written to these rows when the enclosing transaction commits. Useful for ensuring read-write consistency.

:rm <NAME> <SPEC>

Remove rows from the named stored relation. Only keys should be specified in <SPEC>. Removing a non-existent key is not an error and does nothing.

:ensure_not <NAME> <SPEC>

Ensure that rows specified by the output relation and spec do not exist in the database and that no other process has written to these rows when the enclosing transaction commits. Useful for ensuring read-write consistency.

:yield <NAME>

When chaining queries, make the return set of the current query available in the subsequent queries as the given name.

You can rename and remove stored relations with the system ops ::relation rename and ::relation remove, described in the system op chapter.

Create and replace#

The format of <SPEC> is identical for all four ops, but the semantics is a bit different. We first describe the format and semantics for :create and :replace.

A spec, or a specification for columns, is enclosed in curly braces {} and separated by commas:

?[address, company_name, department_name, head_count] <- $input_data

:create dept_info {
    company_name: String,
    department_name: String,
    =>
    head_count: Int,
    address: String,
}

Columns before the symbol => form the keys (actually a composite key) for the stored relation, and those after it form the values. If all columns are keys, the symbol => may be omitted. The order of columns matters. Rows are stored in lexicographically sorted order in trees according to their keys.

In the above example, we explicitly specified the types for all columns. In case of type mismatch, the system will first try to coerce the values given, and if that fails, the query is aborted with an error. You can omit types for columns, in which case their types default to Any?, i.e. all values are acceptable. For example, the above query with all types omitted is:

?[address, company_name, department_name, head_count] <- $input_data

:create dept_info { company_name, department_name => head_count, address }

In the example, the bindings for the output match the columns exactly (though not in the same order). You can also explicitly specify the correspondence:

?[a, b, count(c)] <- $input_data

:create dept_info {
    company_name = a,
    department_name = b,
    =>
    head_count = count(c),
    address: String = b
}

You must use explicit correspondence if the entry head contains aggregation, since names such as count(c) are not valid column names. The address field above shows how to specify both a type and a correspondence.

Instead of specifying bindings, you can specify an expression that generates default values by using default:

?[a, b] <- $input_data

:create dept_info {
    company_name = a,
    department_name = b,
    =>
    head_count default 0,
    address default ''
}

The expression is evaluated anew for each row, so if you specified a UUID-generating functions, you will get a different UUID for each row.

Put, remove, ensure and ensure-not#

For :put, :remove, :ensure and :ensure_not, you do not need to specify all existing columns in the spec if the omitted columns have a default generator, or if the type of the column is nullable, in which case the value defaults to null. For these operations, specifying default values does not have any effect and will not replace existing ones.

For :put and :ensure, the spec needs to contain enough bindings to generate all keys and values. For :rm and :ensure_not, it only needs to generate all keys.

Chaining queries#

Each script you send to Cozo is executed in its own transaction. To ensure consistency of multiple operations on data, You can define multiple queries in a single script, by wrapping each query in curly braces {}. Each query can have its independent query options. Execution proceeds for each query serially, and aborts at the first error encountered. The returned relation is that of the last query.

The :assert (some|none), :ensure and :ensure_not query options allow you to express complicated constraints that must be satisfied for your transaction to commit.

This example uses three queries to put and remove rows atomically (either all succeed or all fail), and ensure that at the end of the transaction an untouched row exists:

{
    ?[a, b] <- [[1, 'one'], [3, 'three']]
    :put rel {a => b}
}
{
    ?[a] <- [[2]]
    :rm rel {a}
}
{
    ?[a, b] <- [[4, 'four']]
    :ensure rel {a => b}
}

When a transaction starts, a snapshot is used, so that only already committed data, or data written within the same transaction, are visible to queries. At the end of the transaction, changes are only committed if there are no conflicts and no errors are raised. If any mutation activate triggers, those triggers execute in the same transaction.

When chaining queries, you can yield the return set of a query to be used in subsequent queries, as the following example illustrates:

{
    ?[a] <- [[1]]
    :yield first_yield
}
{
    ?[a] := first_yield[b], a = b + 1
    :yield second_yield
}
{
    ?[a] := first_yield[a]
    ?[a] := second_yield[a]
}

The final return set should be [[1], [2]]. This example is contrived: the most frequent use of this feature is to compute some results and to insert various aspects of the results into different stored relations.

Triggers and indices#

Cozo does not have traditional indices on stored relations. Instead, you define regular stored relations that are used as indices. At query time, you explicitly query the index instead of the original stored relation.

You synchronize your indices and the original by ensuring that any mutations you do on the database write the correct data to the “canonical” relation and its indices in the same transaction. As doing this by hand for every mutation leads to lots of repetitions and is error-prone, Cozo supports triggers to do it automatically for you.

You attach triggers to a stored relation by running the system op ::set_triggers:

::set_triggers <REL_NAME>

on put { <QUERY> }
on rm { <QUERY> }
on replace { <QUERY> }
on put { <QUERY> } # you can specify as many triggers as you need

<QUERY> can be any valid query.

The on put triggers will run when new data is inserted or upserted, which can be activated by :put, :create and :replace query options. The implicitly defined rules _new[] and _old[] can be used in the triggers, and contain the added rows and the replaced rows respectively.

The on rm triggers will run when data is deleted, which can be activated by a :rm query option. The implicitly defined rules _new[] and _old[] can be used in the triggers, and contain the keys of the rows for deleted rows (even if no row with the key actually exist) and the rows actually deleted (with both keys and non-keys).

The on replace triggers will be activated by a :replace query option. They are run before any on put triggers.

All triggers for a relation must be specified together, in the same ::set_triggers system op. If used again, all the triggers associated with the stored relation are replaced. To remove all triggers from a stored relation, use ::set_triggers <REL_NAME> followed by nothing.

As an example of using triggers to maintain an index, suppose we have the following relation:

:create rel {a => b}

We often want to query *rel[a, b] with b bound but a unbound. This will cause a full scan, which can be expensive. So we need an index:

:create rel.rev {b, a}

In the general case, we cannot assume a functional dependency b => a, so in the index both fields appear as keys.

To manage the index automatically:

::set_triggers rel

on put {
    ?[a, b] := _new[a, b]

    :put rel.rev{ b, a }
}
on rm {
    ?[a, b] := _old[a, b]

    :rm rel.rev{ b, a }
}

With the index set up, you can use *rel.rev{..} in place of *rel{..} in your queries.

Indices in Cozo are manual, but extremely flexible, since you need not conform to any predetermined patterns in your use of _old[] and _new[]. For simple queries, the need to explicitly elect to use an index can seem cumbersome, but for complex ones, the deterministic evaluation entailed can be a huge blessing.

Triggers can be creatively used for other purposes as well.

Warning

Loops in your triggers can cause non-termination. A loop occurs when a relation has triggers which affect other relations, which in turn have other triggers that ultimately affect the starting relation.

Time travel#

Simply put, time travel in a database means tracking changes to data over time and allowing queries to be logically executed at a point in time to get a historical view of the data. In a sense, this makes your database immutable, since nothing is really deleted from the database ever. This story gives some motivations why time travel may be valuable.

In Cozo, a stored relation is eligible for time travel if the last part of its key has the explicit type Validity. A validity has two parts: a time part, represented by a signed integer, and an assertion part, represented by a boolean, so [42, true] represents a validity. Sorting of validity is by the timestamp first, then by the assertive flag, but each field is compared in descending order, so:

[1, true] < [1, false] < [-1, true]

All rows with identical key parts except the last validity part form the history for that key, interpreted in the following way: the fact represented by a row is valid if its flag is true, and the range of its validity is from its timestamp (inclusive) up until the timestamp of the next row under the same key (excluding the last validity part, and here time is interpreted to flow forward). A row with a false assertive flag does nothing other than making the previous fact invalid.

When querying against such a stored relation, a validity specification can be attached, for example:

?[name] := *rel{id: $id, name, @ 789}

The part after the symbol @ is the validity specification and must be a compile-time constant, i.e., it cannot contain variables. Logically, it is as if the query is against a snapshot of the relation containing only valid facts at the specified timestamp.

It is possible for two rows to have identical non-validity key parts and identical timestamps, but differ in their assertive flags. In this case when queried against the exact timestamp, the row is valid, as if the row with the false flag does not exist. The use case for this behaviour is to assert a fact only until a future time when that fact is sure to remain valid. When that time comes, a new fact can be asserted, and if the old fact remains valid there is no need to :rm the previous retraction.

You can use the function to_bool to extract the flag of a validity, and to_int to extract the timestamp as an integer.

In Cozo it is up to you to interpret the timestamp part of validity. If you use it to represent calendar time, then it is recommended that you treat it as microseconds since the UNIX epoch. For this interpretation, the following convenience are provided:

  • When putting facts into the database, instead of specifying the exact literal validity as a list of two items, the strings ASSERT and RETRACT can be used instead, and is interpreted as assertion and retraction at the current timestamp, respectively. This has the additional guarantee that all insertion operations in the same transaction using this method gets the same timestamp, and furthermore you can also use these strings as the default values for a field, and they will do the right thing.

  • In place of a list of two items for specifying the literal validity, you can use RFC 3339 strings for assertion timestamps or validity specification in query. For retraction, prefix the string by ~.

  • When specifying validity against a stored relation, the string NOW uses the current timestamp, and END uses a timestamp logically at the end of the world. Furthermore, the NOW timestamp is guaranteed to be the same as what would be inserted using ASSERT and RETRACT.

  • You can use the function format_timestamp to directly format a the timestamp part of a validity to RFC 3339 strings.

An interesting use case of the time travel facility is to pre-generate the whole history for all time, and in the user-facing interface query with the current time NOW. The effect is that users see an illusion of real-time interactions: a manifestation of Laplace’s daemon.

System ops#

System ops start with a double-colon :: and must appear alone in a script. In the following, we explain what each system op does, and the arguments they expect.

Explain#

::explain { <QUERY> }

A single query is enclosed in curly braces. Query options are allowed but ignored. The query is not executed, but its query plan is returned instead. Currently, there is no specification for the return format, but if you are familiar with the semi-naïve evaluation of stratified Datalog programs subject to magic-set rewrites, you can decipher the result.

Ops for stored relations#

::relations

List all stored relations in the database

::columns <REL_NAME>

List all columns for the stored relation <REL_NAME>.

::remove <REL_NAME> (, <REL_NAME>)*

Remove stored relations. Several can be specified, joined by commas.

::rename <OLD_NAME> -> <NEW_NAME> (, <OLD_NAME> -> <NEW_NAME>)*

Rename stored relation <OLD_NAME> into <NEW_NAME>. Several may be specified, joined by commas.

::show_triggers <REL_NAME>

Display triggers associated with the stored relation <REL_NAME>.

::set_triggers <REL_NAME> ...

Set triggers for the stored relation <REL_NAME>. This is explained in more detail in the transaction chapter.

::access_level <ACCESS_LEVEL> <REL_NAME> (, <REL_NAME>)*

Sets the access level of <REL_NAME> to the given level. The levels are:

  • normal allows everything,

  • protected disallows ::remove and :replace,

  • read_only additionally disallows any mutations and setting triggers,

  • hidden additionally disallows any data access (metadata access via ::relations, etc., are still allowed).

The access level functionality is to protect data from mistakes of the programmer, not from attacks by malicious parties.

Monitor and kill#

::running

Display running queries and their IDs.

::kill <ID>

Kill a running query specified by <ID>. The ID may be obtained by ::running.

Maintenance#

::compact

Instructs Cozo to run a compaction job. Compaction makes the database smaller on disk and faster for read queries.

Types#

Runtime types#

Values in Cozo have the following runtime types:

  • Null

  • Bool

  • Number

  • String

  • Bytes

  • Uuid

  • List

  • Validity

Number can be Float (double precision) or Int (signed, 64 bits). Cozo will auto-promote Int to Float when necessary.

List can contain any number of mixed-type values, including other lists.

Cozo sorts values according to the above order, e.g. null is smaller than true, which is in turn smaller than the list [].

Within each type values are compared according to:

  • false < true;

  • -1 == -1.0 < 0 == 0.0 < 0.5 == 0.5 < 1 == 1.0;

  • Lists are ordered lexicographically by their elements;

  • Bytes are compared lexicographically;

  • Strings are compared lexicographically by their UTF-8 byte representations;

  • UUIDs are sorted in a way that UUIDv1 with similar timestamps are near each other. This is to improve data locality and should be considered an implementation detail. Depending on the order of UUID in your application is not recommended.

  • Validity is introduced for the sole purpose of enabling time travel queries.

Warning

1 == 1.0 evaluates to true, but 1 and 1.0 are distinct values, meaning that a relation can contain both as keys according to set semantics. This is especially confusing when using JavaScript, which converts all numbers to float, and python, which does not show a difference between the two when printing. Using floating point numbers in keys is not recommended if the rows are accessed by these keys (instead of accessed by iteration).

Literals#

The standard notations null for the type Null, false and true for the type Bool are used.

Besides the usual decimal notation for signed integers, you can prefix a number with 0x or -0x for hexadecimal representation, with 0o or -0o for octal, or with 0b or -0b for binary. Floating point numbers include the decimal dot (may be trailing), and may be in scientific notation. All numbers may include underscores _ in their representation for clarity. For example, 299_792_458 is the speed of light in meters per second.

Strings can be typed in the same way as they do in JSON using double quotes "", with the same escape rules. You can also use single quotes '' in which case the roles of double quotes and single quotes are switched. There is also a “raw string” notation:

___"I'm a raw string"___

A raw string starts with an arbitrary number of underscores, and then a double quote. It terminates when followed by a double quote and the same number of underscores. Everything in between is interpreted exactly as typed, including any newlines. By varying the number of underscores, you can represent any string without quoting.

There is no literal representation for Bytes or Uuid. Use the appropriate functions to create them. If you are inserting data into a stored relation with a column specified to contain bytes or UUIDs, auto-coercion will kick in and use decode_base64 and to_uuid for conversion.

Lists are items enclosed between square brackets [], separated by commas. A trailing comma is allowed after the last item.

Column types#

The following atomic types can be specified for columns in stored relations:

  • Int

  • Float

  • Bool

  • String

  • Bytes

  • Uuid

  • Validity

There is no Null type. Instead, if you put a question mark after a type, it is treated as nullable, meaning that it either takes value in the type or is null.

Two composite types are available. A homogeneous list is specified by square brackets, with the inner type in between, like this: [Int]. You may optionally specify how many elements are expected, like this: [Int; 10]. A heterogeneous list, or a tuple, is specified by round brackets, with the element types listed by position, like this: (Int, Float, String). Tuples always have fixed lengths.

A special type Any can be specified, allowing all values except null. If you want to allow null as well, use Any?. Composite types may contain other composite types or Any types as their inner types.

Query execution#

Databases often consider how queries are executed an implementation detail hidden behind an abstraction barrier that users need not care about, so that databases can utilize query optimizers to choose the best query execution plan regardless of how the query was originally written. This abstraction barrier is leaky, however, since bad query execution plans invariably occur, and users need to “reach behind the curtain” to fix performance problems, which is a difficult and tiring task. The problem becomes more severe the more joins a query contains, and graph queries tend to contain a large number of joins.

So in Cozo we take the pragmatic approach and make query execution deterministic and easy to tell from how the query was written. The flip side is that we demand the user to know what is the best way to store their data, which is in general less demanding than coercing the query optimizer. Then, armed with knowledge of this chapter, writing efficient queries is easy.

Disjunctive normal form#

Evaluation starts by canonicalizing inline rules into disjunction normal form, i.e., a disjunction of conjunctions, with any negation pushed to the innermost level. Each clause of the outmost disjunction is then treated as a separate rule. The consequence is that the safety rule may be violated even though textually every variable in the head occurs in the body. As an example:

rule[a, b] := rule1[a] or rule2[b]

is a violation of the safety rule since it is rewritten into two rules, each of which is missing a different binding.

Stratification#

The next step in the processing is stratification. It begins by making a graph of the named rules, with the rules themselves as nodes, and a link is added between two nodes when one of the rules applies the other. This application is through atoms for inline rules, and input relations for fixed rules.

Next, some of the links are labelled stratifying:

  • when an inline rule applies another rule through negation,

  • when an inline rule applies another inline rule that contains aggregations,

  • when an inline rule applies itself and it has non-semi-lattice,

  • when an inline rule applies another rule which is a fixed rule,

  • when a fixed rule has another rule as an input relation.

The strongly connected components of the graph of rules are then determined and tested, and if it found that some strongly connected component contains a stratifying link, the graph is deemed unstratifiable, and the execution aborts. Otherwise, Cozo will topologically sort the strongly connected components to determine the strata of the rules: rules within the same stratum are logically executed together, and no two rules within the same stratum can have a stratifying link between them. In this process, Cozo will merge the strongly connected components into as few supernodes as possible while still maintaining the restriction on stratifying links. The resulting strata are then passed on to be processed in the next step.

You can see the stratum number assigned to rules by using the ::explain system op.

Magic set rewrites#

Within each stratum, the input rules are rewritten using the technique of magic sets. This rewriting ensures that the query execution does not waste time calculating results that are later simply discarded. As an example, consider:

reachable[a, b] := link[a, n]
reachable[a, b] := reachable[a, c], link[c, b]
?[r] := reachable['A', r]

Without magic set rewrites, the whole reachable relation is generated first, then most of them are thrown away, keeping only those starting from 'A'. Magic set rewriting avoids this problem. You can see the result of the rewriting using ::explain. The rewritten query is guaranteed to yield the same relation for ?, and will in general yield fewer intermediate rows.

The rewrite currently only applies to inline rules without aggregations.

Semi-naïve evaluation#

Now each stratum contains either a single fixed rule or a set of inline rules. The single fixed rules are executed by running their specific implementations. For the inline rules, each of them is assigned an output relation. Assuming we know how to evaluate each rule given all the relations it depends on, the semi-naïve algorithm can now be applied to the rules to yield all output rows.

The semi-naïve algorithm is a bottom-up evaluation strategy, meaning that it tries to deduce all facts from a set of given facts.

Note

By contrast, top-down strategies start with stated goals and try to find proof for the goals. Bottom-up strategies have many advantages over top-down ones when the whole output of each rule is needed, but may waste time generating unused facts if only some of the output is kept. Magic set rewrites are introduced to eliminate precisely this weakness.

Ordering of atoms#

The compiler reorders the atoms in the body of the inline rules, and then the atoms are evaluated.

After conversion to disjunctive normal forms, each atom can only be one of the following:

  • an explicit unification,

  • applying a rule or a stored relation,

  • an expression, which should evaluate to a boolean,

  • a negation of an application.

The first two cases may introduce fresh bindings, whereas the last two cannot. The reordering make all atoms that introduce new bindings stay where they are, whereas all atoms that do not introduce new bindings are moved to the earliest possible place where all their bindings are bound. All atoms that introduce bindings correspond to joining with a pre-existing relation followed by projections in relational algebra, and all atoms that do not correspond to filters. By applying filters as early as possible, we minimize the number of rows before joining them with the next relation.

When writing the body of rules, we should aim to minimize the total number of rows generated. A strategy that works almost in all cases is to put the most restrictive atoms which generate new bindings first.

Evaluating atoms#

We now explain how a single atom which generates new bindings is processed.

For unifications, the right-hand side, an expression with all variables bound, is simply evaluated, and the result is joined to the current relation (as in a map-cat operation in functional languages).

Rules or stored relations are conceptually trees, with composite keys sorted lexicographically. The complexity of their applications in atoms is therefore determined by whether the bound variables and constants in the application bindings form a key prefix. For example, the following application:

a_rule['A', 'B', c]

with c unbound, is very efficient, since this corresponds to a prefix scan in the tree with the key prefix ['A', 'B'], whereas the following application:

a_rule[a, 'B', 'C']

where a is unbound, is very expensive, since we must do a full scan. On the other hand, if a is bound, then this is only a logarithmic-time existence check.

For stored relations, you need to check its schema for the order of keys to deduce the complexity. The system op ::explain may also give you some information.

Rows are generated in a streaming fashion, meaning that relation joins proceed as soon as one row is available, and do not wait until the whole relation is generated.

Early stopping#

For the entry rule ?, if :limit is specified as a query option, a counter is used to monitor how many valid rows are already generated. If enough rows are generated, the query stops. This only works when the entry rule is inline and you do not specify :order.

Tips for writing queries#

Dealing with nulls#

Cozo is strict about types. A simple query such as:

?[a] := *rel[a, b], b > 0

will throw if some of the b is null: comparisons can only be made between values of the same type. There are various ways you can deal with it: if you decide that the condition should be false if values are not of the same type, you write:

?[a] := *rel[a, b], try(b > 0, false)

Alternatively, you may decide to consider any null values to be equivalent to some default values, in which case you write:

?[a] := *rel[a, b], (b ~ -1) > 0

here ~ is the coalesce operator. The parentheses are not necessary, but it reads better this way. We recommend using the coalesce operator over using try, since it is more explicit.

You can also be very explicit:

?[a] := *rel[a, b], if(is_null(b), false, b > 0)

but this is rather verbose. cond is also helpful in this case.

How to join relations#

Suppose we have the following relation:

:create friend {fr, to}

Let’s say we want to find Alice’s friends’ friends’ friends’ friends’ friends. One way to write this is:

?[who] := *friends{fr: 'Alice', to: f1},
          *friends{fr: f1, to: f2},
          *friends{fr: f2, to: f3},
          *friends{fr: f3, to: f4},
          *friends{fr: f4, to: who}

Another way is:

f1[who] := *friends{fr: 'Alice', to: who}
f2[who] := f1[fr], *friends{fr, to: who}
f3[who] := f2[fr], *friends{fr, to: who}
f4[who] := f3[fr], *friends{fr, to: who}
?[who] := f4[fr], *friends{fr, to: who}

These two queries yield identical values. But on real networks, where loops abound, the second way of writing executes exponentially faster than the first. Why? Because of set semantics in relations, the second way of writing deduplicates at every turn, whereas the first way of writing builds up all paths to the final layer of friends. Depending on the size of your graph, your computer may not even have enough memory to hold all these paths!

The moral of the story is, always prefer to break your query into smaller rules. It usually reads better, and unlike in some other databases, it almost always executes faster in Cozo as well. But for this particular case, in which the query is largely recursive, prefer to make it a recursive relation:

f_n[who, min(layer)] := *friends{fr: 'Alice', to: who}, layer = 1
f_n[who, min(layer)] := f_n[fr, last_layer], *friends{fr, to: who}, layer = last_layer + 1, layer <= 5
?[who] := f_n[who, 5]

The condition layer <= 5 is necessary to ensure termination.

Are there any situations where the first way of writing is acceptable? Yes:

?[who] := *friends{fr: 'Alice', to: f1},
          *friends{fr: f1, to: f2},
          *friends{fr: f2, to: f3},
          *friends{fr: f3, to: f4},
          *friends{fr: f4, to: who}
:limit 1

in this case, we stop at the first path, and this way of writing avoids the overhead of multiple rules and is perhaps very slightly faster.

Also, if you want to count the different paths, you must write:

?[count(who)] := *friends{fr: 'Alice', to: f1},
                 *friends{fr: f1, to: f2},
                 *friends{fr: f2, to: f3},
                 *friends{fr: f3, to: f4},
                 *friends{fr: f4, to: who}

The multiple-rules way of writing gives wrong results due to set semantics. Due to the presence of the aggregation count, this query only keeps a single path in memory at any instant, so it won’t blow up your memory even on web-scale data.

Functions and operators#

Functions can be used to build expressions.

All functions except those having names starting with rand_ are deterministic.

Non-functions#

Functions must take in expressions as arguments, evaluate each argument in turn, and then evaluate its implementation to produce a value that can be used in an expression. We first describe constructs that look like, but are not functions.

These are language constucts that return Horn clauses instead of expressions:

  • var = expr unifies expr with var. Different from expr1 == expr2.

  • not clause negates a Horn clause clause. Different from !expr or negate(expr).

  • clause1 or clause2 connects two Horn-clauses by disjunction. Different from or(expr1, expr2).

  • clause1 and clause2 connects two Horn-clauses by conjunction. Different from and(expr1, expr2).

  • clause1, clause2 connects two Horn-clauses by conjunction.

For the last three, or binds more tightly from and, which in turn binds more tightly than ,: and and , are identical in every aspect except their binding powers.

These are constructs that return expressions:

  • try(a, b, ...) evaluates each argument in turn, stops at the first expression that does not throw and return its value.

  • if(a, b, c) evaluates a, and if the result is true, evaluate b and returns its value, otherwise evaluate c and returns its value. a must evaluate to a boolean.

  • if(a, b) same as if(a, b, null)

  • cond(a1, b1, a2, b2, ...) evaluates a1, if the results is true, returns the value of b1, otherwise continue with a2 and b2. An even number of arguments must be given and the a``s must evaluate to booleans. If all ``a``s are ``false, null is returned. If you want a catch-all clause at the end, put true as the condition.

Operators representing functions#

Some functions have equivalent operator forms, which are easier to type and perhaps more familiar. First the binary operators:

  • a && b is the same as and(a, b)

  • a || b is the same as or(a, b)

  • a ^ b is the same as pow(a, b)

  • a ++ b is the same as concat(a, b)

  • a + b is the same as add(a, b)

  • a - b is the same as sub(a, b)

  • a * b is the same as mul(a, b)

  • a / b is the same as div(a, b)

  • a % b is the same as mod(a, b)

  • a >= b is the same as ge(a, b)

  • a <= b is the same as le(a, b)

  • a > b is the same as gt(a, b)

  • a < b is the same as le(a, b)

  • a == b is the same as eq(a, b)

  • a != b is the same as neq(a, b)

  • a ~ b is the same as coalesce(a, b)

These operators have precedence as follows (the earlier rows binds more tightly, and within the same row operators have equal binding power):

  • ~

  • ^

  • *, /

  • +, -, ++

  • ==, !=

  • %

  • >=, <=, >, <

  • &&

  • ||

With the exception of ^, all binary operators are left associative: a / b / c is the same as (a / b) / c. ^ is right associative: a ^ b ^ c is the same as a ^ (b ^ c).

And the unary operators are:

  • -a is the same as minus(a)

  • !a is the same as negate(a)

Function applications using parentheses bind the tightest, followed by unary operators, then binary operators.

Equality and Comparisons#

eq(x, y)#

Equality comparison. The operator form is x == y. The two arguments of the equality can be of different types, in which case the result is false.

neq(x, y)#

Inequality comparison. The operator form is x != y. The two arguments of the equality can be of different types, in which case the result is true.

gt(x, y)#

Equivalent to x > y

ge(x, y)#

Equivalent to x >= y

lt(x, y)#

Equivalent to x < y

le(x, y)#

Equivalent to x <= y

Note

The four comparison operators can only compare values of the same runtime type. Integers and floats are of the same type Number.

max(x, ...)#

Returns the maximum of the arguments. Can only be applied to numbers.

min(x, ...)#

Returns the minimum of the arguments. Can only be applied to numbers.

Boolean functions#

and(...)#

Variadic conjunction. For binary arguments it is equivalent to x && y.

or(...)#

Variadic disjunction. For binary arguments it is equivalent to x || y.

negate(x)#

Negation. Equivalent to !x.

assert(x, ...)#

Returns true if x is true, otherwise will raise an error containing all its arguments as the error message.

Mathematics#

add(...)#

Variadic addition. The binary version is the same as x + y.

sub(x, y)#

Equivalent to x - y.

mul(...)#

Variadic multiplication. The binary version is the same as x * y.

div(x, y)#

Equivalent to x / y.

minus(x)#

Equivalent to -x.

pow(x, y)#

Raises x to the power of y. Equivalent to x ^ y. Always returns floating number.

mod(x, y)#

Returns the remainder when x is divided by y. Arguments can be floats. The returned value has the same sign as x. Equivalent to x % y.

abs(x)#

Returns the absolute value.

signum(x)#

Returns 1, 0 or -1, whichever has the same sign as the argument, e.g. signum(to_float('NEG_INFINITY')) == -1, signum(0.0) == 0, but signum(-0.0) == -1. Returns NAN when applied to NAN.

floor(x)#

Returns the floor of x.

ceil(x)#

Returns the ceiling of x.

round(x)#

Returns the nearest integer to the argument (represented as Float if the argument itself is a Float). Round halfway cases away from zero. E.g. round(0.5) == 1.0, round(-0.5) == -1.0, round(1.4) == 1.0.

exp(x)#

Returns the exponential of the argument, natural base.

exp2(x)#

Returns the exponential base 2 of the argument. Always returns a float.

ln(x)#

Returns the natual logarithm.

log2(x)#

Returns the logarithm base 2.

log10(x)#

Returns the logarithm base 10.

sin(x)#

The sine trigonometric function.

cos(x)#

The cosine trigonometric function.

tan(x)#

The tangent trigonometric function.

asin(x)#

The inverse sine.

acos(x)#

The inverse cosine.

atan(x)#

The inverse tangent.

atan2(x, y)#

The inverse tangent atan2 by passing x and y separately.

sinh(x)#

The hyperbolic sine.

cosh(x)#

The hyperbolic cosine.

tanh(x)#

The hyperbolic tangent.

asinh(x)#

The inverse hyperbolic sine.

acosh(x)#

The inverse hyperbolic cosine.

atanh(x)#

The inverse hyperbolic tangent.

deg_to_rad(x)#

Converts degrees to radians.

rad_to_deg(x)#

Converts radians to degrees.

haversine(a_lat, a_lon, b_lat, b_lon)#

Computes with the haversine formula the angle measured in radians between two points a and b on a sphere specified by their latitudes and longitudes. The inputs are in radians. You probably want the next function when you are dealing with maps, since most maps measure angles in degrees instead of radians.

haversine_deg_input(a_lat, a_lon, b_lat, b_lon)#

Same as the previous function, but the inputs are in degrees instead of radians. The return value is still in radians.

If you want the approximate distance measured on the surface of the earth instead of the angle between two points, multiply the result by the radius of the earth, which is about 6371 kilometres, 3959 miles, or 3440 nautical miles.

Note

The haversine formula, when applied to the surface of the earth, which is not a perfect sphere, can result in an error of less than one percent.

String functions#

length(str)#

Returns the number of Unicode characters in the string.

Can also be applied to a list or a byte array.

Warning

length(str) does not return the number of bytes of the string representation. Also, what is returned depends on the normalization of the string. So if such details are important, apply unicode_normalize before length.

concat(x, ...)#

Concatenates strings. Equivalent to x ++ y in the binary case.

Can also be applied to lists.

str_includes(x, y)#

Returns true if x contains the substring y, false otherwise.

lowercase(x)#

Convert to lowercase. Supports Unicode.

uppercase(x)#

Converts to uppercase. Supports Unicode.

trim(x)#

Removes whitespace from both ends of the string.

trim_start(x)#

Removes whitespace from the start of the string.

trim_end(x)#

Removes whitespace from the end of the string.

starts_with(x, y)#

Tests if x starts with y.

Tip

starts_with(var, str) is preferred over equivalent (e.g. regex) conditions, since the compiler may more easily compile the clause into a range scan.

ends_with(x, y)#

tests if x ends with y.

unicode_normalize(str, norm)#

Converts str to the normalization specified by norm. The valid values of norm are 'nfc', 'nfd', 'nfkc' and 'nfkd'.

chars(str)#

Returns Unicode characters of the string as a list of substrings.

from_substrings(list)#

Combines the strings in list into a big string. In a sense, it is the inverse function of chars.

Warning

If you want substring slices, indexing strings, etc., first convert the string to a list with chars, do the manipulation on the list, and then recombine with from_substring.

List functions#

list(x, ...)#

Constructs a list from its argument, e.g. list(1, 2, 3). Equivalent to the literal form [1, 2, 3].

is_in(el, list)#

Tests the membership of an element in a list.

first(l)#

Extracts the first element of the list. Returns null if given an empty list.

last(l)#

Extracts the last element of the list. Returns null if given an empty list.

get(l, n)#

Returns the element at index n in the list l. Raises an error if the access is out of bounds. Indices start with 0.

maybe_get(l, n)#

Returns the element at index n in the list l. Returns null if the access is out of bounds. Indices start with 0.

length(list)#

Returns the length of the list.

Can also be applied to a string or a byte array.

slice(l, start, end)#

Returns the slice of list between the index start (inclusive) and end (exclusive). Negative numbers may be used, which is interpreted as counting from the end of the list. E.g. slice([1, 2, 3, 4], 1, 3) == [2, 3], slice([1, 2, 3, 4], 1, -1) == [2, 3].

concat(x, ...)#

Concatenates lists. The binary case is equivalent to x ++ y.

Can also be applied to strings.

prepend(l, x)#

Prepends x to l.

append(l, x)#

Appends x to l.

reverse(l)#

Reverses the list.

sorted(l)#

Sorts the list and returns the sorted copy.

chunks(l, n)#

Splits the list l into chunks of n, e.g. chunks([1, 2, 3, 4, 5], 2) == [[1, 2], [3, 4], [5]].

chunks_exact(l, n)#

Splits the list l into chunks of n, discarding any trailing elements, e.g. chunks([1, 2, 3, 4, 5], 2) == [[1, 2], [3, 4]].

windows(l, n)#

Splits the list l into overlapping windows of length n. e.g. windows([1, 2, 3, 4, 5], 3) == [[1, 2, 3], [2, 3, 4], [3, 4, 5]].

union(x, y, ...)#

Computes the set-theoretic union of all the list arguments.

intersection(x, y, ...)#

Computes the set-theoretic intersection of all the list arguments.

difference(x, y, ...)#

Computes the set-theoretic difference of the first argument with respect to the rest.

Binary functions#

length(bytes)#

Returns the length of the byte array.

Can also be applied to a list or a string.

bit_and(x, y)#

Calculate the bitwise and. The two bytes must have the same lengths.

bit_or(x, y)#

Calculate the bitwise or. The two bytes must have the same lengths.

bit_not(x)#

Calculate the bitwise not.

bit_xor(x, y)#

Calculate the bitwise xor. The two bytes must have the same lengths.

pack_bits([...])#

packs a list of booleans into a byte array; if the list is not divisible by 8, it is padded with false.

unpack_bits(x)#

Unpacks a byte array into a list of booleans.

encode_base64(b)#

Encodes the byte array b into the Base64-encoded string.

Note

encode_base64 is automatically applied when output to JSON since JSON cannot represent bytes natively.

decode_base64(str)#

Tries to decode the str as a Base64-encoded byte array.

Type checking and conversions#

coalesce(x, ...)#

Returns the first non-null value; coalesce(x, y) is equivalent to x ~ y.

to_string(x)#

Convert x to a string: the argument is unchanged if it is already a string, otherwise its JSON string representation will be returned.

to_float(x)#

Tries to convert x to a float. Conversion from numbers always succeeds. Conversion from strings has the following special cases in addition to the usual string representation:

  • INF is converted to infinity;

  • NEG_INF is converted to negative infinity;

  • NAN is converted to NAN (but don’t compare NAN by equality, use is_nan instead);

  • PI is converted to pi (3.14159…);

  • E is converted to the base of natural logarithms, or Euler’s constant (2.71828…).

Converts null and false to 0.0, true to 1.0

to_int(x)#

Converts to an integer. If x is a validity, extracts the timestamp as an integer.

to_unity(x)#

Tries to convert x to 0 or 1: null, false, 0, 0.0, "", [], and the empty bytes are converted to 0, and everything else is converted to 1.

This is useful in conjunction with aggregation functions. For example, ?[x, count(x)] := rel[x, y], y > 3 with a filter in the body omit groups that are completely filtered out. Instead, use ?[x, sum(should_count)] := rel[x, y], should_count = to_unity(y > 3).

to_bool(x)#

Tries to convert x to a boolean. The following are converted to false, and everything else is converted to true:

  • null

  • false

  • 0, 0.0

  • "" (empty string)

  • the empty byte array

  • the nil UUID (all zeros)

  • [] (the empty list)

  • any validity that is a retraction

to_uuid(x)#

Tries to convert x to a UUID. The input must either be a hyphenated UUID string representation or already a UUID for it to succeed.

uuid_timestamp(x)#

Extracts the timestamp from a UUID version 1, as seconds since the UNIX epoch. If the UUID is not of version 1, null is returned. If x is not a UUID, an error is raised.

is_null(x)#

Checks for null.

is_int(x)#

Checks for integers.

is_float(x)#

Checks for floats.

is_finite(x)#

Returns true if x is an integer or a finite float.

is_infinite(x)#

Returns true if x is infinity or negative infinity.

is_nan(x)#

Returns true if x is the special float NAN. Returns false when the argument is not of number type.

is_num(x)#

Checks for numbers.

is_bytes(x)#

Checks for bytes.

is_list(x)#

Checks for lists.

is_string(x)#

Checks for strings.

is_uuid(x)#

Checks for UUIDs.

Random functions#

rand_float()#

Generates a float in the interval [0, 1], sampled uniformly.

rand_bernoulli(p)#

Generates a boolean with probability p of being true.

rand_int(lower, upper)#

Generates an integer within the given bounds, both bounds are inclusive.

rand_choose(list)#

Randomly chooses an element from list and returns it. If the list is empty, it returns null.

rand_uuid_v1()#

Generate a random UUID, version 1 (random bits plus timestamp). The resolution of the timestamp part is much coarser on WASM targets than the others.

rand_uuid_v4()#

Generate a random UUID, version 4 (completely random bits).

Regex functions#

regex_matches(x, reg)#

Tests if x matches the regular expression reg.

regex_replace(x, reg, y)#

Replaces the first occurrence of the pattern reg in x with y.

regex_replace_all(x, reg, y)#

Replaces all occurrences of the pattern reg in x with y.

regex_extract(x, reg)#

Extracts all occurrences of the pattern reg in x and returns them in a list.

regex_extract_first(x, reg)#

Extracts the first occurrence of the pattern reg in x and returns it. If none is found, returns null.

Regex syntax#

Matching one character:

.             any character except new line
\d            digit (\p{Nd})
\D            not digit
\pN           One-letter name Unicode character class
\p{Greek}     Unicode character class (general category or script)
\PN           Negated one-letter name Unicode character class
\P{Greek}     negated Unicode character class (general category or script)

Character classes:

[xyz]         A character class matching either x, y or z (union).
[^xyz]        A character class matching any character except x, y and z.
[a-z]         A character class matching any character in range a-z.
[[:alpha:]]   ASCII character class ([A-Za-z])
[[:^alpha:]]  Negated ASCII character class ([^A-Za-z])
[x[^xyz]]     Nested/grouping character class (matching any character except y and z)
[a-y&&xyz]    Intersection (matching x or y)
[0-9&&[^4]]   Subtraction using intersection and negation (matching 0-9 except 4)
[0-9--4]      Direct subtraction (matching 0-9 except 4)
[a-g~~b-h]    Symmetric difference (matching `a` and `h` only)
[\[\]]        Escaping in character classes (matching [ or ])

Composites:

xy    concatenation (x followed by y)
x|y   alternation (x or y, prefer x)

Repetitions:

x*        zero or more of x (greedy)
x+        one or more of x (greedy)
x?        zero or one of x (greedy)
x*?       zero or more of x (ungreedy/lazy)
x+?       one or more of x (ungreedy/lazy)
x??       zero or one of x (ungreedy/lazy)
x{n,m}    at least n x and at most m x (greedy)
x{n,}     at least n x (greedy)
x{n}      exactly n x
x{n,m}?   at least n x and at most m x (ungreedy/lazy)
x{n,}?    at least n x (ungreedy/lazy)
x{n}?     exactly n x

Empty matches:

^     the beginning of the text
$     the end of the text
\A    only the beginning of the text
\z    only the end of the text
\b    a Unicode word boundary (\w on one side and \W, \A, or \z on the other)
\B    not a Unicode word boundary

Timestamp functions#

now()#

Returns the current timestamp as seconds since the UNIX epoch. The resolution is much coarser on WASM targets than the others.

format_timestamp(ts, tz?)#

Interpret ts as seconds since the epoch and format as a string according to RFC3339. If ts is a validity, its timestamp will be converted to seconds and used.

If a second string argument is provided, it is interpreted as a timezone and used to format the timestamp.

parse_timestamp(str)#

Parse str into seconds since the epoch according to RFC3339.

Aggregations#

Aggregations in Cozo can be thought of as a function that acts on a stream of values and produces a single value (the aggregate).

There are two kinds of aggregations in Cozo, ordinary aggregations and semi-lattice aggregations. They are implemented differently in Cozo, with semi-lattice aggregations generally faster and more powerful (only the latter can be used recursively).

The power of semi-lattice aggregations derive from the additional properties they satisfy: a semilattice:

idempotency

the aggregate of a single value a is a itself,

commutativity

the aggregate of a then b is equal to the aggregate of b then a,

associativity

it is immaterial where we put the parentheses in an aggregate application.

In auto-recursive semi-lattice aggregations, there are soundness constraints on what can be done on the bindings coming from the auto-recursive parts within the body of the rule. Usually you do not need to worry about this at all since the obvious ways of using this functionality are all sound, but as for non-termination due to fresh variables introduced by function applications, Cozo does not (and cannot) check for unsoundness in this case.

Semi-lattice aggregations#

min(x)#

Aggregate the minimum value of all x.

max(x)#

Aggregate the maximum value of all x.

and(var)#

Aggregate the logical conjunction of the variable passed in.

or(var)#

Aggregate the logical disjunction of the variable passed in.

union(var)#

Aggregate the unions of var, which must be a list.

intersection(var)#

Aggregate the intersections of var, which must be a list.

choice(var)#

Returns a non-null value. If all values are null, returns null. Which one is returned is deterministic but implementation-dependent and may change from version to version.

min_cost([data, cost])#

The argument should be a list of two elements and this aggregation chooses the list of the minimum cost.

shortest(var)#

var must be a list. Returns the shortest list among all values. Ties will be broken non-deterministically.

bit_and(var)#

var must be bytes. Returns the bitwise ‘and’ of the values.

bit_or(var)#

var must be bytes. Returns the bitwise ‘or’ of the values.

Ordinary aggregations#

count(var)#

Count how many values are generated for var (using bag instead of set semantics).

count_unique(var)#

Count how many unique values there are for var.

collect(var)#

Collect all values for var into a list.

unique(var)#

Collect var into a list, keeping each unique value only once.

group_count(var)#

Count the occurrence of unique values of var, putting the result into a list of lists, e.g. when applied to 'a', 'b', 'c', 'c', 'a', 'c', the results is [['a', 2], ['b', 1], ['c', 3]].

bit_xor(var)#

var must be bytes. Returns the bitwise ‘xor’ of the values.

latest_by([data, time])#

The argument should be a list of two elements and this aggregation returns the data of the maximum time. This is very similar to min_cost, the differences being that maximum instead of minimum is used, and non-numerical costs are allowed. only data is returned, and the aggregation is deliberately not a semi-lattice aggregation.

Note

This aggregation is intended to be used in timestamped audit trails. As an example:

?[id, latest_by(status_ts)] := *data[id, status, ts], status_ts = [status, ts]

returns the latest status for each id. If you do this regularly, consider using the time travelling facility.

smallest_by([data, cost])#

The argument should be a list of two elements and this aggregation returns the data of the minimum cost. Non-numerical costs are allowed, unlike min_cost. The value null for data are ignored when comparing.

choice_rand(var)#

Non-deterministically chooses one of the values of var as the aggregate. Each value the aggregation encounters has the same probability of being chosen.

Note

This version of choice is not a semi-lattice aggregation since it is impossible to satisfy the uniform sampling requirement while maintaining no state, which is an implementation restriction unlikely to be lifted.

Statistical aggregations#

mean(x)#

The mean value of x.

sum(x)#

The sum of x.

product(x)#

The product of x.

variance(x)#

The sample variance of x.

std_dev(x)#

The sample standard deviation of x.

Utilities and algorithms#

Fixed rules in CozoScript apply utilities or algorithms.

The algorithms described here are only available if your distribution of Cozo is compiled with the graph-algo feature flag. Currently all prebuilt binaries are compiled with this flag on.

Some algorithms make use of multiple threads to greatly improve running time if the rayon feature flag is on. All prebuilt binaries except WASM have this flag on. As a result, and also because of other platform restrictions, algorithms on WASM in general run much slower than on other platforms.

Utilities#

Constant(data: [...])#

Returns a relation containing the data passed in. The constant rule ?[] <- ... is syntax sugar for ?[] <~ Constant(data: ...).

Parameters

data – A list of lists, representing the rows of the returned relation.

ReorderSort(rel[...], out: [...], sort_by: [...], descending: false, break_ties: false, skip: 0, take: 0)#

Sort and then extract new columns of the passed in relation rel.

Parameters
  • out (required) – A list of expressions which will be used to produce the output relation. Any bindings in the expressions will be bound to the named positions in rel.

  • sort_by – A list of expressions which will be used to produce the sort keys. Any bindings in the expressions will be bound to the named positions in rel.

  • descending – Whether the sorting process should be done in descending order. Defaults to false.

  • break_ties – Whether ties should be broken, e.g. whether the first two rows with identical sort keys should be given ordering numbers 1 and 2 instead of 1 and 1. Defaults to false.

  • skip – How many rows to skip before producing rows. Defaults to zero.

  • take – How many rows at most to produce. Zero means no limit. Defaults to zero.

Returns

The returned relation, in addition to the rows specified in the parameter out, will have the ordering prepended. The ordering starts at 1.

Tip

This algorithm serves a similar purpose to the global :order, :limit and :offset options, but can be applied to intermediate results. Prefer the global options if it is applied to the final output.

CsvReader(url: ..., types: [...], delimiter: ',', prepend_index: false, has_headers: true)#

Read a CSV file from disk or an HTTP GET request and convert the result to a relation.

This utility is not available on WASM targets. In addition, if the feature flag requests is off, only reading from local file is supported.

Parameters
  • url (required) – URL for the CSV file. For local file, use file://<PATH_TO_FILE>.

  • types (required) – A list of strings interpreted as types for the columns of the output relation. If any type is specified as nullable and conversion to the specified type fails, null will be the result. This is more lenient than other functions since CSVs tend to contain lots of bad values.

  • delimiter – The delimiter to use when parsing the CSV file.

  • prepend_index – If true, row index will be prepended to the columns.

  • has_headers – Whether the CSV file has headers. The reader will not interpret the header in any way but will instead simply ignore it.

JsonReader(url: ..., fields: [...], json_lines: true, null_if_absent: false, prepend_index: false)#

Read a JSON file for disk or an HTTP GET request and convert the result to a relation.

This utility is not available on WASM targets. In addition, if the feature flag requests is off, only reading from local file is supported.

Parameters
  • url (required) – URL for the JSON file. For local file, use file://<PATH_TO_FILE>.

  • fields (required) – A list of field names, for extracting fields from JSON arrays into the relation.

  • json_lines – If true, parse the file as lines of JSON objects, each line containing a single object; if false, parse the file as a JSON array containing many objects.

  • null_if_absent – If a true and a requested field is absent, will output null in its place. If false and the requested field is absent, will throw an error.

  • prepend_index – If true, row index will be prepended to the columns.

Connectedness algorithms#

ConnectedComponents(edges[from, to])#

Computes the connected components of a graph with the provided edges.

Returns

Pairs containing the node index, and its component index.

StronglyConnectedComponent(edges[from, to])#

Computes the strongly connected components of a graph with the provided edges.

Returns

Pairs containing the node index, and its component index.

SCC(...)#

See Algo.StronglyConnectedComponent.

MinimumSpanningForestKruskal(edges[from, to, weight?])#

Runs Kruskal’s algorithm on the provided edges to compute a minimum spanning forest. Negative weights are fine.

Returns

Triples containing the from-node, the to-node, and the cost from the tree root to the to-node. Which nodes are chosen to be the roots are non-deterministic. Multiple roots imply the graph is disconnected.

MinimumSpanningTreePrim(edges[from, to, weight?], starting?[idx])#

Runs Prim’s algorithm on the provided edges to compute a minimum spanning tree. starting should be a relation producing exactly one node index as the starting node. Only the connected component of the starting node is returned. If starting is omitted, which component is returned is arbitrary.

Returns

Triples containing the from-node, the to-node, and the cost from the tree root to the to-node.

TopSort(edges[from, to])#

Performs topological sorting on the graph with the provided edges. The graph is required to be connected in the first place.

Returns

Pairs containing the sort order and the node index.

Pathfinding algorithms#

ShortestPathBFS(edges[from, to], starting[start_idx], goals[goal_idx])#

Runs breadth-first search to determine the shortest path between the starting nodes and the goals. Assumes the graph to be directed and all edges to be of unit weight. Ties will be broken in an unspecified way. If you need anything more complicated, use one of the other algorithms below.

Returns

Triples containing the starting node, the goal, and a shortest path.

ShortestPathDijkstra(edges[from, to, weight?], starting[idx], goals[idx], undirected: false, keep_ties: false)#

Runs Dijkstra’s algorithm to determine the shortest paths between the starting nodes and the goals. Weights, if given, must be non-negative.

Parameters
  • undirected – Whether the graph should be interpreted as undirected. Defaults to false.

  • keep_ties – Whether to return all paths with the same lowest cost. Defaults to false, in which any one path of the lowest cost could be returned.

Returns

4-tuples containing the starting node, the goal, the lowest cost, and a path with the lowest cost.

KShortestPathYen(edges[from, to, weight?], starting[idx], goals[idx], k: expr, undirected: false)#

Runs Yen’s algorithm (backed by Dijkstra’s algorithm) to find the k-shortest paths between nodes in starting and nodes in goals.

Parameters
  • k (required) – How many routes to return for each start-goal pair.

  • undirected – Whether the graph should be interpreted as undirected. Defaults to false.

Returns

4-tuples containing the starting node, the goal, the cost, and a path with the cost.

BreadthFirstSearch(edges[from, to], nodes[idx, ...], starting?[idx], condition: expr, limit: 1)#

Runs breadth first search on the directed graph with the given edges and nodes, starting at the nodes in starting. If starting is not given, it will default to all of nodes, which may be quite a lot to calculate.

Parameters
  • condition (required) – The stopping condition, will be evaluated with the bindings given to nodes. Should evaluate to a boolean, with true indicating an acceptable answer was found.

  • limit – How many answers to produce for each starting nodes. Defaults to 1.

Returns

Triples containing the starting node, the answer node, and the found path connecting them.

BFS(...)#

See Algo.BreadthFirstSearch.

DepthFirstSearch(edges[from, to], nodes[idx, ...], starting?[idx], condition: expr, limit: 1)#

Runs depth first search on the directed graph with the given edges and nodes, starting at the nodes in starting. If starting is not given, it will default to all of nodes, which may be quite a lot to calculate.

Parameters
  • condition (required) – The stopping condition, will be evaluated with the bindings given to nodes. Should evaluate to a boolean, with true indicating an acceptable answer was found.

  • limit – How many answers to produce for each starting nodes. Defaults to 1.

Returns

Triples containing the starting node, the answer node, and the found path connecting them.

DFS(...)#

See Algo.DepthFirstSearch.

ShortestPathAStar(edges[from, to, weight], nodes[idx, ...], starting[idx], goals[idx], heuristic: expr)#

Computes the shortest path from every node in starting to every node in goals by the A* algorithm.

edges are interpreted as directed, weighted edges with non-negative weights.

Parameters

heuristic (required) – The search heuristic expression. It will be evaluated with the bindings from goals and nodes. It should return a number which is a lower bound of the true shortest distance from a node to the goal node. If the estimate is not a valid lower-bound, i.e. it over-estimates, the results returned may not be correct.

Returns

4-tuples containing the starting node index, the goal node index, the lowest cost, and a path with the lowest cost.

Tip

The performance of A* star algorithm heavily depends on how good your heuristic function is. Passing in 0 as the estimate is always valid, but then you really should be using Dijkstra’s algorithm.

Good heuristics usually come about from a metric in the ambient space in which your data live, e.g. spherical distance on the surface of a sphere, or Manhattan distance on a grid. Func.Math.haversine_deg_input could be helpful for the spherical case. Note that you must use the correct units for the distance.

Providing a heuristic that is not guaranteed to be a lower-bound might be acceptable if you are fine with inaccuracies. The errors in the answers are bound by the sum of the margins of your over-estimates.

Community detection algorithms#

ClusteringCoefficients(edges[from, to, weight?])#

Computes the clustering coefficients of the graph with the provided edges.

Returns

4-tuples containing the node index, the clustering coefficient, the number of triangles attached to the node, and the total degree of the node.

CommunityDetectionLouvain(edges[from, to, weight?], undirected: false, max_iter: 10, delta: 0.0001, keep_depth?: depth)#

Runs the Louvain algorithm on the graph with the provided edges, optionally non-negatively weighted.

Parameters
  • undirected – Whether the graph should be interpreted as undirected. Defaults to false.

  • max_iter – The maximum number of iterations to run within each epoch of the algorithm. Defaults to 10.

  • delta – How much the modularity has to change before a step in the algorithm is considered to be an improvement.

  • keep_depth – How many levels in the hierarchy of communities to keep in the final result. If omitted, all levels are kept.

Returns

Pairs containing the label for a community, and a node index belonging to the community. Each label is a list of integers with maximum length constrained by the parameter keep_depth. This list represents the hierarchy of sub-communities containing the list.

LabelPropagation(edges[from, to, weight?], undirected: false, max_iter: 10)#

Runs the label propagation algorithm on the graph with the provided edges, optionally weighted.

Parameters
  • undirected – Whether the graph should be interpreted as undirected. Defaults to false.

  • max_iter – The maximum number of iterations to run. Defaults to 10.

Returns

Pairs containing the integer label for a community, and a node index belonging to the community.

Centrality measures#

DegreeCentrality(edges[from, to])#

Computes the degree centrality of the nodes in the graph with the given edges. The computation is trivial, so this should be your first thing to try when exploring new data.

Returns

4-tuples containing the node index, the total degree (how many edges involve this node), the out-degree (how many edges point away from this node), and the in-degree (how many edges point to this node).

PageRank(edges[from, to, weight?], undirected: false, theta: 0.85, epsilon: 0.0001, iterations: 10)#

Computes the PageRank from the given graph with the provided edges, optionally weighted.

This algorithm is implemented differently if the rayon is not enabled, in which case it runs much slower. This affects only the WASM platform.

Parameters
  • undirected – Whether the graph should be interpreted as undirected. Defaults to false.

  • theta – A number between 0 and 1 indicating how much weight in the PageRank matrix is due to the explicit edges. A number of 1 indicates no random restarts. Defaults to 0.8.

  • epsilon – Minimum PageRank change in any node for an iteration to be considered an improvement. Defaults to 0.05.

  • iterations – How many iterations to run. Fewer iterations are run if convergence is reached. Defaults to 20.

Returns

Pairs containing the node label and its PageRank. For a graph with uniform edges, the PageRank of every node is 1. The L2-norm of the results is forced to be invariant, i.e. in the results those nodes with a PageRank greater than 1 is “more central” than the average node in a certain sense.

ClosenessCentrality(edges[from, to, weight?], undirected: false)#

Computes the closeness centrality of the graph. The input relation represent edges connecting node indices which are optionally weighted.

Parameters

undirected – Whether the edges should be interpreted as undirected. Defaults to false.

Returns

Node index together with its centrality.

BetweennessCentrality(edges[from, to, weight?], undirected: false)#

Computes the betweenness centrality of the graph. The input relation represent edges connecting node indices which are optionally weighted.

Parameters

undirected – Whether the edges should be interpreted as undirected. Defaults to false.

Returns

Node index together with its centrality.

Warning

BetweennessCentrality is very expensive for medium to large graphs. If possible, collapse large graphs into supergraphs by running a community detection algorithm first.

Miscellaneous#

RandomWalk(edges[from, to, ...], nodes[idx, ...], starting[idx], steps: 10, weight?: expr, iterations: 1)#

Performs random walk on the graph with the provided edges and nodes, starting at the nodes in starting.

Parameters
  • steps (required) – How many steps to walk for each node in starting. Produced paths may be shorter if dead ends are reached.

  • weight – An expression evaluated against bindings of nodes and bindings of edges, at a time when the walk is at a node and choosing between multiple edges to follow. It should evaluate to a non-negative number indicating the weight of the given choice of edge to follow. If omitted, which edge to follow is chosen uniformly.

  • iterations – How many times walking is repeated for each starting node.

Returns

Triples containing a numerical index for the walk, the starting node, and the path followed.

Beyond CozoScript#

Most functionalities of the Cozo database are accessible via the CozoScript API. However, other functionalities either cannot conform to the “always return a relation” constraint, or are of such a nature as to make a separate API desirable. These are described here.

The calling convention and even names of the APIs may differ on different target languages, please refer to the respective language-specific documentation. Here we use the Python API as an example to describe what they do.

export_relations(self, relations)#

Export the specified relations. It is guaranteed that the exported data form a consistent snapshot of what was stored in the database.

Parameters

relations – names of the relations in a list.

Returns

a dict with string keys for the names of relations, and values containing all the rows.

import_relations(self, data)#

Import data into a database. The data are imported inside a transaction, so that either all imports are successful, or none are. If conflicts arise because of concurrent modification to the database, via either CosoScript queries or other imports, the transaction will fail.

The relations to import into must exist beforehand, and the data given must match the schema defined.

This API can be used to batch-put or remove data from several stored relations atomically. The data parameter can contain relation names such as "rel_a", or relation names prefixed by a minus sign such as "-rel_a". For the former case, every row given for the relation will be put into the database, i.e. upsert semantics. For the latter case, the corresponding rows are removed from the database, and you should only specify the key part of the rows. As for rm in CozoScript, it is not an error to remove non-existent rows.

Warning

Triggers are not run for direct imports.

Parameters

data – should be given as a dict with string keys, in the same format as returned by export_relations. For example: {"rel_a": {"headers": ["x", "y"], "rows": [[1, 2], [3, 4]]}, "rel_b": {"headers": ["z"], "rows": []}}

backup(self, path)#

Backup a database to the specified path. The exported data is guaranteed to form a consistent snapshot of what was stored in the database.

This backs up everything: you cannot choose what to back up. It is also much more efficient than exporting all stored relations via export_relations, and only a tiny fraction of the total data needs to reside in memory during backup.

This function is only available if the storage-sqlite feature flag was on when compiling. The flag is on for all pre-built binaries except the WASM binaries. The backup produced by this API can then be used as an independent SQLite-based Cozo database. If you want to store the backup for future use, you should compress it to save a lot of disk space.

Parameters

path – the path to write the backup into. For a remote database, this is a path on the remote machine.

restore(self, path)#

Restore the database from a backup. Must be called on an empty database.

This restores everything: you cannot choose what to restore.

Parameters

path – the path to the backup. You cannot restore remote databases this way: use the executable directly.

import_from_backup(self, path, relations)#

Import stored relations from a backup.

In terms of semantics, this is like import_relations, except that data comes from the backup file directly, and you can only put, not rm. It is also more memory-efficient than import_relations.

Warning

Triggers are not run for direct imports.

Parameters
  • path – path to the backup file. For remote databases, this is a path on the remote machine.

  • relations – a list containing the names of the relations to import. The relations must exist in the database.

Notes#

Here are some miscellaneous notes about various aspects of CozoDB.

Some use cases for Cozo#

As Cozo is a general-purpose database, it can be used in situations where traditional databases such as PostgreSQL and SQLite are used. However, Cozo is designed to overcome several shortcomings of traditional databases, and hence fares especially well in specific situations:

Interconnected relations#

You have a lot of interconnected relations and the usual queries need to relate many relations together. In other words, you need to query a complex graph.

An example is a system granting permissions to users for specific tasks. In this case, users may have roles, belong to an organization hierarchy, and tasks similarly have organizations and special provisions associated with them. The granting process itself may also be a complicated rule encoded as data within the database.

With a traditional database, the corresponding SQL tend to become an entangled web of nested queries, with many tables joined together, and maybe even with some recursive CTE thrown in. This is hard to maintain, and worse, the performance is unpredictable since query optimizers in general fail when you have over twenty tables joined together.

With Cozo, on the other hand, Horn clauses make it easy to break the logic into smaller pieces and write clear, easily testable queries. Furthermore, the deterministic evaluation order makes identifying and solving performance problems easier.

Just a graph#

Your data may be simple, even a single table, but it is inherently a graph.

We have seen an example in the Tutorial: the air route dataset, where the key relation contains the routes connecting airports.

In traditional databases, when you are given a new relation, you try to understand it by running aggregations on it to collect statistics: what is the distribution of values, how are the columns correlated, etc.

In Cozo you can do the same exploratory analysis, except now you also have graph algorithms that you can easily apply to understand things such as: what is the most connected entity, how are the nodes connected, and what are the communities structure within the nodes.

Hidden structures#

Your data contains hidden structures that only become apparent when you identify the scales of the relevant structures.

Examples are most real networks, such as social networks, which have a very rich hierarchy of structures.

In a traditional database, you are limited to doing nested aggregations and filtering, i.e. a form of multifaceted data analysis. For example, you can analyze by gender, geography, job or combinations of them. For structures hidden in other ways, or if such categorizing tags are not already present in your data, you are out of luck.

With Cozo, you can now deal with emergent and fuzzy structures by using e.g. community detection algorithms, and collapse the original graph into a coarse-grained graph consisting of super-nodes and super-edges. The process can be iterated to gain insights into even higher-order emergent structures. This is possible in a social network with only edges and no categorizing tags associated with nodes at all, and the discovered structures almost always have meanings correlated to real-world events and organizations, for example, forms of collusion and crime rings. Also, from a performance perspective, coarse-graining is a required step in analyzing the so-called big data, since many graph algorithms have high complexity and are only applicable to the coarse-grained small or medium networks.

Knowledge augmentation#

You want to understand your live business data better by augmenting it into a knowledge graph.

For example, your sales database contains product, buyer, inventory, and invoice tables. The augmentation is external data about the entities in your data in the form of taxonomies and ontologies in layers.

This is inherently a graph-theoretic undertaking and traditional databases are not suitable. Usually, a dedicated graph processing engine is used, separate from the main database.

With Cozo, it is possible to keep your live data and knowledge graph analysis together, and importing new external data and doing analysis is just a few lines of code away. This ease of use means that you will do the analysis much more often, with a perhaps much wider scope.

Cozo runs (almost) everywhere#

Version 0.1 of Cozo can be used embedded from Python, NodeJS, Java, Rust and C, in addition to running standalone as a web server. Immediately after its release, many people asked about the feasibility of using Cozo embedded on mobile devices.

There was one major obstacle to supporting mobile: Cozo 0.1 used RockDB as the storage engine, and compiling RocksDB for mobile devices is not an easy task. We chose RocksDB because it could handle a huge amount of concurrency and is very fast, but the concurrency part may not be relevant for the mobile case: you almost always have only one process concurrently accessing the database.

So we ripped apart the storage engine code, made a nice and minimal interface out of it, and now Cozo supports swappable storage engines! At the time of this writing, you can choose from the following:

  • In-memory engine

  • SQLite engine

  • RocksDB engine

  • Sled engine

  • TiKV engine

They offer different trade-offs:

  • The in-memory engine is perfect if you just want to use Cozo as a computation engine. For us, it also made writing tests much easier. The downside is that it doesn’t persist data, and it doesn’t support much write concurrency.

  • The SQLite engine uses a minimal amount of resources, is easy to compile for almost all platforms including mobile, and is reasonably fast for reads. SQLite markets itself as a file storage format, and we took advantage of that by making SQLite the backup format for all engines. In this way when you backup your database, you get a single-file SQLite-backed Cozo database. You do not need to restore the backup to look inside: the backup is a fully functional database. As a backup format, it is also extremely space-efficient after you gzip it. The downside is as expected: SQLite is not very fast when it comes to writing and is effectively single-threaded when write concurrency is involved. But as we have said, these are usually not problems on mobile.

  • The RocksDB engine is crazy fast for both reads and writes and can handle an enormous amount of concurrency, while still being rather conservative on resource usage. In particular, its storage on disk is compressed, making its disk space requirements for live databases the smallest among all persistent options.

  • We included Sled as an engine just because we can. The only benefit is that it is pure Rust, and we are not Rust fundamentalists. It is not faster than SQLite for the usual workload that Cozo encounters and uses way more disk space.

  • The TiKV option is the slowest among all options (10x to 100x slower) since data must come from the network. The benefit is that TiKV is a distributed storage. We included it so that people may decide for themselves if it offers value for them. By the way, 100x slower than the other storage options may not be slow compared to the average graph databases in the market – see the next section.

As a result of the storage engine refactoring, Cozo now runs on a much wider range of platforms, including iOS, Android, and web browsers (with web assembly)! We have also expanded the officially supported languages where you can use Cozo embedded: Swift and Golang. Even if your platform/language combination is not supported, you can still use Cozo with the client/server mode. Or you can try to compile Cozo from source and interface it with your platform/language: let us know if you encounter problems, and we will help!

On performance#

Commercial databases like to do publicity stunts by publishing “performance comparisons” of competing databases, with the results invariably favouring their products. We are not going to do that because, first, we do not want to attract the wrong kind of attention; second and more importantly, such benchmarks don’t educate users beyond the “trust us, we are the best” preaching. Instead, we want our benchmarks to help the users answer the following questions:

  • For my particular workload, what are the pros and cons of the different setups that I can choose from?

  • What is the performance I can expect from such a setup and is it enough for me?

The setup#

We will only be comparing Cozo to itself running on two different machines:

  • Mac Mini (2020)

  • Linux Server (2017)

The Mac Mini runs MacOS 13.0.1, has Apple M1 CPUs with 4 performance cores and 4 efficiency cores, 16GB of memory and a pretty fast NVMe SSD storage. Its benchmarks would be typical of recent reasonably powerful desktop and laptop machines. The Linux server runs Ubuntu 22.04.1 LTS, has two Intel Xeon E5-2678v3 CPUs with a total of 24 physical cores, 64GB of memory and one of the slowest SSD storage I have ever seen. You can expect similar performance if you (over)pay cloud providers. If you have servers that have newer hardware, you can probably expect much better performance.

We will be running Cozo embedded in Rust (in fact, we will take advantage of Rust’s built-in benchmark tools). As it is embedded, we will use different numbers of concurrent threads to run Cozo to see how a particular task scales with the number of processors. Embedding Cozo in Rust is the fastest way to run Cozo and if your use case involves embedding in another language such as Python, there will be overheads due to Python itself. Still, if for similar tasks as recorded here you experience orders of magnitude worse performance, please let us know since it could be an environment-related bug.

It seems graph databases like to use the Slovenian social network Pokec for benchmarks (see here and here). We will use three different sizes for subsets of the data:

  • “Tiny”: 10,000 vertices, 121,716 edges

  • “Small”: 100,000 vertices, 1,768,515 edges

  • “Medium”: 1,632,803 vertices, 30,622,564 edges, this is the full dataset

Note that this is the same subsets as done here, except their “small” is our “tiny”, their “medium” is our “small”, and their “large” is our “medium”. We feel it to be rather presumptuous in this age to call a dataset with just over 30 million edges “large”. When will we do a benchmark for a truly webscale dataset? When we have more time and a deeper pocket, maybe! Anyway, the “medium” size is probably large enough for most purposes.

The schema for the data is the following, written in CozoScript:

{:create user {uid: Int => cmpl_pct: Int, gender: String?, age: Int?}}
{:create friends {fr: Int, to: Int}}
{:create friends.rev {to: Int, fr: Int}}

If you don’t read CozoScript yet: the relation user has an integer primary key and three non-keys representing the vertices, the relation friends has a composite (Int, Int) primary key representing the edges, and the relation friends.rev acts as an index for the edges in the reverse direction (in Cozo, everything is very explicit).

We will be comparing Cozo running with three different storage engines as introduced in the first section:

  • In-memory engine

  • SQLite engine

  • RocksDB engine

All queries are run with snapshot isolations in effect: when mixing reads and writes, reads will only see a snapshot of data valid at the start of the transaction, and when writes conflict with each other, at least one of them will fail to commit, in which case we will manually retry the query. This is the only consistency level Cozo currently supports: there is no way to opt for a more lax consistency model.

All mutations sent to the backend engine complete only when the underlying storage engine transactions complete: there is no “fire and forget” involved to give the user a false impression of high performance.

Finally, note that we only benchmark the out-of-box experience of Cozo. In the case of the RocksDB engine, there are lots of knobs you can turn to make it much more performant for certain workloads (the easiest knob is to beg it to use more memory, which helps when dealing with large datasets), but we expect most users not to be experts in such optimizations.

You can download the complete result of the benchmarks as a spreadsheet here to do your own analysis.

Loading data#

Batch import#

The first question we are interested in is how long it takes to load our datasets into the database: do we need to wait for days? Our approach is to parse data from a text file and insert them into the database in batches of 300, single-threaded.

For the tiny dataset, the results are:

Platform

Backend

Time taken (seconds)

Mac Mini

Mem

0.265788

Mac Mini

RocksDB

0.686022

Mac Mini

SQLite

6.383260

Server

Mem

0.494136

Server

RocksDB

1.285214

Server

SQLite

27.971535

Here is for the small dataset:

Platform

Backend

Time taken (seconds)

Mac Mini

Mem

5.429186

Mac Mini

RocksDB

11.752198

Mac Mini

SQLite

146.848621

Server

Mem

8.741262

Server

RocksDB

19.261249

Server

SQLite

432.705856

And for the large dataset:

Platform

Backend

Time taken (seconds)

Mac Mini

Mem

155.894422

Mac Mini

RocksDB

212.731813

Server

Mem

219.956052

Server

RocksDB

348.638331

As you can see we didn’t even test for SQLite’s performance using the medium dataset, as we grew tired of waiting. If the trend continues, import with SQLite backend would take at least 45 minutes on Mac Mini, and more than 2 hours on the Linux server. SQLite’s performance looks really bad here, but we used to import a similar amount of data into another graph database and it took us half a day. And even if you insist on using the SQLite backend, there is a much faster way to import data: keep reading.

For the RocksDB backend, everything can be done within a few minutes, which is more than reasonable for tens of millions of rows.

We can compare performance across the board by considering raw rows per second in imports, in which an edge counts as two raw rows since it must appear in two relations:

Batch import

Batch import#

Here RocksDB performs well, especially for scaling: the decrease in raw rows per second due to larger datasets is very small. And it is always within a factor of three for the mem backend which does not persist data at all.

Some of you may say that this is not fair for the SQLite backend, since with some additional tricks and more clever batching, you can get higher numbers for SQLite. Well, we are testing for simple-minded out-of-box performance, and the fact is that with tuning, the RocksDB performance can be increased even more drastically.

How much memory does the database use during the import process? We will show the peak memory usage as reported by the system:

Batch import mem

Batch import mem#

The benchmark infrastructure takes about 50MB of memory even if it does nothing. So the SQLite backend always uses a negligible amount of extra memory. RocksDB on the other hand will use memory to speed things up. As we have said before we didn’t collect data for importing the medium dataset into the SQLite backend.

The data for the mem backend is shown below separately:

Batch import mem for mem

Batch import mem for mem#

This measures the size of the whole dataset as the mem backend can only store data in memory. As we can see Apple’s OS somehow uses memory more efficiently. For almost everything we do in this benchmark, the memory usage of the mem backend is very similar to this, so we will not show the memory usage of the mem backend before. If you are interested nonetheless, you can look at the raw data in the spreadsheet.

Backup#

In Cozo we can backup the database to an SQLite-based database. How fast is the backup?

Backup

Backup#

On a Mac Mini, this is around one million raw rows per second for all backends, which should be fast enough for most purposes. On the Linux server, the bad quality of the SSD shows, but it is still quite fast. By the way, if you have lots of data and you want to use the SQLite backend, you can batch import the data into the RocksDB or mem backend, and then back up the database. The backup file is a working SQLite-backed database, and the whole process is a lot faster than importing into an SQLite-backed database directly.

Memory usage:

Backup memory

Backup memory#

Not much surprise here. As we said before around 50MB is used by the benchmark infrastructure, so take that into account.

Restoring from backup#

How fast is restoring from a backup?

Restore

Restore#

This is the only benchmark where RocksDB performs the worst, with 400K raw rows per second. Restoring into the SQLite backend is fast, but in fact, you can be faster still: just copy the backup file over (or use it directly if you don’t intend to write any data)!

Memory usage:

Restore memory

Restore memory#

No surprise.

Transactional queries (OLTP)#

Online Transaction Processing (OLTP) queries are simple reads or writes queries that are expected to finish quickly, and you are expected to deal with lots of them.

Point read#

This is the simplest kind of query you can imagine: given an ID, it just reads the corresponding row and gives it to you:

?[cmpl_pct, gender, age] := *user{uid: $id, cmpl_pct, gender, age}

The performance metric we are interested in is the queries per second (QPS):

Single vertex read QPS

Single vertex read QPS#

The effect of data size on such queries is small, and in general, adding more cores helps almost linearly, though in the case of Mac Mini, only the performance cores help, the efficient cores are pretty useless and can get in the way. In general, you can expect at least around 100K QPS regardless of data size on all setups when you fully utilize your resources.

For memory usage:

Single vertex read mem

Single vertex read mem#

RocksDB only starts using memory with the medium dataset. In all other cases, memory usage is minimal.

Point write#

This is the simplest write query: it just creates a new vertex:

?[uid, cmpl_pct, gender, age] <- [[$id, 0, null, null]] :put user {uid => cmpl_pct, gender, age}

For this query, we are only going to show multi-thread performances for RocksDB, since writing to the other backends are protected by a big lock, so they are effectively still single-threaded:

Single vertex write QPS

Single vertex write QPS#

RocksDB shines here as you can expect more than about 100K QPS for both setups. Using more than the number of performance cores on the Mac Mini decreases performance quite a bit, so avoid that if you can. But you can’t see the SQLite bars, can you? Let’s use logarithmic scale instead:

Single vertex write QPS zoom

Single vertex write QPS zoom#

Whereas RocksDB easily manages more than 100K QPS, SQLite struggles to reach even 100 QPS on the server with the slow SSD. That is more than 1000 times slower! It is so slow since each request translates into an SQLite write transaction, and SQLite writes transactions are known to be super expensive. These separate transactions are unavoidable here because that’s the rule for the game: lots of independent, potentially conflicting writes to the database. The moral of the story is to stay away from the SQLite backend if you expect lots of independent writes.

Memory usage?

Single vertex write mem

Single vertex write mem#

Completely reasonable, I’d say. Even for large datasets, RocksDB keeps memory usage under 500MB.

For writing to edges, we need to put the data into both the friends relation and the reverse friends.rev relation:

Point update#

This query updates a field for a given row:

?[uid, cmpl_pct, age, gender] := uid = $id, *user{uid, age, gender}, cmpl_pct = $n
:put user {uid => cmpl_pct, age, gender}

The performance:

Single vertex update QPS

Single vertex update QPS#

It is slower than point writes, but within a factor of two. You can still easily manage more than 50K QPS for RocksDB. Memory usage is almost the same as the point write case:

Single vertex update mem

Single vertex update mem#

Mixed queries?#

Of course in realistic situations, you would expect read, write and update to occur concurrently. We won’t show the details here, but the conclusion is that in such cases, the RocksDB backend doesn’t care if the queries are reads, writes or updates, whereas any amount of writes kills SQLite. If you want the details, you can find them in the spreadsheet.

If SQLite performs so badly at writes, why include it at all? Well, its performance is still acceptable if you are using it to build a desktop or mobile application where writes are batched, and with the SQLite engine, the database does not use more than the absolute minimal amount of memory.

Analytical queries (OLAP)#

Online analytical processing (OLAP) queries are queries which may touch lots of rows in the database, do complex processing on them, and may return a large number of rows. All graph queries should fall into this category.

For OLAP queries, we are more interested in latency: how long does a query take before it returns (on average)?

Friends of friends#

The classical graph traversal query is the “friends of friends” query: finding out who the friends of friends of a particular person are. For such queries, the intermediate results and the return sets must be stored somewhere (usually in memory). For these queries, we will only show results for the “medium” dataset: 1.6 million vertices and 32 million edges. The same query for the smaller datasets complete much faster: refer to the raw numbers if you are interested.

We start by following the “friends” relation twice—a “2 hops” query:

?[to] := *friends{fr: $id, to: a}, *friends{fr: a, to}

On average, this will return hundreds of rows.

Friends 2 latency

Friends 2 latency#

We see that the RocksDB backend performs very well, and if the storage is fast enough, it is even faster than the mem backend. The SQLite backend also performs quite well competitively. Having more threads harms latency, but not much.

For memory usage: Friends 2 mem

As usual, the SQLite backend doesn’t use more than the absolute minimal amount of memory, unless you have many concurrent threads. The memory usage of the RocksDB backend is also pretty small.

Let’s now go up one hop to find out friends’ friends’ friends:

l1[to] := *friends{fr: $id, to}
l2[to] := l1[fr], *friends{fr, to}
?[to] := l2[fr], *friends{fr, to}

The variance of the number of returned rows is now very high: on average thousands of rows will be returned, and if you start with some particular nodes, you get tens of thousands of rows. The latency is as follows:

Friends 3 latency

Friends 3 latency#

The trend is similar to the 2 hops case, except that the latency is about twenty times as long, roughly proportional to the number of returned rows.

For memory usage:

Friends 3 mem

Friends 3 mem#

Because the database must keep the return set in memory, in all cases the memory usage increases. But it still manages with under 1GB of memory, even with 24 concurrent threads running on the server.

Now let’s go to the extreme, by considering the 4 hops query:

l1[to] := *friends{fr: $id, to}
l2[to] := l1[fr], *friends{fr, to}
l3[to] := l2[fr], *friends{fr, to}
?[to] := l3[fr], *friends{fr, to}

The number of return rows now varies wildly: from tens of thousands of rows if you start with someone who is solitary, or more than half of the whole dataset (more than 600K rows) if you start with someone popular!

Friends 4 latency

Friends 4 latency#

I’d say that for return sets this big, the average latency of a few seconds (or even less than a second) is excellent.

Peak memory usage just reflects the size of the returned sets:

Friends 4 mem

Friends 4 mem#

We won’t go beyond four hops but will note instead that if you go up to six hops, by the “six degrees of separation”, you will return the majority of nodes in almost all cases. Actually, in our experiments, this already happens with a high probability for five hops.

Aggregations#

Aggregations present a different challenge to the database: here the amount of data to keep in memory is not much (in the case of counting, just a single counter), but the database must scan every row of a relation to return the result. For these queries, we will again only show results for the “medium” dataset: 1.6 million rows for the relation in question.

First, we will group users by their age and return the counts for each age group:

?[age, count(uid)] := *user{uid, age}
Aggregation group latency

Aggregation group latency#

This tests the single-core CPU performance and disk read performance. Around 1 second (within a factor of two) to scan the whole table in all cases.

The memory usage is minimal as the return set is small:

Aggregation group mem

Aggregation group mem#

Now let’s add a filter to the aggregation:

?[age, count(age)] := *user{age}, age ~ 0 >= 18

This adds in a bit of processing time, but in terms of the order of magnitude the numbers are similar to before: Aggregation filter latency

The memory usage is almost identical:

Aggregation filter mem

Aggregation filter mem#

The results are similar if we compute several aggregations in tandem:

?[min(uid), max(uid), mean(uid)] := *user{uid, age}

The latency: Aggregation stats latency

and the memory usage: Aggregation stats mem

Pagerank#

Finally let’s see how one of our canned algorithms performs: the Pagerank algorithm with query

?[] <~ PageRank(*friends[])

This time we will show results for different dataset sizes. First for the tiny dataset (10K vertices, 122K edges):

Pagerank tiny latency

Pagerank tiny latency#

Completes in the blink of an eye. Memory usage:

Pagerank tiny mem

Pagerank tiny mem#

Not much, since the dataset is truly tiny.

Now for the small dataset (100K vertices, 1.7M edges):

Pagerank small latency

Pagerank small latency#

About one second within a factor of two. Memory usage:

Pagerank small mem

Pagerank small mem#

This is the amount of memory used to store the graph in the main memory, which is less than the size of the total graph on disk.

Now for the full dataset (1.6M vertices, 32M edges):

Pagerank medium latency

Pagerank medium latency#

About half a minute across all setups. I’d argue that this is as fast as any implementation could go. (, currently, we did not implement the Pagerank algorithm ourselves: instead, we used the excellent implementation of this crate. In the future we will continue to improve canned algorithms according to the metrics that we collected from our internal tests.) Memory usage:

Pagerank medium mem

Pagerank medium mem#

1GB memory for such a workload is more than reasonable.

Conclusion#

We hope that you are convinced that Cozo is an extremely performant database that excels on minimal resources. As it can run (almost) everywhere, please try it for your use case, and send us feedback so that we can improve Cozo further! In a future blog, we will talk about some of the design decisions of Cozo, and the impact on performance and memory usage of these decisions.

Time travel in a database: a Cozo story#

You have built a social network. The only thing your users can do on the network is to update their “mood”. For example, user “joe” may be “moody” yesterday, and “happy” today. How is this reflected in the database? In CozoDB, this can be stored in the following relation:

:create status {uid: String => mood: String}

The equivalent SQL for Postgres is

create table status (
    uid text primary key,
    status text not null
)

Your home page is very simple: it is a gigantic page containing the moods of all the users all at once. To generate it, you run the query:

?[uid, mood] := *status{uid, mood}
select uid, mood from status

And when users want to update their status, you run:

?[uid, mood] <- $input
:put status {uid => mood}
update status 
set mood = $1 
where uid = $2

Simple enough. Now scientists come to you and want to buy your data for their study of the fluctuation of moods during the pandemic. Of course, you know that their real motive is nefarious, so you promptly show them the door. And then start banging your head against the door. Why have you thrown away the history, the most valuable part of your data? WHY?

So you borrow a time machine from a future you and travel back in time to warn the former you. “It’s simple to fix”, the former you says:

:create status {uid: String, ts default now() => mood: String}
create table status (
    uid text not null,
    ts timestamptz not null default now(),
    mood text not null,
    primary key (uid, ts)
)

Of course, now there is no way users can delete their accounts anymore, all they can do is send you more and more updates. Very useful feature!

Now, to generate your homepage:

?[uid, mood] := *status{uid, mood, ts}, ts == now()
select uid, mood from status
where ts = now()

Disaster! The homepage remains forever blank, no matter what the users do!

The problem is that when you generate your homepage, you can only collect data that were inserted in the past. And for past data, the condition ts == now() is never true!

After a lot of fumbling, you find that the following query works:

candidates[uid, max(ts)] := *status{uid, ts}
?[uid, mood] := candidates[uid, ts], *status{uid, ts, mood}
with candidates(uid, ts) as (
    select uid, max(ts) from status
    group by uid
)
select status.uid, status.mood from status
inner join candidates on status.uid = candidates.uid and status.ts = candidates.ts

You first find out what are the timestamps for the latest status for each user, and then use the user ID together with the timestamps to collect the moods.

Now travelling back to a particular time in the past is easy:

candidates[uid, max(ts)] := *status{uid, ts}, ts < $date_ts
?[uid, mood] := candidates[uid, ts], *status{uid, ts, mood}
with candidates(uid, ts) as (
    select uid, max(ts) from status
    where ts < $1
    group by uid
)
select status.uid, status.mood from status
inner join candidates on status.uid = candidates.uid and status.ts = candidates.ts

The cost of time travel#

Your social network becomes a runaway success, and the scientists are happy too! As time goes on, however, you notice performance problems, and it gets worse each day.

What’s happening? After all, your network caters only for the students on a certain campus, and even if everyone signed up, there would only be 10,000 users at most. After digging into your data, you notice that some (most?) of your users are hyperactive and update their mood every five minutes during their waking hour. Even though you have only run your service for three months, some of them have already accumulated over 10,000 mood updates!

So for the front-page generating query:

candidates[uid, max(ts)] := *status{uid, ts}
?[uid, mood] := candidates[uid, ts], *status{uid, ts, mood}
with candidates(uid, ts) as (
    select uid, max(ts) from status
    group by uid
)
select status.uid, status.mood from status
inner join candidates on status.uid = candidates.uid and status.ts = candidates.ts

you are doing a full scan of your data to get your results. For 10,000 users and 1,000 updates each (we use the mean number of mood updates, so it’s 1,000 instead of 10,000), that’s 10 million rows. And next year it will become more than one billion rows, since time ticks and you are thinking of expanding your service to other communities.

Dreamy indices#

Your investor suggests the “enterprisey” thing: pre-computing the front page and updating it periodically instead of calculating it in real-time. Being a hacker with a big ego, you detest all things “enterprisey” and ask yourself: “is there anything better that can be done?” Your friend, who works in finance, suggests time series databases. “It can handle data from the stock market quite well, so surely it is good enough for your data!” “Just index your data by the timestamp!” Well, unlike stock market indices, your data is sparse: it is not collected at regular intervals for all users in unison. Are you going to materialize these implicit rows so that every time a user updates her mood, everyone else also gets an automatic update? Your cloud provider is very welcoming of this idea and urges you to sign up for their proprietary time series database. Your investor is also kind of supportive since it would make you an instant “big data” company, but worries about whether you can secure additional funding in time to cover the costs. You, ever calm, make some back-of-envelop estimates and give up the idea.

Your investor still wants to talk to you over the phone, but you become very annoyed and go to sleep, clouded in anxiety.

In your dream, you come to a dark, Harry-Potteresque library, with dusty shelves neatly lined up, and on the shelves were … files for the mood of your users at different times, meticulously arranged. The tree backing your database has taken physical form! You walk mechanically to the first shelf, like a robot, and start to collect the mood of every user at midnight some days back.

“Aaron, 2022-09-01, 15:31, too early.”

“Aaron, 2022-09-01, 15:33, still too early.”

“Aaron, 2022-12-24, 23:59, still too early.”

“Aaron, 2022-12-25, 00:34, well it’s past the time we want, so the previous item contains the mood.” (The mood is “festive”, by the way.)

“Aaron, 2022-12-25, 00:42, we don’t need this anymore.”

“Aaron, 2022-12-25, 00:47, we don’t need this anymore.”

“Aaron, 2022-12-27, 12:31, why are we still looking at Aaron, by the way?”

“Bean, 2022-11-27, …”

Two things irked you. First, you are going through the data in the wrong direction, so after you have gone past the expected date, you have to go back and look at the previous record. This is especially annoying since some users signed up only today, and the previous record is someone else’s. Second, you are walking a tree, so why can’t you jump to the next user when you know you are done with a user?

As if triggered by these thoughts, the books pour out of the shelves to form a tornado, swirl all over the library, and after a while return to the shelves. “I have to do this all over again,” you gruntle and walk to the first shelf. But something has changed: you can now directly jump to the beginning, and the records are in a different order, ascending by the user, as before, but descending by the timestamp:

“Aaron, 2022-12-27, 03:38, too late, let’s jump to the book past Aaron, 2022-12-25, 00:00.”

“Aaron, 2022-12-24, 23:59. Now this book contains the required mood for Aaron.” “Let’s now jump to the book past Aaron at the BIG BANG.”

“Bean, 2022-12-24, 23:11, this is already the correct book for Bean, lucky! Now let’s go past Bean at the BIG BANG.” “I wonder what happened to Mr Bean since Christmas Eve?”

Suddenly, you wake up. You rush to your computer and write down what you saw, in code.

Back to reality#

Eventually, your social network takes over the world and changes it fundamentally, with the simple schema in CozoScript:

:create status {uid: String, ts: Validity default 'ASSERT' => mood: String}

the query for the present:

?[uid, mood] := *status{uid, mood @ 'NOW'}

and the query for historical moments:

?[uid, mood] := *status{uid, mood @ '2019-12-31T23:59:59Z'}

Obviously, there are no longer any SQL translations.

The story ends here. It is the story of the new time travel feature in Cozo v0.4. We have also added a part in the tutorial giving you hands-on experience.

But what about performance?#

Real databases care about performance deeply, and at Cozo we do. So let’s do some performance tests, with the same Mac Mini as before: it runs MacOS 13.0.1, has Apple M1 CPUs with 4 performance cores and 4 efficiency cores, 16GB of memory and a pretty fast NVMe SSD storage. We only test against the RocksDB backend for simplicity.

We generated many relations, all of which contain data for 10,000 users. The ‘Plain’ relation stores no history at all. The ‘Naive’ relations store and query history using the naive approach we described in the story. We generated different versions of the naive relations, containing different numbers of mood updates per user. Finally, the ‘Hopping’ relations store and query history using the dream approach we described earlier.

The historical timestamp for the queries is chosen randomly and the results are averaged over many runs.

First, we want to know how history storage affects point query throughput, measured in queries per second (QPS):

Type

# updates per user

QPS

Performance percentage

Plain

1

143956

100.00%

Naive

1

106182

73.76%

Naive

10

92335

64.14%

Naive

100

42665

29.64%

Naive

1000

7154

4.97%

Hopping

1

125913

87.47%

Hopping

10

124959

86.80%

Hopping

100

100947

70.12%

Hopping

1000

102193

70.99%

As we can see, for 1000 updates per user, the naive approach has a 20x reduction in throughput compared to the no history approach. The hopping approach, on the other hand, maintains more than 70% of the original performance.

To be fair, for the simplest kind of point query where you know the complete key and the timestamp and want the result for only a single user, there is a better way to write the query so that the naive approach can achieve a similar performance to the hopping one. We deliberately wrote our query in a way to avoid this optimization, since this optimization may not always be possible, depending on the query.

Next, let’s look at aggregation results where we must go through all users. Now we measure latency instead of throughput:

Type

# updates per user

Latency (ms)

Slowdown ratio

Plain

1

2.38

1.00

Naive

1

8.90

3.74

Naive

10

55.52

23.35

Naive

100

541.01

227.52

Naive

1000

5391.75

2267.53

Hopping

1

9.60

4.04

Hopping

10

9.99

4.20

Hopping

100

39.34

16.55

Hopping

1000

31.41

13.21

Now the naive approach scales badly. For a query that takes only 2ms in a plain relation, it takes more than 5 seconds in a relation with 1000 historical facts per user. The hopping approach, on the other hand, keeps the time complexity under control. Notice that it performs better in the relation with 1000 historical facts per user than in the relation with only 100. This is not a random fluctuation: it occurs consistently no matter how we test it. We guess that the RocksDB backend does something different when a certain threshold is passed.

Nonetheless, it is important to note that there is at least a 3x slowdown no matter how you store the history, even if you only have one historical fact per user. This is the minimal cost of time travel. And this is why Cozo does not automatically keep history for every relation regardless of whether you need it: our philosophy is “zero-cost and zero-mental-overhead abstraction”.

Before we end, let’s remark that some people maintain that data is naturally immutable and hence should always be stored immutably. We do not take this view. Use immutability and time travel only when you really need it. Are we simply collections of data in an immutable database, operated by Laplace’s daemon, perhaps? I hope not, and modern physics certainly says no, no matter how you look at it: whether it is the collapse of the wave function or the dance at the edge of chaos. So immutability is an illusion, or at best a platonic fantasy that we created so that we can make better sense of the world. That’s OK, since we can only understand the world by projecting our models onto the world. Just don’t become the slaves of our own creation and let it slow us down.

Indices#