Tutorial#

This tutorial will teach you the basics of using the Cozo database.

There are several ways you can run the queries in this tutorial:

You can run the examples in your browser without installing anything: just open Cozo in Wasm and you are ready to go.
You can download the appropriate cozo-* executable for your operating system from the release page, uncompress, rename to cozo, and run the REPL mode by running in a terminal cozo repl and follow along.
If you are familiar with the Python datascience stack, you should following the instruction here instead to run this notebook in Jupyter Lab, with the notebook file.
There are many other ways, but the above ones are the easiest.

The following cell is to set up Jupyter. Ignore if you are using other methods.

[1]:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

%load_ext pycozo.ipyext_direct

Introduction#

Datalog is pattern matching for your data.

This sort of thing (Ruby):

config = {db: {user: 'admin', password: 'abc123'}}

case config
in db: {user:} # matches subhash and puts matched value in variable user
  puts "Connect with user '#{user}'"
in connection: {username: }
  puts "Connect with user '#{username}'"
else
  puts "Unrecognized structure of config"
end
# Prints: "Connect with user 'admin'"

Broadly speaking, you provide some values and variables that constitute a pattern for pulling apart some data structure. The matching simultaneously chooses an execution path and gets values of the variables that match out of the data structure.

First steps#

Cozo is a relational database. Relations are similar to tables in SQL. The difference is that you can’t have duplicate rows in a relation.

You can define and then retrieve all the rows from a relation in Cozo like this:

[2]:

?[] <- [['hello', 'world', 'Cozo!']]

[2]:

	_0	_1	_2
0	hello	world	Cozo!

This is a rule. The <- separates the head of the rule on the left from the body on the right. The [] <- acts like SELECT * in SQL. The above query in SQL would look something like:

SELECT * FROM (VALUES(‘hello’, ‘world’, ‘sql’)) as unnecessary_alias;

(a hint here of Datalog’s freedom from the endless ceremony that plagues SQL)

You can have multiple rows:

[3]:

?[] <- [[1, 2, 3], ['a', 'b', 'c']]

[3]:

	_0	_1	_2
0	1	2	3
1	a	b	c

You can have the usual sorts of types:

[4]:

?[] <- [[1.5, 2.5, 3, 4, 5.5],
        ['aA', 'bB', 'cC', 'dD', 'eE'],
        [true, false, null, -1.4e-2, "A string with double quotes"]]

[4]:

	_0	_1	_2	_3	_4
0	True	False	None	-0.014000	A string with double quotes
1	1.500000	2.500000	3	4	5.500000
2	aA	bB	cC	dD	eE

The input literal representations are similar to those in JavaScript. In particular, strings in double quotes are guaranteed to be interpreted in the same way as in JSON. The output are in JSON, but how they appear on your screen depends on your setup. For example, if you are using a Python setup, booleans are displayed as True and False instead of in lowercase, since they are converted to the native Python datatypes for display.

In the last example, you may have noticed that the returned order is not the same as the input order. This is because in Cozo relations are always stored as trees, and trees are always sorted.

Another consequence of relations as trees is that you can have no duplicate rows:

[5]:

?[] <- [[1], [2], [1], [2], [1]]

[5]:

	_0
0	1
1	2

Expressions#

The next example shows the use of various functions, expressions and comments:

[6]:

?[] <- [[
            1 + 2, # addition
            3 / 4, # division
            5 == 6, # equality
            7 > 8, # greater
            true || false, # or
            false && true, # and
            lowercase('HELLO'), # function
            rand_float(), # function taking no argument
            union([1, 2, 3], [3, 4, 5], [5, 6, 7]), # variadic function
        ]]

[6]:

	_0	_1	_2	_3	_4	_5	_6	_7	_8
0	3	0.750000	False	False	True	False	hello	0.061268	[1, 2, 3, 4, 5, 6, 7]

Cozo has a pretty robust set of built-in functions and operators.

Find values for these variables#

More complex queries proceed by providing some expressions with values and variables, much like pattern matching in your favourite language.

Cozo tries to find all the sets of rows that match the given values, including values for the variables that also match across all the expressions. Some examples will make this clearer.

[7]:

?[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]

[7]:

	first	second	third
0	1	2	3
1	a	b	c

Here, we’ve just provided named variables for the values coming back.

Note that you have to do obvious things, like not having different numbers of columns within one relation. Typing can be strict or loose as you choose (so far, it’s all been loose; we’ll see examples of strict typing shortly).

For <- (called constant) rules, the number of bindings must match the actual data (the arity), otherwise, you will get an error:

[8]:

?[first, second] <- [[1, 2, 3], ['a', 'b', 'c']]

[8]:

parser::fixed_rule_head_arity_mismatch

  × Fixed rule head arity mismatch
   ╭─[1:1]
 1 │ ?[first, second] <- [[1, 2, 3], ['a', 'b', 'c']]
   · ─────────────────────────────────────────────────
 2 │
   ╰────
  help: Expected arity: 3, number of arguments given: 2

Now let’s define rules that apply (use) other rules:

[9]:

rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[a, b, c] := rule[a, b, c]

[9]:

	a	b	c
0	1	2	3
1	a	b	c

You’ll notice the two different connectors. The first, <-, is actually syntactic sugar for <~, which is used for invoking functions on the right hand side (as well as Cozo’s many built-in functions, you can write your own in the host language). <- is invoking a Constant function that looks like rule[first, second, third] <~ Constant(data: [[1, 2, 3], ['a', 'b', 'c']]) <- and <~ rules are called constant rules.

The second rule is an inline rule, meaning it’s not invoking anything external to the database; it’s just matching data. Note that because it is named ?, it is the result of the query.

Note how in the second line, we match a, b, c to the first, second, third in the first rule.

We don’t have to use all the variables from the body in an inline rule:

[10]:

rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[c, b] := rule[a, b, c]

[10]:

	c	b
0	3	2
1	c	b

Let’s get a bit fancier:

[11]:

?[c, b] := rule[a, b, c], is_num(a)
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]

[11]:

	c	b
0	3	2

Here, the first rule had multiple bodies, separated by commas. These are effectively joined by and. We see here how the variable a matches with a value from the database, but also has to return true when passed to the is_num function.

You can also bind constants to rule applications directly:

[12]:

rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[c, b] := rule['a', b, c]

[12]:

	c	b
0	c	b

The pattern in the head on the second line matches only rows with an ‘a’ for the first field value.

You can have multiple rules with the same head name, which is like joining their bodies with OR.

important[a] := recent[a], unread[a], not deleted[a]
important[a] := flagged[a], not deleted[a]

This means a is important if it isn’t deleted, and it’s either flagged, or recent and unread. You can also use ‘and’, ‘or’, ‘not’ and parentheses, although the style above is more idiomatic.

We can just directly set a value as part of the query, using ‘=’:

[13]:

rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[c, b, d] := rule[a, b, c], is_num(a), d = a + b + 2*c

[13]:

	c	b	d
0	3	2	9

The unification = unifies with a single value. Use in to unify with each value within a list in turn:

[14]:

?[x, y] := x in [1, 2, 3], y in ['x', 'y']

[14]:

	x	y
0	1	x
1	1	y
2	2	x
3	2	y
4	3	x
5	3	y

Having multiple rule applications in the body generates every combination of the bindings:

[15]:

r1[] <- [[1, 'a'], [2, 'b']]
r2[] <- [[2, 'B'], [3, 'C']]

?[l1, l2] := r1[a, l1],
             r2[b, l2]

[15]:

	l1	l2
0	a	B
1	a	C
2	b	B
3	b	C

This is a cross join in SQL

We’ve now seen enough of the basics to start looking at some more complex operations.

Joins, made easy#

[16]:

r1[] <- [[1, 'a'], [2, 'b']]
r2[] <- [[2, 'B'], [3, 'C']]

?[l1, l2] := r1[a, l1],
             r2[a, l2] # reused `a`

[16]:

	l1	l2
0	b	B

A join is as simple as using the same variable in different relations. Here, we are asking for a single value of a that matches a row in each of the sources at the same time.

Compare the above with the SQL for the same operation:

WITH
    r1 AS SELECT * FROM (VALUES (1, ‘a’), (2, ‘b’)) as t(c1, c2),
    r2 AS SELECT * FROM (VALUES(2, ‘B’), (3, ‘C’)) as u(d1, d2)
SELECT
    r1.c2, r2.d2 FROM r1 JOIN r2 ON (r1.c1 = r2.d1);

An outer join is done by considering the cases separately:

[17]:

a[x, y] <- [[1, 2], [3, 4]]
b[y, z] <- [[2, 3], [2, 4]]
?[x, y, z] := a[x, y], b[y, z]
?[x, y, z] := a[x, y], not b[y, _], z = null

[17]:

	x	y	z
0	1	2	3.000000
1	1	2	4.000000
2	3	4	nan

Note that we still only have a single output ? rule here; it is split into two parts effectively joining the right sides with or as we discussed previously.

Stored relations#

The relations we’ve seen so far were created on the fly as constant rules, and only persist for the duration of the query.

You can of course also persist data. To do this, you first declare the relation:

[18]:

:create stored {c1, c2}

[18]:

	status
0	OK

This creates a relation called ‘stored’ with two columns that will take any type. Here is a fully typed relation:

[19]:

:create dept_info {
    company_name: String,
    department_name: String,
    =>
    head_count: Int default 0,
    address: String,
}

[19]:

	status
0	OK

This should be fairly obvious, except for the => in the middle. This symbol separates the columns in the primary key before the => from the other columns. If the => isn’t present, all of the columns are considered part of the primary key.

[20]:

?[a, b, c] <- [[1, 'a', 'A'],
[2, 'b', 'B'],
[3, 'c', 'C'],
[4, 'd', 'D']]

:create fd {a, b => c}

[20]:

	status
0	OK

[21]:

?[a, b, c] := *fd[a, b, c]

[21]:

	a	b	c
0	1	a	A
1	2	b	B
2	3	c	C
3	4	d	D

Note how we create rows in fd as we define it.

In Cozo terms, the fields after the => are functionally dependent on the ones before. Those before determine identity, which lets the :put operation function as (upsert)https://wiki.postgresql.org/wiki/UPSERT.

[22]:

?[a, b, c] <- [[3, 'c', 'CCCCCCC']]

:put fd {a, b => c}

[22]:

	status
0	OK

[23]:

?[a, b, c] := *fd[a, b, c]

[23]:

	a	b	c
0	1	a	A
1	2	b	B
2	3	c	CCCCCCC
3	4	d	D

Note that the name of the stored relation should not start with an underscore _. You will get no error if you use a name starting with an underscore, but you will discover that the relation won’t be persisted. What you are dealing with is actually what is called an ephemeral relation.

You can see the stored relations in your system by running a system op:

[24]:

::relations

[24]:

	name	arity	access_level	n_keys	n_non_keys
0	dept_info	4	normal	2	2
1	fd	3	normal	2	1
2	stored	2	normal	2	0

Note that we have regular operations like :create that start with a :, and system operations with a ::.

We can investigate the columns of the stored relation:

[25]:

::columns stored

[25]:

	column	is_key	index	type	has_default
0	c1	True	0	Any?	False
1	c2	True	1	Any?	False

Stored relations can be accessed by using an asterisk * before the name:

[26]:

?[a, b] := *stored[a, b]

[26]:

	a	b

Unlike relations defined inline, the columns of stored relations have fixed names. You can take advantage of this by selectively referring to columns by name, especially if you have a lot of columns:

[27]:

?[a, b] := *stored{l2: b, l1: a}

[27]:

eval::named_field_not_found

  × stored relation 'stored' does not have field 'l1'
   ╭─[1:1]
 1 │ ?[a, b] := *stored{l2: b, l1: a}
   ·            ─────────────────────
 2 │
   ╰────

If the binding is the same as the column name, you can omit the binding:

[28]:

?[l2] := *stored{l2}

[28]:

eval::named_field_not_found

  × stored relation 'stored' does not have field 'l2'
   ╭─[1:1]
 1 │ ?[l2] := *stored{l2}
   ·          ───────────
 2 │
   ╰────

Use :rm to remove data:

[29]:

?[l1, l2] <- [['e', 'E']]

:rm stored {l1, l2}

[29]:

eval::required_col_not_found

  × required column l1 not found

[30]:

?[l1, l2] := *stored[l1, l2]

[30]:

	l1	l2

Use ::remove (double colon!) to get rid of a stored relation:

[31]:

::remove stored

[31]:

	status
0	OK

[32]:

::relations

[32]:

	name	arity	access_level	n_keys	n_non_keys	n_put_triggers	n_rm_triggers	n_replace_triggers	description
0	dept_info	4	normal	2	2	0	0	0
1	fd	3	normal	2	1	0	0	0

We’ve now seen a variety of Cozo queries. They all have one or more rules, one of which will have the name ? which determines which variables are the result of the query.

A query can also have one or more operations such as :put, which are executed after the query and can use its results. These operations might also change the query in some way, by limiting the number of results, sorting, paginating and so on.

Stored relations can have triggers. These are discussed in detail here.

Before continuing, let’s remove the stored relation we introduced:

[33]:

::remove fd

[33]:

	status
0	OK

Command blocks#

Cozo executes the entire script you send to it in a single transaction. You can do multiple query-command blocks by wrapping each block in curly braces. The blocks are executed in sequence and either all of them succeed or none of them do and the state rolls back. The result returned to a client is whatever is in the last block. Like this:

[34]:

{?[a] <- [[1], [2], [3]]; :replace test {a}}
{?[a] <- []; :replace test2 {a}}
%swap test test2
%return test

[34]:

Graphs#

Now let’s consider a graph stored as a relation:

[35]:

?[loving, loved] <- [['alice', 'eve'],
                     ['bob', 'alice'],
                     ['eve', 'alice'],
                     ['eve', 'bob'],
                     ['eve', 'charlie'],
                     ['charlie', 'eve'],
                     ['david', 'george'],
                     ['george', 'george']]

:replace love {loving, loved}

[35]:

	status
0	OK

The graph we have created reads like “Alice loves Eve, Bob loves Alice”, “nobody loves David, David loves George, but George only loves himself”, and so on. Here we used :replace instead of :create. The difference is that if love already exists, it will be wiped and replaced with the new data given.

We can investigate competing interests:

[36]:

?[loved_by_b_e] := *love['eve', loved_by_b_e],
                   *love['bob', loved_by_b_e]

[36]:

	loved_by_b_e
0	alice

It fells natural here to use the or keyword:

[37]:

?[loved_by_b_e] := *love['eve', loved_by_b_e] or *love['bob', loved_by_b_e],
                   loved_by_b_e != 'bob',
                   loved_by_b_e != 'eve'

[37]:

	loved_by_b_e
0	alice
1	charlie

Compare this with multiple rule definitions under the same name:

[38]:

?[loved_by_b_e] := *love['eve', loved_by_b_e],
                   loved_by_b_e != 'bob',
                   loved_by_b_e != 'eve'
?[loved_by_b_e] := *love['bob', loved_by_b_e],
                   loved_by_b_e != 'bob',
                   loved_by_b_e != 'eve'

[38]:

	loved_by_b_e
0	alice
1	charlie

When you have multiple definitions of the same inline rule, the rule heads must be compatible. Only inline rules can have multiple definitions.

Negation#

Negation of expressions should be familiar:

[39]:

?[loved] := *love[person, loved], !ends_with(person, 'e')

[39]:

	loved
0	alice
1	george

Rule applications can also be negated, not with the ! operator, but with the not keyword:

[40]:

?[loved_by_e_not_b] := *love['eve', loved_by_e_not_b],
                       not *love['bob', loved_by_e_not_b]

[40]:

	loved_by_e_not_b
0	bob
1	charlie

There are two sets of logical operations in Cozo, one set that acts on the level of expressions, and another set that acts within the expression (we say they work on atoms):

For atoms: , or and (conjunction), or (disjunction), not (negation)
For expressions: && (conjunction), || (disjunction), ! (negation)

The difference between , and and is operator precedence: and has higher precedence than or, whereas , has lower precedence than or.

There is a safety rule for negation:

[41]:

?[not_loved_by_b] := not *love['bob', not_loved_by_b]

[41]:

eval::unbound_symb_in_head

  × Symbol 'not_loved_by_b' in rule head is unbound
   ╭─[1:1]
 1 │ ?[not_loved_by_b] := not *love['bob', not_loved_by_b]
   ·   ──────────────
 2 │
   ╰────
  help: Note that symbols occurring only in negated positions are not considered bound

This query is forbidden because the resulting relation is infinite. For example, ‘gold’ should be in the result, since according to the facts stored in the database, Bob has no interest in ‘gold’.

To make our query finite, we have to explicitly give our query a closed world:

[42]:

the_population[p] := *love[p, _a]
the_population[p] := *love[_a, p]

?[not_loved_by_b] := the_population[not_loved_by_b],
                     not *love['bob', not_loved_by_b]

[42]:

	not_loved_by_b
0	bob
1	charlie
2	david
3	eve
4	george

Recursion#

The single greatest advantage of Datalog over SQL is that recursive/graph queries area much simpler.

Inline rules can apply other rules, and can have multiple definitions. If you combine these two, you get recursions:

[43]:

alice_love_chain[person] := *love['alice', person]
alice_love_chain[person] := alice_love_chain[in_person],
                            *love[in_person, person]

?[chained] := alice_love_chain[chained]

[43]:

	chained
0	alice
1	bob
2	charlie
3	eve

You may object that you only need to be able to apply other rules to have recursion, without multiple definitions. Technically correct, but the resulting queries are not useful:

..

[44]:

alice_love_chain[person] := alice_love_chain[in_person],
                            *love[in_person, person]

?[chained] := alice_love_chain[chained]

[44]:

	chained

Similar to the negation case, if there is no way to deduce a fact from the given facts, then the fact itself is considered false. You need multiple definitions to “bootstrap” the query.

Aggregation#

Aggregations are usually used to compute statistics. In Cozo, aggregations are applied in the head of inline rules:

[45]:

?[person, count(loved_by)] := *love[loved_by, person]

[45]:

	person	count(loved_by)
0	alice	2
1	bob	1
2	charlie	1
3	eve	2
4	george	2

The usual sum, mean, etc. are all available.

Here is the full list of aggregations for you to play with.

Query options#

We have seen query options like :create, :put, :rm for manipulating stored relations. There are also query options for controlling what is returned:

[46]:

?[loving, loved] := *love{ loving, loved }

:limit 1

[46]:

	loving	loved
0	alice	eve

Next we want the result to be sorted by loved in descending order, then loving in ascending order, and skip the first row:

[47]:

?[loving, loved] := *love{ loving, loved }

:order -loved, loving
:offset 1

[47]:

	loving	loved
0	george	george
1	alice	eve
2	charlie	eve
3	eve	charlie
4	eve	bob
5	bob	alice
6	eve	alice

Putting - in front of variables in :order clause denotes reverse order. Nothing or + denotes the ascending order.

The full list of query options are explained in this chapter.

Fixed rules#

We’ve seen how the <- syntax for constant rules is syntax sugar. The full syntax is:

[48]:

?[] <~ Constant(data: [['hello', 'world', 'Cozo!']])

[48]:

	_0	_1	_2
0	hello	world	Cozo!

Here we are using the fixed rule Constant, which takes one option named data. The curly tail <~ denotes a fixed rule.

Fixed rules take input relations as arguments, apply custom logic to them and produce its output relation. They’re functions that return relations, defined outside the Cozo engine itself (either built-in or loaded from a library or defined by the host language).

The Constant fixed rule take zero input relations.

If you are using Cozo in browser, Constant is the only fixed rule you can use. In all other cases your Cozo would have the graph algorithms module enabled, and all graph algorithms are implemented as fixed rules. As an example, let’s find out who is most popular in the love graph by using the PageRank algorithm:

[49]:

?[person, page_rank] <~ PageRank(*love[])

:order -page_rank

[49]:

	person	page_rank
0	george	0.280443
1	eve	0.275376
2	alice	0.191205
3	bob	0.103023
4	charlie	0.103023
5	david	0.025000

Here the input relation is the stored relation *love (as noted above, you will receive an error if you run this in the WASM implementation).

Each fixed rule is different: here is the full list.

[50]:

::remove love

[50]:

	status
0	OK

Time travel#

A very useful capability of Cozo is the ability to time travel in a stored relation. Usually when you :put into a relation, the old value is overwritten; and when you :rm, the row is completely gone. But the old values can be of great value — we have a short story about it.

In this case, time travel is the solution: instead of storing current facts, the stored relation stores the complete history of facts, at least all the available history as history itself unfolds.

If you believe that you don’t want this functionality at all, you can skip to the next section. At Cozo we adopt the “zero-cost, zero-mental-overhead abstraction” philosophy: if you don’t use a functionality, you don’t pay the performance or cognitive overhead.

Let’s have a simple example: storing the head of state of countries. First we have to create a new stored relation:

[51]:

:create hos {state: String, year: Validity => hos: String}

[51]:

	status
0	OK

hos is shorthand for “head of state”. The only thing different about this relation is that we are giving year the type Validity. You can think of validity as a list of two elements, the first element being an integer and the second element being a boolean. The integer indicates the “time” of the fact recorded by the row, the boolean, if true, indicates that the new version of the row holds starting from the indicated time; if the boolean is false, the row is deleted at this time.

Note that the integer only identifies a temporal sequence to Cozo. What sequence you use is up to you. UNIX Epoch is the default, as we will see.

Now let’s insert some data:

[52]:

?[state, year, hos] <- [['US', [2001, true], 'Bush'],
                        ['US', [2005, true], 'Bush'],
                        ['US', [2009, true], 'Obama'],
                        ['US', [2013, true], 'Obama'],
                        ['US', [2017, true], 'Trump'],
                        ['US', [2021, true], 'Biden']]

:put hos {state, year => hos}

[52]:

	status
0	OK

It is OK to assert a still valid fact again, as we have done above. You can use this relation like a normal relation:

[53]:

?[state, year, hos] := *hos{state, year, hos}

[53]:

	state	year	hos
0	US	[2021, True]	Biden
1	US	[2017, True]	Trump
2	US	[2013, True]	Obama
3	US	[2009, True]	Obama
4	US	[2005, True]	Bush
5	US	[2001, True]	Bush

The curious thing is that it is sorted descendingly by year. Validity sorts descendingly.

For any stored relation that has type Validity at the last slot of the key, the time-travel capability is enabled. Say you have forgotten who the president of the US was in 2019. Easy:

[54]:

?[hos, year] := *hos{state: 'US', year, hos @ 2019}

[54]:

	hos	year
0	Trump	[2017, True]

In your answer you also got the year that this fact was last known to be true.

You also don’t know about the year 2099:

[55]:

?[hos, year] := *hos{state: 'US', year, hos @ 2099}

[55]:

	hos	year
0	Biden	[2021, True]

That certainly doesn’t look right. Let’s fix it by retracting facts on or after 2025, and only inserting them back when we have the sure facts:

[56]:

?[state, year, hos] <- [['US', [2025, false], '']]

:put hos {state, year => hos}

[56]:

	status
0	OK

As we have hinted, you retract facts by putting a retraction. Now let’s run the query again:

[57]:

?[hos, year] := *hos{state: 'US', year, hos @ 2099}

[57]:

	hos	year

Since the database does not contain facts on or after 2025, your query returns empty.

This functionality is flexible: you can mix different periods in the same query:

[58]:

?[hos2018, hos2010] := *hos{state: 'US', hos: hos2018 @ 2018},
                       *hos{state: 'US', hos: hos2010 @ 2010}

[58]:

	hos2018	hos2010
0	Trump	Obama

As the relation hos is just a normal relation, you can still rm facts from it, in which case the facts are irretrievably gone. Whether that’s desirable is up to you: the database gives you the choice of how you want to use it, and trusts that you know how to use it correctly for your use case.

First let’s create a new stored relation to store people’s moods:

[59]:

:create mood {name: String, at: Validity => mood: String}

[59]:

	status
0	OK

I want to record my mood now:

[60]:

?[name, at, mood] <- [['me', 'ASSERT', 'curious']]
:put mood {name, at => mood}

[60]:

	status
0	OK

Instead of giving a list of two elements as we have done above, we have simply used the string ASSERT, and the system will know that we mean an assertion of a current fact, and you guessed it, the value used is the current UNIX Epoch.

[61]:

?[name, at, mood] := *mood{name, at, mood}

[61]:

	name	at	mood
0	me	[1700569902203327, True]	curious

[62]:

?[name, time, mood] := *mood{name, at, mood},
                       time = format_timestamp(at)

[62]:

	name	time	mood
0	me	2023-11-21T12:31:42.203+00:00	curious

To query for current facts, use the string NOW in the validity specification:

[63]:

?[name, time, mood] := *mood{name, at, mood @ 'NOW'},
                       time = format_timestamp(at)

[63]:

	name	time	mood
0	me	2023-11-21T12:31:42.203+00:00	curious

You can put in facts with manual timestamps as before, so it is possible that your database contains facts about the future. Let’s do just that. Instead of giving a mysterious string of numerals, you can use a string for the timestamp:

[64]:

?[name, at, mood] <- [['me', '2030-01-01T00:00:00.000+00:00', 'hopeful']]
:put mood {name, at => mood}

[64]:

	status
0	OK

Since this is in the future, it shouldn’t affect NOW:

[65]:

?[name, time, mood] := *mood{name, at, mood @ 'NOW'},
                       time = format_timestamp(at)

[65]:

	name	time	mood
0	me	2023-11-21T12:31:42.203+00:00	curious

In this case, there is also END for the validity specification, meaning to extract facts at the end of time:

[66]:

?[name, time, mood] := *mood{name, at, mood @ 'END'},
                       time = format_timestamp(at)

[66]:

	name	time	mood
0	me	2030-01-01T00:00:00+00:00	hopeful

Retraction at the current timestamp can be done with the string RETRACT:

[67]:

?[name, at, mood] <- [['me', 'RETRACT', '']]
:put mood {name, at => mood}

[67]:

	status
0	OK

Retraction placed in the future can also be done with stringy timestamps by prefixing with ~:

[68]:

?[name, at, mood] <- [['me', '~9999-01-01T00:00:00.000+00:00', 'who cares']]
:put mood {name, at => mood}

[68]:

	status
0	OK

Now let’s look at the complete history:

[69]:

?[name, time, is_assert, mood] := *mood{name, at, mood},
                                  time = format_timestamp(at),
                                  is_assert = to_bool(at)

[69]:

	name	time	is_assert	mood
0	me	2023-11-21T12:31:42.203+00:00	True	curious
1	me	2023-11-21T12:31:46.232+00:00	False
2	me	2030-01-01T00:00:00+00:00	True	hopeful
3	me	9999-01-01T00:00:00+00:00	False	who cares

This time-travel facility is much faster than what you get if you try to implement it directly with Datalog: see the note for more details. Some further technical details of time travel is discussed in its own chapter.

[70]:

::remove mood, hos

[70]:

	status
0	OK

Extended example: the air routes dataset#

Now you have a basic understanding of using the various constructs of Cozo, let’s deal with a small real-world dataset, with about 3700 nodes and 57000 edges.

The data we are going to use, and many examples that we will present, are adapted from the book Practical Gremlin. Gremlin is an imperative query language for graphs, a very different take compared to Datalog.

First, let’s create the stored relations we want (wrapping queries in braces allows you to execute several queries together atomically):

[71]:

{:create airport {
    code: String
    =>
    icao: String,
    desc: String,
    region: String,
    runways: Int,
    longest: Float,
    elev: Float,
    country: String,
    city: String,
    lat: Float,
    lon: Float
}}
{:create country {
    code: String
    =>
    desc: String
}}
{:create continent {
    code: String
    =>
    desc: String
}}
{:create contain { entity: String, contained: String }}
{:create route { fr: String, to: String => dist: Float }}

[71]:

	status
0	OK

The next command applies only if you are using Jupyter notebooks: it downloads a JSON file containing the data and imports it into the database. The commented out line shows how to do the same thing with a local file. If you are using the Cozo WASM interface, click the “Import from URL” or the “Import from File” icon, and paste in the address.

[72]:

%cozo_import_remote_file 'https://raw.githubusercontent.com/cozodb/cozo/dev/cozo-core/tests/air-routes.json'
# %cozo_import_local_file '../../cozo/cozo-core/tests/air-routes.json'

If you are using the cozo repl command line, the command to use instead is simply

%import https://raw.githubusercontent.com/cozodb/cozo/dev/cozo-core/tests/air-routes.json

You can replace the URL with the path to a local file as well. Other environments also have ways to do the same thing: refer to the respective documentations.

If you feel that the above is too much magic, we will show you the “hard way” of importing the same data at the end of this tutorial. For now let’s move on.

Let’s verify all the relations we want are there:

[73]:

::relations

[73]:

	name	arity	access_level	n_keys	n_non_keys
0	airport	11	normal	1	10
1	contain	2	normal	2	0
2	continent	2	normal	1	1
3	country	2	normal	1	1
4	dept_info	4	normal	2	2
5	route	3	normal	2	1
6	test	1	normal	1	0
7	test2	1	normal	1	0

While we are at it, let’s lock all these tables to prevent accidentally changing their contents:

[74]:

::access_level read_only airport, contain, continent, country, route

[74]:

	status
0	OK

More information about what this does is explained in this chapter.

Let’s just look at some data. Start with airports:

[75]:

?[code, city, desc, region, runways, lat, lon] := *airport{code, city, desc, region, runways, lat, lon}

:limit 5

[75]:

	code	city	desc	region	runways	lat	lon
0	AAA	Anaa	Anaa Airport	PF-U-A	1	-17.352600	-145.509995
1	AAE	Annabah	Annaba Airport	DZ-36	2	36.822201	7.809170
2	AAL	Aalborg	Aalborg Airport	DK-81	2	57.092759	9.849243
3	AAN	Al Ain	Al Ain International Airport	AE-AZ	1	24.261700	55.609200
4	AAQ	Anapa	Anapa Airport	RU-KDA	1	45.002102	37.347301

Airports with the most runways:

[76]:

?[code, city, desc, region, runways, lat, lon] := *airport{code, city, desc, region, runways, lat, lon}

:order -runways
:limit 10

[76]:

	code	city	desc	region	runways	lat	lon
0	DFW	Dallas	Dallas/Fort Worth International Airport	US-TX	7	32.896801	-97.038002
1	ORD	Chicago	Chicago O'Hare International Airport	US-IL	7	41.978600	-87.904800
2	AMS	Amsterdam	Amsterdam Airport Schiphol	NL-NH	6	52.308601	4.763890
3	BOS	Boston	Boston Logan	US-MA	6	42.364300	-71.005203
4	DEN	Denver	Denver International Airport	US-CO	6	39.861698	-104.672997
5	DTW	Detroit	Detroit Metropolitan, Wayne County	US-MI	6	42.212399	-83.353401
6	ATL	Atlanta	Hartsfield - Jackson Atlanta International Airport	US-GA	5	33.636700	-84.428101
7	GIS	Gisborne	Gisborne Airport	NZ-GIS	5	-38.663300	177.977997
8	HLZ	Hamilton	Hamilton International Airport	NZ-WKO	5	-37.866699	175.332001
9	IAH	Houston	George Bush Intercontinental	US-TX	5	29.984400	-95.341400

How many airports are there in total?

[77]:

?[count(code)] := *airport{code}

[77]:

	count(code)
0	3504

Let’s get a distribution of the initials of the airport codes:

[78]:

?[count(initial), initial] := *airport{code}, initial = first(chars(code))

:order initial

[78]:

	count(initial)	initial
0	212	A
1	235	B
2	214	C
3	116	D
4	95	E
5	76	F
6	135	G
7	129	H
8	112	I
9	80	J
10	197	K
11	184	L
12	228	M
13	111	N
14	89	O
15	203	P
16	7	Q
17	121	R
18	245	S
19	205	T
20	77	U
21	86	V
22	59	W
23	28	X
24	211	Y
25	49	Z

More useful are the statistics of runways:

[79]:

?[count(r), count_unique(r), sum(r), min(r), max(r), mean(r), std_dev(r)] :=
    *airport{runways: r}

[79]:

	count(r)	count_unique(r)	sum(r)	min(r)	max(r)	mean(r)	std_dev(r)
0	3504	7	4980.000000	1	7	1.421233	0.743083

Using country, we can find countries with no airports:

[80]:

?[desc] := *country{code, desc}, not *airport{country: code}

[80]:

	desc
0	Andorra
1	Liechtenstein
2	Monaco
3	Pitcairn
4	San Marino

The route relation by itself is rather boring:

[81]:

?[fr, to, dist] := *route{fr, to, dist}

:limit 10

[81]:

	fr	to	dist
0	AAA	FAC	48.000000
1	AAA	MKP	133.000000
2	AAA	PPT	270.000000
3	AAA	RAR	968.000000
4	AAE	ALG	254.000000
5	AAE	CDG	882.000000
6	AAE	IST	1161.000000
7	AAE	LYS	631.000000
8	AAE	MRS	477.000000
9	AAE	ORN	477.000000

It just records the starting and ending airports of each route, together with the distance. This relation only becomes useful when used as a graph.

Airports with no routes:

[82]:

?[code, desc] := *airport{code, desc}, not *route{fr: code}, not *route{to: code}

[82]:

	code	desc
0	AFW	Fort Worth Alliance Airport
1	APA	Centennial Airport
2	APK	Apataki Airport
3	BID	Block Island State Airport
4	BVS	Breves Airport
5	BWU	Sydney Bankstown Airport
6	CRC	Santa Ana Airport
7	CVT	Coventry Airport
8	EKA	Murray Field
9	GYZ	Gruyere Airport
10	HFN	Hornafjordur Airport
11	HZK	Husavik Airport
12	ILG	New Castle Airport
13	INT	Smith Reynolds Airport
14	ISL	Ataturk International Airport
15	KGG	Kédougou Airport
16	NBW	Leeward Point Field
17	NFO	Mata'aho Airport
18	PSY	Stanley Airport
19	RIG	Rio Grande Airport
20	SFD	San Fernando De Apure Airport
21	SFH	San Felipe International Airport
22	SXF	Berlin-Schönefeld International Airport Closed
23	TUA	Teniente Coronel Luis a Mantilla Airport
24	TWB	Toowoomba Airport
25	TXL	Berlin, Tegel International Airport Closed
26	VCV	Southern California Logistics Airport
27	YEI	Bursa Yenişehir Airport

Airports with the most out routes:

[83]:

route_count[fr, count(fr)] := *route{fr}
?[code, n] := route_count[code, n]

:sort -n
:limit 5

[83]:

	code	n
0	FRA	310
1	IST	309
2	CDG	293
3	AMS	283
4	MUC	270

How many routes are there from the European Union to the US?

[84]:

routes[unique(r)] := *contain['EU', fr],
                     *route{fr, to},
                     *airport{code: to, country: 'US'},
                     r = [fr, to]
?[n] := routes[rs], n = length(rs)

[84]:

	n
0	435

How many airports are there in the US with routes from the EU?

[85]:

?[count_unique(to)] := *contain['EU', fr],
                       *route{fr, to},
                       *airport{code: to, country: 'US'}

[85]:

	count_unique(to)
0	45

How many routes are there for each airport in London, UK?

[86]:

?[code, count(code)] := *airport{code, city: 'London', region: 'GB-ENG'}, *route{fr: code}

[86]:

	code	count(code)
0	LCY	51
1	LGW	232
2	LHR	221
3	LTN	130
4	STN	211

We need to specify the region, because there is another city called London, not in the UK.

How many airports are reachable from London, UK in two hops?

[87]:

lon_uk_airports[code] := *airport{code, city: 'London', region: 'GB-ENG'}
one_hop[to] := lon_uk_airports[fr], *route{fr, to}, not lon_uk_airports[to];
?[count_unique(a3)] := one_hop[a2], *route{fr: a2, to: a3}, not lon_uk_airports[a3];

[87]:

	count_unique(a3)
0	2353

What are the cities directly reachable from LGW (London Gatwick), but furthermost away?

[88]:

?[city, dist] := *route{fr: 'LGW', to, dist},
                 *airport{code: to, city}
:order -dist
:limit 10

[88]:

	city	dist
0	Buenos Aires	6908.000000
1	Singapore	6751.000000
2	Langkawi	6299.000000
3	Duong Dong	6264.000000
4	Taipei	6080.000000
5	Port Louis	6053.000000
6	Rayong	6008.000000
7	Cape Town	5987.000000
8	Hong Kong	5982.000000
9	Shanghai	5745.000000

What airports are within 0.1 degrees of the Greenwich meridian?

[89]:

?[code, desc, lon, lat] := *airport{lon, lat, code, desc}, lon > -0.1, lon < 0.1

[89]:

	code	desc	lon	lat
0	CDT	Castellon De La Plana Airport	0.026111	39.999199
1	LCY	London City Airport	0.055278	51.505278
2	LDE	Tarbes-Lourdes-Pyrénées Airport	-0.006439	43.178699
3	LEH	Le Havre Octeville Airport	0.088056	49.533901

Airports in a box drawn around London Heathrow, UK:

[90]:

h_box[lon, lat] := *airport{code: 'LHR', lon, lat}
?[code, desc] := h_box[lhr_lon, lhr_lat], *airport{code, lon, lat, desc},
                 abs(lhr_lon - lon) < 1, abs(lhr_lat - lat) < 1

[90]:

	code	desc
0	LCY	London City Airport
1	LGW	London Gatwick
2	LHR	London Heathrow
3	LTN	London Luton Airport
4	SOU	Southampton Airport
5	STN	London Stansted Airport

For some spherical geometry: what is the angle subtended by SFO and NRT on the surface of the earth?

[91]:

?[deg_diff] := *airport{code: 'SFO', lat: a_lat, lon: a_lon},
               *airport{code: 'NRT', lat: b_lat, lon: b_lon},
               deg_diff = rad_to_deg(haversine_deg_input(a_lat, a_lon, b_lat, b_lon))

[91]:

	deg_diff
0	73.992112

We mentioned before that aggregations in Cozo are powerful. They are powerful because they can be used in recursions (some restrictions apply).

Let’s find the distance of the shortest route between two airports. One wayis to enumerate all the routes between the two airports, and then apply min aggregation to the results. This cannot be implemented as stated, since the routes may contain cycles and hence there can be an infinite number of routes between two airports.

Instead, think recursively. If we already have all the shortest routes between all nodes, we derive an equation satisfied by the shortest route: the shortest route between a and b is either the distance of a direct route, or the sum of the shortest distance from a to c and the distance of a direct route from c to d. We apply our min aggregation to this recursive set instead.

Write it out and it works. For exmaple, the shortest routes between the airports LHR and YPO:

[92]:

shortest[b, min(dist)] := *route{fr: 'LHR', to: b, dist}
                          # Start with the airport 'LHR', retrieve a direct route from 'LHR' to b

shortest[b, min(dist)] := shortest[c, d1], # Start with an existing shortest route from 'LHR' to c
                          *route{fr: c, to: b, dist: d2},  # Retrieve a direct route from c to b
                          dist = d1 + d2 # Add the distances

?[dist] := shortest['YPO', dist] # Extract the answer for 'YPO'.
                                 # We chose it since it is the hardest airport to get to from 'LHR'.

[92]:

	dist
0	4147.000000

There is a caveat when you try to write similar queries. Say you try to write it in the following way (don’t try to run it):

shortest[a, b, min(dist)] := *route{fr: a, to: b, dist}
shortest[a, b, min(dist)] := shortest[a, c, d1],
                          *route{fr: c, to: b, dist: d2},
                          dist = d1 + d2

?[dist] := shortest['LHR', 'YPO', dist]

You will find that the query does not complete in a reasonable amount of time, despite it being equivalent to the original query. Why?

In the changed query, you are asking the database to compute the all-pair shortest path, and then extract the answer to a particular shortest path. Normally Cozo would apply a technique called magic set rewrite so that only the needed answer would be calculated. However, in the changed query the presence of the aggregation operator min prevents that. In this case, applying the rewrite to the variable a would still yield the correct answer, but rewriting in any other way would give complete nonsense, and in the more general case with recursive aggregations this is a can of worms.

So as explained in the chapter about execution, magic set rewrites are only applied to rules without aggregations or recursions for the moment, until we are sure of the exact conditions under which the rewrites are safe. So for now at least the database executes the query as written, computing the result of the shortest rule containing more than ten million rows (to be exact, 3700 * 3700 = 13,690,000 rows) first!

The bottom line is, be mindful of the cardinality of the return sets of recursive rules.

A tour of graph algorithms#

Now let’s investigate the graph using some graph algorithms. As we have mentioned before, the Cozo running in browsers through WASM does not have the graph algorithms module enabled, so to run the following examples you will need to use some other implementation (for example, the Python one).

Since path-finding is such a common operation on graphs, Cozo has several fixed rules for that:

[93]:

starting[] <- [['LHR']]
goal[] <- [['YPO']]
?[starting, goal, distance, path] <~ ShortestPathDijkstra(*route[], starting[], goal[])

[93]:

	starting	goal	distance	path
0	LHR	YPO	4147.000000	['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']

Not only is it more efficient, but we also get a path for the shortest route.

Not content with the shortest path, the following calculates ten the shortest paths:

[94]:

starting[] <- [['LHR']]
goal[] <- [['YPO']]
?[starting, goal, distance, path] <~ KShortestPathYen(*route[], starting[], goal[], k: 10)

[94]:

	starting	goal	distance	path
0	LHR	YPO	4147.000000	['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
1	LHR	YPO	4150.000000	['LHR', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
2	LHR	YPO	4164.000000	['LHR', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
3	LHR	YPO	4167.000000	['LHR', 'DUB', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
4	LHR	YPO	4187.000000	['LHR', 'MAN', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
5	LHR	YPO	4202.000000	['LHR', 'IOM', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
6	LHR	YPO	4204.000000	['LHR', 'MAN', 'DUB', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
7	LHR	YPO	4209.000000	['LHR', 'YUL', 'YMT', 'YNS', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
8	LHR	YPO	4211.000000	['LHR', 'MAN', 'IOM', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']
9	LHR	YPO	4212.000000	['LHR', 'DUB', 'YUL', 'YMT', 'YNS', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']

If efficiency is really important to you, you can use the A* algorithm with a good heuristic function:

[95]:

code_lat_lon[code, lat, lon] := *airport{code, lat, lon}
starting[code, lat, lon] := code = 'LHR', *airport{code, lat, lon};
goal[code, lat, lon] := code = 'YPO', *airport{code, lat, lon};
?[] <~ ShortestPathAStar(*route[],
                         code_lat_lon[node, lat1, lon1],
                         starting[],
                         goal[goal, lat2, lon2],
                         heuristic: haversine_deg_input(lat1, lon1, lat2, lon2) * 3963);

[95]:

	_0	_1	_2	_3
0	LHR	YPO	4147.000000	['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO']

There’s a lot more setup required in this case: we need to retrieve the latitudes and longitudes of airports and do processing on them first. The number 3963 above is the radius of the earth in miles.

The most important airports, by PageRank:

[96]:

rank[code, score] <~ PageRank(*route[a, b])
?[code, desc, score] := rank[code, score], *airport{code, desc}

:limit 10;
:order -score

[96]:

	code	desc	score
0	IST	Istanbul International Airport	0.004889
1	DFW	Dallas/Fort Worth International Airport	0.004696
2	ORD	Chicago O'Hare International Airport	0.004452
3	DEN	Denver International Airport	0.004252
4	PEK	Beijing Capital International Airport	0.004044
5	FRA	Frankfurt am Main	0.004027
6	ATL	Hartsfield - Jackson Atlanta International Airport	0.004022
7	DXB	Dubai International Airport	0.004002
8	CDG	Paris Charles de Gaulle	0.003998
9	DME	Moscow, Domodedovo International Airport	0.003817

The following example takes a long time to run since it calculates the betweenness centrality: up to a few seconds, depending on your machine. Algorithms for calculating the betweenness centrality have high complexity.

[97]:

centrality[code, score] <~ BetweennessCentrality(*route[a, b])
?[code, desc, score] := centrality[code, score], *airport{code, desc}

:limit 10;
:order -score

[97]:

	code	desc	score
0	ANC	Anchorage Ted Stevens	1074868.250000
1	KEF	Reykjavik, Keflavik International Airport	928450.437500
2	HEL	Helsinki Ventaa	581588.500000
3	PEK	Beijing Capital International Airport	532021.062500
4	DEL	Indira Gandhi International Airport	472979.968750
5	IST	Istanbul International Airport	457882.156250
6	PKC	Yelizovo Airport	408571.000000
7	MSP	Minneapolis-St.Paul International Airport	396433.250000
8	LAX	Los Angeles International Airport	393309.968750
9	DEN	Denver International Airport	374339.531250

These are the airports that, if disconnected from the network, cause the most disruption. As this example shows, some of the algorithms really struggle when you go beyond small or medium sized dataset.

Community detection can collapse a graph into a supergraph. Here we store the result, since it has too many rows to display nicely:

[98]:

community[detailed_cluster, code] <~ CommunityDetectionLouvain(*route[a, b])
?[code, cluster, detailed_cluster] := community[detailed_cluster, code], cluster = first(detailed_cluster)

:replace community {code => cluster, detailed_cluster}

[98]:

	status
0	OK

We can look at the supernodes containing specific nodes. For example, the supernode for London Gatwick consists of mainly UK and other European airports, as you would expect:

[99]:

community[code] := *community{code: 'LGW', cluster}, *community{code, cluster}
?[country, count(code)] :=
    community[code],
    *airport{code, desc, country: country_code},
    *country{code: country_code, desc: country},

:order -count(code)
:limit 5

[99]:

	country	count(code)
0	United Kingdom	54
1	France	50
2	Norway	49
3	Spain	40
4	Greece	38

For JFK on the other hand, its supernode consists of mainly US airports:

[100]:

community[code] := *community{code: 'JFK', cluster}, *community{code, cluster}
?[country, count(code)] :=
    community[code],
    *airport{code, desc, country: country_code},
    *country{code: country_code, desc: country},

:order -count(code)
:limit 5

[100]:

	country	count(code)
0	United States	444
1	Canada	111
2	Brazil	108
3	Mexico	57
4	Colombia	50

But it does not always work according to geography. For example, Frankfurt airport is in Germany:

[101]:

?[desc, country_desc] := *airport{code: 'FRA', desc, country: country_code}, *country{code: country_code, desc: country_desc}

[101]:

	desc	country_desc
0	Frankfurt am Main	Germany

But its supernode:

[102]:

community[code] := *community{code: 'FRA', cluster}, *community{code, cluster}
?[country, count(code)] :=
    community[code],
    *airport{code, desc, country: country_code},
    *country{code: country_code, desc: country},

:order -count(code)
:limit 5

[102]:

	country	count(code)
0	United States	444
1	Canada	111
2	Brazil	108
3	Mexico	57
4	Colombia	50

Germany does not even appear in the top five. In fact, FRA is in the same supernode as JFK. What matters is the connectivity in the graph, not the geography. As another example:

[103]:

community[code] := *community{code: 'SIN', cluster}, *community{code, cluster}
?[country, count(code)] :=
    community[code],
    *airport{code, desc, country: country_code},
    *country{code: country_code, desc: country},

:order -count(code)
:limit 5

[103]:

	country	count(code)
0	China	216
1	Australia	125
2	Indonesia	68
3	Japan	65
4	Philippines	40

You’d expect SIN to be a Chinese airport. Wrong:

[104]:

?[desc, country_desc] := *airport{code: 'SIN', desc, country: country_code}, *country{code: country_code, desc: country_desc}

[104]:

	desc	country_desc
0	Singapore, Changi International Airport	Singapore

Finally, let’s collapse the route relation into super_route:

[105]:

?[fr_cluster, to_cluster, count(dist), sum(dist)] := *route{fr, to, dist},
                                                     *community{code: fr, cluster: fr_cluster},
                                                     *community{code: to, cluster: to_cluster}
:replace super_route {fr_cluster, to_cluster => n_routes=count(dist), total_distance=sum(dist)}

[105]:

	status
0	OK

As expected, the “diagonals” where fr_cluster == to_cluster are larger in the super_route graph:

[106]:

?[fr_cluster, to_cluster, n_routes, total_distance] := *super_route{fr_cluster, to_cluster, n_routes, total_distance}, fr_cluster < 2

[106]:

	fr_cluster	to_cluster	n_routes	total_distance
0	0	0	9041	8933554.000000
1	0	1	434	1695379.000000
2	0	2	228	761661.000000
3	0	3	530	1681865.000000
4	0	4	163	391892.000000
5	0	8	3	300.000000
6	0	11	2	283.000000
7	0	19	1	238.000000
8	0	21	2	705.000000
9	0	22	1	975.000000
10	1	0	434	1696858.000000
11	1	1	4474	5142452.000000
12	1	2	1160	1492734.000000
13	1	3	526	1724591.000000
14	1	4	223	361986.000000
15	1	9	1	808.000000

Now the super graph is small enough that all analytics algorithms return instantly:

[107]:

?[cluster, score] <~ PageRank(*super_route[])
:order -score
:limit 5

[107]:

	cluster	score
0	3	0.173309
1	0	0.093072
2	1	0.053465
3	2	0.053389
4	4	0.044654

You can now go on to investigate the supernodes, give real-world interpretations to them, etc. For example, a naïve interpretation of the above PageRank result is that North America is (still) the most prosperous part of the world, followed by East Asia in the second place, South Asia in the third place, and Europe in the fourth place.

Importing dataset the hard way#

Previously, we imported the air-routes dataset by using Python under the hood to download a specially-crafted JSON file and feed it to the database. Here we learn how to achieve the same effect by letting Cozo fetch and import a series of CSV files, without Python’s help.

Let’s set the database magic up first:

[108]:

::access_level normal airport, contain, continent, country, route

[108]:

	status
0	OK

[109]:

::remove airport, contain, continent, country, route, community, super_route

[109]:

	status
0	OK

Next, some parameters to make life eaiser (the lines commented out do the same thing by processing local files):

[113]:

# %cozo_set AIR_ROUTES_NODES_URL 'https://raw.githubusercontent.com/cozodb/cozo/dev/cozo-core/tests/air-routes-latest-nodes.csv'
# %cozo_set AIR_ROUTES_EDGES_URL 'https://raw.githubusercontent.com/cozodb/cozo/dev/cozo-core/tests/air-routes-latest-edges.csv'
%cozo_set AIR_ROUTES_NODES_URL 'file://./../../cozo/cozo-core/tests/air-routes-latest-nodes.csv'
%cozo_set AIR_ROUTES_EDGES_URL 'file://./../../cozo/cozo-core/tests/air-routes-latest-edges.csv'

First, import the airport relation:

[114]:

res[idx, label, typ, code, icao, desc, region, runways, longest, elev, country, city, lat, lon] <~
    CsvReader(types: ['Int', 'Any', 'Any', 'Any', 'Any', 'Any', 'Any', 'Int?', 'Float?', 'Float?', 'Any', 'Any', 'Float?', 'Float?'],
              url: $AIR_ROUTES_NODES_URL,
              has_headers: true)

?[code, icao, desc, region, runways, longest, elev, country, city, lat, lon] :=
    res[idx, label, typ, code, icao, desc, region, runways, longest, elev, country, city, lat, lon],
    label == 'airport'

:replace airport {
    code: String
    =>
    icao: String,
    desc: String,
    region: String,
    runways: Int,
    longest: Float,
    elev: Float,
    country: String,
    city: String,
    lat: Float,
    lon: Float
}

[114]:

	status
0	OK

The CsvReader utility downloads a CSV file from the internet and attempts to parse its content into a relation. When we store the relation, we specified types for the columns. The code column acts as a primary key for the airport stored relation.

Next is country:

[115]:

res[idx, label, typ, code, icao, desc] <~
    CsvReader(types: ['Int', 'Any', 'Any', 'Any', 'Any', 'Any'],
              url: $AIR_ROUTES_NODES_URL,
              has_headers: true)
?[code, desc] :=
    res[idx, label, typ, code, icao, desc],
    label == 'country'

:replace country {
    code: String
    =>
    desc: String
}

[115]:

	status
0	OK

continent:

[116]:

res[idx, label, typ, code, icao, desc] <~
    CsvReader(types: ['Int', 'Any', 'Any', 'Any', 'Any', 'Any'],
              url: $AIR_ROUTES_NODES_URL,
              has_headers: true)
?[idx, code, desc] :=
    res[idx, label, typ, code, icao, desc],
    label == 'continent'

:replace continent {
    code: String
    =>
    desc: String
}

[116]:

	status
0	OK

We need to make a translation table for the indices the data use:

[117]:

res[idx, label, typ, code] <~
    CsvReader(types: ['Int', 'Any', 'Any', 'Any'],
              url: $AIR_ROUTES_NODES_URL,
              has_headers: true)
?[idx, code] :=
    res[idx, label, typ, code],

:replace idx2code { idx => code }

[117]:

	status
0	OK

The contain relation contains information on the geographical inclusion of entities:

[118]:

res[] <~
    CsvReader(types: ['Int', 'Int', 'Int', 'String'],
              url: $AIR_ROUTES_EDGES_URL,
              has_headers: true)
?[entity, contained] :=
    res[idx, fr_i, to_i, typ],
    typ == 'contains',
    *idx2code[fr_i, entity],
    *idx2code[to_i, contained]


:replace contain { entity: String, contained: String }

[118]:

	status
0	OK

Finally, the routes between the airports. This relation is much larger than the rest and contains about 60k rows:

[119]:

res[] <~
    CsvReader(types: ['Int', 'Int', 'Int', 'String', 'Float?'],
              url: $AIR_ROUTES_EDGES_URL,
              has_headers: true)
?[fr, to, dist] :=
    res[idx, fr_i, to_i, typ, dist],
    typ == 'route',
    *idx2code[fr_i, fr],
    *idx2code[to_i, to]

:replace route { fr: String, to: String => dist: Float }

[119]:

	status
0	OK

We no longer need the idx2code relation:

[120]:

::remove idx2code

[120]:

	status
0	OK

Let’s verify all the relations we want are there:

[121]:

::relations

[121]:

	name	arity	access_level	n_keys	n_non_keys
0	airport	11	normal	1	10
1	contain	2	normal	2	0
2	continent	2	normal	1	1
3	country	2	normal	1	1
4	dept_info	4	normal	2	2
5	route	3	normal	2	1
6	test	1	normal	1	0
7	test2	1	normal	1	0

That’s it for the tutorial!

	count(initial)	initial
0	212	A
1	235	B
2	214	C
3	116	D
4	95	E
5	76	F
6	135	G
7	129	H
8	112	I
9	80	J
10	197	K
11	184	L
12	228	M
13	111	N
14	89	O
15	203	P
16	7	Q
17	121	R
18	245	S
19	205	T
20	77	U
21	86	V
22	59	W
23	28	X
24	211	Y
25	49	Z

	count(initial)	initial
0	212	A
1	235	B
2	214	C
3	116	D
4	95	E
5	76	F
6	135	G
7	129	H
8	112	I
9	80	J
10	197	K
11	184	L
12	228	M
13	111	N
14	89	O
15	203	P
16	7	Q
17	121	R
18	245	S
19	205	T
20	77	U
21	86	V
22	59	W
23	28	X
24	211	Y
25	49	Z

	count(initial)	initial
0	212	A
1	235	B
2	214	C
3	116	D
4	95	E
5	76	F
6	135	G
7	129	H
8	112	I
9	80	J
10	197	K
11	184	L
12	228	M
13	111	N
14	89	O
15	203	P
16	7	Q
17	121	R
18	245	S
19	205	T
20	77	U
21	86	V
22	59	W
23	28	X
24	211	Y
25	49	Z