Tutorial#
This tutorial will teach you the basics of using the Cozo database.
There are several ways you can run the queries in this tutorial:
You can run the examples in your browser without installing anything: just open Cozo in Wasm and you are ready to go.
You can download the appropriate
cozo-*
executable for your operating system from the release page, uncompress, rename tocozo
, and run the REPL mode by running in a terminalcozo repl
and follow along.If you are familiar with the Python datascience stack, you should following the instruction here instead to run this notebook in Jupyter Lab, with the notebook file.
There are many other ways, but the above ones are the easiest.
The following cell is to set up Jupyter. Ignore if you are using other methods.
[1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%load_ext pycozo.ipyext_direct
Introduction#
Datalog is pattern matching for your data.
This sort of thing (Ruby):
config = {db: {user: 'admin', password: 'abc123'}}
case config
in db: {user:} # matches subhash and puts matched value in variable user
puts "Connect with user '#{user}'"
in connection: {username: }
puts "Connect with user '#{username}'"
else
puts "Unrecognized structure of config"
end
# Prints: "Connect with user 'admin'"
Broadly speaking, you provide some values and variables that constitute a pattern for pulling apart some data structure. The matching simultaneously chooses an execution path and gets values of the variables that match out of the data structure.
First steps#
Cozo is a relational database. Relations are similar to tables in SQL. The difference is that you can’t have duplicate rows in a relation.
You can define and then retrieve all the rows from a relation in Cozo like this:
[2]:
?[] <- [['hello', 'world', 'Cozo!']]
[2]:
_0 | _1 | _2 | |
---|---|---|---|
0 | hello | world | Cozo! |
This is a rule. The <- separates the head of the rule on the left from the body on the right. The [] <-
acts like SELECT *
in SQL. The above query in SQL would look something like:
SELECT * FROM (VALUES(‘hello’, ‘world’, ‘sql’)) as unnecessary_alias;
(a hint here of Datalog’s freedom from the endless ceremony that plagues SQL)
You can have multiple rows:
[3]:
?[] <- [[1, 2, 3], ['a', 'b', 'c']]
[3]:
_0 | _1 | _2 | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | a | b | c |
You can have the usual sorts of types:
[4]:
?[] <- [[1.5, 2.5, 3, 4, 5.5],
['aA', 'bB', 'cC', 'dD', 'eE'],
[true, false, null, -1.4e-2, "A string with double quotes"]]
[4]:
_0 | _1 | _2 | _3 | _4 | |
---|---|---|---|---|---|
0 | True | False | None | -0.014000 | A string with double quotes |
1 | 1.500000 | 2.500000 | 3 | 4 | 5.500000 |
2 | aA | bB | cC | dD | eE |
The input literal representations are similar to those in JavaScript. In particular, strings in double quotes are guaranteed to be interpreted in the same way as in JSON. The output are in JSON, but how they appear on your screen depends on your setup. For example, if you are using a Python setup, booleans are displayed as True
and False
instead of in lowercase, since they are converted to the native Python datatypes for display.
In the last example, you may have noticed that the returned order is not the same as the input order. This is because in Cozo relations are always stored as trees, and trees are always sorted.
Another consequence of relations as trees is that you can have no duplicate rows:
[5]:
?[] <- [[1], [2], [1], [2], [1]]
[5]:
_0 | |
---|---|
0 | 1 |
1 | 2 |
Expressions#
The next example shows the use of various functions, expressions and comments:
[6]:
?[] <- [[
1 + 2, # addition
3 / 4, # division
5 == 6, # equality
7 > 8, # greater
true || false, # or
false && true, # and
lowercase('HELLO'), # function
rand_float(), # function taking no argument
union([1, 2, 3], [3, 4, 5], [5, 6, 7]), # variadic function
]]
[6]:
_0 | _1 | _2 | _3 | _4 | _5 | _6 | _7 | _8 | |
---|---|---|---|---|---|---|---|---|---|
0 | 3 | 0.750000 | False | False | True | False | hello | 0.061268 | [1, 2, 3, 4, 5, 6, 7] |
Cozo has a pretty robust set of built-in functions and operators.
Find values for these variables#
More complex queries proceed by providing some expressions with values and variables, much like pattern matching in your favourite language.
Cozo tries to find all the sets of rows that match the given values, including values for the variables that also match across all the expressions. Some examples will make this clearer.
[7]:
?[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
[7]:
first | second | third | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | a | b | c |
Here, we’ve just provided named variables for the values coming back.
Note that you have to do obvious things, like not having different numbers of columns within one relation. Typing can be strict or loose as you choose (so far, it’s all been loose; we’ll see examples of strict typing shortly).
For <-
(called constant) rules, the number of bindings must match the actual data (the arity), otherwise, you will get an error:
[8]:
?[first, second] <- [[1, 2, 3], ['a', 'b', 'c']]
[8]:
parser::fixed_rule_head_arity_mismatch
× Fixed rule head arity mismatch
╭─[1:1]
1 │ ?[first, second] <- [[1, 2, 3], ['a', 'b', 'c']]
· ─────────────────────────────────────────────────
2 │
╰────
help: Expected arity: 3, number of arguments given: 2
Now let’s define rules that apply (use) other rules:
[9]:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[a, b, c] := rule[a, b, c]
[9]:
a | b | c | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | a | b | c |
You’ll notice the two different connectors. The first, <-
, is actually syntactic sugar for <~
, which is used for invoking functions on the right hand side (as well as Cozo’s many built-in functions, you can write your own in the host language). <-
is invoking a Constant function that looks like rule[first, second, third] <~ Constant(data: [[1, 2, 3], ['a', 'b', 'c']])
<-
and <~
rules are called constant rules.
The second rule is an inline rule, meaning it’s not invoking anything external to the database; it’s just matching data. Note that because it is named ?
, it is the result of the query.
Note how in the second line, we match a, b, c to the first, second, third in the first rule.
We don’t have to use all the variables from the body in an inline rule:
[10]:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[c, b] := rule[a, b, c]
[10]:
c | b | |
---|---|---|
0 | 3 | 2 |
1 | c | b |
Let’s get a bit fancier:
[11]:
?[c, b] := rule[a, b, c], is_num(a)
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
[11]:
c | b | |
---|---|---|
0 | 3 | 2 |
Here, the first rule had multiple bodies, separated by commas. These are effectively joined by and. We see here how the variable a
matches with a value from the database, but also has to return true when passed to the is_num function.
You can also bind constants to rule applications directly:
[12]:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[c, b] := rule['a', b, c]
[12]:
c | b | |
---|---|---|
0 | c | b |
The pattern in the head on the second line matches only rows with an ‘a’ for the first field value.
You can have multiple rules with the same head name, which is like joining their bodies with OR
.
important[a] := recent[a], unread[a], not deleted[a]
important[a] := flagged[a], not deleted[a]
This means a is important if it isn’t deleted, and it’s either flagged, or recent and unread. You can also use ‘and’, ‘or’, ‘not’ and parentheses, although the style above is more idiomatic.
We can just directly set a value as part of the query, using ‘=’:
[13]:
rule[first, second, third] <- [[1, 2, 3], ['a', 'b', 'c']]
?[c, b, d] := rule[a, b, c], is_num(a), d = a + b + 2*c
[13]:
c | b | d | |
---|---|---|---|
0 | 3 | 2 | 9 |
The unification =
unifies with a single value. Use in
to unify with each value within a list in turn:
[14]:
?[x, y] := x in [1, 2, 3], y in ['x', 'y']
[14]:
x | y | |
---|---|---|
0 | 1 | x |
1 | 1 | y |
2 | 2 | x |
3 | 2 | y |
4 | 3 | x |
5 | 3 | y |
Having multiple rule applications in the body generates every combination of the bindings:
[15]:
r1[] <- [[1, 'a'], [2, 'b']]
r2[] <- [[2, 'B'], [3, 'C']]
?[l1, l2] := r1[a, l1],
r2[b, l2]
[15]:
l1 | l2 | |
---|---|---|
0 | a | B |
1 | a | C |
2 | b | B |
3 | b | C |
This is a cross join in SQL
We’ve now seen enough of the basics to start looking at some more complex operations.
Joins, made easy#
[16]:
r1[] <- [[1, 'a'], [2, 'b']]
r2[] <- [[2, 'B'], [3, 'C']]
?[l1, l2] := r1[a, l1],
r2[a, l2] # reused `a`
[16]:
l1 | l2 | |
---|---|---|
0 | b | B |
A join is as simple as using the same variable in different relations. Here, we are asking for a single value of a
that matches a row in each of the sources at the same time.
Compare the above with the SQL for the same operation:
WITH
r1 AS SELECT * FROM (VALUES (1, ‘a’), (2, ‘b’)) as t(c1, c2),
r2 AS SELECT * FROM (VALUES(2, ‘B’), (3, ‘C’)) as u(d1, d2)
SELECT
r1.c2, r2.d2 FROM r1 JOIN r2 ON (r1.c1 = r2.d1);
An outer join is done by considering the cases separately:
[17]:
a[x, y] <- [[1, 2], [3, 4]]
b[y, z] <- [[2, 3], [2, 4]]
?[x, y, z] := a[x, y], b[y, z]
?[x, y, z] := a[x, y], not b[y, _], z = null
[17]:
x | y | z | |
---|---|---|---|
0 | 1 | 2 | 3.000000 |
1 | 1 | 2 | 4.000000 |
2 | 3 | 4 | nan |
Note that we still only have a single output ?
rule here; it is split into two parts effectively joining the right sides with or
as we discussed previously.
Stored relations#
The relations we’ve seen so far were created on the fly as constant rules, and only persist for the duration of the query.
You can of course also persist data. To do this, you first declare the relation:
[18]:
:create stored {c1, c2}
[18]:
status | |
---|---|
0 | OK |
This creates a relation called ‘stored’ with two columns that will take any type. Here is a fully typed relation:
[19]:
:create dept_info {
company_name: String,
department_name: String,
=>
head_count: Int default 0,
address: String,
}
[19]:
status | |
---|---|
0 | OK |
This should be fairly obvious, except for the =>
in the middle. This symbol separates the columns in the primary key before the =>
from the other columns. If the =>
isn’t present, all of the columns are considered part of the primary key.
[20]:
?[a, b, c] <- [[1, 'a', 'A'],
[2, 'b', 'B'],
[3, 'c', 'C'],
[4, 'd', 'D']]
:create fd {a, b => c}
[20]:
status | |
---|---|
0 | OK |
[21]:
?[a, b, c] := *fd[a, b, c]
[21]:
a | b | c | |
---|---|---|---|
0 | 1 | a | A |
1 | 2 | b | B |
2 | 3 | c | C |
3 | 4 | d | D |
Note how we create rows in fd as we define it.
In Cozo terms, the fields after the => are functionally dependent on the ones before. Those before determine identity, which lets the :put operation function as (upsert)https://wiki.postgresql.org/wiki/UPSERT.
[22]:
?[a, b, c] <- [[3, 'c', 'CCCCCCC']]
:put fd {a, b => c}
[22]:
status | |
---|---|
0 | OK |
[23]:
?[a, b, c] := *fd[a, b, c]
[23]:
a | b | c | |
---|---|---|---|
0 | 1 | a | A |
1 | 2 | b | B |
2 | 3 | c | CCCCCCC |
3 | 4 | d | D |
Note that the name of the stored relation should not start with an underscore _
. You will get no error if you use a name starting with an underscore, but you will discover that the relation won’t be persisted. What you are dealing with is actually what is called an ephemeral relation.
You can see the stored relations in your system by running a system op:
[24]:
::relations
[24]:
name | arity | access_level | n_keys | n_non_keys | n_put_triggers | n_rm_triggers | n_replace_triggers | description | |
---|---|---|---|---|---|---|---|---|---|
0 | dept_info | 4 | normal | 2 | 2 | 0 | 0 | 0 | |
1 | fd | 3 | normal | 2 | 1 | 0 | 0 | 0 | |
2 | stored | 2 | normal | 2 | 0 | 0 | 0 | 0 |
Note that we have regular operations like :create
that start with a :, and system operations with a ::.
We can investigate the columns of the stored relation:
[25]:
::columns stored
[25]:
column | is_key | index | type | has_default | |
---|---|---|---|---|---|
0 | c1 | True | 0 | Any? | False |
1 | c2 | True | 1 | Any? | False |
Stored relations can be accessed by using an asterisk *
before the name:
[26]:
?[a, b] := *stored[a, b]
[26]:
a | b |
---|
Unlike relations defined inline, the columns of stored relations have fixed names. You can take advantage of this by selectively referring to columns by name, especially if you have a lot of columns:
[27]:
?[a, b] := *stored{l2: b, l1: a}
[27]:
eval::named_field_not_found
× stored relation 'stored' does not have field 'l1'
╭─[1:1]
1 │ ?[a, b] := *stored{l2: b, l1: a}
· ─────────────────────
2 │
╰────
If the binding is the same as the column name, you can omit the binding:
[28]:
?[l2] := *stored{l2}
[28]:
eval::named_field_not_found
× stored relation 'stored' does not have field 'l2'
╭─[1:1]
1 │ ?[l2] := *stored{l2}
· ───────────
2 │
╰────
Use :rm
to remove data:
[29]:
?[l1, l2] <- [['e', 'E']]
:rm stored {l1, l2}
[29]:
eval::required_col_not_found
× required column l1 not found
[30]:
?[l1, l2] := *stored[l1, l2]
[30]:
l1 | l2 |
---|
Use ::remove
(double colon!) to get rid of a stored relation:
[31]:
::remove stored
[31]:
status | |
---|---|
0 | OK |
[32]:
::relations
[32]:
name | arity | access_level | n_keys | n_non_keys | n_put_triggers | n_rm_triggers | n_replace_triggers | description | |
---|---|---|---|---|---|---|---|---|---|
0 | dept_info | 4 | normal | 2 | 2 | 0 | 0 | 0 | |
1 | fd | 3 | normal | 2 | 1 | 0 | 0 | 0 |
We’ve now seen a variety of Cozo queries. They all have one or more rules, one of which will have the name ? which determines which variables are the result of the query.
A query can also have one or more operations such as :put, which are executed after the query and can use its results. These operations might also change the query in some way, by limiting the number of results, sorting, paginating and so on.
Stored relations can have triggers. These are discussed in detail here.
Before continuing, let’s remove the stored relation we introduced:
[33]:
::remove fd
[33]:
status | |
---|---|
0 | OK |
Command blocks#
Cozo executes the entire script you send to it in a single transaction. You can do multiple query-command blocks by wrapping each block in curly braces. The blocks are executed in sequence and either all of them succeed or none of them do and the state rolls back. The result returned to a client is whatever is in the last block. Like this:
[34]:
{?[a] <- [[1], [2], [3]]; :replace test {a}}
{?[a] <- []; :replace test2 {a}}
%swap test test2
%return test
[34]:
Graphs#
Now let’s consider a graph stored as a relation:
[35]:
?[loving, loved] <- [['alice', 'eve'],
['bob', 'alice'],
['eve', 'alice'],
['eve', 'bob'],
['eve', 'charlie'],
['charlie', 'eve'],
['david', 'george'],
['george', 'george']]
:replace love {loving, loved}
[35]:
status | |
---|---|
0 | OK |
The graph we have created reads like “Alice loves Eve, Bob loves Alice”, “nobody loves David, David loves George, but George only loves himself”, and so on. Here we used :replace
instead of :create
. The difference is that if love
already exists, it will be wiped and replaced with the new data given.
We can investigate competing interests:
[36]:
?[loved_by_b_e] := *love['eve', loved_by_b_e],
*love['bob', loved_by_b_e]
[36]:
loved_by_b_e | |
---|---|
0 | alice |
It fells natural here to use the or
keyword:
[37]:
?[loved_by_b_e] := *love['eve', loved_by_b_e] or *love['bob', loved_by_b_e],
loved_by_b_e != 'bob',
loved_by_b_e != 'eve'
[37]:
loved_by_b_e | |
---|---|
0 | alice |
1 | charlie |
Compare this with multiple rule definitions under the same name:
[38]:
?[loved_by_b_e] := *love['eve', loved_by_b_e],
loved_by_b_e != 'bob',
loved_by_b_e != 'eve'
?[loved_by_b_e] := *love['bob', loved_by_b_e],
loved_by_b_e != 'bob',
loved_by_b_e != 'eve'
[38]:
loved_by_b_e | |
---|---|
0 | alice |
1 | charlie |
When you have multiple definitions of the same inline rule, the rule heads must be compatible. Only inline rules can have multiple definitions.
Negation#
Negation of expressions should be familiar:
[39]:
?[loved] := *love[person, loved], !ends_with(person, 'e')
[39]:
loved | |
---|---|
0 | alice |
1 | george |
Rule applications can also be negated, not with the !
operator, but with the not
keyword:
[40]:
?[loved_by_e_not_b] := *love['eve', loved_by_e_not_b],
not *love['bob', loved_by_e_not_b]
[40]:
loved_by_e_not_b | |
---|---|
0 | bob |
1 | charlie |
There are two sets of logical operations in Cozo, one set that acts on the level of expressions, and another set that acts within the expression (we say they work on atoms):
For atoms:
,
orand
(conjunction),or
(disjunction),not
(negation)For expressions:
&&
(conjunction),||
(disjunction),!
(negation)
The difference between ,
and and
is operator precedence: and
has higher precedence than or
, whereas ,
has lower precedence than or
.
There is a safety rule for negation:
[41]:
?[not_loved_by_b] := not *love['bob', not_loved_by_b]
[41]:
eval::unbound_symb_in_head
× Symbol 'not_loved_by_b' in rule head is unbound
╭─[1:1]
1 │ ?[not_loved_by_b] := not *love['bob', not_loved_by_b]
· ──────────────
2 │
╰────
help: Note that symbols occurring only in negated positions are not considered bound
This query is forbidden because the resulting relation is infinite. For example, ‘gold’ should be in the result, since according to the facts stored in the database, Bob has no interest in ‘gold’.
To make our query finite, we have to explicitly give our query a closed world:
[42]:
the_population[p] := *love[p, _a]
the_population[p] := *love[_a, p]
?[not_loved_by_b] := the_population[not_loved_by_b],
not *love['bob', not_loved_by_b]
[42]:
not_loved_by_b | |
---|---|
0 | bob |
1 | charlie |
2 | david |
3 | eve |
4 | george |
Recursion#
The single greatest advantage of Datalog over SQL is that recursive/graph queries area much simpler.
Inline rules can apply other rules, and can have multiple definitions. If you combine these two, you get recursions:
[43]:
alice_love_chain[person] := *love['alice', person]
alice_love_chain[person] := alice_love_chain[in_person],
*love[in_person, person]
?[chained] := alice_love_chain[chained]
[43]:
chained | |
---|---|
0 | alice |
1 | bob |
2 | charlie |
3 | eve |
You may object that you only need to be able to apply other rules to have recursion, without multiple definitions. Technically correct, but the resulting queries are not useful:
..
[44]:
alice_love_chain[person] := alice_love_chain[in_person],
*love[in_person, person]
?[chained] := alice_love_chain[chained]
[44]:
chained |
---|
Similar to the negation case, if there is no way to deduce a fact from the given facts, then the fact itself is considered false. You need multiple definitions to “bootstrap” the query.
Aggregation#
Aggregations are usually used to compute statistics. In Cozo, aggregations are applied in the head of inline rules:
[45]:
?[person, count(loved_by)] := *love[loved_by, person]
[45]:
person | count(loved_by) | |
---|---|---|
0 | alice | 2 |
1 | bob | 1 |
2 | charlie | 1 |
3 | eve | 2 |
4 | george | 2 |
The usual sum
, mean
, etc. are all available.
Here is the full list of aggregations for you to play with.
Query options#
We have seen query options like :create
, :put
, :rm
for manipulating stored relations. There are also query options for controlling what is returned:
[46]:
?[loving, loved] := *love{ loving, loved }
:limit 1
[46]:
loving | loved | |
---|---|---|
0 | alice | eve |
Next we want the result to be sorted by loved
in descending order, then loving
in ascending order, and skip the first row:
[47]:
?[loving, loved] := *love{ loving, loved }
:order -loved, loving
:offset 1
[47]:
loving | loved | |
---|---|---|
0 | george | george |
1 | alice | eve |
2 | charlie | eve |
3 | eve | charlie |
4 | eve | bob |
5 | bob | alice |
6 | eve | alice |
Putting -
in front of variables in :order
clause denotes reverse order. Nothing or +
denotes the ascending order.
The full list of query options are explained in this chapter.
Fixed rules#
We’ve seen how the <-
syntax for constant rules is syntax sugar. The full syntax is:
[48]:
?[] <~ Constant(data: [['hello', 'world', 'Cozo!']])
[48]:
_0 | _1 | _2 | |
---|---|---|---|
0 | hello | world | Cozo! |
Here we are using the fixed rule Constant
, which takes one option named data
. The curly tail <~
denotes a fixed rule.
Fixed rules take input relations as arguments, apply custom logic to them and produce its output relation. They’re functions that return relations, defined outside the Cozo engine itself (either built-in or loaded from a library or defined by the host language).
The Constant
fixed rule take zero input relations.
If you are using Cozo in browser, Constant
is the only fixed rule you can use. In all other cases your Cozo would have the graph algorithms module enabled, and all graph algorithms are implemented as fixed rules. As an example, let’s find out who is most popular in the love
graph by using the PageRank algorithm:
[49]:
?[person, page_rank] <~ PageRank(*love[])
:order -page_rank
[49]:
person | page_rank | |
---|---|---|
0 | george | 0.280443 |
1 | eve | 0.275376 |
2 | alice | 0.191205 |
3 | bob | 0.103023 |
4 | charlie | 0.103023 |
5 | david | 0.025000 |
Here the input relation is the stored relation *love
(as noted above, you will receive an error if you run this in the WASM implementation).
Each fixed rule is different: here is the full list.
[50]:
::remove love
[50]:
status | |
---|---|
0 | OK |
Time travel#
A very useful capability of Cozo is the ability to time travel in a stored relation. Usually when you :put
into a relation, the old value is overwritten; and when you :rm
, the row is completely gone. But the old values can be of great value — we have a short story about it.
In this case, time travel is the solution: instead of storing current facts, the stored relation stores the complete history of facts, at least all the available history as history itself unfolds.
If you believe that you don’t want this functionality at all, you can skip to the next section. At Cozo we adopt the “zero-cost, zero-mental-overhead abstraction” philosophy: if you don’t use a functionality, you don’t pay the performance or cognitive overhead.
Let’s have a simple example: storing the head of state of countries. First we have to create a new stored relation:
[51]:
:create hos {state: String, year: Validity => hos: String}
[51]:
status | |
---|---|
0 | OK |
hos
is shorthand for “head of state”. The only thing different about this relation is that we are giving year
the type Validity
. You can think of validity as a list of two elements, the first element being an integer and the second element being a boolean. The integer indicates the “time” of the fact recorded by the row, the boolean, if true, indicates that the new version of the row holds starting from the indicated time; if the boolean is false, the row is deleted at this time.
Note that the integer only identifies a temporal sequence to Cozo. What sequence you use is up to you. UNIX Epoch is the default, as we will see.
Now let’s insert some data:
[52]:
?[state, year, hos] <- [['US', [2001, true], 'Bush'],
['US', [2005, true], 'Bush'],
['US', [2009, true], 'Obama'],
['US', [2013, true], 'Obama'],
['US', [2017, true], 'Trump'],
['US', [2021, true], 'Biden']]
:put hos {state, year => hos}
[52]:
status | |
---|---|
0 | OK |
It is OK to assert a still valid fact again, as we have done above. You can use this relation like a normal relation:
[53]:
?[state, year, hos] := *hos{state, year, hos}
[53]:
state | year | hos | |
---|---|---|---|
0 | US | [2021, True] | Biden |
1 | US | [2017, True] | Trump |
2 | US | [2013, True] | Obama |
3 | US | [2009, True] | Obama |
4 | US | [2005, True] | Bush |
5 | US | [2001, True] | Bush |
The curious thing is that it is sorted descendingly by year. Validity sorts descendingly.
For any stored relation that has type Validity
at the last slot of the key, the time-travel capability is enabled. Say you have forgotten who the president of the US was in 2019. Easy:
[54]:
?[hos, year] := *hos{state: 'US', year, hos @ 2019}
[54]:
hos | year | |
---|---|---|
0 | Trump | [2017, True] |
In your answer you also got the year that this fact was last known to be true.
You also don’t know about the year 2099:
[55]:
?[hos, year] := *hos{state: 'US', year, hos @ 2099}
[55]:
hos | year | |
---|---|---|
0 | Biden | [2021, True] |
That certainly doesn’t look right. Let’s fix it by retracting facts on or after 2025, and only inserting them back when we have the sure facts:
[56]:
?[state, year, hos] <- [['US', [2025, false], '']]
:put hos {state, year => hos}
[56]:
status | |
---|---|
0 | OK |
As we have hinted, you retract facts by putting a retraction. Now let’s run the query again:
[57]:
?[hos, year] := *hos{state: 'US', year, hos @ 2099}
[57]:
hos | year |
---|
Since the database does not contain facts on or after 2025, your query returns empty.
This functionality is flexible: you can mix different periods in the same query:
[58]:
?[hos2018, hos2010] := *hos{state: 'US', hos: hos2018 @ 2018},
*hos{state: 'US', hos: hos2010 @ 2010}
[58]:
hos2018 | hos2010 | |
---|---|---|
0 | Trump | Obama |
As the relation hos
is just a normal relation, you can still rm
facts from it, in which case the facts are irretrievably gone. Whether that’s desirable is up to you: the database gives you the choice of how you want to use it, and trusts that you know how to use it correctly for your use case.
First let’s create a new stored relation to store people’s moods:
[59]:
:create mood {name: String, at: Validity => mood: String}
[59]:
status | |
---|---|
0 | OK |
I want to record my mood now:
[60]:
?[name, at, mood] <- [['me', 'ASSERT', 'curious']]
:put mood {name, at => mood}
[60]:
status | |
---|---|
0 | OK |
Instead of giving a list of two elements as we have done above, we have simply used the string ASSERT
, and the system will know that we mean an assertion of a current fact, and you guessed it, the value used is the current UNIX Epoch.
[61]:
?[name, at, mood] := *mood{name, at, mood}
[61]:
name | at | mood | |
---|---|---|---|
0 | me | [1700569902203327, True] | curious |
[62]:
?[name, time, mood] := *mood{name, at, mood},
time = format_timestamp(at)
[62]:
name | time | mood | |
---|---|---|---|
0 | me | 2023-11-21T12:31:42.203+00:00 | curious |
To query for current facts, use the string NOW
in the validity specification:
[63]:
?[name, time, mood] := *mood{name, at, mood @ 'NOW'},
time = format_timestamp(at)
[63]:
name | time | mood | |
---|---|---|---|
0 | me | 2023-11-21T12:31:42.203+00:00 | curious |
You can put in facts with manual timestamps as before, so it is possible that your database contains facts about the future. Let’s do just that. Instead of giving a mysterious string of numerals, you can use a string for the timestamp:
[64]:
?[name, at, mood] <- [['me', '2030-01-01T00:00:00.000+00:00', 'hopeful']]
:put mood {name, at => mood}
[64]:
status | |
---|---|
0 | OK |
Since this is in the future, it shouldn’t affect NOW
:
[65]:
?[name, time, mood] := *mood{name, at, mood @ 'NOW'},
time = format_timestamp(at)
[65]:
name | time | mood | |
---|---|---|---|
0 | me | 2023-11-21T12:31:42.203+00:00 | curious |
In this case, there is also END
for the validity specification, meaning to extract facts at the end of time:
[66]:
?[name, time, mood] := *mood{name, at, mood @ 'END'},
time = format_timestamp(at)
[66]:
name | time | mood | |
---|---|---|---|
0 | me | 2030-01-01T00:00:00+00:00 | hopeful |
Retraction at the current timestamp can be done with the string RETRACT
:
[67]:
?[name, at, mood] <- [['me', 'RETRACT', '']]
:put mood {name, at => mood}
[67]:
status | |
---|---|
0 | OK |
Retraction placed in the future can also be done with stringy timestamps by prefixing with ~
:
[68]:
?[name, at, mood] <- [['me', '~9999-01-01T00:00:00.000+00:00', 'who cares']]
:put mood {name, at => mood}
[68]:
status | |
---|---|
0 | OK |
Now let’s look at the complete history:
[69]:
?[name, time, is_assert, mood] := *mood{name, at, mood},
time = format_timestamp(at),
is_assert = to_bool(at)
[69]:
name | time | is_assert | mood | |
---|---|---|---|---|
0 | me | 2023-11-21T12:31:42.203+00:00 | True | curious |
1 | me | 2023-11-21T12:31:46.232+00:00 | False | |
2 | me | 2030-01-01T00:00:00+00:00 | True | hopeful |
3 | me | 9999-01-01T00:00:00+00:00 | False | who cares |
This time-travel facility is much faster than what you get if you try to implement it directly with Datalog: see the note for more details. Some further technical details of time travel is discussed in its own chapter.
[70]:
::remove mood, hos
[70]:
status | |
---|---|
0 | OK |
Extended example: the air routes dataset#
Now you have a basic understanding of using the various constructs of Cozo, let’s deal with a small real-world dataset, with about 3700 nodes and 57000 edges.
The data we are going to use, and many examples that we will present, are adapted from the book Practical Gremlin. Gremlin is an imperative query language for graphs, a very different take compared to Datalog.
First, let’s create the stored relations we want (wrapping queries in braces allows you to execute several queries together atomically):
[71]:
{:create airport {
code: String
=>
icao: String,
desc: String,
region: String,
runways: Int,
longest: Float,
elev: Float,
country: String,
city: String,
lat: Float,
lon: Float
}}
{:create country {
code: String
=>
desc: String
}}
{:create continent {
code: String
=>
desc: String
}}
{:create contain { entity: String, contained: String }}
{:create route { fr: String, to: String => dist: Float }}
[71]:
status | |
---|---|
0 | OK |
The next command applies only if you are using Jupyter notebooks: it downloads a JSON file containing the data and imports it into the database. The commented out line shows how to do the same thing with a local file. If you are using the Cozo WASM interface, click the “Import from URL” or the “Import from File” icon, and paste in the address.
[72]:
%cozo_import_remote_file 'https://raw.githubusercontent.com/cozodb/cozo/dev/cozo-core/tests/air-routes.json'
# %cozo_import_local_file '../../cozo/cozo-core/tests/air-routes.json'
If you are using the cozo repl
command line, the command to use instead is simply
%import https://raw.githubusercontent.com/cozodb/cozo/dev/cozo-core/tests/air-routes.json
You can replace the URL with the path to a local file as well. Other environments also have ways to do the same thing: refer to the respective documentations.
If you feel that the above is too much magic, we will show you the “hard way” of importing the same data at the end of this tutorial. For now let’s move on.
Let’s verify all the relations we want are there:
[73]:
::relations
[73]:
name | arity | access_level | n_keys | n_non_keys | n_put_triggers | n_rm_triggers | n_replace_triggers | description | |
---|---|---|---|---|---|---|---|---|---|
0 | airport | 11 | normal | 1 | 10 | 0 | 0 | 0 | |
1 | contain | 2 | normal | 2 | 0 | 0 | 0 | 0 | |
2 | continent | 2 | normal | 1 | 1 | 0 | 0 | 0 | |
3 | country | 2 | normal | 1 | 1 | 0 | 0 | 0 | |
4 | dept_info | 4 | normal | 2 | 2 | 0 | 0 | 0 | |
5 | route | 3 | normal | 2 | 1 | 0 | 0 | 0 | |
6 | test | 1 | normal | 1 | 0 | 0 | 0 | 0 | |
7 | test2 | 1 | normal | 1 | 0 | 0 | 0 | 0 |
While we are at it, let’s lock all these tables to prevent accidentally changing their contents:
[74]:
::access_level read_only airport, contain, continent, country, route
[74]:
status | |
---|---|
0 | OK |
More information about what this does is explained in this chapter.
Let’s just look at some data. Start with airports:
[75]:
?[code, city, desc, region, runways, lat, lon] := *airport{code, city, desc, region, runways, lat, lon}
:limit 5
[75]:
code | city | desc | region | runways | lat | lon | |
---|---|---|---|---|---|---|---|
0 | AAA | Anaa | Anaa Airport | PF-U-A | 1 | -17.352600 | -145.509995 |
1 | AAE | Annabah | Annaba Airport | DZ-36 | 2 | 36.822201 | 7.809170 |
2 | AAL | Aalborg | Aalborg Airport | DK-81 | 2 | 57.092759 | 9.849243 |
3 | AAN | Al Ain | Al Ain International Airport | AE-AZ | 1 | 24.261700 | 55.609200 |
4 | AAQ | Anapa | Anapa Airport | RU-KDA | 1 | 45.002102 | 37.347301 |
Airports with the most runways:
[76]:
?[code, city, desc, region, runways, lat, lon] := *airport{code, city, desc, region, runways, lat, lon}
:order -runways
:limit 10
[76]:
code | city | desc | region | runways | lat | lon | |
---|---|---|---|---|---|---|---|
0 | DFW | Dallas | Dallas/Fort Worth International Airport | US-TX | 7 | 32.896801 | -97.038002 |
1 | ORD | Chicago | Chicago O'Hare International Airport | US-IL | 7 | 41.978600 | -87.904800 |
2 | AMS | Amsterdam | Amsterdam Airport Schiphol | NL-NH | 6 | 52.308601 | 4.763890 |
3 | BOS | Boston | Boston Logan | US-MA | 6 | 42.364300 | -71.005203 |
4 | DEN | Denver | Denver International Airport | US-CO | 6 | 39.861698 | -104.672997 |
5 | DTW | Detroit | Detroit Metropolitan, Wayne County | US-MI | 6 | 42.212399 | -83.353401 |
6 | ATL | Atlanta | Hartsfield - Jackson Atlanta International Airport | US-GA | 5 | 33.636700 | -84.428101 |
7 | GIS | Gisborne | Gisborne Airport | NZ-GIS | 5 | -38.663300 | 177.977997 |
8 | HLZ | Hamilton | Hamilton International Airport | NZ-WKO | 5 | -37.866699 | 175.332001 |
9 | IAH | Houston | George Bush Intercontinental | US-TX | 5 | 29.984400 | -95.341400 |
How many airports are there in total?
[77]:
?[count(code)] := *airport{code}
[77]:
count(code) | |
---|---|
0 | 3504 |
Let’s get a distribution of the initials of the airport codes:
[78]:
?[count(initial), initial] := *airport{code}, initial = first(chars(code))
:order initial
[78]:
count(initial) | initial | |
---|---|---|
0 | 212 | A |
1 | 235 | B |
2 | 214 | C |
3 | 116 | D |
4 | 95 | E |
5 | 76 | F |
6 | 135 | G |
7 | 129 | H |
8 | 112 | I |
9 | 80 | J |
10 | 197 | K |
11 | 184 | L |
12 | 228 | M |
13 | 111 | N |
14 | 89 | O |
15 | 203 | P |
16 | 7 | Q |
17 | 121 | R |
18 | 245 | S |
19 | 205 | T |
20 | 77 | U |
21 | 86 | V |
22 | 59 | W |
23 | 28 | X |
24 | 211 | Y |
25 | 49 | Z |
More useful are the statistics of runways:
[79]:
?[count(r), count_unique(r), sum(r), min(r), max(r), mean(r), std_dev(r)] :=
*airport{runways: r}
[79]:
count(r) | count_unique(r) | sum(r) | min(r) | max(r) | mean(r) | std_dev(r) | |
---|---|---|---|---|---|---|---|
0 | 3504 | 7 | 4980.000000 | 1 | 7 | 1.421233 | 0.743083 |
Using country
, we can find countries with no airports:
[80]:
?[desc] := *country{code, desc}, not *airport{country: code}
[80]:
desc | |
---|---|
0 | Andorra |
1 | Liechtenstein |
2 | Monaco |
3 | Pitcairn |
4 | San Marino |
The route
relation by itself is rather boring:
[81]:
?[fr, to, dist] := *route{fr, to, dist}
:limit 10
[81]:
fr | to | dist | |
---|---|---|---|
0 | AAA | FAC | 48.000000 |
1 | AAA | MKP | 133.000000 |
2 | AAA | PPT | 270.000000 |
3 | AAA | RAR | 968.000000 |
4 | AAE | ALG | 254.000000 |
5 | AAE | CDG | 882.000000 |
6 | AAE | IST | 1161.000000 |
7 | AAE | LYS | 631.000000 |
8 | AAE | MRS | 477.000000 |
9 | AAE | ORN | 477.000000 |
It just records the starting and ending airports of each route, together with the distance. This relation only becomes useful when used as a graph.
Airports with no routes:
[82]:
?[code, desc] := *airport{code, desc}, not *route{fr: code}, not *route{to: code}
[82]:
code | desc | |
---|---|---|
0 | AFW | Fort Worth Alliance Airport |
1 | APA | Centennial Airport |
2 | APK | Apataki Airport |
3 | BID | Block Island State Airport |
4 | BVS | Breves Airport |
5 | BWU | Sydney Bankstown Airport |
6 | CRC | Santa Ana Airport |
7 | CVT | Coventry Airport |
8 | EKA | Murray Field |
9 | GYZ | Gruyere Airport |
10 | HFN | Hornafjordur Airport |
11 | HZK | Husavik Airport |
12 | ILG | New Castle Airport |
13 | INT | Smith Reynolds Airport |
14 | ISL | Ataturk International Airport |
15 | KGG | Kédougou Airport |
16 | NBW | Leeward Point Field |
17 | NFO | Mata'aho Airport |
18 | PSY | Stanley Airport |
19 | RIG | Rio Grande Airport |
20 | SFD | San Fernando De Apure Airport |
21 | SFH | San Felipe International Airport |
22 | SXF | Berlin-Schönefeld International Airport *Closed* |
23 | TUA | Teniente Coronel Luis a Mantilla Airport |
24 | TWB | Toowoomba Airport |
25 | TXL | Berlin, Tegel International Airport *Closed* |
26 | VCV | Southern California Logistics Airport |
27 | YEI | Bursa Yenişehir Airport |
Airports with the most out routes:
[83]:
route_count[fr, count(fr)] := *route{fr}
?[code, n] := route_count[code, n]
:sort -n
:limit 5
[83]:
code | n | |
---|---|---|
0 | FRA | 310 |
1 | IST | 309 |
2 | CDG | 293 |
3 | AMS | 283 |
4 | MUC | 270 |
How many routes are there from the European Union to the US?
[84]:
routes[unique(r)] := *contain['EU', fr],
*route{fr, to},
*airport{code: to, country: 'US'},
r = [fr, to]
?[n] := routes[rs], n = length(rs)
[84]:
n | |
---|---|
0 | 435 |
How many airports are there in the US with routes from the EU?
[85]:
?[count_unique(to)] := *contain['EU', fr],
*route{fr, to},
*airport{code: to, country: 'US'}
[85]:
count_unique(to) | |
---|---|
0 | 45 |
How many routes are there for each airport in London, UK?
[86]:
?[code, count(code)] := *airport{code, city: 'London', region: 'GB-ENG'}, *route{fr: code}
[86]:
code | count(code) | |
---|---|---|
0 | LCY | 51 |
1 | LGW | 232 |
2 | LHR | 221 |
3 | LTN | 130 |
4 | STN | 211 |
We need to specify the region, because there is another city called London, not in the UK.
How many airports are reachable from London, UK in two hops?
[87]:
lon_uk_airports[code] := *airport{code, city: 'London', region: 'GB-ENG'}
one_hop[to] := lon_uk_airports[fr], *route{fr, to}, not lon_uk_airports[to];
?[count_unique(a3)] := one_hop[a2], *route{fr: a2, to: a3}, not lon_uk_airports[a3];
[87]:
count_unique(a3) | |
---|---|
0 | 2353 |
What are the cities directly reachable from LGW (London Gatwick), but furthermost away?
[88]:
?[city, dist] := *route{fr: 'LGW', to, dist},
*airport{code: to, city}
:order -dist
:limit 10
[88]:
city | dist | |
---|---|---|
0 | Buenos Aires | 6908.000000 |
1 | Singapore | 6751.000000 |
2 | Langkawi | 6299.000000 |
3 | Duong Dong | 6264.000000 |
4 | Taipei | 6080.000000 |
5 | Port Louis | 6053.000000 |
6 | Rayong | 6008.000000 |
7 | Cape Town | 5987.000000 |
8 | Hong Kong | 5982.000000 |
9 | Shanghai | 5745.000000 |
What airports are within 0.1 degrees of the Greenwich meridian?
[89]:
?[code, desc, lon, lat] := *airport{lon, lat, code, desc}, lon > -0.1, lon < 0.1
[89]:
code | desc | lon | lat | |
---|---|---|---|---|
0 | CDT | Castellon De La Plana Airport | 0.026111 | 39.999199 |
1 | LCY | London City Airport | 0.055278 | 51.505278 |
2 | LDE | Tarbes-Lourdes-Pyrénées Airport | -0.006439 | 43.178699 |
3 | LEH | Le Havre Octeville Airport | 0.088056 | 49.533901 |
Airports in a box drawn around London Heathrow, UK:
[90]:
h_box[lon, lat] := *airport{code: 'LHR', lon, lat}
?[code, desc] := h_box[lhr_lon, lhr_lat], *airport{code, lon, lat, desc},
abs(lhr_lon - lon) < 1, abs(lhr_lat - lat) < 1
[90]:
code | desc | |
---|---|---|
0 | LCY | London City Airport |
1 | LGW | London Gatwick |
2 | LHR | London Heathrow |
3 | LTN | London Luton Airport |
4 | SOU | Southampton Airport |
5 | STN | London Stansted Airport |
For some spherical geometry: what is the angle subtended by SFO and NRT on the surface of the earth?
[91]:
?[deg_diff] := *airport{code: 'SFO', lat: a_lat, lon: a_lon},
*airport{code: 'NRT', lat: b_lat, lon: b_lon},
deg_diff = rad_to_deg(haversine_deg_input(a_lat, a_lon, b_lat, b_lon))
[91]:
deg_diff | |
---|---|
0 | 73.992112 |
We mentioned before that aggregations in Cozo are powerful. They are powerful because they can be used in recursions (some restrictions apply).
Let’s find the distance of the shortest route between two airports. One wayis to enumerate all the routes between the two airports, and then apply min
aggregation to the results. This cannot be implemented as stated, since the routes may contain cycles and hence there can be an infinite number of routes between two airports.
Instead, think recursively. If we already have all the shortest routes between all nodes, we derive an equation satisfied by the shortest route: the shortest route between a
and b
is either the distance of a direct route, or the sum of the shortest distance from a
to c
and the distance of a direct route from c
to d
. We apply our min
aggregation to this recursive set instead.
Write it out and it works. For exmaple, the shortest routes between the airports LHR
and YPO
:
[92]:
shortest[b, min(dist)] := *route{fr: 'LHR', to: b, dist}
# Start with the airport 'LHR', retrieve a direct route from 'LHR' to b
shortest[b, min(dist)] := shortest[c, d1], # Start with an existing shortest route from 'LHR' to c
*route{fr: c, to: b, dist: d2}, # Retrieve a direct route from c to b
dist = d1 + d2 # Add the distances
?[dist] := shortest['YPO', dist] # Extract the answer for 'YPO'.
# We chose it since it is the hardest airport to get to from 'LHR'.
[92]:
dist | |
---|---|
0 | 4147.000000 |
There is a caveat when you try to write similar queries. Say you try to write it in the following way (don’t try to run it):
shortest[a, b, min(dist)] := *route{fr: a, to: b, dist}
shortest[a, b, min(dist)] := shortest[a, c, d1],
*route{fr: c, to: b, dist: d2},
dist = d1 + d2
?[dist] := shortest['LHR', 'YPO', dist]
You will find that the query does not complete in a reasonable amount of time, despite it being equivalent to the original query. Why?
In the changed query, you are asking the database to compute the all-pair shortest path, and then extract the answer to a particular shortest path. Normally Cozo would apply a technique called magic set rewrite so that only the needed answer would be calculated. However, in the changed query the presence of the aggregation operator min
prevents that. In this case, applying the rewrite to the variable a
would still yield the correct answer, but rewriting in any other way would give
complete nonsense, and in the more general case with recursive aggregations this is a can of worms.
So as explained in the chapter about execution, magic set rewrites are only applied to rules without aggregations or recursions for the moment, until we are sure of the exact conditions under which the rewrites are safe. So for now at least the database executes the query as written, computing the result of the shortest
rule containing more than ten million rows (to be exact, 3700 * 3700 = 13,690,000 rows) first!
The bottom line is, be mindful of the cardinality of the return sets of recursive rules.
A tour of graph algorithms#
Now let’s investigate the graph using some graph algorithms. As we have mentioned before, the Cozo running in browsers through WASM does not have the graph algorithms module enabled, so to run the following examples you will need to use some other implementation (for example, the Python one).
Since path-finding is such a common operation on graphs, Cozo has several fixed rules for that:
[93]:
starting[] <- [['LHR']]
goal[] <- [['YPO']]
?[starting, goal, distance, path] <~ ShortestPathDijkstra(*route[], starting[], goal[])
[93]:
starting | goal | distance | path | |
---|---|---|---|---|
0 | LHR | YPO | 4147.000000 | ['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO'] |
Not only is it more efficient, but we also get a path for the shortest route.
Not content with the shortest path, the following calculates ten the shortest paths:
[94]:
starting[] <- [['LHR']]
goal[] <- [['YPO']]
?[starting, goal, distance, path] <~ KShortestPathYen(*route[], starting[], goal[], k: 10)
[94]:
starting | goal | distance | path | |
---|---|---|---|---|
0 | LHR | YPO | 4147.000000 | ['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO'] |
1 | LHR | YPO | 4150.000000 | ['LHR', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO'] |
2 | LHR | YPO | 4164.000000 | ['LHR', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO'] |
3 | LHR | YPO | 4167.000000 | ['LHR', 'DUB', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO'] |
4 | LHR | YPO | 4187.000000 | ['LHR', 'MAN', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO'] |
5 | LHR | YPO | 4202.000000 | ['LHR', 'IOM', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO'] |
6 | LHR | YPO | 4204.000000 | ['LHR', 'MAN', 'DUB', 'YUL', 'YMT', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO'] |
7 | LHR | YPO | 4209.000000 | ['LHR', 'YUL', 'YMT', 'YNS', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO'] |
8 | LHR | YPO | 4211.000000 | ['LHR', 'MAN', 'IOM', 'DUB', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO'] |
9 | LHR | YPO | 4212.000000 | ['LHR', 'DUB', 'YUL', 'YMT', 'YNS', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO'] |
If efficiency is really important to you, you can use the A* algorithm with a good heuristic function:
[95]:
code_lat_lon[code, lat, lon] := *airport{code, lat, lon}
starting[code, lat, lon] := code = 'LHR', *airport{code, lat, lon};
goal[code, lat, lon] := code = 'YPO', *airport{code, lat, lon};
?[] <~ ShortestPathAStar(*route[],
code_lat_lon[node, lat1, lon1],
starting[],
goal[goal, lat2, lon2],
heuristic: haversine_deg_input(lat1, lon1, lat2, lon2) * 3963);
[95]:
_0 | _1 | _2 | _3 | |
---|---|---|---|---|
0 | LHR | YPO | 4147.000000 | ['LHR', 'YUL', 'YVO', 'YKQ', 'YMO', 'YFA', 'ZKE', 'YAT', 'YPO'] |
There’s a lot more setup required in this case: we need to retrieve the latitudes and longitudes of airports and do processing on them first. The number 3963
above is the radius of the earth in miles.
The most important airports, by PageRank:
[96]:
rank[code, score] <~ PageRank(*route[a, b])
?[code, desc, score] := rank[code, score], *airport{code, desc}
:limit 10;
:order -score
[96]:
code | desc | score | |
---|---|---|---|
0 | IST | Istanbul International Airport | 0.004889 |
1 | DFW | Dallas/Fort Worth International Airport | 0.004696 |
2 | ORD | Chicago O'Hare International Airport | 0.004452 |
3 | DEN | Denver International Airport | 0.004252 |
4 | PEK | Beijing Capital International Airport | 0.004044 |
5 | FRA | Frankfurt am Main | 0.004027 |
6 | ATL | Hartsfield - Jackson Atlanta International Airport | 0.004022 |
7 | DXB | Dubai International Airport | 0.004002 |
8 | CDG | Paris Charles de Gaulle | 0.003998 |
9 | DME | Moscow, Domodedovo International Airport | 0.003817 |
The following example takes a long time to run since it calculates the betweenness centrality: up to a few seconds, depending on your machine. Algorithms for calculating the betweenness centrality have high complexity.
[97]:
centrality[code, score] <~ BetweennessCentrality(*route[a, b])
?[code, desc, score] := centrality[code, score], *airport{code, desc}
:limit 10;
:order -score
[97]:
code | desc | score | |
---|---|---|---|
0 | ANC | Anchorage Ted Stevens | 1074868.250000 |
1 | KEF | Reykjavik, Keflavik International Airport | 928450.437500 |
2 | HEL | Helsinki Ventaa | 581588.500000 |
3 | PEK | Beijing Capital International Airport | 532021.062500 |
4 | DEL | Indira Gandhi International Airport | 472979.968750 |
5 | IST | Istanbul International Airport | 457882.156250 |
6 | PKC | Yelizovo Airport | 408571.000000 |
7 | MSP | Minneapolis-St.Paul International Airport | 396433.250000 |
8 | LAX | Los Angeles International Airport | 393309.968750 |
9 | DEN | Denver International Airport | 374339.531250 |
These are the airports that, if disconnected from the network, cause the most disruption. As this example shows, some of the algorithms really struggle when you go beyond small or medium sized dataset.
Community detection can collapse a graph into a supergraph. Here we store the result, since it has too many rows to display nicely:
[98]:
community[detailed_cluster, code] <~ CommunityDetectionLouvain(*route[a, b])
?[code, cluster, detailed_cluster] := community[detailed_cluster, code], cluster = first(detailed_cluster)
:replace community {code => cluster, detailed_cluster}
[98]:
status | |
---|---|
0 | OK |
We can look at the supernodes containing specific nodes. For example, the supernode for London Gatwick consists of mainly UK and other European airports, as you would expect:
[99]:
community[code] := *community{code: 'LGW', cluster}, *community{code, cluster}
?[country, count(code)] :=
community[code],
*airport{code, desc, country: country_code},
*country{code: country_code, desc: country},
:order -count(code)
:limit 5
[99]:
country | count(code) | |
---|---|---|
0 | United Kingdom | 54 |
1 | France | 50 |
2 | Norway | 49 |
3 | Spain | 40 |
4 | Greece | 38 |
For JFK
on the other hand, its supernode consists of mainly US airports:
[100]:
community[code] := *community{code: 'JFK', cluster}, *community{code, cluster}
?[country, count(code)] :=
community[code],
*airport{code, desc, country: country_code},
*country{code: country_code, desc: country},
:order -count(code)
:limit 5
[100]:
country | count(code) | |
---|---|---|
0 | United States | 444 |
1 | Canada | 111 |
2 | Brazil | 108 |
3 | Mexico | 57 |
4 | Colombia | 50 |
But it does not always work according to geography. For example, Frankfurt airport is in Germany:
[101]:
?[desc, country_desc] := *airport{code: 'FRA', desc, country: country_code}, *country{code: country_code, desc: country_desc}
[101]:
desc | country_desc | |
---|---|---|
0 | Frankfurt am Main | Germany |
But its supernode:
[102]:
community[code] := *community{code: 'FRA', cluster}, *community{code, cluster}
?[country, count(code)] :=
community[code],
*airport{code, desc, country: country_code},
*country{code: country_code, desc: country},
:order -count(code)
:limit 5
[102]:
country | count(code) | |
---|---|---|
0 | United States | 444 |
1 | Canada | 111 |
2 | Brazil | 108 |
3 | Mexico | 57 |
4 | Colombia | 50 |
Germany does not even appear in the top five. In fact, FRA
is in the same supernode as JFK
. What matters is the connectivity in the graph, not the geography. As another example:
[103]:
community[code] := *community{code: 'SIN', cluster}, *community{code, cluster}
?[country, count(code)] :=
community[code],
*airport{code, desc, country: country_code},
*country{code: country_code, desc: country},
:order -count(code)
:limit 5
[103]:
country | count(code) | |
---|---|---|
0 | China | 216 |
1 | Australia | 125 |
2 | Indonesia | 68 |
3 | Japan | 65 |
4 | Philippines | 40 |
You’d expect SIN
to be a Chinese airport. Wrong:
[104]:
?[desc, country_desc] := *airport{code: 'SIN', desc, country: country_code}, *country{code: country_code, desc: country_desc}
[104]:
desc | country_desc | |
---|---|---|
0 | Singapore, Changi International Airport | Singapore |
Finally, let’s collapse the route
relation into super_route
:
[105]:
?[fr_cluster, to_cluster, count(dist), sum(dist)] := *route{fr, to, dist},
*community{code: fr, cluster: fr_cluster},
*community{code: to, cluster: to_cluster}
:replace super_route {fr_cluster, to_cluster => n_routes=count(dist), total_distance=sum(dist)}
[105]:
status | |
---|---|
0 | OK |
As expected, the “diagonals” where fr_cluster == to_cluster
are larger in the super_route
graph:
[106]:
?[fr_cluster, to_cluster, n_routes, total_distance] := *super_route{fr_cluster, to_cluster, n_routes, total_distance}, fr_cluster < 2
[106]:
fr_cluster | to_cluster | n_routes | total_distance | |
---|---|---|---|---|
0 | 0 | 0 | 9041 | 8933554.000000 |
1 | 0 | 1 | 434 | 1695379.000000 |
2 | 0 | 2 | 228 | 761661.000000 |
3 | 0 | 3 | 530 | 1681865.000000 |
4 | 0 | 4 | 163 | 391892.000000 |
5 | 0 | 8 | 3 | 300.000000 |
6 | 0 | 11 | 2 | 283.000000 |
7 | 0 | 19 | 1 | 238.000000 |
8 | 0 | 21 | 2 | 705.000000 |
9 | 0 | 22 | 1 | 975.000000 |
10 | 1 | 0 | 434 | 1696858.000000 |
11 | 1 | 1 | 4474 | 5142452.000000 |
12 | 1 | 2 | 1160 | 1492734.000000 |
13 | 1 | 3 | 526 | 1724591.000000 |
14 | 1 | 4 | 223 | 361986.000000 |
15 | 1 | 9 | 1 | 808.000000 |
Now the super graph is small enough that all analytics algorithms return instantly:
[107]:
?[cluster, score] <~ PageRank(*super_route[])
:order -score
:limit 5
[107]:
cluster | score | |
---|---|---|
0 | 3 | 0.173309 |
1 | 0 | 0.093072 |
2 | 1 | 0.053465 |
3 | 2 | 0.053389 |
4 | 4 | 0.044654 |
You can now go on to investigate the supernodes, give real-world interpretations to them, etc. For example, a naïve interpretation of the above PageRank result is that North America is (still) the most prosperous part of the world, followed by East Asia in the second place, South Asia in the third place, and Europe in the fourth place.
Importing dataset the hard way#
Previously, we imported the air-routes dataset by using Python under the hood to download a specially-crafted JSON file and feed it to the database. Here we learn how to achieve the same effect by letting Cozo fetch and import a series of CSV files, without Python’s help.
Let’s set the database magic up first:
[108]:
::access_level normal airport, contain, continent, country, route
[108]:
status | |
---|---|
0 | OK |
[109]:
::remove airport, contain, continent, country, route, community, super_route
[109]:
status | |
---|---|
0 | OK |
Next, some parameters to make life eaiser (the lines commented out do the same thing by processing local files):
[113]:
# %cozo_set AIR_ROUTES_NODES_URL 'https://raw.githubusercontent.com/cozodb/cozo/dev/cozo-core/tests/air-routes-latest-nodes.csv'
# %cozo_set AIR_ROUTES_EDGES_URL 'https://raw.githubusercontent.com/cozodb/cozo/dev/cozo-core/tests/air-routes-latest-edges.csv'
%cozo_set AIR_ROUTES_NODES_URL 'file://./../../cozo/cozo-core/tests/air-routes-latest-nodes.csv'
%cozo_set AIR_ROUTES_EDGES_URL 'file://./../../cozo/cozo-core/tests/air-routes-latest-edges.csv'
First, import the airport
relation:
[114]:
res[idx, label, typ, code, icao, desc, region, runways, longest, elev, country, city, lat, lon] <~
CsvReader(types: ['Int', 'Any', 'Any', 'Any', 'Any', 'Any', 'Any', 'Int?', 'Float?', 'Float?', 'Any', 'Any', 'Float?', 'Float?'],
url: $AIR_ROUTES_NODES_URL,
has_headers: true)
?[code, icao, desc, region, runways, longest, elev, country, city, lat, lon] :=
res[idx, label, typ, code, icao, desc, region, runways, longest, elev, country, city, lat, lon],
label == 'airport'
:replace airport {
code: String
=>
icao: String,
desc: String,
region: String,
runways: Int,
longest: Float,
elev: Float,
country: String,
city: String,
lat: Float,
lon: Float
}
[114]:
status | |
---|---|
0 | OK |
The CsvReader
utility downloads a CSV file from the internet and attempts to parse its content into a relation. When we store the relation, we specified types for the columns. The code
column acts as a primary key for the airport
stored relation.
Next is country
:
[115]:
res[idx, label, typ, code, icao, desc] <~
CsvReader(types: ['Int', 'Any', 'Any', 'Any', 'Any', 'Any'],
url: $AIR_ROUTES_NODES_URL,
has_headers: true)
?[code, desc] :=
res[idx, label, typ, code, icao, desc],
label == 'country'
:replace country {
code: String
=>
desc: String
}
[115]:
status | |
---|---|
0 | OK |
continent
:
[116]:
res[idx, label, typ, code, icao, desc] <~
CsvReader(types: ['Int', 'Any', 'Any', 'Any', 'Any', 'Any'],
url: $AIR_ROUTES_NODES_URL,
has_headers: true)
?[idx, code, desc] :=
res[idx, label, typ, code, icao, desc],
label == 'continent'
:replace continent {
code: String
=>
desc: String
}
[116]:
status | |
---|---|
0 | OK |
We need to make a translation table for the indices the data use:
[117]:
res[idx, label, typ, code] <~
CsvReader(types: ['Int', 'Any', 'Any', 'Any'],
url: $AIR_ROUTES_NODES_URL,
has_headers: true)
?[idx, code] :=
res[idx, label, typ, code],
:replace idx2code { idx => code }
[117]:
status | |
---|---|
0 | OK |
The contain
relation contains information on the geographical inclusion of entities:
[118]:
res[] <~
CsvReader(types: ['Int', 'Int', 'Int', 'String'],
url: $AIR_ROUTES_EDGES_URL,
has_headers: true)
?[entity, contained] :=
res[idx, fr_i, to_i, typ],
typ == 'contains',
*idx2code[fr_i, entity],
*idx2code[to_i, contained]
:replace contain { entity: String, contained: String }
[118]:
status | |
---|---|
0 | OK |
Finally, the route
s between the airports. This relation is much larger than the rest and contains about 60k rows:
[119]:
res[] <~
CsvReader(types: ['Int', 'Int', 'Int', 'String', 'Float?'],
url: $AIR_ROUTES_EDGES_URL,
has_headers: true)
?[fr, to, dist] :=
res[idx, fr_i, to_i, typ, dist],
typ == 'route',
*idx2code[fr_i, fr],
*idx2code[to_i, to]
:replace route { fr: String, to: String => dist: Float }
[119]:
status | |
---|---|
0 | OK |
We no longer need the idx2code
relation:
[120]:
::remove idx2code
[120]:
status | |
---|---|
0 | OK |
Let’s verify all the relations we want are there:
[121]:
::relations
[121]:
name | arity | access_level | n_keys | n_non_keys | n_put_triggers | n_rm_triggers | n_replace_triggers | description | |
---|---|---|---|---|---|---|---|---|---|
0 | airport | 11 | normal | 1 | 10 | 0 | 0 | 0 | |
1 | contain | 2 | normal | 2 | 0 | 0 | 0 | 0 | |
2 | continent | 2 | normal | 1 | 1 | 0 | 0 | 0 | |
3 | country | 2 | normal | 1 | 1 | 0 | 0 | 0 | |
4 | dept_info | 4 | normal | 2 | 2 | 0 | 0 | 0 | |
5 | route | 3 | normal | 2 | 1 | 0 | 0 | 0 | |
6 | test | 1 | normal | 1 | 0 | 0 | 0 | 0 | |
7 | test2 | 1 | normal | 1 | 0 | 0 | 0 | 0 |
That’s it for the tutorial!