Utilities and algorithms#
Fixed rules in CozoScript apply utilities or algorithms.
The algorithms described here are only available if your distribution of Cozo is compiled with the graph-algo
feature flag.
Currently all prebuilt binaries except WASM are compiled with this flag on.
If you are using the Cozo libraries in Rust, Python or NodeJS, or if you are using the standalone executable, you can also easily define custom fixed rules in the hosting environment: see the respective documentations for how to do it.
Utilities#
- Constant(data: [...])#
Returns a relation containing the data passed in. The constant rule
?[] <- ...
is syntax sugar for?[] <~ Constant(data: ...)
.- Parameters:
data – A list of lists, representing the rows of the returned relation.
- ReorderSort(rel[...], out: [...], sort_by: [...], descending: false, break_ties: false, skip: 0, take: 0)#
Sort and then extract new columns of the passed in relation
rel
.- Parameters:
out (required) – A list of expressions which will be used to produce the output relation. Any bindings in the expressions will be bound to the named positions in
rel
.sort_by – A list of expressions which will be used to produce the sort keys. Any bindings in the expressions will be bound to the named positions in
rel
.descending – Whether the sorting process should be done in descending order. Defaults to
false
.break_ties – Whether ties should be broken, e.g. whether the first two rows with identical sort keys should be given ordering numbers
1
and2
instead of1
and1
. Defaults to false.skip – How many rows to skip before producing rows. Defaults to zero.
take – How many rows at most to produce. Zero means no limit. Defaults to zero.
- Returns:
The returned relation, in addition to the rows specified in the parameter
out
, will have the ordering prepended. The ordering starts at1
.
Tip
This algorithm serves a similar purpose to the global
:order
,:limit
and:offset
options, but can be applied to intermediate results. Prefer the global options if it is applied to the final output.
- CsvReader(url: ..., types: [...], delimiter: ',', prepend_index: false, has_headers: true)#
Read a CSV file from disk or an HTTP GET request and convert the result to a relation.
This utility is not available on WASM targets. In addition, if the feature flag
requests
is off, only reading from local file is supported.- Parameters:
url (required) – URL for the CSV file. For local file, use
file://<PATH_TO_FILE>
.types (required) – A list of strings interpreted as types for the columns of the output relation. If any type is specified as nullable and conversion to the specified type fails,
null
will be the result. This is more lenient than other functions since CSVs tend to contain lots of bad values.delimiter – The delimiter to use when parsing the CSV file.
prepend_index – If
true
, row index will be prepended to the columns.has_headers – Whether the CSV file has headers. The reader will not interpret the header in any way but will instead simply ignore it.
- JsonReader(url: ..., fields: [...], json_lines: true, null_if_absent: false, prepend_index: false)#
Read a JSON file for disk or an HTTP GET request and convert the result to a relation.
This utility is not available on WASM targets. In addition, if the feature flag
requests
is off, only reading from local file is supported.- Parameters:
url (required) – URL for the JSON file. For local file, use
file://<PATH_TO_FILE>
.fields (required) – A list of field names, for extracting fields from JSON arrays into the relation.
json_lines – If
true
, parse the file as lines of JSON objects, each line containing a single object; if false, parse the file as a JSON array containing many objects.null_if_absent – If a
true
and a requested field is absent, will outputnull
in its place. Iffalse
and the requested field is absent, will throw an error.prepend_index – If
true
, row index will be prepended to the columns.
Connectedness algorithms#
- ConnectedComponents(edges[from, to])#
Computes the connected components of a graph with the provided edges.
- Returns:
Pairs containing the node index, and its component index.
- StronglyConnectedComponent(edges[from, to])#
Computes the strongly connected components of a graph with the provided edges.
- Returns:
Pairs containing the node index, and its component index.
- SCC(...)#
- MinimumSpanningForestKruskal(edges[from, to, weight?])#
Runs Kruskal’s algorithm on the provided edges to compute a minimum spanning forest. Negative weights are fine.
- Returns:
Triples containing the from-node, the to-node, and the cost from the tree root to the to-node. Which nodes are chosen to be the roots are non-deterministic. Multiple roots imply the graph is disconnected.
- MinimumSpanningTreePrim(edges[from, to, weight?], starting?[idx])#
Runs Prim’s algorithm on the provided edges to compute a minimum spanning tree.
starting
should be a relation producing exactly one node index as the starting node. Only the connected component of the starting node is returned. Ifstarting
is omitted, which component is returned is arbitrary.- Returns:
Triples containing the from-node, the to-node, and the cost from the tree root to the to-node.
- TopSort(edges[from, to])#
Performs topological sorting on the graph with the provided edges. The graph is required to be connected in the first place.
- Returns:
Pairs containing the sort order and the node index.
Pathfinding algorithms#
- ShortestPathBFS(edges[from, to], starting[start_idx], goals[goal_idx])#
Runs breadth-first search to determine the shortest path between the
starting
nodes and thegoals
. Assumes the graph to be directed and all edges to be of unit weight. Ties will be broken in an unspecified way. If you need anything more complicated, use one of the other algorithms below.- Returns:
Triples containing the starting node, the goal, and a shortest path.
- ShortestPathDijkstra(edges[from, to, weight?], starting[idx], goals[idx], undirected: false, keep_ties: false)#
Runs Dijkstra’s algorithm to determine the shortest paths between the
starting
nodes and thegoals
. Weights, if given, must be non-negative.- Parameters:
undirected – Whether the graph should be interpreted as undirected. Defaults to
false
.keep_ties – Whether to return all paths with the same lowest cost. Defaults to
false
, in which any one path of the lowest cost could be returned.
- Returns:
4-tuples containing the starting node, the goal, the lowest cost, and a path with the lowest cost.
- KShortestPathYen(edges[from, to, weight?], starting[idx], goals[idx], k: expr, undirected: false)#
Runs Yen’s algorithm (backed by Dijkstra’s algorithm) to find the k-shortest paths between nodes in
starting
and nodes ingoals
.- Parameters:
k (required) – How many routes to return for each start-goal pair.
undirected – Whether the graph should be interpreted as undirected. Defaults to
false
.
- Returns:
4-tuples containing the starting node, the goal, the cost, and a path with the cost.
- BreadthFirstSearch(edges[from, to], nodes[idx, ...], starting?[idx], condition: expr, limit: 1)#
Runs breadth first search on the directed graph with the given edges and nodes, starting at the nodes in
starting
. Ifstarting
is not given, it will default to all ofnodes
, which may be quite a lot to calculate.- Parameters:
condition (required) – The stopping condition, will be evaluated with the bindings given to
nodes
. Should evaluate to a boolean, withtrue
indicating an acceptable answer was found.limit – How many answers to produce for each starting nodes. Defaults to 1.
- Returns:
Triples containing the starting node, the answer node, and the found path connecting them.
- BFS(...)#
- DepthFirstSearch(edges[from, to], nodes[idx, ...], starting?[idx], condition: expr, limit: 1)#
Runs depth first search on the directed graph with the given edges and nodes, starting at the nodes in
starting
. Ifstarting
is not given, it will default to all ofnodes
, which may be quite a lot to calculate.- Parameters:
condition (required) – The stopping condition, will be evaluated with the bindings given to
nodes
. Should evaluate to a boolean, withtrue
indicating an acceptable answer was found.limit – How many answers to produce for each starting nodes. Defaults to 1.
- Returns:
Triples containing the starting node, the answer node, and the found path connecting them.
- DFS(...)#
- ShortestPathAStar(edges[from, to, weight], nodes[idx, ...], starting[idx], goals[idx], heuristic: expr)#
Computes the shortest path from every node in
starting
to every node ingoals
by the A* algorithm.edges
are interpreted as directed, weighted edges with non-negative weights.- Parameters:
heuristic (required) – The search heuristic expression. It will be evaluated with the bindings from
goals
andnodes
. It should return a number which is a lower bound of the true shortest distance from a node to the goal node. If the estimate is not a valid lower-bound, i.e. it over-estimates, the results returned may not be correct.- Returns:
4-tuples containing the starting node index, the goal node index, the lowest cost, and a path with the lowest cost.
Tip
The performance of A* star algorithm heavily depends on how good your heuristic function is. Passing in
0
as the estimate is always valid, but then you really should be using Dijkstra’s algorithm.Good heuristics usually come about from a metric in the ambient space in which your data live, e.g. spherical distance on the surface of a sphere, or Manhattan distance on a grid.
Func.Math.haversine_deg_input
could be helpful for the spherical case. Note that you must use the correct units for the distance.Providing a heuristic that is not guaranteed to be a lower-bound might be acceptable if you are fine with inaccuracies. The errors in the answers are bound by the sum of the margins of your over-estimates.
Community detection algorithms#
- ClusteringCoefficients(edges[from, to, weight?])#
Computes the clustering coefficients of the graph with the provided edges.
- Returns:
4-tuples containing the node index, the clustering coefficient, the number of triangles attached to the node, and the total degree of the node.
- CommunityDetectionLouvain(edges[from, to, weight?], undirected: false, max_iter: 10, delta: 0.0001, keep_depth?: depth)#
Runs the Louvain algorithm on the graph with the provided edges, optionally non-negatively weighted.
- Parameters:
undirected – Whether the graph should be interpreted as undirected. Defaults to
false
.max_iter – The maximum number of iterations to run within each epoch of the algorithm. Defaults to 10.
delta – How much the modularity has to change before a step in the algorithm is considered to be an improvement.
keep_depth – How many levels in the hierarchy of communities to keep in the final result. If omitted, all levels are kept.
- Returns:
Pairs containing the label for a community, and a node index belonging to the community. Each label is a list of integers with maximum length constrained by the parameter
keep_depth
. This list represents the hierarchy of sub-communities containing the list.
- LabelPropagation(edges[from, to, weight?], undirected: false, max_iter: 10)#
Runs the label propagation algorithm on the graph with the provided edges, optionally weighted.
- Parameters:
undirected – Whether the graph should be interpreted as undirected. Defaults to
false
.max_iter – The maximum number of iterations to run. Defaults to 10.
- Returns:
Pairs containing the integer label for a community, and a node index belonging to the community.
Centrality measures#
- DegreeCentrality(edges[from, to])#
Computes the degree centrality of the nodes in the graph with the given edges. The computation is trivial, so this should be your first thing to try when exploring new data.
- Returns:
4-tuples containing the node index, the total degree (how many edges involve this node), the out-degree (how many edges point away from this node), and the in-degree (how many edges point to this node).
- PageRank(edges[from, to, weight?], undirected: false, theta: 0.85, epsilon: 0.0001, iterations: 10)#
Computes the PageRank from the given graph with the provided edges, optionally weighted.
- Parameters:
undirected – Whether the graph should be interpreted as undirected. Defaults to
false
.theta – A number between 0 and 1 indicating how much weight in the PageRank matrix is due to the explicit edges. A number of 1 indicates no random restarts. Defaults to 0.8.
epsilon – Minimum PageRank change in any node for an iteration to be considered an improvement. Defaults to 0.05.
iterations – How many iterations to run. Fewer iterations are run if convergence is reached. Defaults to 20.
- Returns:
Pairs containing the node label and its PageRank.
- ClosenessCentrality(edges[from, to, weight?], undirected: false)#
Computes the closeness centrality of the graph. The input relation represent edges connecting node indices which are optionally weighted.
- Parameters:
undirected – Whether the edges should be interpreted as undirected. Defaults to
false
.- Returns:
Node index together with its centrality.
- BetweennessCentrality(edges[from, to, weight?], undirected: false)#
Computes the betweenness centrality of the graph. The input relation represent edges connecting node indices which are optionally weighted.
- Parameters:
undirected – Whether the edges should be interpreted as undirected. Defaults to
false
.- Returns:
Node index together with its centrality.
Warning
BetweennessCentrality
is very expensive for medium to large graphs. If possible, collapse large graphs into supergraphs by running a community detection algorithm first.
Miscellaneous#
- RandomWalk(edges[from, to, ...], nodes[idx, ...], starting[idx], steps: 10, weight?: expr, iterations: 1)#
Performs random walk on the graph with the provided edges and nodes, starting at the nodes in
starting
.- Parameters:
steps (required) – How many steps to walk for each node in
starting
. Produced paths may be shorter if dead ends are reached.weight – An expression evaluated against bindings of
nodes
and bindings ofedges
, at a time when the walk is at a node and choosing between multiple edges to follow. It should evaluate to a non-negative number indicating the weight of the given choice of edge to follow. If omitted, which edge to follow is chosen uniformly.iterations – How many times walking is repeated for each starting node.
- Returns:
Triples containing a numerical index for the walk, the starting node, and the path followed.