User Guide¶
As described in the homepage, cliquematch
aims to do two things:
Find maximum cliques in large sparse undirected graphs, as quickly and efficiently as possible.
Construct large sparse undirected graphs in-memory for the various applications of the maximum clique/clique enumeration problem.
Let’s look at how cliquematch
can be used as described above.
Finding a Maximum Clique¶
Note
The below examples use python
. If you’re using R
and like to call
python packages via reticulate
, you can view the R Markdown file from
the examples in the Github repo.
Finding a maximum clique in an undirected graph is a well known problem. Most
exact algorithms, including the one used in cliquematch
, use some form of a
depth-first search (DFS), along with some pruning techniques to speed up the
search.
Finding a maximum clique using cliquematch
is simple (and fast): load a
Graph
from a file (or from an adjacency matrix, an adjacency
list, or a list of edges), set some search properties, and find a maximum
clique in it.
For example, we load a graph from a file: cond-mat-2003.mtx and find a maximum clique.
import cliquematch
G = cliquematch.Graph.from_file("cond-mat-2003.mtx")
We can also load a Graph
using an adjacency matrix
(from_adj_matrix
), an adjacency list
(from_adj_list
), or a list of edges
(from_edgelist
). Once the Graph
object has
been loaded, we can provide various parameters to search for a clique with get_max_clique
:
answer = G.get_max_clique(
lower_bound=1,
upper_bound=1729,
time_limit=100,
use_heuristic=True,
use_dfs=True,
)
print(answer)
# [9986, 9987, 10066, 10068, 10071, 10072, 10074, 10076,
# 10077, 10078, 10079, 10080, 10081, 10082, 10083, 10085,
# 10287, 10902, 10903, 10904, 10905, 10906, 10907, 10908, 10909]
A
lower_bound
and anupper_bound
can be set for the size for clique.A
time_limit
can be set for the clique search. If time limits are not needed, set0
as the limit.In case the graph is very large, to get a large clique quickly via a heuristic method,
use_heuristic
can be set toTrue
, anduse_dfs
can be set toFalse
to skip the depth-first search.
Warning
Both use_heuristic
and use_dfs
cannot be set to False
.
cliquematch
will then skip the clique search entirely and raise a RuntimeError
.
The Graph
object has some read-only properties:
n_vertices
andn_edges
provide the number of vertices and edges in the graph.search_done
isTrue
if the depth-first search has been completed.
The user can print the Graph
object to view the properties:
print(G)
# cliquematch.core.Graph object at 0x559e7da730c0
# (n_vertices=31163,n_edges=120029,search_done=False)
What is the search_done
property for? In case the clique
search was interrupted due to time_limit
, the user can provide a
continue_search
parameter within a loop like this:
while not G.search_done:
answer = G.get_max_clique(
lower_bound=1,
upper_bound=1729,
time_limit=100,
use_heuristic=True,
use_dfs=True,
continue_search=True,
)
If the entire Graph
needs to be searched again (say for a clique of larger size),
the user can call the reset_search
method:
G.reset_search()
G.get_max_clique(
lower_bound=1, upper_bound=31, time_limit=100, use_heuristic=True, use_dfs=True
)
Finding Multiple cliques : clique enumeration¶
The above Graph
methods deal with finding one large clique. A
related use case is to find multiple large cliques in a given graph,and
iterate through them in some order.
The Graph
class has an all_cliques
method
for finding all cliques of a given size. Taking the Graph
object
loaded as per the above section:
import cliquematch
G = cliquematch.Graph.from_file("cond-mat-2003.mtx")
# we know there exists a maximum clique of size 25
# so let's find cliques of size 24
for clique in G.all_cliques(size=24):
print(clique)
Applications of the maximum clique problem¶
Applications of the maximum clique problem primarily involve:
the construction of a graph from a different kind of dataset,
writing the graph to a file,
reading the file again,
finding a maximum clique, and then
conversion of the clique back into the existing dataset.
This process is usually repeated with tweaks to underlying dataset, leading to
different graphs and cliques. For such use cases the primary bottlenecks are
the construction of the graph, reading/processing the graph data in a
clique-friendly manner, and finding the maximum clique. cliquematch
aims to
solve these issues by keeping the graph construction in memory, and having an
optimized clique search algorithm.
Graph construction for maximum clique problems mostly involve one of the two ways below:
A graph is constructed using an edge indication function on all pairs of elements belonging to a dataset \(X\).
cliquematch
does not provide any specific code for this; an edge list can constructed from the data using a nested loop, following which aGraph
object can be loaded,and the maximum clique (ie the largest group of related elements in the dataset \(X\)) can be computed.
A correspondence graph is constructed using the elements of two datasets \(P\) and \(Q\); the vertices of the graph refer to pairs of elements \((p_i, q_j)\), and an edge between two vertices implies some common relationship between the elements from \(P\) and the elements from \(Q\).
Yeah… that’s a little dense. Let’s try again with some math.
Assume you have two sets \(P = { p_1, p_2, ... p_M }\) and \(Q = { q_1, q_2, ... q_N }\), and we want to find the largest subsets \(P^* \in P\) and \(Q^* \in Q\) such that there exists a one-to-one correspondence between \(P^*\) and \(Q^*\).
This means the elements of \(P^*\) are related to each other similar to how the elements of \(Q^*\) are related to each other. Suppose the elements of \(P^*\) (and similarly \(Q^*\)) have a pairwise relationship, then we can say that for all pairs
there exists some boolean condition function \(f\) such that`
What does this have to do with maximum cliques? Well, \(P^*\) and \(Q^*\) are the largest such subsets, so maybe finding them can be done by converting the problem to a maximum clique problem. This is where we bring in a correspondence graph: an undirected graph \(G(V,E)\), where \(V = P \times Q\) ie the vertices indicate a mapping \((p_i, q_j)\). As for edges, that’s where \(f\) comes in: an edge exists between \(v_1 = (p_{i_1}, q_{j_1})\) and \(v_2 = (p_{i_2}, q_{j_2})\) if and only if:
It can be proved that finding a maximum clique in the correspondence graph \(G\) is the same as finding the largest subsets \(P^*\) and \(Q^*\) that have a one-to-one correspondence.
cliquematch
provides classes the implement the above functionality of
correspondence graphs. The user has to provide are the sets \(P\) and
\(Q\), along with a condition function \(f\). A common use case is to
have \(f\) expressed using distance metrics:
The correspondence graph classes are all subclasses of Graph
, they all expose the same methods:
An
__init__
method that accepts the sets \(P\) (orS1
), \(Q\) (orS2
), and the distance functions forS1
andS2
respectively.A
build_edges
method that constructs the correspondence graph using only the distance metrics.A
build_edges_with_condition
method that accepts a condition functioncf
, and a booleanuse_condition_only
:If
use_cfunc_only
isTrue
, the graph is constructed using onlycf
(slower)Otherwise the graph is constructed using the distance metrics and pruned with
cf
(faster)
A
get_correspondence
method that returns the largest corresponding subsets \(P^*\) and \(Q^*\), or the indices of the elements in the subsets.A
all_correspondences
method similar to thecliquematch.Graph.all_cliques
that works similar toget_correspondence
.
The correspondence graph classes available are:
cliquematch.A2AGraph
whereS1
andS2
are 2-Dnumpy.ndarray
scliquematch.L2LGraph
whereS1
andS2
arelist
s (or any list-like object)cliquematch.A2LGraph
whereS1
is a 2-Dnumpy.ndarray
andS2
is alist
cliquematch.L2AGraph
whereS2
is alist
andS2
is a 2-Dnumpy.ndarray
cliquematch.IsoGraph
whereS1
andS2
areGraph
s (subgraph isomorphism).cliquematch.AlignGraph
which is a special case ofA2AGraph
used for image alignment.
The concept of correspondence graphs enables applying maximum cliques to many fields. Here are a couple of examples from the Github repo:
This image matching algorithm can be implemented using
cliquematch
like this .Simple molecular alignment can be implemented like this .