User Guide¶
As described in the homepage, cliquematch aims to do two things:
Find a maximum clique in large sparse undirected graphs, as quickly and efficiently as possible.
Construct large sparse undirected graphs in-memory for the various applications of the maximum clique problem.
Let’s look at how cliquematch can be used as described above.
Finding a Maximum Clique¶
Finding a maximum clique in an undirected graph is a well known problem. Most
exact algorithms, including the one used in cliquematch, use some form of a
depth-first search (DFS), along with some pruning techniques to speed up the
search.
Finding a maximum clique using cliquematch is simple (and fast): load a
Graph from a file (or from an adjacency matrix, an adjacency
list, or a list of edges), set some search properties, and find a maximum
clique in it.
For example, we load a graph from a file: cond-mat-2003.mtx and find a maximum clique.
import cliquematch
G = cliquematch.Graph.from_file("cond-mat-2003.mtx")
We can also load a Graph using an adjacency matrix
(from_adj_matrix), an adjacency list
(from_adj_list), or a list of edges
(from_edgelist). Once the Graph object has
been loaded, we can change some of the search properties of the
Graph before searching for a clique:
A
lower_boundand anupper_boundcan be set for the size for clique.A
time_limitcan be set for the clique search.In case the graph is very large, to get a large clique quickly via a heuristic method,
use_heuristiccan be set toTrue, anduse_dfscan be set toFalseto skip the depth-first search.
Warning
Both use_heuristic and use_dfs cannot be set to False.
cliquematch will then skip the clique search entirely and raise a RuntimeError.
There are some additional read-only properties:
n_verticesandn_edgesprovide the number of vertices and edges in the graph.search_doneisTrueif the depth-first search has been completed.
The user can print the Graph object to view the properties:
G.time_limit = 1
G.upper_bound = 30
G.lower_bound = 10
G.use_heuristic = False
G.use_dfs = True
print(G)
# cliquematch.core.Graph object at 0x559e7da730c0
# (n_vertices=31163,n_edges=120029,lower_bound=10,upper_bound=30,
# time_limit=1,use_heuristic=False,use_dfs=True,search_done=False)
After setting the properties, the user can call
get_max_clique to get a clique as per the above
constraints.
answer = G.get_max_clique()
print(answer)
# [9986, 9987, 10066, 10068, 10071, 10072, 10074, 10076,
# 10077, 10078, 10079, 10080, 10081, 10082, 10083, 10085,
# 10287, 10902, 10903, 10904, 10905, 10906, 10907, 10908, 10909]
What was the search_done property for? In case the clique
search has not completed, the user can continue the search within a loop like
this:
while not G.search_done:
G.continue_search()
answer = G.get_max_clique()
Finally, if the entire Graph needs to be searched again (say for a clique of larger size),
the user can call the reset_search method:
G.reset_search()
G.upper_bound = 45
G.get_max_clique()
Applications of the maximum clique problem¶
Applications of the maximum clique problem primarily involve:
the construction of a graph from a different kind of dataset,
writing the graph to a file,
reading the file again,
finding a maximum clique, and then
conversion of the clique back into the existing dataset.
This process is usually repeated with tweaks to underlying dataset, leading to
different graphs and cliques. For such use cases the primary bottlenecks are
the construction of the graph, reading/processing the graph data in a
clique-friendly manner, and finding the maximum clique. cliquematch aims to
solve these issues by keeping the graph construction in memory, and having an
optimized clique search algorithm.
Graph construction for maximum clique problems mostly involve one of the two ways below:
A graph is constructed using an edge indication function on all pairs of elements belonging to a dataset \(X\).
cliquematchdoes not provide any specific code for this; an edge list can constructed from the data using a nested loop, following which aGraphobject can be loaded,and the maximum clique (ie the largest group of related elements in the dataset \(X\)) can be computed.
A correspondence graph is constructed using the elements of two datasets \(P\) and \(Q\); the vertices of the graph refer to pairs of elements \((p_i, q_j)\), and an edge between two vertices implies some common relationship between the elements from \(P\) and the elements from \(Q\).
Yeah… that’s a little dense. Let’s try again with some math.
Assume you have two sets \(P = { p_1, p_2, ... p_M }\) and \(Q = { q_1, q_2, ... q_N }\), and we want to find the largest subsets \(P^* \in P\) and \(Q^* \in Q\) such that there exists a one-to-one correspondence between \(P^*\) and \(Q^*\).
This means the elements of \(P^*\) are related to each other similar to how the elements of \(Q^*\) are related to each other. Suppose the elements of \(P^*\) (and similarly \(Q^*\)) have a pairwise relationship, then we can say that for all pairs
there exists some boolean condition function \(f\) such that`
What does this have to do with maximum cliques? Well, \(P^*\) and \(Q^*\) are the largest such subsets, so maybe finding them can be done by converting the problem to a maximum clique problem. This is where we bring in a correspondence graph: an undirected graph \(G(V,E)\), where \(V = P \times Q\) ie the vertices indicate a mapping \((p_i, q_j)\). As for edges, that’s where \(f\) comes in: an edge exists between \(v_1 = (p_{i_1}, q_{j_1})\) and \(v_2 = (p_{i_2}, q_{j_2})\) if and only if:
It can be proved that finding a maximum clique in the correspondence graph \(G\) is the same as finding the largest subsets \(P^*\) and \(Q^*\) that have a one-to-one correspondence.
cliquematch provides classes the implement the above functionality of
correspondence graphs. The user has to provide are the sets \(P\) and
\(Q\), along with a condition function \(f\). A common use case is to
have \(f\) expressed using distance metrics:
The correspondence graph classes are all subclasses of Graph, they all expose the same methods:
An
__init__method that accepts the sets \(P\) (orS1), \(Q\) (orS2), and the distance functions forS1andS2respectively.A
build_edgesmethod that constructs the correspondence graph using only the distance metrics.A
build_edges_with_conditionmethod that accepts a condition functioncf, and a booleanuse_condition_only:If
use_cfunc_onlyisTrue, the graph is constructed using onlycf(slower)Otherwise the graph is constructed using the distance metrics and pruned with
cf(faster)
A
get_correspondencemethod that returns the largest corresponding subsets \(P^*\) and \(Q^*\), or the indices of the elements in the subsets.
The correspondence graph classes available are:
cliquematch.A2AGraphwhereS1andS2are 2-Dnumpy.ndarrayscliquematch.L2LGraphwhereS1andS2arelists (or any list-like object)cliquematch.A2LGraphwhereS1is a 2-Dnumpy.ndarrayandS2is alistcliquematch.L2AGraphwhereS2is alistandS2is a 2-Dnumpy.ndarraycliquematch.IsoGraphwhereS1andS2areGraphs (subgraph isomorphism).cliquematch.AlignGraphwhich is a special case ofA2AGraphused for image alignment.
The concept of correspondence graphs enables applying maximum cliques to many fields. Here are a couple of examples from the Github repo:
This image matching algorithm can be implemented using
cliquematchlike this .Simple molecular alignment can be implemented like this .