User Guide

As described in the homepage, cliquematch aims to do two things:

  1. Find maximum cliques in large sparse undirected graphs, as quickly and efficiently as possible.

  2. Construct large sparse undirected graphs in-memory for the various applications of the maximum clique/clique enumeration problem.

Let’s look at how cliquematch can be used as described above.

Finding a Maximum Clique

Note

The below examples use python. If you’re using R and like to call python packages via reticulate, you can view the R Markdown file from the examples in the Github repo.

Finding a maximum clique in an undirected graph is a well known problem. Most exact algorithms, including the one used in cliquematch, use some form of a depth-first search (DFS), along with some pruning techniques to speed up the search.

Finding a maximum clique using cliquematch is simple (and fast): load a Graph from a file (or from an adjacency matrix, an adjacency list, or a list of edges), set some search properties, and find a maximum clique in it.

For example, we load a graph from a file: cond-mat-2003.mtx and find a maximum clique.

import cliquematch
G = cliquematch.Graph.from_file("cond-mat-2003.mtx")

We can also load a Graph using an adjacency matrix (from_adj_matrix), an adjacency list (from_adj_list), or a list of edges (from_edgelist). Once the Graph object has been loaded, we can provide various parameters to search for a clique with get_max_clique:

answer = G.get_max_clique(
    lower_bound=1,
    upper_bound=1729,
    time_limit=100,
    use_heuristic=True,
    use_dfs=True,
)
print(answer)
# [9986, 9987, 10066, 10068, 10071, 10072, 10074, 10076,
# 10077, 10078, 10079, 10080, 10081, 10082, 10083, 10085,
# 10287, 10902, 10903, 10904, 10905, 10906, 10907, 10908, 10909]
  • A lower_bound and an upper_bound can be set for the size for clique.

  • A time_limit can be set for the clique search. If time limits are not needed, set 0 as the limit.

  • In case the graph is very large, to get a large clique quickly via a heuristic method, use_heuristic can be set to True, and use_dfs can be set to False to skip the depth-first search.

Warning

Both use_heuristic and use_dfs cannot be set to False. cliquematch will then skip the clique search entirely and raise a RuntimeError.

The Graph object has some read-only properties:

The user can print the Graph object to view the properties:

print(G)
# cliquematch.core.Graph object at 0x559e7da730c0
# (n_vertices=31163,n_edges=120029,search_done=False)

What is the search_done property for? In case the clique search was interrupted due to time_limit, the user can provide a continue_search parameter within a loop like this:

while not G.search_done:
    answer = G.get_max_clique(
        lower_bound=1,
        upper_bound=1729,
        time_limit=100,
        use_heuristic=True,
        use_dfs=True,
        continue_search=True,
    )

If the entire Graph needs to be searched again (say for a clique of larger size), the user can call the reset_search method:

G.reset_search()
G.get_max_clique(
    lower_bound=1, upper_bound=31, time_limit=100, use_heuristic=True, use_dfs=True
)

Finding Multiple cliques : clique enumeration

The above Graph methods deal with finding one large clique. A related use case is to find multiple large cliques in a given graph,and iterate through them in some order.

The Graph class has an all_cliques method for finding all cliques of a given size. Taking the Graph object loaded as per the above section:

import cliquematch
G = cliquematch.Graph.from_file("cond-mat-2003.mtx")
# we know there exists a maximum clique of size 25
# so let's find cliques of size 24
for clique in G.all_cliques(size=24):
     print(clique)

Applications of the maximum clique problem

Applications of the maximum clique problem primarily involve:

  1. the construction of a graph from a different kind of dataset,

  2. writing the graph to a file,

  3. reading the file again,

  4. finding a maximum clique, and then

  5. conversion of the clique back into the existing dataset.

This process is usually repeated with tweaks to underlying dataset, leading to different graphs and cliques. For such use cases the primary bottlenecks are the construction of the graph, reading/processing the graph data in a clique-friendly manner, and finding the maximum clique. cliquematch aims to solve these issues by keeping the graph construction in memory, and having an optimized clique search algorithm.

Graph construction for maximum clique problems mostly involve one of the two ways below:

  • A graph is constructed using an edge indication function on all pairs of elements belonging to a dataset \(X\).

    • cliquematch does not provide any specific code for this; an edge list can constructed from the data using a nested loop, following which a Graph object can be loaded,and the maximum clique (ie the largest group of related elements in the dataset \(X\)) can be computed.

  • A correspondence graph is constructed using the elements of two datasets \(P\) and \(Q\); the vertices of the graph refer to pairs of elements \((p_i, q_j)\), and an edge between two vertices implies some common relationship between the elements from \(P\) and the elements from \(Q\).

Yeah… that’s a little dense. Let’s try again with some math.

Assume you have two sets \(P = { p_1, p_2, ... p_M }\) and \(Q = { q_1, q_2, ... q_N }\), and we want to find the largest subsets \(P^* \in P\) and \(Q^* \in Q\) such that there exists a one-to-one correspondence between \(P^*\) and \(Q^*\).

This means the elements of \(P^*\) are related to each other similar to how the elements of \(Q^*\) are related to each other. Suppose the elements of \(P^*\) (and similarly \(Q^*\)) have a pairwise relationship, then we can say that for all pairs

\[\begin{split}( (p_{i_1}, p_{i_2}), (q_{j_1}, q_{j_2}) ) \in P^* \times P^* \times Q^* \times Q^* \\ i_1 \neq i_2 \\ j_1 \neq j_2 \\\end{split}\]

there exists some boolean condition function \(f\) such that`

\[\begin{split}f(p_{i_1}, p_{i_2}, q_{j_1}, q_{j_2}) = 1 \\ \forall ( p_{i_1}, p_{i_2}, q_{j_1}, q_{j_2} ) \in P^* \times P^* \times Q^* \times Q^* \\ i_1 \neq i_2 \\ j_1 \neq j_2 \\\end{split}\]

What does this have to do with maximum cliques? Well, \(P^*\) and \(Q^*\) are the largest such subsets, so maybe finding them can be done by converting the problem to a maximum clique problem. This is where we bring in a correspondence graph: an undirected graph \(G(V,E)\), where \(V = P \times Q\) ie the vertices indicate a mapping \((p_i, q_j)\). As for edges, that’s where \(f\) comes in: an edge exists between \(v_1 = (p_{i_1}, q_{j_1})\) and \(v_2 = (p_{i_2}, q_{j_2})\) if and only if:

\[f(p_{i_1}, p_{i_2}, q_{j_1}, q_{j_2}) = 1\]

It can be proved that finding a maximum clique in the correspondence graph \(G\) is the same as finding the largest subsets \(P^*\) and \(Q^*\) that have a one-to-one correspondence.

cliquematch provides classes the implement the above functionality of correspondence graphs. The user has to provide are the sets \(P\) and \(Q\), along with a condition function \(f\). A common use case is to have \(f\) expressed using distance metrics:

\[f(p_{i_1}, p_{i_2}, q_{j_1}, q_{j_2}) = 1 \iff || d_P(p_{i_1}, p_{i_2}) - d_Q(q_{j_1}, q_{j_2}) || \leq \epsilon\]

The correspondence graph classes are all subclasses of Graph, they all expose the same methods:

  • An __init__ method that accepts the sets \(P\) (or S1), \(Q\) (or S2), and the distance functions for S1 and S2 respectively.

  • A build_edges method that constructs the correspondence graph using only the distance metrics.

  • A build_edges_with_condition method that accepts a condition function cf, and a boolean use_condition_only:

    • If use_cfunc_only is True, the graph is constructed using only cf (slower)

    • Otherwise the graph is constructed using the distance metrics and pruned with cf (faster)

  • A get_correspondence method that returns the largest corresponding subsets \(P^*\) and \(Q^*\), or the indices of the elements in the subsets.

  • A all_correspondences method similar to the cliquematch.Graph.all_cliques that works similar to get_correspondence.

The correspondence graph classes available are:

The concept of correspondence graphs enables applying maximum cliques to many fields. Here are a couple of examples from the Github repo:

  1. This image matching algorithm can be implemented using cliquematch like this .

  2. Simple molecular alignment can be implemented like this .