Bidirectional Graph-Search: A Benchmark

Python Jul 15, 2015

We had some trouble with the slow identification of “simple paths” and  “shortest paths” in our application, whereby we applied the common  networkx library for python. We therefore developed our own  bidirectional path search algorithms, for which we want provide a small  benchmark at this post.

The algorithms can be downloaded under the following link

https://github.com/andre-dietrich/networkx/blob/master/networkx/algorithms/bidirectional_path_search.py

… we hope that these little helpers can be integrated into the networkx library.

In principle

This algorithm depicted below constructs sequentially two trees, one from the source in direction to the target and one from the target in direction to the source. The leaves of both growing trees are continuously compared, if there are matching leaves in both trees, a new path is identified. Doing this, the algorithm actually reduces the search space by half, which should result in a faster identification of simple paths and shortest paths, compared to the original nx.all_simple_paths and nx.all_shortest_paths algorithms (which are currently included in the networkx library). As revealed by the benchmark, this bidirectional search approach performs performs better for sparse trees, the more connected a graph is, the less beneficial is our approach.Another advantage is, that the results are returned in order, starting from the shortest paths, which are afterwards incremented stepwise in length.

Usage

The principle of the bidirectional search was implemented for three common search tasks (has_path, all_shortest_paths, all_simple_paths). The application is briefly introduced within this section, before the three approaches are compared against their networkx counterparts.

%pylab inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10.0, 8.0)

from bidirectional_path_search import *

import networkx as nx
G = nx.complete_graph(5)

import matplotlib.pyplot as plt
nx.draw_circular(G)

1. all_simple_paths

This function is used to search for all simple paths. A simple path, by definition, is a path with no repeated nodes. As visible in the example below, one benefit of the bidirectional approach is that all results are generated in order.

for path in bidirectional_all_simple_paths(G,source=0,target=4):
print(path)
[0, 4]
[0, 1, 4]
[0, 2, 4]
[0, 3, 4]
[0, 1, 2, 4]
[0, 3, 2, 4]
[0, 2, 1, 4]
[0, 3, 1, 4]
[0, 1, 3, 4]
[0, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 3, 2, 4]
[0, 3, 1, 2, 4]
[0, 3, 2, 1, 4]
[0, 2, 1, 3, 4]
[0, 2, 3, 1, 4]

A similar example below, shows a the usage of the cutoff parameter, which is used to define a maximal path length

paths = bidirectional_all_simple_paths(G, 0, 4,cutoff=2)
print(list(paths))
[[0, 4], [0, 1, 4], [0, 2, 4], [0, 3, 4]]

2. all_shortest_paths

all_shortest_paths is used similarly … However, the implementation of this function contains some optimizations that cannot be applied in all_simple_paths, making it even faster.

for path in bidirectional_all_shortest_paths(G, 0, 4):
print(path)
[0, 4]

3. has_path

With this function we tried to port our concept onto a simple check of whether a path between two nodes exists or not.

print( bidirectional_has_path(G, source=0, target=4) )
True

4. Note

All of the presented algorithms can be applied in the same way also for directed graphs and multigraphs.

Benchmark against networkx

The following benchmark is used to test the bidirectional approach against the implementation applied in networkx. For the test we applied the huge and dense graph that is generated below.

import time
import random

# trying out a huge graph...
G1 = nx.dense_gnm_random_graph(15000,15000,2)
print nx.info(G1)
Name: dense_gnm_random_graph(15000,15000)
Type: Graph
Number of nodes: 15000
Number of edges: 15000
Average degree:   2.0000

1. has_path – nearly equal (slightly)

As this first test reveals that our approach is a bit slower than the original networkx.has_path algorithm in identifying a connection between two nodes.

times1, times2 = [], []
for i in range(100000):
source = random.randint(0, len(G1)-1)
target = random.randint(0, len(G1)-1)

# original
t = time.time()
r1 = nx.has_path(G1, source, target)
times1.append(time.time()-t)

# bidirectional
t = time.time()
r2 = bidirectional_has_path(G1, source, target)
times2.append(time.time()-t)

print "mean NetworkX:     ", mean(times1)
print "mean Bidirectional:", mean(times2)
mean NetworkX:      0.0001208728862
mean Bidirectional: 0.0001265842175
from scipy.stats import gaussian_kde
from numpy import linspace

plt.title("Comparison - has_path")
plt.xlabel("time in seconds")

density1 = gaussian_kde(times1)
xs1 = linspace(min(times1), max(times1), 200)
plt.plot(xs1, density1(xs1), label="NetworkX")

density2 = gaussian_kde(times2)
xs2 = linspace(min(times2), max(times2), 200)
plt.plot(xs2, density2(xs2), label="Bidirectional")

plt.legend()
plt.show()

2. all_shortest_paths – win (approx 60times faster)

In contrast to the previous approach, we now really perform much better, if it is about identifying all shortest paths within the graph. As you see from the mean value and mainly in the diagrams below, even with our worst approach we are below the minimum of the original search algorithm.

times1, times2     = [], []
results1, results2 = [], []
lengths1, lengths2 = [], []
for i in range(1000):
source = random.randint(0, len(G1)-1)
target = random.randint(0, len(G1)-1)

if not nx.has_path(G1, source, target):
continue

# bidirectional
t = time.time()
res = list(bidirectional_all_shortest_paths(G1, source, target))
times2.append(time.time()-t)
results2.append(len(res))
lengths2.append(len(res[0]))

# original
t = time.time()
res = list(nx.all_shortest_paths(G1, source, target))
times1.append(time.time()-t)
results1.append(len(res))
lengths1.append(len(res[0]))

print "NetworkX:     ", mean(times1)
print "Bidirectional:", mean(times2)
NetworkX:      0.0273389468201
Bidirectional: 0.000450532889254

This difference becomes even more visible, when plotting the resulting  times against the lengths of the shortest paths. It is still a mystery  for us, why there are these clusters within the original algorithm.

plt.plot(lengths1, times1, 'bs')
plt.xlabel("path length")
plt.ylabel("time in seconds")
plt.title("NetworkX - all_shortest_paths")
plt.show()
plt.plot(lengths2, times2, 'gs')
plt.xlabel("path length")
plt.ylabel("time in seconds")
plt.title("Bidirectional - all_shortest_paths")
plt.show()

3. all_simple_paths – tie game

For all simple paths, the bidirectional approach works fine until a certain cutoff number. The following example shows that we perform much better than the original approach and better than Yens algorithm, but this benefit is decreasing (slightly) with the number of cutoffs (but we are still better than Yen). The original approach shows a nearly linear effort (at least on a logarithmic scale) while ours in creases of cutoffs. That there is such a break even point, becomes especially apparent in the second example, which applies a much smaller graph, but with a higher connectivity…

times3, times4, times5       = [0], [0], [0]
results3, results4, results5 = [0], [0], [0]
print "Cutoff\t #\tTime nx\t\t\tTime bi\t\t\tTime Yen"
for i in range(31):
# original
if times3[-1] < 1000:
t = time.time()
res = list(nx.all_simple_paths(G1, 0, 5555, cutoff=i))
times3.append(time.time()-t)
results3.append(len(res))

# bidirectional
if times4[-1] < 1000:
t = time.time()
res = []
for path in bidirectional_all_simple_paths(G1, 0, 5555):
if len(path) > i+1: break
res.append(path)
times4.append(time.time()-t)
results4.append(len(res))

# Yen
if times5[-1] < 1000:
t = time.time()
res = []
for path in nx.shortest_simple_paths(G1, 0, 5555):
if len(path) > i+1: break
res.append(path)
times5.append(time.time()-t)
results5.append(len(res))

#print results3[-1], results4[-1], results5[-1]
print "%2d\t%d\t%4.10f\t\t%4.10f\t\t%4.10f" % (i, results4[-1],
times3[-1],
times4[-1],
times5[-1])
Cutoff	#	Time nx		 Time bi	   Time Yen
0	0	0.0000169277 	 0.0009489059	   0.0003070831
1	0	0.0000131130	 0.0005350113	   0.0002679825
2	0	0.0000209808	 0.0005290508	   0.0002870560
3	0	0.0000319481	 0.0005290508	   0.0002651215
4	0	0.0000538826	 0.0005249977	   0.0002658367
5	0	0.0001230240	 0.0005271435	   0.0002629757
6	0	0.0003259182	 0.0005269051	   0.0002660751
7	0	0.0008490086	 0.0008840561	   0.0004670620
8	0	0.0025880337	 0.0009429455	   0.0003981590
9	0	0.0030961037	 0.0006442070	   0.0002808571
10	0	0.0059850216	 0.0007340908	   0.0003199577
11	0	0.0125038624	 0.0006818771	   0.0002908707
12	0	0.0252490044	 0.0006968975      0.0002899170
13	2	0.0577909946	 0.0011410713	   0.0125930309
14	3	0.0905401707	 0.0010340214	   0.0128521919
15	8	0.1711978912	 0.0015900135	   0.0400440693
16     11	0.3168399334	 0.0019509792	   0.0538990498
17     22	0.6437628269	 0.0037951469	   0.1164710522
18     49       1.2884712219	 0.0057270527	   0.2691831589
19     86	2.5660240650	 0.0100028515	   0.4873750210
20    179	5.2047691345	 0.0212469101	   1.0338859558
21    347      10.9900391102	 0.0519239902	   2.1379001141
22    699      21.2571210861	 0.1343131065	   4.6672840118
23   1432      47.2503390312	 0.3566339016	  12.2228970528
24   2988      87.9153668880	 1.0754389763	  38.3969161510
25   5778     178.2713079453	 2.6051490307	 135.4517998695
26  11810     354.3257060051	 6.7742710114	 572.2558269501
27  23885     703.5765790939	19.5102679729	2796.4162571430
28  48206    1418.3416740894	60.0605568886
29  96801		       161.4522788525
30 194713		       383.0906679630

Plotting the result on a logarithmic scale …

plt.title("Graph with 15000 nodes: Time vs. Cutoffs")
plt.ylabel("time in seconds (logarithmic scale)")
plt.xlabel("cutoffs")
plt.plot(range(len(times3)), times3, color="b", linestyle='-', marker='o', label="nx.all_simple_paths")
plt.plot(range(len(times4)), times4, color="g", linestyle='-', marker='o', label="bidirectional_all_simple_paths")
plt.plot(range(len(times5)), times5, color="r", linestyle='-', marker='o', label="nx.shortest_simple_paths")
plt.yscale('log')
plt.legend(loc="lower right")
plt.grid(True, which="both")
plt.show()

To show a more dramatic example we applied a much smaller graph with a  higher connectivity rate. In essence it shows a similar behavior, but  for a certain number of results or cutoffs, our algorithm performs worse…

from networkx.utils import powerlaw_sequence
from networkx.utils import create_degree_sequence

def pick_nodes(G):
u = next(G.nodes())
v = next(nx.non_neighbors(G, u))
return u, v

def build_power_law(n):
deg_seq = create_degree_sequence(n, powerlaw_sequence, 100)
G = nx.Graph(nx.configuration_model(deg_seq))
G.remove_edges_from(G.selfloop_edges())
G = max(nx.biconnected_component_subgraphs(G), key=len)
G.name = 'Biconnected comp. of power law model: {0}'.format(n)
return G

G2 = build_power_law(100)
u, v = pick_nodes(G2)
print(nx.info(G2))
Name: Biconnected comp. of power law model: 100
Type: Graph
Number of nodes:  56
Number of edges: 117
Average degree:    4.1786
nx.draw_networkx(G2)

Repeating the experiment …

times6, times7, times8       = [0], [0], [0]
results6, results7, results8 = [0], [0], [0]
print "Cutoff\t #\tTime nx\t\t\tTime bi\t\t\tTime Yen"
for i in range(16):
# original
if times6[-1] < 100:
t = time.time()
res = list(nx.all_simple_paths(G2, u, v, cutoff=i))
times6.append(time.time()-t)
results6.append(len(res))

# bidirectional
if times7[-1] < 100:         t = time.time()         res = []         for path in bidirectional_all_simple_paths(G2, u, v):             if len(path) > i+1: break
res.append(path)
times7.append(time.time()-t)
results7.append(len(res))

# Yen
if times8[-1] < 100:         t = time.time()         res = []         for path in nx.shortest_simple_paths(G2, u, v):             if len(path) > i+1: break
res.append(path)
times8.append(time.time()-t)
results8.append(len(res))

print "%2d\t%d\t%4.10f\t\t%4.10f\t\t%4.10f" % (i, results7[-1],
times6[-1],
times7[-1],
times8[-1])
Cutoff	 #    Time nx          Time bi          Time Yen
0	 0    0.0323719978     0.0001041889	0.0000660419
1	 0    0.0000140667     0.0000300407	0.0000369549
2	 0    0.0000200272     0.0000269413	0.0000331402
3	 2    0.0000438690     0.0000901222	0.0002820492
4	 9    0.0002310276     0.0001258850	0.0013308525
5      33    0.0010929108     0.0005600452	0.0068249702
6     145    0.0042741299     0.0014421940	0.0312778950
7     532    0.0120141506     0.0063610077	0.2335650921
8    1791    0.0422871113     0.0198018551	2.5332250595
9    5729    0.1385929585     0.0889999866    25.3470108509
10   17173    0.4517419338     0.3373081684   284.6168701649
11   48061    1.4191639423     1.3295359612
12  125349    4.1057400703     5.4134969711
13  309536   11.3220450878    18.4174230099
14  726092   30.4987280369    88.0133180618
15 1620039   75.8558409214   331.2424049377
plt.title("Graph with 100 nodes: Time vs. Cutoffs")
plt.ylabel("time in seconds (logarithmic scale)")
plt.xlabel("cutoffs")
plt.plot(range(len(times6)), times6, color="b", linestyle='-', marker='o', label="nx.all_simple_paths")
plt.plot(range(len(times7)), times7, color="g", linestyle='-', marker='o', label="bidirectional_all_simple_paths")
plt.plot(range(len(times8)), times8, color="r", linestyle='-', marker='o', label="nx.shortest_simple_paths")
plt.yscale('log')
plt.legend(loc="lower right")
plt.grid(True, which="both")
plt.show()