PageRank#

The PageRank algorithm calculates network centrality, the most important nodes in a graph.

It was introduced by Larry Page to rank search results in Google.

How it works#

Imagine random movement through a graph. The relative frequency how often a node is visited is equivalent to its PageRank.

The calculation is based on Eigenvectors and Eigenvalues.

Example#

The NetworkX library calculates PageRank in Python:

../../_images/page_rank.png

Roman cities colored by PageRank

Files: page_rank.py cities.csv

"""
PageRank algorithm

https://en.wikipedia.org/wiki/PageRank
"""

import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt


pairs1 = pd.read_csv('cities.csv', names=['city1', 'city2'])
pairs2 = pd.read_csv('cities.csv', names=['city2', 'city1'])
pairs = pd.concat([pairs1, pairs2])

# create adjacency matrix
adjacency = pd.crosstab(pairs['city1'], pairs['city2'])
print(adjacency.values)

# run PageRank algorithm with NetworkX library
g = nx.from_pandas_adjacency(adjacency)

rank = nx.pagerank(g, alpha=0.85)

connections = adjacency.sum(axis=1)
rank = pd.Series(rank)

df = pd.DataFrame({'connections': connections, 'rank': rank})
df = df.sort_values(by='rank', ascending=False)
print(df.head(10))


# color by rank
plt.figure()
colors = [df.loc[city]['rank'] for city in g.nodes()]
nx.draw(g, node_color=colors, labels=dict(zip(g.nodes(), g.nodes())))

plt.show()