Graph Convolutional Networks for dummies
The article explains GCN in as simply as possible. A basic knowledge in ML and matrix multiplication is needed to understand the article. Please refrain from directly copying content from the post. In case you want to use an excerpt or any original image, mention the link to the post.
Pinterest, a content sharing social site crossed 200 million monthly active users (MAU) as of 2018. The content shared on Pinterest is mostly pictures with tags and related text. In the same year they had around 100 billion such objects stored in their database.
Along with these numbers, they also mentioned how they moved to using a graph based convolution neural networks in production to recommend pins to their users in their engineering blog.
For many data scientists with no CS/Math or an advanced degree, it is hard to understand graph mathematics at first. In this post I make an attempt to explain traditional GCNs (Graph Neural Network) concepts as simply as I can.
Level 0: Notations
Take a look at the image below. If you understand everything from the image, feel free to skip this section.
A graph has nodes and edges. For example, you and your family members are nodes of your family graph. Edges describe relationship/dependencies between nodes. An edge from you to your mother can be labelled as son of or daughter of. In this case, the edge - son of or daughter of is directional. Because the edge label makes sense only one way. The edge is thus called directed edge and the graph is directional graph. Now consider your friends on Facebook. Each one of your friends are nodes in the network. Including you. The edge between two friends - friend of is undirected. Because if you are friends with someone, she/he is friends with you too.
Tasks on Graphs
What do we want to achieve with Machine Learning on graphs? There are number of tasks.
Node Classification - Consider a dataset of academic papers where each paper cites and is cited by other papers. Papers are the nodes and citations are the edges. If we assign topics to each paper, for example - Statistics, ML, Linear Algebra and so on, we can predict the topic of a new paper by learning the graph structure of these paper-citation network.
Edge classification - Like nodes, it is sometimes useful to classify edges as well, since edges represent relationship between nodes.
Graph level classification - Instead of nodes and edges, we might wanna classify the entire graph. Here, the node and edge level representations are aggregated using some pooling and readout functions. The final graph level representation is used to predict the label.
Graph auto encoding - Like autoencoders are used in images, unsupervised learning on graphs is possible using autoencoders. The core idea is still the same - a network is trained by transforming the graph into low level representation and then reconstructing the graph.
Representing a graph
In Computer Vision, we represent image as grid of pixels. In NLP, we represent documents as TF-IDF vectors or count based vectors. Similarly, a graph is represented as Graph Laplacian Matrices . It’s really simple to understand. Note that we talk about undirected graphs here. For such graphs, laplacian matrix is the difference of two matrices. Take a look at the image below.
The above image shows an exmple graph with three nodes. The adjacency and identity matrices are square matrices with N = number of nodes in the graph. In the example image, there are 3 nodes. An intuition behind taking the identity matrix is to include the node’s embedding itself when computing representations for the same node. The adjacency matrix injects information about the neighbouring nodes. The difference between the identity matrix and the normalized adjacency matrix is the graph laplacian matrix.
What’s the point?
Remember that our objective is to be able to classify nodes in a graph. How is the above framework useful? Consider a case where you have 1 million nodes and labels for only 0.5 million nodes are avilable. You can perform supervised learning on these 0.5 million nodes and predict labels for the rest of the nodes.
But how do convolutions happen on this matrix?
Convolutions on traditional GCNs are somewhat similar to convolution on images. The idea of convolution filter, number of channels in a convolutional layer are pretty much the same.
Convolution Operation
Consider an input graph laplacian X of size 3*3. This means there are three nodes in the graph. Let N be the number of nodes. N=3 here.
Consider a convolution filter of 5 output channels. Keep in mind that unlike CNNs, convolution filter does not slide over the matrix. After the convolution operation, we should expect an N*5 matrix. Why? We need representation for each node. Hence, the number of rows should be N. 5 columns, because the output channels in the layer are 5. In fact even after stacking many convolution layers, the number of rows must remain N. Let’s call the output of the first convolution layer as H1.
The way we construct H1 is by computing individual columns of H1 (5 in this case). In the image above, we see how the first two columns are computed. The first column is computed by addition of various N*1 matrices. Each of these N*1 matrices has 3 unique components - U, W and a column of X. W is a learnable matrix which has the convolutional filter weights. U is a matrix used to perform axes transformation (similar to what you do in PCA). X is the input. U is an N*N matrix. U’ is transpose of U. Don’t bother yourself too much with U at this moment. W is also an N*N matrix. Note that U is not learnable. Only W is. X[:, 1] means we are operating on the first column of X. To understand the subscripts under W, consider this example - W12 means we are transforming the first column of the input and the operation is being done for the second column in the output (H1[:, 2]) . After the addition happens, we apply a non-linear activation like ReLU.
If we want to stack a second convolution layer, the output of the first, H1, becomes the input. Finally, we get the final output matrix in the shape of N*C, where C is the number of classes. Then we compute the loss for each labelled node using cross-entropy and perform back-propagation.
Drawbacks of such convolutions
Notice that we process the whole laplacian matrix at once. Imagine a company like Pinterest that has millions of nodes. Handling such large sparse matrices is often time consuming and difficult.
Also, the structure of graphs is constanlty evolving. Incorporating those changes can often be hard in this framework.
Pinterest, obviously uses a more web-efficient version of GCN which we will discuss in further posts.
Implementation?
Authors of the paper Semi-Supervised Classification with Graph Convolutional Networks have open-sourced their implementation of GCN here. You can also run this simple GCN tutorial on colab here.
Lastly…
If reading my article made the last few minutes of your time useful, do subscribe to my newsletter and if you feel like helping me more, then share my article on LinkedIn/Facebook/Reddit wherever.