Graph Convolution Network

Overview

To spread the node features across the graph, according to the graph structure (typically local connectivity among the nodes).

The features after applying the \(l\)-th graph convolution layer can be denoted as \(\mathbf{H}^{(l)}\), where it should be a \(n \times k\) matrix, with \(n\) being the number of nodes in the graph, and \(k\) being the number of feature dimensions. \(\mathbf{h}^{(l)}_i\) refers to the \(i\)-th row of \(\mathbf{H}^{(l)}\), denoting the feature of the \(i\)-th node after applying the \(l\)-th layer.

The connectivity among the nodes can be represented by the adjacency matrix \(\mathbf{A}\), an \(n \times n\) matrix, where \(\mathbf{A}_{ij}\) representing the connectivity from node \(j\) to \(i\). For undirected graph, \(\mathbf{A}\) should be symmetric.

Feature Matrix

Every node in the graph requires an initial feature matrix, for GCN layers to propagate along the connections between nodes. Besides using the task-specific features, one may also use:

  • Node degree information (Cangea et al. 2018): one-hot encoding the node degree for all degrees up to a given upper bound.
  • One-hot index vector (Yao, Mao, and Luo 2019): one-hot encoding the index of nodes.

Graph Convolution

Basic Formulation

The GCN layer can be formulated as: \[ \mathbf{H}^{(l + 1)} = f(\hat{\mathbf{A}} \mathbf{H}^{(l)}\mathbf{W}^{(l)}) \] where \(\hat{\mathbf{A}}\) is the normalized adjacency matrix. A popular choice for undirected graph may be (Wu et al. 2019): \[ \hat{\mathbf{A}} = \mathbf{D}^{-1/2} \mathbf{A} \,\mathbf{D}^{-1/2} \] where \(\mathbf{D}_{ii} = \sum_j \mathbf{A}_{ij}\) is the degree of node \(i\).

Another choice for \(\hat{\mathbf{A}}\), which not only can be used for undirected grpah, but also can be used for directed ones, can be formulated as: \[ \hat{\mathbf{A}} = \mathbf{D}^{-1} \mathbf{A} \] Some work may add auxiliary self-loop to the adjacency matrix, such that the normalized adjacency matrix becomes: \[ \hat{\mathbf{A}} = \mathbf{D}^{-1/2} (\mathbf{A} + \lambda\,\mathbf{I}) \,\mathbf{D}^{-1/2} \] or (used by Zhang et al. (2018)): \[ \hat{\mathbf{A}} = \mathbf{D}^{-1} (\mathbf{A} + \lambda\,\mathbf{I}) \]

Hetergeneous Graph with Various Edge Relationship

If the link between nodes represent different relationships (\(R = \{r\}\)), each relationship may have their own kernel \(\mathbf{W}_r^{(l)}\), such that the GCN layer can be formulated as (Schlichtkrull et al. 2017): \[ \mathbf{h}^{(l+1)}_i = f\left(\sum_{r \in R} \sum_{j \in \mathcal{N}^r_i} \frac{1}{c_{ij}} \mathbf{W}_r^{(l)} \mathbf{h}^{(l)}_j + \mathbf{W}_0^{(l)} \mathbf{h}^{(l)}_i \right) \]

where \(c_{ij} = \left| \mathcal{N}_i^r \right|\) is the normalizing factor.

Hetergeneuos Graph with Various Node Type

If the node are grouped into different types, i.e., node \(v_i\) has type \(t_i\), then each type can have their own kernel \(\mathbf{W}_{t}^{(l)}\), and the GCN layer can be formulated as: \[ \begin{align} \mathbf{h}^{(l+1)}_i &= f\left( \sum_{j \in \mathcal{N}_{i}} c_{ij} \, g(\mathbf{h}_j^{(l)}) + g(\mathbf{h}_i^{(l)}) \right) \\ g(\mathbf{h}_i^{(l)}) &= \mathbf{W}_{t_i}^{(l)} \mathbf{h}_i^{(l)} \end{align} \] as is proposed by Wang et al. (2019). For the normalizing factor \(c_{ij}\), Wang et al. (2019) also proposed to learn the attention score by self-attention: \[ \begin{align} c_{ij}^{(l)} &= \frac{\exp\left( s_{ij} \right)}{\sum_{\tilde{j} \in \mathcal{N}_i^{(l)}} \exp\left( s_{i\tilde{j}} \right)} \\ s_{ij} &= g(\mathbf{h}_i^{(l)})^{\top} \cdot g(\mathbf{h}_j^{(l)}) \end{align} \] where \(s_{ij}\) is the attention score.

LSTM Aggregation with Random Permuation of Neighborhood Nodes

Hamilton, Ying, and Leskovec (2017) proposed a candidate neighborhoold information aggregation method, based on LSTM with random permutation of neighborhood nodes.

Neighborhood Sampling Instead of Summation

Hamilton, Ying, and Leskovec (2017) proposed to use neighborhood sampling to reduce the training computation time.

Monte-Carlo Approximation Instead of Summation

Chen, Ma, and Xiao (2018) proposed to view the summation over nodes within one GCN layer as an expectation, formulated as: \[ \mathbf{h}^{(l+1)}(v) = f\left( \int \hat{\mathbf{A}}(v,u) \,\mathbf{h}^{(l)}(u)\,\mathbf{W}^{(l)}\,dP(u) \right) \] where \(u\) and \(v\) represent the nodes. \(P(u)\) is regarded as a uniform distribution. According to this formulation, the authors proposed to use a proposal distribution: \[ q(u) = \left\| \hat{\mathbf{A}}(:,u) \right\|^2 / \sum_{u'} \left\| \hat{\mathbf{A}}(:,u') \right\|^2 \] which is the power of node \(u\)'s out-degree, divided by the sum of the power of all nodes' out-degree. The proposal distribution is used to sample \(t_l\) i.i.d. nodes for each GCN layer, namely \(u_1^{(l)}, \dots, u_{t_l}^{(l)}\), and estimate \(\mathbf{h}^{(l+1)}\) by: \[ \mathbf{h}^{(l+1)}(v) \approx f\left( \frac{1}{t_l}\sum_{j=1}^{t_l} \frac{\hat{\mathbf{A}(v,u_j^{(l)})}\,\mathbf{h}^{(l)}(u_j^{(l)})\,\mathbf{W}^{(l)}}{q(u_j^{(l)})} \right) \]

Learning the Weight of Message Passing

Self-Attention

Instead of relying on the fixed adjacency matrix to weight the neighborhood information in graph convolution networks, Veličković et al. (2017) proposed to use a self-attention mechanism to weight the neighborhood information.

The attention coefficients is calculated by: \[ e_{ij} = a\left(\mathbf{W} \mathbf{h}_i, \mathbf{W} \mathbf{h}_j\right) \] where \(a(\cdot)\) is the attention network. \(e_{ij}\) is formulated by Veličković et al. (2017) as follows: \[ e_{ij} = a\left(\mathbf{W} \mathbf{h}_i, \mathbf{W} \mathbf{h}_j\right) = \mathrm{LeakyReLU}\left({\mathbf{a}}^{\top}\left[ \mathbf{W} \mathbf{h}_i \big\| \mathbf{W} \mathbf{h}_j \right]\right) \] This attention score is further normalized as: \[ \alpha_{ij} = \frac{\exp\left(e_{ij}\right)}{\sum_{k \in \mathcal{N}_i} \exp\left(e_{ik}\right)} = \frac{\exp\left(\mathrm{LeakyReLU}\left({\mathbf{a}}^{\top}\left[ \mathbf{W} \mathbf{h}_i \big\| \mathbf{W} \mathbf{h}_j \right]\right)\right)}{\sum_{k \in \mathcal{N}_i} \exp\left(\mathrm{LeakyReLU}\left({\mathbf{a}}^{\top}\left[ \mathbf{W} \mathbf{h}_i \big\| \mathbf{W} \mathbf{h}_k \right]\right)\right)} \] The final output features of this layer is: \[ \mathbf{h}' = \sigma\left( \sum_{j \in \mathcal{N}_i} \alpha_{ij} \mathbf{W} \mathbf{h}_j \right) \] If using mult-head attention, the output features can be formulated as either one of the following (i.e., concat and average multi-head attention): \[ \begin{align} \mathbf{h}' &= \mathop{\bigg\|}_{k=1}^K \sigma\left( \sum_{j \in \mathcal{N}_i} \alpha^k_{ij} \mathbf{W}^k \mathbf{h}_j \right) \\ \mathbf{h}' &= \sigma\left( \frac{1}{K} \sum_{k=1}^K \sum_{j \in \mathcal{N}_i} \alpha^k_{ij} \mathbf{W}^k \mathbf{h}_j \right) \end{align} \]

where the second formulation is adopted as the output layer.

Node Embedding

The feature vector \(\mathbf{h}^{(L)}\) after \(L\)-th GCN layer can be used as the node embedding (Hamilton, Ying, and Leskovec 2017; Schlichtkrull et al. 2017). The unsupervised loss, using negative sampling, should be: \[ \mathcal{L} = \sum_{v_i \in V} \sum_{v_j \in N(v_i)} \left[ \log \frac{\exp(\mathbf{h}^{(L)}_i \cdot \mathbf{h}^{(L)}_j)}{\exp(\mathbf{h}^{(L)}_i \cdot \mathbf{h}^{(L)}_j) + 1} + \sum_{\tilde{v}_n \sim q(v) \\ n = 1 \dots k} \log\frac{1}{\exp(\mathbf{h}^{(L)}_i \cdot \tilde{\mathbf{h}}^{(L)}_n) + 1} \right] \] same as the negative sampling method for node embedding. This embedding loss can be used along in a fully unsupervised task; or as a augumented/regularization term for a supervised task, along with the main objective for the specific task (Hamilton, Ying, and Leskovec 2017).

Pooling

gPool

Cangea et al. (2018) proposed the projection score to sort the nodes, and choose the top-\(\lceil kN \rceil\) nodes after pooling. \(k \in (0, 1]\) is the pooling ratio, while \(N\) is the numberr of input nodes. The projection score \(\mathbf{y}\) and the pooling method is formulated as: \[ \mathbf{Z} = \frac{\mathbf{X} \mathbf{p}}{\left\| \mathbf{p} \right\|_2} \qquad \overrightarrow{i} = \mathop{\text{top-rank}}\left( \mathbf{Z}, \lceil kN \rceil \right) \qquad \mathbf{X}' = \left( \mathbf{X} \odot \tanh(\mathbf{Z}) \right)_{\overrightarrow{i}} \qquad \mathbf{A}' = \mathbf{A}_{\overrightarrow{i},\overrightarrow{i}} \] where \(\mathbf{X}\) is feature matrix of input nodes, \(\mathbf{p}\) is the learnable parameter for the projection. \(\cdot_{\overrightarrow{i}}\) is the indexing operation.

Cangea et al. (2018) proposed a similar architecture, named gPool, with \(\tanh(\mathbf{Z})\) in \(\mathbf{X}' = \left( \mathbf{X} \odot \tanh(\mathbf{Z}) \right)_{\overrightarrow{i}}\) replaced by \(\text{sigmoid}(\mathbf{\mathbf{Z}})\), mimic the often adopted gate mechanism.

Note: \(\tanh(\mathbf{Z})\) or \(\text{sigmoid}(\mathbf{Z})\) enables the gradient to be passed along \(\mathbf{p}\). Without this term, \(\mathbf{p}\) produces a solely discrete element selection, which cannot be trained by stochastic gradient descent.

Self-Attention Graph Pooling

Lee, Lee, and Kang (2019) proposed to improve the projection score based pooling by adding a self-attention mechanism to calculate the projection score. The new projection score is derived as: \[ \mathbf{Z} = f(\mathop{\mathrm{GNN}}(\mathbf{X},\mathbf{A})) \] where \(f\) is the activation function, \(\mathbf{X}\) is the node feature matrix to be pooled, \(\mathbf{A}\) is the adjacency matrix, which may be normalized. \(\mathop{\mathrm{GNN}}\) indicates a general graph network, and authors proposed the following variants of this GNN network:

  • \(\mathbf{Z} = f\left(\tilde{\mathbf{D}}^{-1/2}\,\tilde{\mathbf{A}}\,\tilde{\mathbf{D}}^{1/2}\,\mathbf{X}\,\mathbf{p}\right)\), where \(\tilde{\mathbf{A}} = \mathbf{A}+\mathbf{I}\) is the normalized adjacency matrix with self-loop, and \(\tilde{\mathbf{D}}\) is the degree matrix of \(\tilde{\mathbf{A}}\).
  • \(\mathbf{Z} = f\left(\mathop{\mathrm{GNN}}(\mathbf{X},\mathbf{A}+\mathbf{A}^2)\right)\), which considers the second-order neighborhood by augument 2-hop nodes in the adjacency matrix \(\mathbf{A}\).
  • \(\mathbf{Z} = f\left(\mathop{\mathrm{GNN}_2}(\mathop{\mathrm{GNN}_1}(\mathbf{X},\mathbf{A}),\mathbf{A})\right)\), which considers the second-order neighborhood by stacking GNN layers.
  • \(\mathbf{Z}=\frac{1}{M}\sum_{m} f\left( \mathop{\mathrm{GNN}_m}(\mathbf{X},\mathbf{A}) \right))\), which using an average of multiple attention scores.

Hierarchical Pooling vs Global Pooling

Figure 1: Global Pooling vs Hierarchical Pooling (view pdf)

Lee, Lee, and Kang (2019) found that hierarchical pooling seems to work better on large graphs, where global pooling works better on small graphs.

Readout

Combine the Global Average and Max Pooling

Figure 2: Combine the Global Average and Max Pooling (view pdf)

Cangea et al. (2018) proposed a graph readout layer that combines both the global average and global max pooling techniques. Specifically, for the output graph of every \(l\)-th pooling layer \((\mathbf{X}^{(l)}, \mathbf{A}^{(l)})\), the readout layer produces the summarized output by: \[ \mathbf{s}^{(l)} = \mathop{\mathrm{Concat}}\left(\frac{1}{N^{(l)}} \sum_{i=1}^{N^{(l)}} \mathbf{x}^{(l)}_i \Bigg\|\max_{i=1}^{N^{(l)}} \mathbf{x}^{(l)}_i\right) \] And the summarized output of all pooling layers \(\mathbf{s}^{(1)}, \dots, \mathbf{s}^{(L)}\) are then summed together: \[ \mathbf{s} = \sum_{l=1}^L \mathbf{s}^{(l)} \] as the final aggregated feature of the graph. The whole structure is demonstrated as fig. 2.

Practical GCN Architectures

Modeling Relational Data with Graph Convolutional Networks [2017]

Schlichtkrull et al. (2017) proposed the following formulation for their GCN layer with regarding to the node relationship: \[ \mathbf{h}^{(l+1)}_i = f\left(\sum_r \sum_{j \in \mathcal{N}^r_i} \frac{1}{c_{ij}} \mathbf{W}_r^{(l)} \mathbf{h}^{(l)}_j + \mathbf{W}_0^{(l)} \mathbf{h}^{(l)}_i \right) \] where the term \(\mathbf{W}_0^{(l)} \mathbf{h}^{(l)}_i\) represents the information passed along the self-loop from \(l\)-th layer to the \((l+1)\)-layer. \(c_{ij}\) is the normalizing factor, and is chosen to be \(\left| \mathcal{N}_i^r \right|\) (i.e., the in-coming degree of node \(i\)) in their paper.

To avoid having too many parameters, which may be potentially prone to over-fitting, Schlichtkrull et al. (2017) also proposed two methods to regularize the weights \(\mathbf{W}^{(l)}_r\) of different relationships.

For link prediction, Schlichtkrull et al. (2017) proposed to use the GCN output at \(L\)-th layer \(\mathbf{h}_i^{(L)}\) as the embedding \(\mathbf{e}_i\) of nodes, deriving the score function as: \[ f(i,r,j) = \mathbf{e}_i^{\top}\mathbf{R}_r\mathbf{e}_j \] where \(\mathbf{R}_r\) is a learned diagonal matrix, to represent the "dot-product" between \(\mathbf{e}_i\) and \(\mathbf{e}_j\) under the relationship \(r\). The whole graph is then trained by negative sampling. See the loss function for this situation.

References

Cangea, Cătălina, Petar Veličković, Nikola Jovanović, Thomas Kipf, and Pietro Liò. 2018. “Towards Sparse Hierarchical Graph Classifiers.” arXiv Preprint arXiv:1811.01287.

Chen, Jie, Tengfei Ma, and Cao Xiao. 2018. “FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling.” arXiv:1801.10247 [Cs], January. http://arxiv.org/abs/1801.10247.

Hamilton, Will, Zhitao Ying, and Jure Leskovec. 2017. “Inductive Representation Learning on Large Graphs.” In Advances in Neural Information Processing Systems, 1024–34.

Lee, Junhyun, Inyeop Lee, and Jaewoo Kang. 2019. “Self-Attention Graph Pooling.” arXiv Preprint arXiv:1904.08082.

Schlichtkrull, Michael, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2017. “Modeling Relational Data with Graph Convolutional Networks.” arXiv:1703.06103 [Cs, Stat], October. http://arxiv.org/abs/1703.06103.

Veličković, Petar, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. “Graph Attention Networks.” arXiv Preprint arXiv:1710.10903.

Wang, Yueyang, Ziheng Duan, Binbing Liao, Fei Wu, and Yueting Zhuang. 2019. “Heterogeneous Attributed Network Embedding with Graph Convolutional Networks.” In Proceedings of the AAAI Conference on Artificial Intelligence, 33:10061–2.

Wu, Zonghan, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2019. “A Comprehensive Survey on Graph Neural Networks.” arXiv:1901.00596 [Cs, Stat], August. http://arxiv.org/abs/1901.00596.

Yao, Liang, Chengsheng Mao, and Yuan Luo. 2019. “Graph Convolutional Networks for Text Classification.” In Proceedings of the AAAI Conference on Artificial Intelligence, 33:7370–7.

Zhang, Muhan, Zhicheng Cui, Marion Neumann, and Yixin Chen. 2018. “An End-to-End Deep Learning Architecture for Graph Classification.” In Thirty-Second AAAI Conference on Artificial Intelligence.