## Overview

To spread the node features across the graph, according to the graph structure (typically local connectivity among the nodes).

The features after applying the \(l\)-th graph convolution layer can be denoted as \(\mathbf{H}^{(l)}\), where it should be a \(n \times k\) matrix, with \(n\) being the number of nodes in the graph, and \(k\) being the number of feature dimensions. \(\mathbf{h}^{(l)}_i\) refers to the \(i\)-th row of \(\mathbf{H}^{(l)}\), denoting the feature of the \(i\)-th node after applying the \(l\)-th layer.

The connectivity among the nodes can be represented by the adjacency matrix \(\mathbf{A}\), an \(n \times n\) matrix, where \(\mathbf{A}_{ij}\) representing the connectivity from node \(j\) to \(i\). For undirected graph, \(\mathbf{A}\) should be symmetric.

## Feature Matrix

Every node in the graph requires an initial feature matrix, for GCN layers to propagate along the connections between nodes. Besides using the task-specific features, one may also use:

- Node degree information (Cangea et al. 2018): one-hot encoding the node degree for all degrees up to a given upper bound.
- One-hot index vector (Yao, Mao, and Luo 2019): one-hot encoding the index of nodes.

## Graph Convolution

### Basic Formulation

The GCN layer can be formulated as: \[ \mathbf{H}^{(l + 1)} = f(\hat{\mathbf{A}} \mathbf{H}^{(l)}\mathbf{W}^{(l)}) \] where \(\hat{\mathbf{A}}\) is the normalized adjacency matrix. A popular choice for undirected graph may be (Wu et al. 2019): \[ \hat{\mathbf{A}} = \mathbf{D}^{-1/2} \mathbf{A} \,\mathbf{D}^{-1/2} \] where \(\mathbf{D}_{ii} = \sum_j \mathbf{A}_{ij}\) is the degree of node \(i\).

Another choice for \(\hat{\mathbf{A}}\), which not only can be used for undirected grpah, but also can be used for directed ones, can be formulated as: \[ \hat{\mathbf{A}} = \mathbf{D}^{-1} \mathbf{A} \] Some work may add auxiliary self-loop to the adjacency matrix, such that the normalized adjacency matrix becomes: \[ \hat{\mathbf{A}} = \mathbf{D}^{-1/2} (\mathbf{A} + \lambda\,\mathbf{I}) \,\mathbf{D}^{-1/2} \] or (used by Zhang et al. (2018)): \[ \hat{\mathbf{A}} = \mathbf{D}^{-1} (\mathbf{A} + \lambda\,\mathbf{I}) \]

### Hetergeneous Graph with Various Edge Relationship

If the link between nodes represent different relationships (\(R = \{r\}\)), each relationship may have their own kernel \(\mathbf{W}_r^{(l)}\), such that the GCN layer can be formulated as (Schlichtkrull et al. 2017): \[ \mathbf{h}^{(l+1)}_i = f\left(\sum_{r \in R} \sum_{j \in \mathcal{N}^r_i} \frac{1}{c_{ij}} \mathbf{W}_r^{(l)} \mathbf{h}^{(l)}_j + \mathbf{W}_0^{(l)} \mathbf{h}^{(l)}_i \right) \]

where \(c_{ij} = \left| \mathcal{N}_i^r \right|\) is the normalizing factor.

### Hetergeneuos Graph with Various Node Type

If the node are grouped into different types, i.e., node \(v_i\) has type \(t_i\), then each type can have their own kernel \(\mathbf{W}_{t}^{(l)}\), and the GCN layer can be formulated as: \[ \begin{align} \mathbf{h}^{(l+1)}_i &= f\left( \sum_{j \in \mathcal{N}_{i}} c_{ij} \, g(\mathbf{h}_j^{(l)}) + g(\mathbf{h}_i^{(l)}) \right) \\ g(\mathbf{h}_i^{(l)}) &= \mathbf{W}_{t_i}^{(l)} \mathbf{h}_i^{(l)} \end{align} \] as is proposed by Wang et al. (2019). For the normalizing factor \(c_{ij}\), Wang et al. (2019) also proposed to learn the attention score by self-attention: \[ \begin{align} c_{ij}^{(l)} &= \frac{\exp\left( s_{ij} \right)}{\sum_{\tilde{j} \in \mathcal{N}_i^{(l)}} \exp\left( s_{i\tilde{j}} \right)} \\ s_{ij} &= g(\mathbf{h}_i^{(l)})^{\top} \cdot g(\mathbf{h}_j^{(l)}) \end{align} \] where \(s_{ij}\) is the attention score.

### LSTM Aggregation with Random Permuation of Neighborhood Nodes

Hamilton, Ying, and Leskovec (2017) proposed a candidate neighborhoold information aggregation method, based on LSTM with random permutation of neighborhood nodes.

### Neighborhood Sampling Instead of Summation

Hamilton, Ying, and Leskovec (2017) proposed to use neighborhood sampling to reduce the training computation time.

### Monte-Carlo Approximation Instead of Summation

Chen, Ma, and Xiao (2018) proposed to view the summation over nodes within one GCN layer as an expectation, formulated as: \[ \mathbf{h}^{(l+1)}(v) = f\left( \int \hat{\mathbf{A}}(v,u) \,\mathbf{h}^{(l)}(u)\,\mathbf{W}^{(l)}\,dP(u) \right) \] where \(u\) and \(v\) represent the nodes. \(P(u)\) is regarded as a uniform distribution. According to this formulation, the authors proposed to use a proposal distribution: \[ q(u) = \left\| \hat{\mathbf{A}}(:,u) \right\|^2 / \sum_{u'} \left\| \hat{\mathbf{A}}(:,u') \right\|^2 \] which is the power of node \(u\)'s out-degree, divided by the sum of the power of all nodes' out-degree. The proposal distribution is used to sample \(t_l\) i.i.d. nodes for each GCN layer, namely \(u_1^{(l)}, \dots, u_{t_l}^{(l)}\), and estimate \(\mathbf{h}^{(l+1)}\) by: \[ \mathbf{h}^{(l+1)}(v) \approx f\left( \frac{1}{t_l}\sum_{j=1}^{t_l} \frac{\hat{\mathbf{A}(v,u_j^{(l)})}\,\mathbf{h}^{(l)}(u_j^{(l)})\,\mathbf{W}^{(l)}}{q(u_j^{(l)})} \right) \]

## Learning the Weight of Message Passing

### Self-Attention

Instead of relying on the fixed adjacency matrix to weight the neighborhood information in graph convolution networks, Veličković et al. (2017) proposed to use a self-attention mechanism to weight the neighborhood information.

The *attention coefficients* is calculated by: \[ e_{ij} = a\left(\mathbf{W} \mathbf{h}_i, \mathbf{W} \mathbf{h}_j\right) \] where \(a(\cdot)\) is the attention network. \(e_{ij}\) is formulated by Veličković et al. (2017) as follows: \[ e_{ij} = a\left(\mathbf{W} \mathbf{h}_i, \mathbf{W} \mathbf{h}_j\right) = \mathrm{LeakyReLU}\left({\mathbf{a}}^{\top}\left[ \mathbf{W} \mathbf{h}_i \big\| \mathbf{W} \mathbf{h}_j \right]\right) \] This attention score is further normalized as: \[ \alpha_{ij} = \frac{\exp\left(e_{ij}\right)}{\sum_{k \in \mathcal{N}_i} \exp\left(e_{ik}\right)} = \frac{\exp\left(\mathrm{LeakyReLU}\left({\mathbf{a}}^{\top}\left[ \mathbf{W} \mathbf{h}_i \big\| \mathbf{W} \mathbf{h}_j \right]\right)\right)}{\sum_{k \in \mathcal{N}_i} \exp\left(\mathrm{LeakyReLU}\left({\mathbf{a}}^{\top}\left[ \mathbf{W} \mathbf{h}_i \big\| \mathbf{W} \mathbf{h}_k \right]\right)\right)} \] The final output features of this layer is: \[ \mathbf{h}' = \sigma\left( \sum_{j \in \mathcal{N}_i} \alpha_{ij} \mathbf{W} \mathbf{h}_j \right) \] If using *mult-head attention*, the output features can be formulated as either one of the following (i.e., *concat* and *average* multi-head attention): \[ \begin{align} \mathbf{h}' &= \mathop{\bigg\|}_{k=1}^K \sigma\left( \sum_{j \in \mathcal{N}_i} \alpha^k_{ij} \mathbf{W}^k \mathbf{h}_j \right) \\ \mathbf{h}' &= \sigma\left( \frac{1}{K} \sum_{k=1}^K \sum_{j \in \mathcal{N}_i} \alpha^k_{ij} \mathbf{W}^k \mathbf{h}_j \right) \end{align} \]

where the second formulation is adopted as the output layer.

## Node Embedding

The feature vector \(\mathbf{h}^{(L)}\) after \(L\)-th GCN layer can be used as the node embedding (Hamilton, Ying, and Leskovec 2017; Schlichtkrull et al. 2017). The unsupervised loss, using negative sampling, should be: \[ \mathcal{L} = \sum_{v_i \in V} \sum_{v_j \in N(v_i)} \left[ \log \frac{\exp(\mathbf{h}^{(L)}_i \cdot \mathbf{h}^{(L)}_j)}{\exp(\mathbf{h}^{(L)}_i \cdot \mathbf{h}^{(L)}_j) + 1} + \sum_{\tilde{v}_n \sim q(v) \\ n = 1 \dots k} \log\frac{1}{\exp(\mathbf{h}^{(L)}_i \cdot \tilde{\mathbf{h}}^{(L)}_n) + 1} \right] \] same as the negative sampling method for node embedding. This embedding loss can be used along in a fully unsupervised task; or as a augumented/regularization term for a supervised task, along with the main objective for the specific task (Hamilton, Ying, and Leskovec 2017).

## Pooling

### gPool

Cangea et al. (2018) proposed the *projection score* to sort the nodes, and choose the top-\(\lceil kN \rceil\) nodes after pooling. \(k \in (0, 1]\) is the *pooling ratio*, while \(N\) is the numberr of input nodes. The projection score \(\mathbf{y}\) and the pooling method is formulated as: \[ \mathbf{Z} = \frac{\mathbf{X} \mathbf{p}}{\left\| \mathbf{p} \right\|_2} \qquad \overrightarrow{i} = \mathop{\text{top-rank}}\left( \mathbf{Z}, \lceil kN \rceil \right) \qquad \mathbf{X}' = \left( \mathbf{X} \odot \tanh(\mathbf{Z}) \right)_{\overrightarrow{i}} \qquad \mathbf{A}' = \mathbf{A}_{\overrightarrow{i},\overrightarrow{i}} \] where \(\mathbf{X}\) is feature matrix of input nodes, \(\mathbf{p}\) is the learnable parameter for the projection. \(\cdot_{\overrightarrow{i}}\) is the indexing operation.

Cangea et al. (2018) proposed a similar architecture, named *gPool*, with \(\tanh(\mathbf{Z})\) in \(\mathbf{X}' = \left( \mathbf{X} \odot \tanh(\mathbf{Z}) \right)_{\overrightarrow{i}}\) replaced by \(\text{sigmoid}(\mathbf{\mathbf{Z}})\), mimic the often adopted gate mechanism.

**Note**: \(\tanh(\mathbf{Z})\) or \(\text{sigmoid}(\mathbf{Z})\) enables the gradient to be passed along \(\mathbf{p}\). Without this term, \(\mathbf{p}\) produces a solely discrete element selection, which cannot be trained by stochastic gradient descent.

### Self-Attention Graph Pooling

Lee, Lee, and Kang (2019) proposed to improve the projection score based pooling by adding a self-attention mechanism to calculate the projection score. The new projection score is derived as: \[ \mathbf{Z} = f(\mathop{\mathrm{GNN}}(\mathbf{X},\mathbf{A})) \] where \(f\) is the activation function, \(\mathbf{X}\) is the node feature matrix to be pooled, \(\mathbf{A}\) is the adjacency matrix, which may be normalized. \(\mathop{\mathrm{GNN}}\) indicates a general graph network, and authors proposed the following variants of this GNN network:

- \(\mathbf{Z} = f\left(\tilde{\mathbf{D}}^{-1/2}\,\tilde{\mathbf{A}}\,\tilde{\mathbf{D}}^{1/2}\,\mathbf{X}\,\mathbf{p}\right)\), where \(\tilde{\mathbf{A}} = \mathbf{A}+\mathbf{I}\) is the normalized adjacency matrix with self-loop, and \(\tilde{\mathbf{D}}\) is the degree matrix of \(\tilde{\mathbf{A}}\).
- \(\mathbf{Z} = f\left(\mathop{\mathrm{GNN}}(\mathbf{X},\mathbf{A}+\mathbf{A}^2)\right)\), which considers the second-order neighborhood by augument 2-hop nodes in the adjacency matrix \(\mathbf{A}\).
- \(\mathbf{Z} = f\left(\mathop{\mathrm{GNN}_2}(\mathop{\mathrm{GNN}_1}(\mathbf{X},\mathbf{A}),\mathbf{A})\right)\), which considers the second-order neighborhood by stacking GNN layers.
- \(\mathbf{Z}=\frac{1}{M}\sum_{m} f\left( \mathop{\mathrm{GNN}_m}(\mathbf{X},\mathbf{A}) \right))\), which using an average of multiple attention scores.

### Hierarchical Pooling vs Global Pooling

Lee, Lee, and Kang (2019) found that hierarchical pooling seems to work better on large graphs, where global pooling works better on small graphs.

## Readout

### Combine the Global Average and Max Pooling

Cangea et al. (2018) proposed a graph readout layer that combines both the *global average* and *global max* pooling techniques. Specifically, for the output graph of every \(l\)-th pooling layer \((\mathbf{X}^{(l)}, \mathbf{A}^{(l)})\), the readout layer produces the summarized output by: \[ \mathbf{s}^{(l)} = \mathop{\mathrm{Concat}}\left(\frac{1}{N^{(l)}} \sum_{i=1}^{N^{(l)}} \mathbf{x}^{(l)}_i \Bigg\|\max_{i=1}^{N^{(l)}} \mathbf{x}^{(l)}_i\right) \] And the summarized output of all pooling layers \(\mathbf{s}^{(1)}, \dots, \mathbf{s}^{(L)}\) are then summed together: \[ \mathbf{s} = \sum_{l=1}^L \mathbf{s}^{(l)} \] as the final aggregated feature of the graph. The whole structure is demonstrated as fig. 2.

## Practical GCN Architectures

### Modeling Relational Data with Graph Convolutional Networks [2017]

Schlichtkrull et al. (2017) proposed the following formulation for their GCN layer with regarding to the node relationship: \[ \mathbf{h}^{(l+1)}_i = f\left(\sum_r \sum_{j \in \mathcal{N}^r_i} \frac{1}{c_{ij}} \mathbf{W}_r^{(l)} \mathbf{h}^{(l)}_j + \mathbf{W}_0^{(l)} \mathbf{h}^{(l)}_i \right) \] where the term \(\mathbf{W}_0^{(l)} \mathbf{h}^{(l)}_i\) represents the information passed along the self-loop from \(l\)-th layer to the \((l+1)\)-layer. \(c_{ij}\) is the normalizing factor, and is chosen to be \(\left| \mathcal{N}_i^r \right|\) (i.e., the in-coming degree of node \(i\)) in their paper.

To avoid having too many parameters, which may be potentially prone to over-fitting, Schlichtkrull et al. (2017) also proposed two methods to regularize the weights \(\mathbf{W}^{(l)}_r\) of different relationships.

For link prediction, Schlichtkrull et al. (2017) proposed to use the GCN output at \(L\)-th layer \(\mathbf{h}_i^{(L)}\) as the embedding \(\mathbf{e}_i\) of nodes, deriving the score function as: \[ f(i,r,j) = \mathbf{e}_i^{\top}\mathbf{R}_r\mathbf{e}_j \] where \(\mathbf{R}_r\) is a learned diagonal matrix, to represent the "dot-product" between \(\mathbf{e}_i\) and \(\mathbf{e}_j\) under the relationship \(r\). The whole graph is then trained by negative sampling. See the loss function for this situation.

# References

Cangea, Cătălina, Petar Veličković, Nikola Jovanović, Thomas Kipf, and Pietro Liò. 2018. “Towards Sparse Hierarchical Graph Classifiers.” *arXiv Preprint arXiv:1811.01287*.

Chen, Jie, Tengfei Ma, and Cao Xiao. 2018. “FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling.” *arXiv:1801.10247 [Cs]*, January. http://arxiv.org/abs/1801.10247.

Hamilton, Will, Zhitao Ying, and Jure Leskovec. 2017. “Inductive Representation Learning on Large Graphs.” In *Advances in Neural Information Processing Systems*, 1024–34.

Lee, Junhyun, Inyeop Lee, and Jaewoo Kang. 2019. “Self-Attention Graph Pooling.” *arXiv Preprint arXiv:1904.08082*.

Schlichtkrull, Michael, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2017. “Modeling Relational Data with Graph Convolutional Networks.” *arXiv:1703.06103 [Cs, Stat]*, October. http://arxiv.org/abs/1703.06103.

Veličković, Petar, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. “Graph Attention Networks.” *arXiv Preprint arXiv:1710.10903*.

Wang, Yueyang, Ziheng Duan, Binbing Liao, Fei Wu, and Yueting Zhuang. 2019. “Heterogeneous Attributed Network Embedding with Graph Convolutional Networks.” In *Proceedings of the AAAI Conference on Artificial Intelligence*, 33:10061–2.

Wu, Zonghan, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2019. “A Comprehensive Survey on Graph Neural Networks.” *arXiv:1901.00596 [Cs, Stat]*, August. http://arxiv.org/abs/1901.00596.

Yao, Liang, Chengsheng Mao, and Yuan Luo. 2019. “Graph Convolutional Networks for Text Classification.” In *Proceedings of the AAAI Conference on Artificial Intelligence*, 33:7370–7.

Zhang, Muhan, Zhicheng Cui, Marion Neumann, and Yixin Chen. 2018. “An End-to-End Deep Learning Architecture for Graph Classification.” In *Thirty-Second AAAI Conference on Artificial Intelligence*.