# Graph Convolution Network

## Overview

To spread the node features across the graph, according to the graph structure (typically local connectivity among the nodes).

The features after applying the $$l$$-th graph convolution layer can be denoted as $$\mathbf{H}^{(l)}$$, where it should be a $$n \times k$$ matrix, with $$n$$ being the number of nodes in the graph, and $$k$$ being the number of feature dimensions. $$\mathbf{h}^{(l)}_i$$ refers to the $$i$$-th row of $$\mathbf{H}^{(l)}$$, denoting the feature of the $$i$$-th node after applying the $$l$$-th layer.

The connectivity among the nodes can be represented by the adjacency matrix $$\mathbf{A}$$, an $$n \times n$$ matrix, where $$\mathbf{A}_{ij}$$ representing the connectivity from node $$j$$ to $$i$$. For undirected graph, $$\mathbf{A}$$ should be symmetric.

## Feature Matrix

Every node in the graph requires an initial feature matrix, for GCN layers to propagate along the connections between nodes. Besides using the task-specific features, one may also use:

• Node degree information (Cangea et al. 2018): one-hot encoding the node degree for all degrees up to a given upper bound.
• One-hot index vector (Yao, Mao, and Luo 2019): one-hot encoding the index of nodes.

## Graph Convolution

### Basic Formulation

The GCN layer can be formulated as: $\mathbf{H}^{(l + 1)} = f(\hat{\mathbf{A}} \mathbf{H}^{(l)}\mathbf{W}^{(l)})$ where $$\hat{\mathbf{A}}$$ is the normalized adjacency matrix. A popular choice for undirected graph may be (Wu et al. 2019): $\hat{\mathbf{A}} = \mathbf{D}^{-1/2} \mathbf{A} \,\mathbf{D}^{-1/2}$ where $$\mathbf{D}_{ii} = \sum_j \mathbf{A}_{ij}$$ is the degree of node $$i$$.

Another choice for $$\hat{\mathbf{A}}$$, which not only can be used for undirected grpah, but also can be used for directed ones, can be formulated as: $\hat{\mathbf{A}} = \mathbf{D}^{-1} \mathbf{A}$ Some work may add auxiliary self-loop to the adjacency matrix, such that the normalized adjacency matrix becomes: $\hat{\mathbf{A}} = \mathbf{D}^{-1/2} (\mathbf{A} + \lambda\,\mathbf{I}) \,\mathbf{D}^{-1/2}$ or (used by Zhang et al. (2018)): $\hat{\mathbf{A}} = \mathbf{D}^{-1} (\mathbf{A} + \lambda\,\mathbf{I})$

### Hetergeneous Graph with Various Edge Relationship

If the link between nodes represent different relationships ($$R = \{r\}$$), each relationship may have their own kernel $$\mathbf{W}_r^{(l)}$$, such that the GCN layer can be formulated as (Schlichtkrull et al. 2017): $\mathbf{h}^{(l+1)}_i = f\left(\sum_{r \in R} \sum_{j \in \mathcal{N}^r_i} \frac{1}{c_{ij}} \mathbf{W}_r^{(l)} \mathbf{h}^{(l)}_j + \mathbf{W}_0^{(l)} \mathbf{h}^{(l)}_i \right)$

where $$c_{ij} = \left| \mathcal{N}_i^r \right|$$ is the normalizing factor.

### Hetergeneuos Graph with Various Node Type

If the node are grouped into different types, i.e., node $$v_i$$ has type $$t_i$$, then each type can have their own kernel $$\mathbf{W}_{t}^{(l)}$$, and the GCN layer can be formulated as: \begin{align} \mathbf{h}^{(l+1)}_i &= f\left( \sum_{j \in \mathcal{N}_{i}} c_{ij} \, g(\mathbf{h}_j^{(l)}) + g(\mathbf{h}_i^{(l)}) \right) \\ g(\mathbf{h}_i^{(l)}) &= \mathbf{W}_{t_i}^{(l)} \mathbf{h}_i^{(l)} \end{align} as is proposed by Wang et al. (2019). For the normalizing factor $$c_{ij}$$, Wang et al. (2019) also proposed to learn the attention score by self-attention: \begin{align} c_{ij}^{(l)} &= \frac{\exp\left( s_{ij} \right)}{\sum_{\tilde{j} \in \mathcal{N}_i^{(l)}} \exp\left( s_{i\tilde{j}} \right)} \\ s_{ij} &= g(\mathbf{h}_i^{(l)})^{\top} \cdot g(\mathbf{h}_j^{(l)}) \end{align} where $$s_{ij}$$ is the attention score.

### LSTM Aggregation with Random Permuation of Neighborhood Nodes

Hamilton, Ying, and Leskovec (2017) proposed a candidate neighborhoold information aggregation method, based on LSTM with random permutation of neighborhood nodes.

### Neighborhood Sampling Instead of Summation

Hamilton, Ying, and Leskovec (2017) proposed to use neighborhood sampling to reduce the training computation time.

### Monte-Carlo Approximation Instead of Summation

Chen, Ma, and Xiao (2018) proposed to view the summation over nodes within one GCN layer as an expectation, formulated as: $\mathbf{h}^{(l+1)}(v) = f\left( \int \hat{\mathbf{A}}(v,u) \,\mathbf{h}^{(l)}(u)\,\mathbf{W}^{(l)}\,dP(u) \right)$ where $$u$$ and $$v$$ represent the nodes. $$P(u)$$ is regarded as a uniform distribution. According to this formulation, the authors proposed to use a proposal distribution: $q(u) = \left\| \hat{\mathbf{A}}(:,u) \right\|^2 / \sum_{u'} \left\| \hat{\mathbf{A}}(:,u') \right\|^2$ which is the power of node $$u$$'s out-degree, divided by the sum of the power of all nodes' out-degree. The proposal distribution is used to sample $$t_l$$ i.i.d. nodes for each GCN layer, namely $$u_1^{(l)}, \dots, u_{t_l}^{(l)}$$, and estimate $$\mathbf{h}^{(l+1)}$$ by: $\mathbf{h}^{(l+1)}(v) \approx f\left( \frac{1}{t_l}\sum_{j=1}^{t_l} \frac{\hat{\mathbf{A}(v,u_j^{(l)})}\,\mathbf{h}^{(l)}(u_j^{(l)})\,\mathbf{W}^{(l)}}{q(u_j^{(l)})} \right)$

## Learning the Weight of Message Passing

### Self-Attention

Instead of relying on the fixed adjacency matrix to weight the neighborhood information in graph convolution networks, Veličković et al. (2017) proposed to use a self-attention mechanism to weight the neighborhood information.

The attention coefficients is calculated by: $e_{ij} = a\left(\mathbf{W} \mathbf{h}_i, \mathbf{W} \mathbf{h}_j\right)$ where $$a(\cdot)$$ is the attention network. $$e_{ij}$$ is formulated by Veličković et al. (2017) as follows: $e_{ij} = a\left(\mathbf{W} \mathbf{h}_i, \mathbf{W} \mathbf{h}_j\right) = \mathrm{LeakyReLU}\left({\mathbf{a}}^{\top}\left[ \mathbf{W} \mathbf{h}_i \big\| \mathbf{W} \mathbf{h}_j \right]\right)$ This attention score is further normalized as: $\alpha_{ij} = \frac{\exp\left(e_{ij}\right)}{\sum_{k \in \mathcal{N}_i} \exp\left(e_{ik}\right)} = \frac{\exp\left(\mathrm{LeakyReLU}\left({\mathbf{a}}^{\top}\left[ \mathbf{W} \mathbf{h}_i \big\| \mathbf{W} \mathbf{h}_j \right]\right)\right)}{\sum_{k \in \mathcal{N}_i} \exp\left(\mathrm{LeakyReLU}\left({\mathbf{a}}^{\top}\left[ \mathbf{W} \mathbf{h}_i \big\| \mathbf{W} \mathbf{h}_k \right]\right)\right)}$ The final output features of this layer is: $\mathbf{h}' = \sigma\left( \sum_{j \in \mathcal{N}_i} \alpha_{ij} \mathbf{W} \mathbf{h}_j \right)$ If using mult-head attention, the output features can be formulated as either one of the following (i.e., concat and average multi-head attention): \begin{align} \mathbf{h}' &= \mathop{\bigg\|}_{k=1}^K \sigma\left( \sum_{j \in \mathcal{N}_i} \alpha^k_{ij} \mathbf{W}^k \mathbf{h}_j \right) \\ \mathbf{h}' &= \sigma\left( \frac{1}{K} \sum_{k=1}^K \sum_{j \in \mathcal{N}_i} \alpha^k_{ij} \mathbf{W}^k \mathbf{h}_j \right) \end{align}

where the second formulation is adopted as the output layer.

## Node Embedding

The feature vector $$\mathbf{h}^{(L)}$$ after $$L$$-th GCN layer can be used as the node embedding (Hamilton, Ying, and Leskovec 2017; Schlichtkrull et al. 2017). The unsupervised loss, using negative sampling, should be: $\mathcal{L} = \sum_{v_i \in V} \sum_{v_j \in N(v_i)} \left[ \log \frac{\exp(\mathbf{h}^{(L)}_i \cdot \mathbf{h}^{(L)}_j)}{\exp(\mathbf{h}^{(L)}_i \cdot \mathbf{h}^{(L)}_j) + 1} + \sum_{\tilde{v}_n \sim q(v) \\ n = 1 \dots k} \log\frac{1}{\exp(\mathbf{h}^{(L)}_i \cdot \tilde{\mathbf{h}}^{(L)}_n) + 1} \right]$ same as the negative sampling method for node embedding. This embedding loss can be used along in a fully unsupervised task; or as a augumented/regularization term for a supervised task, along with the main objective for the specific task (Hamilton, Ying, and Leskovec 2017).

## Pooling

### gPool

Cangea et al. (2018) proposed the projection score to sort the nodes, and choose the top-$$\lceil kN \rceil$$ nodes after pooling. $$k \in (0, 1]$$ is the pooling ratio, while $$N$$ is the numberr of input nodes. The projection score $$\mathbf{y}$$ and the pooling method is formulated as: $\mathbf{Z} = \frac{\mathbf{X} \mathbf{p}}{\left\| \mathbf{p} \right\|_2} \qquad \overrightarrow{i} = \mathop{\text{top-rank}}\left( \mathbf{Z}, \lceil kN \rceil \right) \qquad \mathbf{X}' = \left( \mathbf{X} \odot \tanh(\mathbf{Z}) \right)_{\overrightarrow{i}} \qquad \mathbf{A}' = \mathbf{A}_{\overrightarrow{i},\overrightarrow{i}}$ where $$\mathbf{X}$$ is feature matrix of input nodes, $$\mathbf{p}$$ is the learnable parameter for the projection. $$\cdot_{\overrightarrow{i}}$$ is the indexing operation.

Cangea et al. (2018) proposed a similar architecture, named gPool, with $$\tanh(\mathbf{Z})$$ in $$\mathbf{X}' = \left( \mathbf{X} \odot \tanh(\mathbf{Z}) \right)_{\overrightarrow{i}}$$ replaced by $$\text{sigmoid}(\mathbf{\mathbf{Z}})$$, mimic the often adopted gate mechanism.

Note: $$\tanh(\mathbf{Z})$$ or $$\text{sigmoid}(\mathbf{Z})$$ enables the gradient to be passed along $$\mathbf{p}$$. Without this term, $$\mathbf{p}$$ produces a solely discrete element selection, which cannot be trained by stochastic gradient descent.

### Self-Attention Graph Pooling

Lee, Lee, and Kang (2019) proposed to improve the projection score based pooling by adding a self-attention mechanism to calculate the projection score. The new projection score is derived as: $\mathbf{Z} = f(\mathop{\mathrm{GNN}}(\mathbf{X},\mathbf{A}))$ where $$f$$ is the activation function, $$\mathbf{X}$$ is the node feature matrix to be pooled, $$\mathbf{A}$$ is the adjacency matrix, which may be normalized. $$\mathop{\mathrm{GNN}}$$ indicates a general graph network, and authors proposed the following variants of this GNN network:

• $$\mathbf{Z} = f\left(\tilde{\mathbf{D}}^{-1/2}\,\tilde{\mathbf{A}}\,\tilde{\mathbf{D}}^{1/2}\,\mathbf{X}\,\mathbf{p}\right)$$, where $$\tilde{\mathbf{A}} = \mathbf{A}+\mathbf{I}$$ is the normalized adjacency matrix with self-loop, and $$\tilde{\mathbf{D}}$$ is the degree matrix of $$\tilde{\mathbf{A}}$$.
• $$\mathbf{Z} = f\left(\mathop{\mathrm{GNN}}(\mathbf{X},\mathbf{A}+\mathbf{A}^2)\right)$$, which considers the second-order neighborhood by augument 2-hop nodes in the adjacency matrix $$\mathbf{A}$$.
• $$\mathbf{Z} = f\left(\mathop{\mathrm{GNN}_2}(\mathop{\mathrm{GNN}_1}(\mathbf{X},\mathbf{A}),\mathbf{A})\right)$$, which considers the second-order neighborhood by stacking GNN layers.
• $$\mathbf{Z}=\frac{1}{M}\sum_{m} f\left( \mathop{\mathrm{GNN}_m}(\mathbf{X},\mathbf{A}) \right))$$, which using an average of multiple attention scores.

### Hierarchical Pooling vs Global Pooling

Lee, Lee, and Kang (2019) found that hierarchical pooling seems to work better on large graphs, where global pooling works better on small graphs.

### Combine the Global Average and Max Pooling

Cangea et al. (2018) proposed a graph readout layer that combines both the global average and global max pooling techniques. Specifically, for the output graph of every $$l$$-th pooling layer $$(\mathbf{X}^{(l)}, \mathbf{A}^{(l)})$$, the readout layer produces the summarized output by: $\mathbf{s}^{(l)} = \mathop{\mathrm{Concat}}\left(\frac{1}{N^{(l)}} \sum_{i=1}^{N^{(l)}} \mathbf{x}^{(l)}_i \Bigg\|\max_{i=1}^{N^{(l)}} \mathbf{x}^{(l)}_i\right)$ And the summarized output of all pooling layers $$\mathbf{s}^{(1)}, \dots, \mathbf{s}^{(L)}$$ are then summed together: $\mathbf{s} = \sum_{l=1}^L \mathbf{s}^{(l)}$ as the final aggregated feature of the graph. The whole structure is demonstrated as fig. 2.

## Practical GCN Architectures

### Modeling Relational Data with Graph Convolutional Networks [2017]

Schlichtkrull et al. (2017) proposed the following formulation for their GCN layer with regarding to the node relationship: $\mathbf{h}^{(l+1)}_i = f\left(\sum_r \sum_{j \in \mathcal{N}^r_i} \frac{1}{c_{ij}} \mathbf{W}_r^{(l)} \mathbf{h}^{(l)}_j + \mathbf{W}_0^{(l)} \mathbf{h}^{(l)}_i \right)$ where the term $$\mathbf{W}_0^{(l)} \mathbf{h}^{(l)}_i$$ represents the information passed along the self-loop from $$l$$-th layer to the $$(l+1)$$-layer. $$c_{ij}$$ is the normalizing factor, and is chosen to be $$\left| \mathcal{N}_i^r \right|$$ (i.e., the in-coming degree of node $$i$$) in their paper.

To avoid having too many parameters, which may be potentially prone to over-fitting, Schlichtkrull et al. (2017) also proposed two methods to regularize the weights $$\mathbf{W}^{(l)}_r$$ of different relationships.

For link prediction, Schlichtkrull et al. (2017) proposed to use the GCN output at $$L$$-th layer $$\mathbf{h}_i^{(L)}$$ as the embedding $$\mathbf{e}_i$$ of nodes, deriving the score function as: $f(i,r,j) = \mathbf{e}_i^{\top}\mathbf{R}_r\mathbf{e}_j$ where $$\mathbf{R}_r$$ is a learned diagonal matrix, to represent the "dot-product" between $$\mathbf{e}_i$$ and $$\mathbf{e}_j$$ under the relationship $$r$$. The whole graph is then trained by negative sampling. See the loss function for this situation.

# References

Cangea, Cătălina, Petar Veličković, Nikola Jovanović, Thomas Kipf, and Pietro Liò. 2018. “Towards Sparse Hierarchical Graph Classifiers.” arXiv Preprint arXiv:1811.01287.

Chen, Jie, Tengfei Ma, and Cao Xiao. 2018. “FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling.” arXiv:1801.10247 [Cs], January. http://arxiv.org/abs/1801.10247.

Hamilton, Will, Zhitao Ying, and Jure Leskovec. 2017. “Inductive Representation Learning on Large Graphs.” In Advances in Neural Information Processing Systems, 1024–34.

Lee, Junhyun, Inyeop Lee, and Jaewoo Kang. 2019. “Self-Attention Graph Pooling.” arXiv Preprint arXiv:1904.08082.

Schlichtkrull, Michael, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2017. “Modeling Relational Data with Graph Convolutional Networks.” arXiv:1703.06103 [Cs, Stat], October. http://arxiv.org/abs/1703.06103.

Veličković, Petar, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. “Graph Attention Networks.” arXiv Preprint arXiv:1710.10903.

Wang, Yueyang, Ziheng Duan, Binbing Liao, Fei Wu, and Yueting Zhuang. 2019. “Heterogeneous Attributed Network Embedding with Graph Convolutional Networks.” In Proceedings of the AAAI Conference on Artificial Intelligence, 33:10061–2.

Wu, Zonghan, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2019. “A Comprehensive Survey on Graph Neural Networks.” arXiv:1901.00596 [Cs, Stat], August. http://arxiv.org/abs/1901.00596.

Yao, Liang, Chengsheng Mao, and Yuan Luo. 2019. “Graph Convolutional Networks for Text Classification.” In Proceedings of the AAAI Conference on Artificial Intelligence, 33:7370–7.

Zhang, Muhan, Zhicheng Cui, Marion Neumann, and Yixin Chen. 2018. “An End-to-End Deep Learning Architecture for Graph Classification.” In Thirty-Second AAAI Conference on Artificial Intelligence.