Gradient Estimators for Variational Inference

In complex Bayesian networks, especially for deep Bayesian networks, it is often computational intractable to compute some “posterior distributions” (typically along the opposite directions of links, for example, \(p_{\theta}(\mathbf{z}|\mathbf{x})\) in the main network of the right figure). These posteriors are often required in both training and testing. In such cases, variational inference techniques are often adopted, to approximate the intractable posterior by another network (e.g., \(q_{\phi}(\mathbf{z}|\mathbf{x})\) to approximate \(p_{\theta}(\mathbf{z}|\mathbf{x})\) in the right figure). Such “other networks” are often called the “variational networks” or “inference networks”.

Note that also we only present a very basic form of Bayesian networks in this page (i.e., having just one visible variable \(\mathbf{x}\) and one latent variable \(\mathbf{z}\)), the formula of this page can be extended to multiple visible and latent variables easily, by treating all visible variables as \(\mathbf{x}\) and all latent variables as \(\mathbf{z}\).

Variational Lower Bounds

When training a Bayesian network using variational inference, the state-of-the-arts technique is to construct a lower bound \(\mathcal{L}(\mathbf{x};\theta,\phi)\) for \(\log p_{\theta}(\mathbf{x})\). When maximizing \(\mathbb{E}_{\mathbf{x}\sim p_{data}(\mathbf{x})}\left[\mathcal{L}(\mathbf{x};\theta,\phi)\right]\) , it simultaneously do the following two things: (1) maximize the joint log-likelihood \(\mathbb{E}_{\mathbf{x} \sim p_{data}(\mathbf{x}),\mathbf{z}\sim p_{\theta}(\mathbf{z})}\left[\log p_{\theta}(\mathbf{x},\mathbf{z})\right]\), and (2) let \(q_{\phi}(\mathbf{z}|\mathbf{x})\) to approximate \(p_{\theta}(\mathbf{z}|\mathbf{x})\).

Evidence Lower Bound (ELBO)

The “Evidence Lower Bound” \(\mathcal{L}(\mathbf{x};\theta,\phi)\) is deduced by:

\[ \begin{aligned} \log p_{\theta}(\mathbf{x}) &\geq \log p_{\theta}(\mathbf{x}) - \operatorname{D}_{KL}\big[ q_{\phi}(\mathbf{z}|\mathbf{x})\|p_{\theta}(\mathbf{z}|\mathbf{x}) \big] \\ &= \mathbb{E}_{\mathbf{z} \sim q_{\phi}(\mathbf{z}|\mathbf{x})}\big[ \log p_{\theta}(\mathbf{x}) + \log p_{\theta}(\mathbf{z}|\mathbf{x}) - \log q_{\phi}(\mathbf{z}|\mathbf{x}) \big] \\ &= \mathbb{E}_{\mathbf{z} \sim q_{\phi}(\mathbf{z}|\mathbf{x})}\big[ \log p_{\theta}(\mathbf{x},\mathbf{z}) - \log q_{\phi}(\mathbf{z}|\mathbf{x}) \big] \\ &= \mathcal{L}(\mathbf{x};\theta,\phi) \end{aligned} \]

Monte Carlo Objective

The “Monto Carlo Objective” is an importance sampling based variational lower-bound, given by:

\[ \mathcal{L}_{K}(\mathbf{x};\theta,\phi) = \mathbb{E}_{\mathbf{z}^{(1:K)} \sim q_{\phi}(\mathbf{z}|\mathbf{x})}\Bigg[ \log \frac{1}{K} \sum_{k=1}^K { \frac{p_{\theta}(\mathbf{x},\mathbf{z}^{(k)})} {q_{\phi}(\mathbf{z}^{(k)}|\mathbf{x})} } \Bigg] \]

where \(\mathbf{z}^{(1:K)}\) are \(K\) independent samples of \(\mathbf{z}\) from \(q_{\phi}(\mathbf{z}|\mathbf{x})\). It is proven by Burda, Grosse, and Salakhutdinov (2015) that:

\[ \log p_{\theta}(\mathbf{x}) \geq \mathcal{L}_K(\theta,\phi) \geq \mathcal{L}_M(\theta,\phi)$ for $K \geq M$, and $\lim_{K \to \infty} \mathcal{L}_K (\theta,\phi) = \log p_{\theta}(\mathbf{x}) \]

Gradient Estimators

To optimize a variational lower bound, especially for deep Bayesian networks, stochastic gradient descent is often adopted. However, it is not straightforward to compute the gradient of an expectation. The gradient estimators thus serve to compute the gradients for the variational lower bounds.

For simplicity, we shall omit the subscripts in this section, thus the gradient operator \(\nabla\) should be applied on both \(\theta\) and \(\phi\), and \(f(\mathbf{x},\mathbf{z})\) should have both two sets of parameters.

SGVB

Kingma and Welling (2014) proposes that, if \(\mathbf{z}\) can be reparameterized by \(\mathbf{z} = \mathbf{z}(\epsilon)\), where \(\mathbf{\epsilon}\) is another random variable independent of \(\theta\) and \(\phi\), and \(\mathbf{z}(\mathbf{\epsilon})\) is a continuous, differentiable mapping, then the gradient estimator for \(\mathcal{L}(\mathbf{x};\theta,\phi)=\mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\big[f(\mathbf{x},\mathbf{z})\big]\) is given by:

\[ \nabla \, \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\big[f(\mathbf{x},\mathbf{z})\big] = \nabla \, \mathbb{E}_{q(\mathbf{\epsilon})}\big[f(\mathbf{x},\mathbf{z}(\mathbf{\epsilon}))\big] = \mathbb{E}_{q(\mathbf{\epsilon})}\big[\nabla f(\mathbf{x},\mathbf{z}(\mathbf{\epsilon}))\big] \]

IWAE

Burda, Grosse, and Salakhutdinov (2015) extends SGVB to the Monte Carlo objective. Let \(w_k = f\big(\mathbf{x},\mathbf{z}(\mathbf{\epsilon}^{(k)})\big)\), and \(\widetilde{w}_k = w_k / \sum_{i=1}^K w_i\), the gradient estimator for \(\mathcal{L}_K(\mathbf{x};\theta,\phi)=\mathbb{E}_{q(\mathbf{z}^{(1:K)}|\mathbf{x})}\Big[\log \frac{1}{K} \sum_{k=1}^K f\big(\mathbf{x},\mathbf{z}^{(k)}\big)\Big]\) is deduced by:

\[ \begin{aligned} &\nabla\,\mathbb{E}_{q(\mathbf{z}^{(1:K)}|\mathbf{x})}\Big[\log \frac{1}{K} \sum_{k=1}^K f\big(\mathbf{x},\mathbf{z}^{(k)}\big)\Big] = \nabla \, \mathbb{E}_{q(\mathbf{\epsilon}^{(1:K)})}\Bigg[\log \frac{1}{K} \sum_{k=1}^K w_k\Bigg] = \mathbb{E}_{q(\mathbf{\epsilon}^{(1:K)})}\Bigg[\nabla \log \frac{1}{K} \sum_{k=1}^K w_k\Bigg] = \\ & \quad \mathbb{E}_{q(\mathbf{\epsilon}^{(1:K)})}\Bigg[\frac{\nabla \frac{1}{K} \sum_{k=1}^K w_k}{\frac{1}{K} \sum_{i=1}^K w_i}\Bigg] = \mathbb{E}_{q(\mathbf{\epsilon}^{(1:K)})}\Bigg[\frac{\sum_{k=1}^K w_k \nabla \log w_k}{\sum_{i=1}^K w_i}\Bigg] = \mathbb{E}_{q(\mathbf{\epsilon}^{(1:K)})}\Bigg[\sum_{k=1}^K \widetilde{w}_k \nabla \log w_k\Bigg] \end{aligned} \]

NVIL

Mnih and Gregor (2014) proposes a variant of REINFORCE, choosing the gradient estimator for \(\mathcal{L}(\mathbf{x};\theta,\phi)=\mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\big[f(\mathbf{x},\mathbf{z})\big]\) as:

\[ \begin{aligned} \nabla \, \mathbb{E}_{q(\mathbf{z}|\mathbf{x})} \big[f(\mathbf{x},\mathbf{z})\big] &= \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\Big[ \nabla f(\mathbf{x},\mathbf{z}) + f(\mathbf{x},\mathbf{z})\,\nabla\log q(\mathbf{z}|\mathbf{x})\Big] \\ &= \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\Big[ \nabla f(\mathbf{x},\mathbf{z}) + \big(f(\mathbf{x},\mathbf{z}) - C_{\psi}(\mathbf{x})-c\big)\,\nabla\log q(\mathbf{z}|\mathbf{x})\Big] \end{aligned} \]

\(C_{\psi}(\mathbf{x})\) is a learnable network with parameter \(\psi\), and \(c\) is a learnable constant. They should be learnt by minimizing \(\mathbb{E}_{ q(\mathbf{z}|\mathbf{x}) }\Big[\big(f(\mathbf{x},\mathbf{z}) - C_{\psi}(\mathbf{x})-c\big)^2 \Big]\), leading to the gradient estimator for \(\psi\):

\[ \nabla_{\psi} \, \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\Big[\big(f(\mathbf{x},\mathbf{z})-C_{\psi}(\mathbf{x})-c\big)^2\Big] = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\Big[-2\, \big(f(\mathbf{x},\mathbf{z})-C_{\psi}(\mathbf{x})-c\big) \, \nabla_{\psi} \, C_{\psi}(\mathbf{x})\Big] \]

The NVIL algorithm glues these two objectives together with a proper coefficient, and performs SGD on them together.

VIMCO

Mnih and Rezende (2016) suggests to use the unrelated samples of the Monte Carlo objective, instead of the dedicated learnable \(C_{\psi}(\mathbf{x})\) and \(c\), to centralize the learning signals. The VIMCO gradient estimator for \(\mathcal{L}_K(\mathbf{x};\theta,\phi)=\mathbb{E}_{q(\mathbf{z}^{(1:K)}|\mathbf{x})}\Big[\log \frac{1}{K} \sum_{k=1}^K f\big(\mathbf{x},\mathbf{z}^{(k)}\big)\Big]\) is deduced by:

\[ \begin{aligned} &\nabla\,\mathbb{E}_{q(\mathbf{z}^{(1:K)}|\mathbf{x})}\Big[\log \frac{1}{K} \sum_{k=1}^K f\big(\mathbf{x},\mathbf{z}^{(k)}\big)\Big] \\ &\quad = \mathbb{E}_{q(\mathbf{z}^{(1:K)}|\mathbf{x})}\bigg[{\sum_{k=1}^K \hat{L}(\mathbf{z}^{(k)}|\mathbf{z}^{(-k)}) \, \nabla \log q(\mathbf{z}^{(k)}|\mathbf{x})}\bigg] + \mathbb{E}_{q(\mathbf{z}^{(1:K)}|\mathbf{x})}\bigg[{\sum_{k=1}^K \widetilde{w}_k\,\nabla\log f(\mathbf{x},\mathbf{z}^{(k)})}\bigg] \end{aligned} \]

where \(w_k = f\big(\mathbf{x},\mathbf{z}^{(k)}\big)\), \(\widetilde{w}_k = w_k / \sum_{i=1}^K w_i\), and:

\[ \begin{aligned} \hat{L}(\mathbf{z}^{(k)}|\mathbf{z}^{(-k)}) &= \hat{L}(\mathbf{z}^{(1:K)}) - \log \frac{1}{K} \bigg(\hat{f}(\mathbf{x},\mathbf{z}^{(-k)})+\sum_{i \neq k} f(\mathbf{x},\mathbf{z}^{(i)})\bigg) \\ \hat{L}(\mathbf{z}^{(1:K)}) &= \log \frac{1}{K} \sum_{k=1}^K f(\mathbf{x},\mathbf{z}^{(k)}) \\ \hat{f}(\mathbf{x},\mathbf{z}^{(-k)}) &= \exp\big(\frac{1}{K-1} \sum_{i \neq k} \log f(\mathbf{x},\mathbf{z}^{(i)})\big) \end{aligned} \]

References

Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. 2015. “Importance Weighted Autoencoders.” arXiv Preprint arXiv:1509.00519.

Kingma, Diederik P, and Max Welling. 2014. “Auto-Encoding Variational Bayes.” In Proceedings of the International Conference on Learning Representations.

Mnih, Andriy, and Karol Gregor. 2014. “Neural Variational Inference and Learning in Belief Networks.” arXiv Preprint arXiv:1402.0030.

Mnih, Andriy, and Danilo Rezende. 2016. “Variational Inference for Monte Carlo Objectives.” In PMLR, 2188–96.