Energy Function in Probabilistic Models

This post summarizes the relationship between energy function and the deduced probabilistic model by a specified energy function.

Common Formulation

The probabilistic model deduced from an energy function can have the following formulations.

Gibbs Distribution

Given an energy function \(U(\mathbf{x};\theta)\) with parameters \(\theta\), the probability distribution can be deduced as: \[ \begin{align} p(\mathbf{x};\theta) &= \frac{1}{Z(\theta)}\,\exp\left( -U(\mathbf{x};\theta) \right) \\ Z(\theta) &= \int \exp\left( -U(\mathbf{x};\theta) \right)\,\mathrm{d}\mathbf{x} \end{align} \] The gradient of \(\mathbb{E}_{p_D(\mathbf{x})}\left[ -\log p(\mathbf{x};\theta) \right]\) (i.e., the expectation of the negative log-likelihood \(-\log p(\mathbf{x})\) over data distribution \(p_D(\mathbf{x})\)) is then derived as: \[ \begin{align} \nabla \mathbb{E}_{p_D(\mathbf{x};\theta)}\left[ -\log p(\mathbf{x};\theta) \right] &= \mathbb{E}_{p_D(\mathbf{x})}\left[ -\nabla \log p(\mathbf{x};\theta) \right] \\ &= \mathbb{E}_{p_D(\mathbf{x})} \left[ \nabla U(\mathbf{x};\theta) + \nabla \log Z(\theta) \right] \\ &= \mathbb{E}_{p_D(\mathbf{x})} \left[ \nabla U(\mathbf{x};\theta) \right] + \nabla \log Z(\theta) \end{align} \]

where \(\nabla \log Z(\theta)\) is: \[ \begin{align} \nabla \log Z(\theta) &= \frac{\nabla Z(\theta)}{Z(\theta)} \\ &= \frac{1}{Z(\theta)} \int \nabla \exp\left( -U(\mathbf{x};\theta) \right)\,\mathrm{d}\mathbf{x} \\ &= \int \frac{\exp\left( -U(\mathbf{x};\theta) \right) / Z(\theta)}{\exp\left( -U(\mathbf{x};\theta) \right)} \nabla \exp\left( -U(\mathbf{x};\theta) \right) \,\mathrm{d}\mathbf{x} \\ &= \int p(\mathbf{x};\theta) \, \nabla \log \exp\left( -U(\mathbf{x};\theta) \right) \,\mathrm{d}\mathbf{x} \\ &= -\int p(\mathbf{x};\theta) \, \nabla U(\mathbf{x};\theta) \,\mathrm{d}\mathbf{x} \\ &= -\mathbb{E}_{p(\mathbf{x};\theta)} \left[ \nabla U(\mathbf{x};\theta) \right] \end{align} \] thus the final gradient can be derived as: \[ \nabla \mathbb{E}_{p_D(\mathbf{x};\theta)}\left[ -\log p(\mathbf{x};\theta) \right] = \mathbb{E}_{p_D(\mathbf{x})} \left[ \nabla U(\mathbf{x};\theta) \right] - \mathbb{E}_{p(\mathbf{x};\theta)} \left[ \nabla U(\mathbf{x};\theta) \right] \]

Positive and Negative Phase

The above gradient consists of the positive phase term \(\mathbb{E}_{p_D(\mathbf{x})} \left[ \nabla U(\mathbf{x};\theta) \right]\), and the negative phase term \(\mathbb{E}_{p(\mathbf{x};\theta)} \left[ \nabla U(\mathbf{x};\theta) \right]\). The gradient reaches zero (which indicates a local minima) when these two terms are equal.

If the path of the gradient to \(\mathbb{E}_{p(\mathbf{x};\theta)}\) is blocked (this sentence is added by me), then the positive phase term can be seen as minimizing the energy on "positive samples" from data distribution, and the negative phase can be seen as maximizing the energy on "negative samples" from model distribution. (Kim and Bengio 2016)

Sampling from \(\mathbb{E}_{p(\mathbf{x};\theta)}\) often requires MCMC techniques, for example, the Contrastive Divergence algorithm.

Conditional and Independence

Definition

The distribution \(p(\mathbf{x},\mathbf{y},\mathbf{z})\) deduced by energy function \(U(\mathbf{x},\mathbf{y},\mathbf{z})\): \[ p(\mathbf{x},\mathbf{y},\mathbf{z}) = \frac{\exp\left( -U(\mathbf{x},\mathbf{y},\mathbf{z}) \right)}{\iiint \exp\left( -U(\mathbf{x}^*,\mathbf{y}^*,\mathbf{z}^*) \right) \, \mathrm{d}\mathbf{z}^* \,\mathrm{d}\mathbf{y}^* \,\mathrm{d}\mathbf{x}^*} \] Also, the conditional distribution \(p(\mathbf{y},\mathbf{z}|\mathbf{x})\) is defined as: \[ p(\mathbf{y},\mathbf{z}|\mathbf{x}) = \frac{\exp\left( -U(\mathbf{x},\mathbf{y},\mathbf{z}) \right)}{\iint \exp\left( -U(\mathbf{x},\mathbf{y}^*,\mathbf{z}^*) \right) \, \mathrm{d}\mathbf{z}^* \,\mathrm{d}\mathbf{y}^*} \]

Theorem 1

If \(U(\mathbf{x},\mathbf{y},\mathbf{z}) = f(\mathbf{x},\mathbf{y}) + g(\mathbf{x},\mathbf{z}) + h(\mathbf{x})\), then: \[ \begin{align} p(\mathbf{y}|\mathbf{x}) &= \frac{\exp\left( -f(\mathbf{x},\mathbf{y}) \right)}{\int \exp\left( -f(\mathbf{x},\mathbf{y}^*) \right) \,\mathrm{d}\mathbf{y}^*} \\ p(\mathbf{z}|\mathbf{x}) &= \frac{\exp\left( -g(\mathbf{x},\mathbf{z}) \right)}{\int \exp\left( -g(\mathbf{x},\mathbf{z}^*) \right) \,\mathrm{d}\mathbf{z}^*} \end{align} \] Proof: \[ \begin{align} p(\mathbf{y}|\mathbf{x}) &= \int p(\mathbf{x},\mathbf{y},\mathbf{z})\,\mathrm{d}\mathbf{z} \\ &= \int \frac{\exp\left( -U(\mathbf{x},\mathbf{y},\mathbf{z}) \right)}{\iint \exp\left( -U(\mathbf{x},\mathbf{y}^*,\mathbf{z}^*) \right) \, \mathrm{d}\mathbf{z}^* \,\mathrm{d}\mathbf{y}^*}\,\mathrm{d}\mathbf{z} \\ &= \frac{\exp\left( -h(\mathbf{x}) \right)\cdot\exp\left( -f(\mathbf{x},\mathbf{y}) \right) \int \exp\left( -g(\mathbf{x},\mathbf{z}) \right)\,\mathrm{d}\mathbf{z}}{\iint \exp\left( -h(\mathbf{x}) \right)\cdot\exp\left( -f(\mathbf{x},\mathbf{y}^*)\right) \cdot\exp\left( -g(\mathbf{x},\mathbf{z}^*) \right)\,\mathrm{d}\mathbf{z}^*\,\mathrm{d}\mathbf{y}^*} \\ &= \frac{\exp\left( -h(\mathbf{x}) \right)\cdot\exp\left( -f(\mathbf{x},\mathbf{y}) \right) \int \exp\left( -g(\mathbf{x},\mathbf{z}) \right)\,\mathrm{d}\mathbf{z}}{\exp\left( -h(\mathbf{x}) \right)\cdot\left( \int \exp\left( -f(\mathbf{x},\mathbf{y}^*)\right)\,\mathrm{d}\mathbf{y}^* \right) \cdot \left( \int \exp\left( -g(\mathbf{x},\mathbf{z}^*) \right)\,\mathrm{d}\mathbf{z}^* \right)} \\ &= \frac{\exp\left( -f(\mathbf{x},\mathbf{y}) \right)}{\int \exp\left( -f(\mathbf{x},\mathbf{y}^*)\right)\,\mathrm{d}\mathbf{y}^*} \end{align} \] \(p(\mathbf{z}|\mathbf{x}) = \frac{\exp\left( -g(\mathbf{x},\mathbf{z}) \right)}{\int \exp\left( -g(\mathbf{x},\mathbf{z}^*) \right) \,\mathrm{d}\mathbf{z}^*}\) can be proven in the same way.

Collary 1

If \(U(\mathbf{x},\mathbf{y},\mathbf{z}) = f(\mathbf{x},\mathbf{y}) + g(\mathbf{x},\mathbf{z}) + h(\mathbf{x})\), then \(\mathbf{y} \perp\!\!\!\perp \mathbf{z} \mid \mathbf{x}\).

Proof: \[ \begin{align} p(\mathbf{y},\mathbf{z}|\mathbf{x}) &= \frac{\exp\left( -U(\mathbf{x},\mathbf{y},\mathbf{z}) \right)}{\iint \exp\left( -U(\mathbf{x},\mathbf{y}^*,\mathbf{z}^*) \right) \, \mathrm{d}\mathbf{z}^* \,\mathrm{d}\mathbf{y}^*} \\ &= \frac{\exp\left( -f(\mathbf{x},\mathbf{y}) \right)}{\int \exp\left( -f(\mathbf{x},\mathbf{y}^*) \right) \,\mathrm{d}\mathbf{y}^*} \cdot \frac{\exp\left( -g(\mathbf{x},\mathbf{z}) \right)}{\int \exp\left( -g(\mathbf{x},\mathbf{z}^*) \right) \,\mathrm{d}\mathbf{z}^*} \\ &= p(\mathbf{y}|\mathbf{x}) \cdot p(\mathbf{z}|\mathbf{x}) \end{align} \] which implies \(\mathbf{y} \perp\!\!\!\perp \mathbf{z} \mid \mathbf{x}\).

Collary 2

If \(U(\mathbf{y},\mathbf{z}) = f(\mathbf{y}) + g(\mathbf{z})\), then: \[ \begin{align} p(\mathbf{y}) &= \frac{\exp\left( -f(\mathbf{y}) \right)}{\int \exp\left( -f(\mathbf{y}^*) \right) \,\mathrm{d}\mathbf{y}^*} \\ p(\mathbf{z}) &= \frac{\exp\left( -g(\mathbf{z}) \right)}{\int \exp\left( -g(\mathbf{z}^*) \right) \,\mathrm{d}\mathbf{z}^*} \end{align} \] Proof: similar to Theorem 1.

Collary 3

If \(U(\mathbf{y},\mathbf{z}) = f(\mathbf{y}) + g(\mathbf{z})\), then \(\mathbf{y} \perp\!\!\!\perp \mathbf{z}\).

Proof: according to Collary 2, and similar to Collary 1.

References

Kim, Taesup, and Yoshua Bengio. 2016. “Deep Directed Generative Models with Energy-Based Probability Estimation.” arXiv Preprint arXiv:1606.03439.