# Contrastive Divergence

## Problem Statement

To find the optimum $$\theta$$ that minimizes $$U(\mathbf{X};\theta)$$ w.r.t. the given dataset $$\mathbf{X} = \{\mathbf{x}_1,\dots,\mathbf{x}_N\}$$ without estimating $$Z(\theta)$$:

\begin{align} U(\mathbf{X};\theta) &= -\frac{1}{N} \sum_{i=1}^N \log p_m(\mathbf{x}_i;\theta) = \log Z(\theta) - \frac{1}{N} \sum_{i=1}^N \log \tilde{p}_m(\mathbf{x}_i;\theta) \\ p_m(\mathbf{x};\theta) &= \frac{1}{Z(\theta)} \, \tilde{p}_m(\mathbf{x};\theta) \\ Z(\theta) &= \int \tilde{p}_m(\mathbf{x};\theta)\,\mathrm{d}\mathbf{x} \end{align}

## Contrastive Divergence

The gradient for $$U(\mathbf{X};\theta)$$ is: \begin{align} \frac{\partial U(\mathbf{X};\theta)}{\partial \theta} &= \frac{\partial \log Z(\theta)}{\partial \theta} - \frac{1}{N} \sum_{i=1}^N \frac{\partial \log \tilde{p}_m(\mathbf{x}_i;\theta)}{\partial \theta} \end{align}

The first term, $$\frac{\partial \log Z(\theta)}{\partial \theta}$$, can be derived as:

\begin{align} \frac{\partial \log Z(\theta)}{\partial \theta} &= \frac{1}{Z(\theta)} \cdot \frac{\partial Z(\theta)}{\partial \theta} = \frac{1}{Z(\theta)} \cdot \int \frac{\partial \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\,\mathrm{d}\mathbf{x} = \frac{1}{Z(\theta)} \cdot \int \tilde{p}_m(\mathbf{x};\theta)\,\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\,\mathrm{d}\mathbf{x} \\ &= \int p_m(\mathbf{x};\theta)\,\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\,\mathrm{d}\mathbf{x} = \mathbb{E}_{p_m(\mathbf{x};\theta)}\left[\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right] = \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}} \end{align}

where $$\left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right> _ {\mathbf{X}}$$ is the expectation of $$\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}$$ over $$p_m(\mathbf{x};\theta)$$. Also see Energy Function in Probabilistic Models for another deduction.

Unfortunately, directly sampling from $$p_m(\mathbf{x};\theta)$$ is generally not possible. Thus, Hinton suggested to start with the empirical distribution (i.e., dataset $$\mathbf{X}$$), and use MCMC chain to approach $$p_m(\mathbf{x};\theta)$$. Let $$\mathbf{X}^0 \equiv \mathbf{X}$$ be the samples from the empirical distribution $$p_d(\mathbf{x})$$, $$\mathbf{X}^k$$ be the samples derived by $$k$$-cycle MCMC chain, and $$\mathbf{X}^{\infty}$$ be the samples from the limit distribution of MCMC chain (i.e., $$p_m(\mathbf{x};\theta)$$), we then have: \begin{align} \frac{\partial U(\mathbf{X};\theta)}{\partial \theta} &= \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^{\infty}} - \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^0} \\ &\approx \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^k} - \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^0} \end{align}

This algorithm is called CD-k algorithm when using $$k$$-cycle MCMC chain. For further materials, see Woodford (n.d.).

Hinton (2002) has found that even with 1-cycle MCMC chain, it is sufficient for the training to converge.

Interestingly, although the 1-cycle MCMC chain (or few cycles MCMC chain) is not expected to be a good estimator of the true $$p_m(\mathbf{x};\theta)$$, it can be viewed as a neighborhood of a particular given training data $$\mathbf{x}$$ with relatively high log-likelihood (since by design, the chain starts from a given training data). Then optimizing the contrastive divergence loss can be viewed as "pull-down" the energy of some energy function $$E(\mathbf{x};\theta)$$ at the given train data, and "pull-up" the energy at the sampled neighborhood data, if we can write $$p_{m}(\mathbf{x};\theta) = \frac{\exp(-\beta E(\mathbf{x};\theta))}{\int \exp(-\beta E(\mathbf{x'};\theta))\,dx'}$$. In this perspective, the CD-k algorithm can be viewed as a form of energy based learning, with approximated constrastive samples (see Energy Based Models).

## Persistent Contrastive Divergence

Tieleman (2008) proposed to use the final samples from the previous MCMC chain at each mini-batch instead of the training points, as the initial state of the MCMC chain at each mini-batch.

# References

Hinton, Geoffrey E. 2002. “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14 (8): 1771–1800.

Tieleman, Tijmen. 2008. “Training Restricted Boltzmann Machines Using Approximations to the Likelihood Gradient.” In Proceedings of the 25th International Conference on Machine Learning - ICML ’08, 1064–71. Helsinki, Finland: ACM Press. https://doi.org/10.1145/1390156.1390290.

Woodford, Oliver. n.d. “Notes on Contrastive Divergence,” 3.