# Contrastive Divergence

## Problem Statement

To find the optimum $$\theta$$ that minimizes $$U(\mathbf{X};\theta)$$ w.r.t. the given dataset $$\mathbf{X} = \{\mathbf{x}_1,\dots,\mathbf{x}_N\}$$ without estimating $$Z(\theta)$$:

\begin{align} U(\mathbf{X};\theta) &= -\frac{1}{N} \sum_{i=1}^N \log p_m(\mathbf{x}_i;\theta) = \log Z(\theta) - \frac{1}{N} \sum_{i=1}^N \log \tilde{p}_m(\mathbf{x}_i;\theta) \\ p_m(\mathbf{x};\theta) &= \frac{1}{Z(\theta)} \, \tilde{p}_m(\mathbf{x};\theta) \\ Z(\theta) &= \int \tilde{p}_m(\mathbf{x};\theta)\,\mathrm{d}\mathbf{x} \end{align}

## Contrastive Divergence

The gradient for $$U(\mathbf{X};\theta)$$ is: \begin{align} \frac{\partial U(\mathbf{X};\theta)}{\partial \theta} &= \frac{\partial \log Z(\theta)}{\partial \theta} - \frac{1}{N} \sum_{i=1}^N \frac{\partial \log \tilde{p}_m(\mathbf{x}_i;\theta)}{\partial \theta} \end{align}

The first term, $$\frac{\partial \log Z(\theta)}{\partial \theta}$$, can be derived as:

\begin{align} \frac{\partial \log Z(\theta)}{\partial \theta} &= \frac{1}{Z(\theta)} \cdot \frac{\partial Z(\theta)}{\partial \theta} = \frac{1}{Z(\theta)} \cdot \int \frac{\partial \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\,\mathrm{d}\mathbf{x} = \frac{1}{Z(\theta)} \cdot \int \tilde{p}_m(\mathbf{x};\theta)\,\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\,\mathrm{d}\mathbf{x} \\ &= \int p_m(\mathbf{x};\theta)\,\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\,\mathrm{d}\mathbf{x} = \mathbb{E}_{p_m(\mathbf{x};\theta)}\left[\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right] = \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}} \end{align}

where $$\left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right> _ {\mathbf{X}}$$ is the expectation of $$\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}$$ over $$p_m(\mathbf{x};\theta)$$. Unfortunately, directly sampling from $$p_m(\mathbf{x};\theta)$$ is generally not possible. Thus, Hinton suggested to start with the empirical distribution (i.e., dataset $$\mathbf{X}$$), and use MCMC chain to approach $$p_m(\mathbf{x};\theta)$$. Let $$\mathbf{X}^0 \equiv \mathbf{X}$$ be the samples from the empirical distribution $$p_d(\mathbf{x})$$, $$\mathbf{X}^k$$ be the samples derived by $$k$$-cycle MCMC chain, and $$\mathbf{X}^{\infty}$$ be the samples from the limit distribution of MCMC chain (i.e., $$p_m(\mathbf{x};\theta)$$), we then have:

\begin{align} \frac{\partial U(\mathbf{X};\theta)}{\partial \theta} &= \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^{\infty}} - \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^0} \\ &\approx \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^k} - \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^0} \end{align}

This algorithm is called CD-k algorithm when using $$k$$-cycle MCMC chain. Hinton (2002) has found that even with 1-cycle MCMC chain, it is sufficient for the training to converge.

For further materials, see Woodford (n.d.).

# References

Hinton, Geoffrey E. 2002. “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14 (8): 1771–1800.

Woodford, Oliver. n.d. “Notes on Contrastive Divergence,” 3.