Contrastive Divergence

Problem Statement

To find the optimum \(\theta\) that minimizes \(U(\mathbf{X};\theta)\) w.r.t. the given dataset \(\mathbf{X} = \{\mathbf{x}_1,\dots,\mathbf{x}_N\}\) without estimating \(Z(\theta)\):

\[ \begin{align} U(\mathbf{X};\theta) &= -\frac{1}{N} \sum_{i=1}^N \log p_m(\mathbf{x}_i;\theta) = \log Z(\theta) - \frac{1}{N} \sum_{i=1}^N \log \tilde{p}_m(\mathbf{x}_i;\theta) \\ p_m(\mathbf{x};\theta) &= \frac{1}{Z(\theta)} \, \tilde{p}_m(\mathbf{x};\theta) \\ Z(\theta) &= \int \tilde{p}_m(\mathbf{x};\theta)\,\mathrm{d}\mathbf{x} \end{align} \]

Contrastive Divergence

The gradient for \(U(\mathbf{X};\theta)\) is: \[ \begin{align} \frac{\partial U(\mathbf{X};\theta)}{\partial \theta} &= \frac{\partial \log Z(\theta)}{\partial \theta} - \frac{1}{N} \sum_{i=1}^N \frac{\partial \log \tilde{p}_m(\mathbf{x}_i;\theta)}{\partial \theta} \end{align} \]

The first term, \(\frac{\partial \log Z(\theta)}{\partial \theta}\), can be derived as:

\[ \begin{align} \frac{\partial \log Z(\theta)}{\partial \theta} &= \frac{1}{Z(\theta)} \cdot \frac{\partial Z(\theta)}{\partial \theta} = \frac{1}{Z(\theta)} \cdot \int \frac{\partial \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\,\mathrm{d}\mathbf{x} = \frac{1}{Z(\theta)} \cdot \int \tilde{p}_m(\mathbf{x};\theta)\,\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\,\mathrm{d}\mathbf{x} \\ &= \int p_m(\mathbf{x};\theta)\,\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\,\mathrm{d}\mathbf{x} = \mathbb{E}_{p_m(\mathbf{x};\theta)}\left[\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right] = \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}} \end{align} \]

where \(\left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right> _ {\mathbf{X}}\) is the expectation of \(\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\) over \(p_m(\mathbf{x};\theta)\). Unfortunately, directly sampling from \(p_m(\mathbf{x};\theta)\) is generally not possible. Thus, Hinton suggested to start with the empirical distribution (i.e., dataset \(\mathbf{X}\)), and use MCMC chain to approach \(p_m(\mathbf{x};\theta)\). Let \(\mathbf{X}^0 \equiv \mathbf{X}\) be the samples from the empirical distribution \(p_d(\mathbf{x})\), \(\mathbf{X}^k\) be the samples derived by \(k\)-cycle MCMC chain, and \(\mathbf{X}^{\infty}\) be the samples from the limit distribution of MCMC chain (i.e., \(p_m(\mathbf{x};\theta)\)), we then have:

\[ \begin{align} \frac{\partial U(\mathbf{X};\theta)}{\partial \theta} &= \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^{\infty}} - \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^0} \\ &\approx \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^k} - \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^0} \end{align} \]

This algorithm is called CD-k algorithm when using \(k\)-cycle MCMC chain. Hinton (2002) has found that even with 1-cycle MCMC chain, it is sufficient for the training to converge.

For further materials, see Woodford (n.d.).

References

Hinton, Geoffrey E. 2002. “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14 (8): 1771–1800.

Woodford, Oliver. n.d. “Notes on Contrastive Divergence,” 3.