Contrastive Divergence

Problem Statement

To find the optimum \(\theta\) that minimizes \(U(\mathbf{X};\theta)\) w.r.t. the given dataset \(\mathbf{X} = \{\mathbf{x}_1,\dots,\mathbf{x}_N\}\) without estimating \(Z(\theta)\):

\[ \begin{align} U(\mathbf{X};\theta) &= -\frac{1}{N} \sum_{i=1}^N \log p_m(\mathbf{x}_i;\theta) = \log Z(\theta) - \frac{1}{N} \sum_{i=1}^N \log \tilde{p}_m(\mathbf{x}_i;\theta) \\ p_m(\mathbf{x};\theta) &= \frac{1}{Z(\theta)} \, \tilde{p}_m(\mathbf{x};\theta) \\ Z(\theta) &= \int \tilde{p}_m(\mathbf{x};\theta)\,\mathrm{d}\mathbf{x} \end{align} \]

Contrastive Divergence

The gradient for \(U(\mathbf{X};\theta)\) is: \[ \begin{align} \frac{\partial U(\mathbf{X};\theta)}{\partial \theta} &= \frac{\partial \log Z(\theta)}{\partial \theta} - \frac{1}{N} \sum_{i=1}^N \frac{\partial \log \tilde{p}_m(\mathbf{x}_i;\theta)}{\partial \theta} \end{align} \]

The first term, \(\frac{\partial \log Z(\theta)}{\partial \theta}\), can be derived as:

\[ \begin{align} \frac{\partial \log Z(\theta)}{\partial \theta} &= \frac{1}{Z(\theta)} \cdot \frac{\partial Z(\theta)}{\partial \theta} = \frac{1}{Z(\theta)} \cdot \int \frac{\partial \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\,\mathrm{d}\mathbf{x} = \frac{1}{Z(\theta)} \cdot \int \tilde{p}_m(\mathbf{x};\theta)\,\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\,\mathrm{d}\mathbf{x} \\ &= \int p_m(\mathbf{x};\theta)\,\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\,\mathrm{d}\mathbf{x} = \mathbb{E}_{p_m(\mathbf{x};\theta)}\left[\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right] = \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}} \end{align} \]

where \(\left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right> _ {\mathbf{X}}\) is the expectation of \(\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\) over \(p_m(\mathbf{x};\theta)\). Also see Energy Function in Probabilistic Models for another deduction.

Unfortunately, directly sampling from \(p_m(\mathbf{x};\theta)\) is generally not possible. Thus, Hinton suggested to start with the empirical distribution (i.e., dataset \(\mathbf{X}\)), and use MCMC chain to approach \(p_m(\mathbf{x};\theta)\). Let \(\mathbf{X}^0 \equiv \mathbf{X}\) be the samples from the empirical distribution \(p_d(\mathbf{x})\), \(\mathbf{X}^k\) be the samples derived by \(k\)-cycle MCMC chain, and \(\mathbf{X}^{\infty}\) be the samples from the limit distribution of MCMC chain (i.e., \(p_m(\mathbf{x};\theta)\)), we then have: \[ \begin{align} \frac{\partial U(\mathbf{X};\theta)}{\partial \theta} &= \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^{\infty}} - \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^0} \\ &\approx \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^k} - \left<\frac{\partial \log \tilde{p}_m(\mathbf{x};\theta)}{\partial \theta}\right>_{\mathbf{X}^0} \end{align} \]

This algorithm is called CD-k algorithm when using \(k\)-cycle MCMC chain. For further materials, see Woodford (n.d.).

Hinton (2002) has found that even with 1-cycle MCMC chain, it is sufficient for the training to converge.

Interestingly, although the 1-cycle MCMC chain (or few cycles MCMC chain) is not expected to be a good estimator of the true \(p_m(\mathbf{x};\theta)\), it can be viewed as a neighborhood of a particular given training data \(\mathbf{x}\) with relatively high log-likelihood (since by design, the chain starts from a given training data). Then optimizing the contrastive divergence loss can be viewed as "pull-down" the energy of some energy function \(E(\mathbf{x};\theta)\) at the given train data, and "pull-up" the energy at the sampled neighborhood data, if we can write \(p_{m}(\mathbf{x};\theta) = \frac{\exp(-\beta E(\mathbf{x};\theta))}{\int \exp(-\beta E(\mathbf{x'};\theta))\,dx'}\). In this perspective, the CD-k algorithm can be viewed as a form of energy based learning, with approximated constrastive samples (see Energy Based Models).

Persistent Contrastive Divergence

Tieleman (2008) proposed to use the final samples from the previous MCMC chain at each mini-batch instead of the training points, as the initial state of the MCMC chain at each mini-batch.

References

Hinton, Geoffrey E. 2002. “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14 (8): 1771–1800.

Tieleman, Tijmen. 2008. “Training Restricted Boltzmann Machines Using Approximations to the Likelihood Gradient.” In Proceedings of the 25th International Conference on Machine Learning - ICML ’08, 1064–71. Helsinki, Finland: ACM Press. https://doi.org/10.1145/1390156.1390290.

Woodford, Oliver. n.d. “Notes on Contrastive Divergence,” 3.