Restricted Boltzmann Machine

Standard RBM

The formuation of a standard Restricted Boltzmann Machine (RBM) consists of an observed binary variable \(\mathbf{v}\) and a latent binary variable \(\mathbf{h}\), as well as an energy function defined as: \[ E(\mathbf{v},\mathbf{h}) = -\mathbf{a}^{\top} \mathbf{v} - \mathbf{b}^{\top}\mathbf{h}-\mathbf{v}^{\top}\mathbf{W}\,\mathbf{h} \] the probability of the model is defined as: \[ P(\mathbf{v},\mathbf{h}) = \frac{1}{Z} \, \exp\left( -E(\mathbf{v},\mathbf{h}) \right) \] where \(Z\) is the partition function.

The conditional distributions are: \[ \begin{align} P(\mathbf{v}|\mathbf{h}) &= \frac{1}{Z(\mathbf{h})} \, \exp\left(\left( \mathbf{a} + \mathbf{W}\mathbf{h} \right)^{\top}\,\mathbf{v}\right) \\ P(\mathbf{h}|\mathbf{v}) &= \frac{1}{Z(\mathbf{v})} \, \exp\left(\left( \mathbf{b}^{\top} + \mathbf{v}^{\top}\mathbf{W} \right)\mathbf{h}\right) \end{align} \]

It is easy to verify that \(h_i\) is independent of \(h_j\), for \(i \neq j\), given an observed \(\mathbf{v}\). This is also true for \(v_i\), \(v_j\) given \(\mathbf{h}\). Thus sampling from \(P(\mathbf{v},\mathbf{h})\) could be achieved by sampling from the two conditional distributions alternatively (i.e., a block Gibbs sampler).

The parameters \(\mathbf{a},\mathbf{b},\mathbf{W}\) of \(E(\mathbf{v},\mathbf{h})\) can be optimized by Contrastive Divergence algorithm, with the energy function \(U(\mathbf{z})\) for \(\mathbf{v}\), satisfying \(U(\mathbf{v}) + \log Z = \log P(\mathbf{v})\). Whereas it is more simply to deduce the energy function \(U(\mathbf{v})\) if we consider the conditional independence of \(h_i\), \(h_j\) beforehand, and use the element-wise notations, I provide here the deduction using vector notation.

Deduction of \(U(\mathbf{v})\) using Vector Notation

The marginal distribution \(P(\mathbf{v})\) for the training data is: \[ P(\mathbf{v}) = \sum_{\mathbf{h}} P(\mathbf{v},\mathbf{h}) = \frac{1}{Z} \, \exp\left( \mathbf{a}^{\top}\mathbf{v} \right) \cdot \sum_{\mathbf{h}} \exp\left( \left( \mathbf{b}^{\top}+\mathbf{v}^{\top}\mathbf{W} \right)\mathbf{h} \right) \] which results in the following energy function for \(\mathbf{v}\): \[ U(\mathbf{v}) = \log P(\mathbf{v}) = \mathbf{a}^{\top}\mathbf{v} + \log\sum_{\mathbf{h}}\exp\left( \left( \mathbf{b}^{\top}+\mathbf{v}^{\top}\mathbf{W} \right)\mathbf{h} \right) \] If the number of elements of the vector \(\mathbf{h}\) is k, we can further get: \[ \begin{align} \sum_{\mathbf{h}}\exp\left( \left( \mathbf{b}^{\top}+\mathbf{v}^{\top}\mathbf{W} \right)\mathbf{h} \right) &= \sum_{h_1,h_2,\dots,h_k} \exp\left( \sum_{j=1}^k(b_j + \mathbf{v}^{\top} \mathbf{W}_j)\,h_j \right) \\ &= \sum_{h_1,h_2,\dots,h_k} \prod_{j=1}^k \exp\left( (b_j+\mathbf{v}^{\top} \mathbf{W}_j) \,h_j \right) \\ &= \prod_{j=1}^k \sum_{h_j} \exp\left( (b_j+\mathbf{v}^{\top} \mathbf{W}_j) \,h_j \right) \end{align} \] where \(\mathbf{W}_j\) is the \(j\)-th column of the matrix \(\mathbf{W}\). We then have: \[ U(\mathbf{v}) = \mathbf{a}^{\top}\mathbf{v} + \sum_{j=1}^k \log \sum_{h_j} \exp\left( (b_j+\mathbf{v}^{\top} \mathbf{W}_j) \,h_j \right) \] Given that \(h_j\) is a binary variable, \(\sum_{h_j}\) can be it further deduced into: \[ U(\mathbf{v}) = \mathbf{a}^{\top}\mathbf{v} + \sum_{j=1}^k \log \left( 1 + \exp\left( b_j+\mathbf{v}^{\top} \mathbf{W}_j \right) \right) \]