# Restricted Boltzmann Machine

## Standard RBM

The formuation of a standard Restricted Boltzmann Machine (RBM) consists of an observed binary variable $$\mathbf{v}$$ and a latent binary variable $$\mathbf{h}$$, as well as an energy function defined as: $E(\mathbf{v},\mathbf{h}) = -\mathbf{a}^{\top} \mathbf{v} - \mathbf{b}^{\top}\mathbf{h}-\mathbf{v}^{\top}\mathbf{W}\,\mathbf{h}$ the probability of the model is defined as: $P(\mathbf{v},\mathbf{h}) = \frac{1}{Z} \, \exp\left( -E(\mathbf{v},\mathbf{h}) \right)$ where $$Z$$ is the partition function.

The conditional distributions are: \begin{align} P(\mathbf{v}|\mathbf{h}) &= \frac{1}{Z(\mathbf{h})} \, \exp\left(\left( \mathbf{a} + \mathbf{W}\mathbf{h} \right)^{\top}\,\mathbf{v}\right) \\ P(\mathbf{h}|\mathbf{v}) &= \frac{1}{Z(\mathbf{v})} \, \exp\left(\left( \mathbf{b}^{\top} + \mathbf{v}^{\top}\mathbf{W} \right)\mathbf{h}\right) \end{align}

It is easy to verify that $$h_i$$ is independent of $$h_j$$, for $$i \neq j$$, given an observed $$\mathbf{v}$$. This is also true for $$v_i$$, $$v_j$$ given $$\mathbf{h}$$. Thus sampling from $$P(\mathbf{v},\mathbf{h})$$ could be achieved by sampling from the two conditional distributions alternatively (i.e., a block Gibbs sampler).

The parameters $$\mathbf{a},\mathbf{b},\mathbf{W}$$ of $$E(\mathbf{v},\mathbf{h})$$ can be optimized by Contrastive Divergence algorithm, with the energy function $$U(\mathbf{z})$$ for $$\mathbf{v}$$, satisfying $$U(\mathbf{v}) + \log Z = \log P(\mathbf{v})$$. Whereas it is more simply to deduce the energy function $$U(\mathbf{v})$$ if we consider the conditional independence of $$h_i$$, $$h_j$$ beforehand, and use the element-wise notations, I provide here the deduction using vector notation.

### Deduction of $$U(\mathbf{v})$$ using Vector Notation

The marginal distribution $$P(\mathbf{v})$$ for the training data is: $P(\mathbf{v}) = \sum_{\mathbf{h}} P(\mathbf{v},\mathbf{h}) = \frac{1}{Z} \, \exp\left( \mathbf{a}^{\top}\mathbf{v} \right) \cdot \sum_{\mathbf{h}} \exp\left( \left( \mathbf{b}^{\top}+\mathbf{v}^{\top}\mathbf{W} \right)\mathbf{h} \right)$ which results in the following energy function for $$\mathbf{v}$$: $U(\mathbf{v}) = \log P(\mathbf{v}) = \mathbf{a}^{\top}\mathbf{v} + \log\sum_{\mathbf{h}}\exp\left( \left( \mathbf{b}^{\top}+\mathbf{v}^{\top}\mathbf{W} \right)\mathbf{h} \right)$ If the number of elements of the vector $$\mathbf{h}$$ is k, we can further get: \begin{align} \sum_{\mathbf{h}}\exp\left( \left( \mathbf{b}^{\top}+\mathbf{v}^{\top}\mathbf{W} \right)\mathbf{h} \right) &= \sum_{h_1,h_2,\dots,h_k} \exp\left( \sum_{j=1}^k(b_j + \mathbf{v}^{\top} \mathbf{W}_j)\,h_j \right) \\ &= \sum_{h_1,h_2,\dots,h_k} \prod_{j=1}^k \exp\left( (b_j+\mathbf{v}^{\top} \mathbf{W}_j) \,h_j \right) \\ &= \prod_{j=1}^k \sum_{h_j} \exp\left( (b_j+\mathbf{v}^{\top} \mathbf{W}_j) \,h_j \right) \end{align} where $$\mathbf{W}_j$$ is the $$j$$-th column of the matrix $$\mathbf{W}$$. We then have: $U(\mathbf{v}) = \mathbf{a}^{\top}\mathbf{v} + \sum_{j=1}^k \log \sum_{h_j} \exp\left( (b_j+\mathbf{v}^{\top} \mathbf{W}_j) \,h_j \right)$ Given that $$h_j$$ is a binary variable, $$\sum_{h_j}$$ can be it further deduced into: $U(\mathbf{v}) = \mathbf{a}^{\top}\mathbf{v} + \sum_{j=1}^k \log \left( 1 + \exp\left( b_j+\mathbf{v}^{\top} \mathbf{W}_j \right) \right)$