Softmax Speedup

Problem Statement

To optimize \(\theta\) for a softmax classification model \(p(y|\mathbf{x};\theta)\) (or formulated as \(p(w|\mathbf{c};\theta)\) in many NLP literature) without estimating the partition function \(Z(\theta)\):

\[ \begin{align} p(y=i|\mathbf{x};\theta) &= \frac{\tilde{p}(y=i|\mathbf{x};\theta)}{Z(\theta)} \\ Z(\mathbf{x};\theta) &= \sum_{j=1}^m \tilde{p}(y=j|\mathbf{x};\theta) \end{align} \] where the unnormalized \(\tilde{p}(y|\mathbf{x};\theta)\) is typically formulated as \(\tilde{p}(y=i|\mathbf{x};\theta)=\exp\left( u_i(\mathbf{x},y;\theta) \right)\), where \(i=1 \dots m\) is the \(k\) classes.

Hierarchical Softmax

Encode each class \(y\) by a binary sequence \(b_1(y), b_2(y), \dots, b_k(y)\) , where \(b_i(y) \in \{0, 1\}\), and decompose \(p(y|\mathbf{x};\theta)\) as: \[ p(y|\mathbf{x};\theta) = \prod_{i=1}^k p(b_i(y)|b_{i-1}(y),\dots,b_1(y),\mathbf{x};\theta) \] such that each of the decomposed term is a binary classifier.

Further Speedup

Hoffman coding can be used to construct the binary tree, such that more frequently visited classes \(y\) can have a shorter binary sequence.

Noise Contrastive Estimation

The NCE method starts by choosing a noise distribution \(q(y)\), and modify the original \(m\)-target classification problem as a binary classfication problem.

The binary classifier is defined as: \[ \begin{align} p(D=1|\mathbf{x},y;\theta) &= \frac{p(y|\mathbf{x};\theta)}{p(y|\mathbf{x};\theta) + k\, q(y)} = \frac{\tilde{p}(y|\mathbf{x};\theta) / Z(\mathbf{x};\theta)}{\tilde{p}(y|\mathbf{x};\theta) / Z(\mathbf{x};\theta) + k\, q(y)} \\ p(D=0|\mathbf{x},y;\theta) &= \frac{k\, q(y)}{p(y|\mathbf{x};\theta) + k\, q(y)} = \frac{k\, q(y)}{\tilde{p}(y|\mathbf{x};\theta)/Z(\mathbf{x};\theta) + k\, q(y)} \end{align} \] and the NCE objective (which should be maximized) is defined as: \[ \begin{align} \mathcal{L}_{\mathrm{NCE}} &= \mathbb{E}_{(\mathbf{x},y) \sim p_d(\mathbf{x},y)} \left[ \log p(D=1|\mathbf{x},y;\theta) + k\, \mathbb{E}_{\bar{y} \sim q(y)}\left[ \log p(D=0|\mathbf{x},\bar{y}) \right] \right] \\ &\approx \sum_{\mathbf{x},y } \left[ \log p(D=1|\mathbf{x},y;\theta) + k\cdot\frac{1}{k}\sum_{j = 1 \dots k \\ \bar{y}_j \sim q(y)} \log p(D=0|\mathbf{x},\bar{y}_j;\theta) \right] \\ &= \sum_{\mathbf{x},y } \left[ \log p(D=1|\mathbf{x},y;\theta) + \sum_{j = 1}^k \log p(D=0|\mathbf{x},\bar{y}_j;\theta) \right] \end{align} \] where \(\sum_{\mathbf{x},y}\) is a summation over the train data in a mini-batch, and \(\frac{1}{k}\sum_{j = 1 \dots k \\ \bar{y}_j \sim q(y)}\) is a Monte Carlo estimator of \(\mathbb{E}_{\bar{y} \sim q(y)}\).

A necessary condition for NCE to work is that, \(q(y) \neq 0\) wherever \(p(y|\mathbf{x};\theta) \neq 0\).

The Gradient of NCE Loss

\[ \begin{align} \frac{\partial \mathcal{L}_{\mathrm{NCE}}}{\partial \theta} &= \frac{\partial}{\partial \theta} \mathbb{E}_{p_d(\mathbf{x},y)} \left[ \log \frac{p(y|\mathbf{x};\theta)}{p(y|\mathbf{x};\theta) + k\, q(y)} + k\, \mathbb{E}_{q(y)}\left[ \log \frac{k\, q(y)}{p(y|\mathbf{x};\theta) + k\, q(y)} \right] \right] \\ &= \frac{\partial}{\partial \theta} \mathbb{E}_{p_d(\mathbf{x})} \left[ \mathbb{E}_{p_d(y|\mathbf{x})} \left[ \log \frac{p(y|\mathbf{x};\theta)}{p(y|\mathbf{x};\theta) + k\, q(y)} \right] + k\, \mathbb{E}_{q(y)}\left[ \log \frac{k\, q(y)}{p(y|\mathbf{x};\theta) + k\, q(y)} \right] \right] \\ &= \mathbb{E}_{p_d(\mathbf{x})} \left[ \mathbb{E}_{p_d(y|\mathbf{x})} \left[ \frac{1}{p(y|\mathbf{x};\theta)} \cdot \frac{k\,q(y)}{p(y|\mathbf{x};\theta) + k\, q(y)} \cdot \frac{\partial p(y|\mathbf{x};\theta)}{\partial \theta} \right] - k\, \mathbb{E}_{q(y)}\left[ \frac{1}{p(y|\mathbf{x};\theta) + k\, q(y)} \cdot \frac{\partial p(y|\mathbf{x};\theta)}{\partial \theta} \right] \right] \\ &= \mathbb{E}_{p_d(\mathbf{x})} \left[ \mathbb{E}_{p_d(y|\mathbf{x})} \left[ \frac{k\,q(y)}{p(y|\mathbf{x};\theta) + k\, q(y)} \cdot \frac{\partial \log p(y|\mathbf{x};\theta)}{\partial \theta} \right] - k\, \mathbb{E}_{q(y)}\left[ \frac{p(y|\mathbf{x};\theta)}{p(y|\mathbf{x};\theta) + k\, q(y)} \cdot \frac{\partial \log p(y|\mathbf{x};\theta)}{\partial \theta} \right] \right] \\ &= \mathbb{E}_{p_d(\mathbf{x})} \sum_y \left[ \frac{k\,q(y)}{p(y|\mathbf{x};\theta) + k\, q(y)} \cdot \bigg( p_d(y|\mathbf{x}) - p(y|\mathbf{x};\theta) \bigg) \cdot \frac{\partial \log p(y|\mathbf{x};\theta)}{\partial \theta} \right] \end{align} \]

Note that:

\[ \sum_y \left[ p(y|\mathbf{x};\theta) \cdot \frac{\partial \log p(y|\mathbf{x};\theta)}{\partial \theta} \right] = \sum_y \frac{\partial p(y|\mathbf{x};\theta)}{\partial \theta} = \frac{\partial}{\partial \theta} \sum_y p(y|\mathbf{x};\theta) = \frac{\partial 1}{\partial \theta} = 0 \]

Thus, when \(k \to \infty\), \(\frac{\partial \mathcal{L}_{\mathrm{NCE}}}{\partial \theta}\) equals to \(\frac{\partial}{\partial \theta} \mathbb{E}_{p_d(\mathbf{x},y)}\left[ \log p(y|\mathbf{x};\theta) \right]\), which is the gradient for the original classification problem.

The Global Optimum of NCE Loss

(This section is added by me. The original NCE paper (Gutmann and Hyvärinen 2010) has a similar theorem, but did not give the proof. I use Euler's formula from Calculus of Variations, but in fact the condition of using this theorem is not strictly satisfied. Thus this proof may only be seen as a discussion.) \[ \begin{align} \frac{\partial \mathcal{L}_{\mathrm{NCE}}}{\partial p(y|\mathbf{x};\theta)} &= \mathbb{E}_{p_d(\mathbf{x})} \sum_y \left[ \frac{k\,q(y)}{p(y|\mathbf{x};\theta) + k\,q(y)} \cdot \bigg( p_d(y|\mathbf{x}) - p(y|\mathbf{x};\theta) \bigg) \cdot \frac{1}{p(y|\mathbf{x};\theta)} \right] \end{align} \] According to Euler's formula: \[ \begin{align} \frac{\partial \mathcal{L}_{\mathrm{NCE}}}{\partial p(y|\mathbf{x};\theta)} = 0 &\Leftrightarrow \frac{k\,q(y)}{p(y|\mathbf{x};\theta) + k\,q(y)} \cdot \bigg( p_d(y|\mathbf{x}) - p(y|\mathbf{x};\theta) \bigg) \cdot \frac{1}{p(y|\mathbf{x};\theta)} = 0 \end{align} \] Since \(q(y)\) is arbitrary, the only solution for \(\mathcal{L}_{\mathrm{NCE}}\) to attains its global optimum is: \[ p(y|\mathbf{x};\theta) \equiv p_d(y|\mathbf{x}) \]

Dealing with \(Z(\mathbf{x};\theta)\)

For NCE, \(Z(\mathbf{x};\theta)\) can be learned instead of estimated. To avoid having extra free parameters, Mnih and Teh (2012) suggested to let \(Z(\mathbf{x};\theta) \equiv 1\), and found it works well. Having a fixed \(Z(\mathbf{x};\theta) \equiv 1\) will likely to induce a normalized \(p(y|\mathbf{x};\theta)\).

Negative Sampling

Negative sampling (Mikolov et al. 2013) can be seen as a simplified version of noise contrastive estimation. The binary classifier is defined as: \[ \begin{align} p(D=1|\mathbf{x},y;\theta) &= \frac{p(y|\mathbf{x};\theta)}{p(y|\mathbf{x};\theta) + 1} \\ p(D=0|\mathbf{x},y;\theta) &= \frac{1}{p(y|\mathbf{x};\theta) + 1} \end{align} \] that is, \(q(y) \equiv \frac{1}{k}\). The loss is derived as: \[ \begin{align} \mathcal{L}_{\mathrm{NEG}} &= \mathbb{E}_{(\mathbf{x},y) \sim p_d(\mathbf{x},y)} \left[ \log p(D=1|\mathbf{x},y;\theta) + \mathbb{E}_{\bar{y} \sim q(y)}\left[ \log p(D=0|\mathbf{x},\bar{y}) \right] \right] \\ &= \mathbb{E}_{p_d(\mathbf{x})} \left[ \mathbb{E}_{ p(y|\mathbf{x};\theta)}\left[ \log \frac{p(y|\mathbf{x};\theta)}{p(y|\mathbf{x};\theta) + 1} \right] + k\,\mathbb{E}_{q(y)}\left[ \log \frac{1}{p(y|\mathbf{x};\theta) + 1} \right] \right] \\ &\approx \frac{1}{N} \sum_{\mathbf{x},y } \left[ \log \frac{p(y|\mathbf{x};\theta)}{p(y|\mathbf{x};\theta) + 1} + \sum_{j = 1}^k \log \frac{1}{p(\bar{y}_j|\mathbf{x};\theta) + 1} \right] \end{align} \]

The Normalizing Factor

The negative sampling does not train a properly normalized \(p(y|\mathbf{x};\theta)\) (Dyer 2014), unless \(k = m\), where \(q(y) \equiv \frac{1}{m}\) is a normalized uniform distribution, since: \[ \begin{align} \mathcal{L}_{\mathrm{NEG}} &= \mathbb{E}_{p_d(\mathbf{x})} \left[ \mathbb{E}_{ p_d(y|\mathbf{x})}\left[ \log \frac{p(y|\mathbf{x};\theta)}{p(y|\mathbf{x};\theta) + 1} \right] + k\,\mathbb{E}_{q(y)}\left[ \log \frac{1}{p(y|\mathbf{x};\theta) + 1} \right] \right] \\ &= \mathbb{E}_{p_d(\mathbf{x})} \left[ \mathbb{E}_{ p_d(y|\mathbf{x}) }\left[ \log \frac{p(y|\mathbf{x};\theta) / \frac{m}{k}}{p(y|\mathbf{x};\theta) / \frac{m}{k} + k \cdot \left( \frac{1}{k} \cdot \frac{k}{m} \right)} \right] + k\,\mathbb{E}_{q(y)}\left[ \log \frac{k \cdot \left( \frac{1}{k} \cdot \frac{k}{m} \right)}{p(y|\mathbf{x};\theta) / \frac{m}{k} + k \cdot \left( \frac{1}{k} \cdot \frac{k}{m} \right)} \right] \right] \end{align} \] If we believe \(q(y) \equiv \frac{1}{m}\) (which is often indeed the case for uniform sampling of the noise samples), then this should be a standard NCE loss, and the learned "probability distribution" \(p(y|\mathbf{x};\theta)\) could potentially be converted into a truely normalized distribution, by scaling it with the normalizing factor \(Z(\mathbf{x};\theta) \equiv \frac{m}{k}\). (This assertion is made by me)

References

Dyer, Chris. 2014. “Notes on Noise Contrastive Estimation and Negative Sampling.” arXiv Preprint arXiv:1410.8251.

Gutmann, Michael, and Aapo Hyvärinen. 2010. “Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models.” In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 297–304.

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” In Advances in Neural Information Processing Systems, 3111–9.

Mnih, Andriy, and Yee Whye Teh. 2012. “A Fast and Simple Algorithm for Training Neural Probabilistic Language Models.” arXiv:1206.6426 [Cs], June. http://arxiv.org/abs/1206.6426.