Score Matching

Problem Statement

To find the optimum \(\theta\) that minimizes \(U(\mathbf{X};\theta)\) w.r.t. the given dataset \(\mathbf{X} = \{\mathbf{x}_1,\dots,\mathbf{x}_N\}\) without estimating \(Z(\theta)\):

\[ \begin{align} U(\mathbf{X};\theta) &= -\frac{1}{N} \sum_{i=1}^N \log p_m(\mathbf{x}_i;\theta) = \log Z(\theta) - \frac{1}{N} \sum_{i=1}^N \log \tilde{p}_m(\mathbf{x}_i;\theta) \\ p_m(\mathbf{x};\theta) &= \frac{1}{Z(\theta)} \, \tilde{p}_m(\mathbf{x};\theta) \\ Z(\theta) &= \int \tilde{p}_m(\mathbf{x};\theta)\,\mathrm{d}\mathbf{x} \end{align} \]

Score Matching

Suppose each \(\mathbf{x}\) is a \(k\)-dimensional vector. The score function of the model \(p_m(\mathbf{x};\theta)\) is defined as: \[ \mathbf{s}(\mathbf{x};\theta) = \begin{pmatrix} \frac{\partial \log p_m(\mathbf{x};\theta)}{\partial x_1} \\ \vdots \\ \frac{\partial \log p_m(\mathbf{x};\theta)}{\partial x_k} \end{pmatrix} = \begin{pmatrix} \mathbf{s}_1(\mathbf{x};\theta) \\ \vdots \\ \mathbf{s}_k(\mathbf{x};\theta) \end{pmatrix} = \nabla_{\mathbf{x}} \log p_m(\mathbf{x};\theta) \] It is easy to see that \(\mathbf{s}(\mathbf{x};\theta) = \nabla_{\mathbf{x}} \log \tilde{p}_m(\mathbf{x};\theta)\), since: \[ \begin{align} \mathbf{s}(\mathbf{x};\theta) &= \nabla_{\mathbf{x}} \log p_m(\mathbf{x};\theta) = \frac{1}{p_m(\mathbf{x};\theta)} \nabla_{\mathbf{x}} p_m(\mathbf{x};\theta) \\ &= \frac{1}{\tilde{p}_m(\mathbf{x};\theta) / Z(\theta)} \cdot \nabla_{\mathbf{x}} \frac{\tilde{p}_m(\mathbf{x};\theta)}{Z(\theta)} = \frac{1}{\tilde{p}_m(\mathbf{x};\theta)} \nabla_{\mathbf{x}} \tilde{p}_m(\mathbf{x};\theta) = \nabla_{\mathbf{x}} \log \tilde{p}_m(\mathbf{x};\theta) \end{align} \]

And if we denote the score function for the empirical distribution \(p_d(\mathbf{x})\) as \(\mathbf{s}_d(\mathbf{x})\), then the objective function is to match \(\mathbf{s}(\mathbf{x};\theta)\) against \(\mathbf{s}_d(\mathbf{x})\), using squared loss: \[ J(\theta) = \frac{1}{2} \int p_d(\mathbf{x}) \, \left\| \mathbf{s}(\mathbf{x};\theta) - \mathbf{s}_d(\mathbf{x}) \right\|_2^2 \,d\mathbf{x} \]

where, under some weak regularity conditions, \(J(\theta)\) can be expressed as: \[ \begin{align} & J(\theta) = \int p_d(\mathbf{x})\sum_{i=1}^k \left[ \frac{\partial \mathbf{s}_i(\mathbf{x};\theta)}{\partial x_i} + \frac{1}{2} \mathbf{s}_i(\mathbf{x};\theta)^2 \right]\,d\mathbf{x} + const \\ & \text{where} \;\, \frac{\partial \mathbf{s}_i(\mathbf{x};\theta)}{\partial x_i} = \frac{\partial^2 \log \tilde{p}_m(\mathbf{x};\theta)}{\partial x_i^2} \end{align} \] or equivalently, using expectation and matrix notation: \[ \begin{align} & J(\theta) = \mathbb{E}_{p_d(\mathbf{x})}\left[ \operatorname{tr}\left( \nabla_{\mathbf{x}} \mathbf{s}(\mathbf{x};\theta) \right) + \frac{1}{2} \left\| \mathbf{s}(\mathbf{x};\theta) \right\|^2_2 \right] \\ & \text{where} \;\, \nabla_{\mathbf{x}} \mathbf{s}(\mathbf{x};\theta) = \nabla^2_{\mathbf{x}} \log \tilde{p}_m(\mathbf{x};\theta) \;\, \text{is the Hessian matrix} \end{align} \]

For further materials, see Hyvärinen (2005).

Score Estimation in Implicit Generative Models

In implicit generative models (i.e., GANs), the sampling process may be explicitly given by:

\[ \mathbf{x} = g(\boldsymbol{\epsilon};\theta) \]

where \(\boldsymbol{\epsilon}\) is a random variable independent of \(\theta\). The true density \(p_m(\mathbf{x};\theta)\) is not tractable.

In this case, the score \(\nabla_{\mathbf{x}} \log p_m(\mathbf{x};\theta)\) can be estimated by a dedicated score network \(\mathbf{s}(\mathbf{x};\phi)\), trained by minimizing the score-matching objective: \[ J(\phi) = \mathbb{E}_{p_m(\mathbf{x};\theta)}\left[ \operatorname{tr}\left( \nabla_{\mathbf{x}} \mathbf{s}(\mathbf{x};\phi) \right) + \frac{1}{2} \left\| \mathbf{s}(\mathbf{x};\phi) \right\|^2_2 \right] \] The learned score network can be used to estimate certain quantities involving the score \(\nabla_{\mathbf{x}} \log p_m(\mathbf{x};\theta)\). For example, the gradient of the entropy of \(p_m(\mathbf{x};\theta)\), *i.e., \(\nabla_{\theta} H\left(p_m(\mathbf{x};\theta)\right)\): \[ \begin{align} \nabla_{\theta} H\left( p_m(\mathbf{x};\theta) \right) &= -\nabla_{\theta} \mathbb{E}_{p_m(\mathbf{x};\theta)}\left[ \log p_m(\mathbf{x};\theta) \right] = -\nabla_{\theta} \mathbb{E}_{p(\boldsymbol{\epsilon})}\left[ \log p_m(g(\boldsymbol{\epsilon};\theta);\theta) \right] \\ &= -\mathbb{E}_{p(\boldsymbol{\epsilon})}\left[ \nabla_{\theta} \log p_m(\mathbf{x};\theta) + \nabla_{\mathbf{x}} \log p_m(\mathbf{x};\theta)\big|_{\mathbf{x}=g(\boldsymbol{\epsilon;\theta})} \cdot \nabla_{\theta} g(\boldsymbol{\epsilon};\theta) \right] \\ &= -\mathbb{E}_{p_m(\mathbf{x};\theta)}\left[ \nabla_{\theta} \log p_m(\mathbf{x};\theta) \right] - \mathbb{E}_{p(\boldsymbol{\epsilon})}\left[ \nabla_{\mathbf{x}} \log p_m(\mathbf{x};\theta)\big|_{\mathbf{x}=g(\boldsymbol{\epsilon;\theta})} \cdot \nabla_{\theta} g(\boldsymbol{\epsilon};\theta) \right] \\ &= - \mathbb{E}_{p(\boldsymbol{\epsilon})}\left[ \nabla_{\mathbf{x}} \log p_m(\mathbf{x};\theta)\big|_{\mathbf{x}=g(\boldsymbol{\epsilon;\theta})} \cdot \nabla_{\theta} g(\boldsymbol{\epsilon};\theta) \right] \end{align} \] where \(\mathbb{E}_{p_m(\mathbf{x};\theta)}\left[ \nabla_{\theta} \log p_m(\mathbf{x};\theta) \right]=0\), since: \[ \begin{align} \mathbb{E}_{p_m(\mathbf{x};\theta)}\left[ \nabla_{\theta} \log p_m(\mathbf{x};\theta) \right] &= \int p_m(\mathbf{x};\theta) \cdot {\nabla_{\theta} p_m(\mathbf{x};\theta) \over p_m(\mathbf{x};\theta)} \,d\mathbf{x} \\ = \int \nabla_{\theta} p_m(\mathbf{x};\theta)\,d\mathbf{x} &= \nabla_{\theta} \int p_m(\mathbf{x};\theta)\,d\mathbf{x} = \nabla_{\theta} 1 = 0 \end{align} \]

For further materials, see Li and Turner (2017).

Reducing Computation Cost

\(\nabla_{\mathbf{x}} \mathbf{s}(\mathbf{x};\theta)\) is the Hessian matrix of \(\log \tilde{p}_m(\mathbf{x};\theta)\),

References

Hyvärinen, Aapo. 2005. “Estimation of Non-Normalized Statistical Models by Score Matching.” Journal of Machine Learning Research 6 (Apr): 695–709.

Li, Yingzhen, and Richard E. Turner. 2017. “Gradient Estimators for Implicit Models.” arXiv:1705.07107 [Cs, Stat], May. http://arxiv.org/abs/1705.07107.