Evaluation Metrics


Negative Log-Likelihood

The negative log-likelihood (NLL) for \(p_{\theta}(\mathbf{x})\) is defined as:

\[ \begin{align} \text{NLL} &= \mathbb{E}_{p_d(\mathbf{x})} \left[ -\log p_{\theta}(\mathbf{x}) \right] \end{align} \]

Find the Original NLL using Scaled Data

If the original data \(\mathbf{x}\) is \(k\)-dimensional ,and is scaled by \(\frac{1}{\sigma}\) at each of its dimensions, such that the data fed into the model is \(\tilde{\mathbf{x}} = \frac{1}{\sigma} \mathbf{x}\), then: \[ \begin{align} p_d(\mathbf{x}) &= \tilde{p}_d(\tilde{\mathbf{x}}) \left| \det\left( \frac{\mathrm{d}\tilde{\mathbf{x}}}{\mathrm{d}\mathbf{x}} \right) \right| = \tilde{p}_d(\tilde{\mathbf{x}})\left| \det\left( \frac{\mathrm{d}(\mathbf{x} / \sigma)}{\mathrm{d}\mathbf{x}} \right) \right| = \frac{1}{\sigma^k}\,\tilde{p}_d(\tilde{\mathbf{x}}) \\ p_{\theta}(\mathbf{x}) &= \frac{1}{\sigma^k}\, \tilde{p}_{\theta}(\tilde{\mathbf{x}}) \end{align} \] Thus the computed NLL for \(\mathbf{x}\) and \(\tilde{\mathbf{x}}\) has the following relationship: \[ \begin{align} \text{NLL} &= \mathbb{E}_{p_d(\mathbf{x})} \left[ -\log p_{\theta}(\mathbf{x}) \right] \\ &= -\int p_d(\mathbf{x}) \log p_{\theta}(\mathbf{x})\,\mathrm{d}\mathbf{x} \\ &= -\int \frac{1}{\sigma^k}\,\tilde{p}_{d}(\tilde{\mathbf{x}}) \log \left( \frac{1}{\sigma^k}\, \tilde{p}_{\theta}(\tilde{\mathbf{x}}) \right)\left| \det\left( \frac{\mathrm{d}\mathbf{x}}{\mathrm{d}\tilde{\mathbf{x}}} \right) \right|\mathrm{d}\tilde{\mathbf{x}} \\ &= -\int \tilde{p}_{d}(\tilde{\mathbf{x}}) \left[ \log \tilde{p}_{\theta}(\tilde{\mathbf{x}}) - k\log \sigma \right]\mathrm{d}\tilde{\mathbf{x}} \\ &= \mathbb{E}_{\tilde{p}_d(\mathbf{x})} \left[ -\log \tilde{p}_{\theta}(\mathbf{x}) \right] + k\log \sigma \\ &= \widetilde{NLL} + k\log\sigma \end{align} \]

Continuous NLL as an Upper-Bound of Discrete NLL

To train a continuous model upon discrete data (e.g., images), one may add a uniform noise to the data, and obtain an upper-bound of the discrete data NLL with the augmented data.

For pixel integer-valued \(\mathbf{x}\) ranging from 0 to 255, adding a uniform noise \(\mathbf{u} \sim \mathcal{U}[0, 1)\), such that \(\tilde{\mathbf{x}} = \mathbf{x} + \mathbf{u}\), we have (Theis, Oord, and Bethge 2015): \[ \begin{align} -\int \tilde{p}_d(\tilde{\mathbf{x}}) \log \tilde{p}_{\theta}(\tilde{\mathbf{x}}) \,\mathrm{d}\tilde{\mathbf{x}} &= -\sum_{\mathbf{x}} P_d(\mathbf{x}) \int \log \tilde{p}_{\theta}(\mathbf{x} + \mathbf{u}) \,\mathrm{d}\mathbf{u} \\ &\geq -\sum_{\mathbf{x}} P_d(\mathbf{x}) \log \int \tilde{p}_{\theta}(\mathbf{x} + \mathbf{u}) \,\mathrm{d}\mathbf{u} \\ \\ &= -\sum_{\mathbf{x}} P_d(\mathbf{x}) \log P_{\theta}(\mathbf{x}) \\ \end{align} \] where we define the probability of the true discrete data to be: \[ P_{\theta}(\mathbf{x}) = \int \tilde{p}_{\theta}(\mathbf{x} + \mathbf{u}) \,\mathrm{d}\mathbf{u} \] That is to say, the NLL of the augmented continuous random variable \(\tilde{\mathbf{x}}\) can serve as an upper-bound as the true discrete data NLL.

Image Quality


(Davis and Goadrich 2006)

  1. There is a one-to-one correspondence between the points on ROC and AUC curves.

  2. A curve dominates the ROC (fpr-tpr curve) \(\Leftrightarrow\) dominates the AUC (recall-precision curve).

  3. Interpolation between two points \(A\) and \(B\):

    1. On ROC: linear interpolation.

    2. On AUC: \[ \left( \frac{TP_A + x}{\text{Total Pos}}, \frac{TP_A + x}{TP_A + x + FP_A + \frac{FP_B - FP_A}{TP_B - TP_A} x} \right) \]

  4. Compute the area: include the interpolation and use composite trapezoidal method.

    1. Incorrect interpoluation for computing AUC-PR will cause over-estimate.
  5. Optimize Area under ROC and AUC curves: not exactly the same. (especially when not one algorithm dominates the curve?)


Davis, Jesse, and Mark Goadrich. 2006. “The Relationship Between Precision-Recall and ROC Curves.” In Proceedings of the 23rd International Conference on Machine Learning - ICML ’06, 233–40. Pittsburgh, Pennsylvania: ACM Press. https://doi.org/10.1145/1143844.1143874.

Theis, Lucas, Aäron van den Oord, and Matthias Bethge. 2015. “A Note on the Evaluation of Generative Models.” arXiv Preprint arXiv:1511.01844.