Overview

Energy based models (EBM) consists of three ingredients:

  1. The energy function: to assign a scalar to each configuration of the variables.
  2. Inference: given a set of observed variables, to find values of the remaining variables that minimizes the energy.
  3. Learning: to find an energy function that assigns low energy values to correct configurations of the variables, and high energy values to incorrect configurations of the variables.
  4. Loss function: minimized during training, which measures to quality of the energy function.

Loss functions

The general form of the training objective of energy based model can be formulated as (LeCun et al. 2006): \[ \mathcal{L}(E,\mathcal{S})=\frac{1}{P}\sum_{i=1}^P L(Y^i, E(W,\mathcal{Y},X^i)) + R(W) \]

where \(L\) is the loss function that should "pull-down" the energy function at "correct" configurations (i.e., the training data \((X^i, Y^i)\)), and "pull-up" the energy function at incorrect configurations. \(R\) is the constraint on the model parameter \(W\), which may effectively restrict the VC-dimension of the model.

Here are some popular choices of loss functions for energy based models:

Energy loss

\[ L=E(W,Y^i,X^i) \]

which requires \(L\) to satisfy the property that: "pull-down" the energy at correct configurations will automatically "pull-up" the energy at incorrect configurations.

Generalized Perceptron Loss

\[ L = E(W,Y^i,X^i) - \min_{Y \in \mathcal{Y}} E(W,Y,X^i) \]

There is no mechanism for creating an energy gap between the correct configurations and the incorrect ones, which may potentially produce (almost) flat energy surfaces if the architecture allows it.

Generalized Margin Losses

Here is some common definitions of this section:

  • \(\bar{Y}^i = \arg\min_{Y\in \mathcal{Y},\left\| Y-Y^i \right\| > \epsilon} E(W,Y,X^i)\)

Hinge loss

\[ L=\max\left(0, m+E(W,Y^i,X^i)-E(W,\bar{Y}^i,X^i)\right) \]

Log Loss

\[ L = \log\left( 1 + \exp\left(E(W,Y^i,X^i) - E(W,\bar{Y}^i,X^i)\right) \right) \]

LVQ2 Loss

\[ L = \max\left( 0, m+E(W,Y^i,X^i) - E(W,\bar{Y}^i, X^i) \right) \]

MCE Loss

\[ L = \sigma\left( E(W,Y^i,X^i) - E(W,\bar{Y}^i,X^i) \right) \]

Square-Square Loss

\[ L = E(W,Y^i,X^i) + \left( \max\left(0, m-E(W,\bar{Y}^i,X^i)\right)^2 \right) \]

Square-Exponential Loss

\[ L = E(W,Y^i,X^i)^2 + \gamma \exp\left( -E(W,\bar{Y}^i,X^i) \right) \]

Negative Log-Likelihood Loss (NLL Loss)

\[ \begin{align} L &= E(W,Y^i,X^i) + \mathcal{F}_{\beta}(W,\mathcal{Y},X^i) \\ \mathcal{F}_{\beta}(W,\mathcal{Y},X^i) &= \frac{1}{\beta} \log\left( \int_{y\in \mathcal{Y}}\exp\left( -\beta E(W,y,X^i)\right) \,dy \right) \end{align} \]

where \[ P(Y|X^i,W) = \frac{\exp\left( -\beta E(W,Y,X^i) \right)}{\int_{y\in \mathcal{Y}}\exp\left( -\beta E(W,y,X^i)\right) \, dy} \] Some authors have argued that the NLL loss puts too much emphasis on mistakes, which inspires the MEE loss.

Minimum Empirical Error Loss (MEE Loss)

\[ L = 1 - P(Y|X^i,W) = 1 - \frac{\exp\left( -\beta E(W,Y,X^i) \right)}{\int_{y\in \mathcal{Y}}\exp\left( -\beta E(W,y,X^i)\right) \, dy} \]

Learning with Approximate Inference

Many of the losses or algorithms used by energy based models does not guarantee that the energy for \(E(W,Y,X^i)\) is pull-up propertly at all \(Y \in \mathcal{Y}, Y \neq Y^i\), such that \(E(W,Y^i,X^i)\) should be a global minimum.

LeCun et al. (2006) justifies that, if learning is driven by approximated inference, such that all the constrastive samples found by the approximated inference algorithm was pull-up, then there is no need to worry about the far away samples which cannot be found by the inference algorithm (which I personally disagree with this, where it might cause some "out-distribution" problems).

An example of such approximated inference driven learning algorithm is Contrastive Divergence:

Contrastive Divergence

\[ W \leftarrow W - \eta \left( \frac{\partial E(W,Y^i,X^i)}{\partial W} - \frac{\partial E(W,\bar{Y}^i,X)}{\partial W} \right) \]

References

LeCun, Yann, Sumit Chopra, Raia Hadsell, M. Ranzato, and F. Huang. 2006. “A Tutorial on Energy-Based Learning.” Predicting Structured Data 1 (0).