Energy based models (EBM) consists of three ingredients:

- The energy function: to assign a scalar to each configuration of the variables.
- Inference: given a set of observed variables, to find values of the remaining variables that minimizes the energy.
- Learning: to find an energy function that assigns low energy values to correct configurations of the variables, and high energy values to incorrect configurations of the variables.
- Loss function: minimized during training, which measures to quality of the energy function.

## Loss functions

The general form of the training objective of energy based model can be formulated as (LeCun et al. 2006): \[ \mathcal{L}(E,\mathcal{S})=\frac{1}{P}\sum_{i=1}^P L(Y^i, E(W,\mathcal{Y},X^i)) + R(W) \]

where \(L\) is the loss function that should "pull-down" the energy function at "correct" configurations (i.e., the training data \((X^i, Y^i)\)), and "pull-up" the energy function at incorrect configurations. \(R\) is the constraint on the model parameter \(W\), which may effectively restrict the VC-dimension of the model.

Here are some popular choices of loss functions for energy based models:

### Energy loss

\[ L=E(W,Y^i,X^i) \]

which requires \(L\) to satisfy the property that: "pull-down" the energy at correct configurations will automatically "pull-up" the energy at incorrect configurations.

### Generalized Perceptron Loss

\[ L = E(W,Y^i,X^i) - \min_{Y \in \mathcal{Y}} E(W,Y,X^i) \]

There is no mechanism for creating an energy gap between the correct configurations and the incorrect ones, which may potentially produce (almost) flat energy surfaces if the architecture allows it.

### Generalized Margin Losses

Here is some common definitions of this section:

- \(\bar{Y}^i = \arg\min_{Y\in \mathcal{Y},\left\| Y-Y^i \right\| > \epsilon} E(W,Y,X^i)\)

#### Hinge loss

\[ L=\max\left(0, m+E(W,Y^i,X^i)-E(W,\bar{Y}^i,X^i)\right) \]

#### Log Loss

\[ L = \log\left( 1 + \exp\left(E(W,Y^i,X^i) - E(W,\bar{Y}^i,X^i)\right) \right) \]

#### LVQ2 Loss

\[ L = \max\left( 0, m+E(W,Y^i,X^i) - E(W,\bar{Y}^i, X^i) \right) \]

#### MCE Loss

\[ L = \sigma\left( E(W,Y^i,X^i) - E(W,\bar{Y}^i,X^i) \right) \]

#### Square-Square Loss

\[ L = E(W,Y^i,X^i) + \left( \max\left(0, m-E(W,\bar{Y}^i,X^i)\right)^2 \right) \]

#### Square-Exponential Loss

\[ L = E(W,Y^i,X^i)^2 + \gamma \exp\left( -E(W,\bar{Y}^i,X^i) \right) \]

### Negative Log-Likelihood Loss (NLL Loss)

\[ \begin{align} L &= E(W,Y^i,X^i) + \mathcal{F}_{\beta}(W,\mathcal{Y},X^i) \\ \mathcal{F}_{\beta}(W,\mathcal{Y},X^i) &= \frac{1}{\beta} \log\left( \int_{y\in \mathcal{Y}}\exp\left( -\beta E(W,y,X^i)\right) \,dy \right) \end{align} \]

where \[ P(Y|X^i,W) = \frac{\exp\left( -\beta E(W,Y,X^i) \right)}{\int_{y\in \mathcal{Y}}\exp\left( -\beta E(W,y,X^i)\right) \, dy} \] Some authors have argued that the NLL loss puts too much emphasis on mistakes, which inspires the MEE loss.

#### Minimum Empirical Error Loss (MEE Loss)

\[ L = 1 - P(Y|X^i,W) = 1 - \frac{\exp\left( -\beta E(W,Y,X^i) \right)}{\int_{y\in \mathcal{Y}}\exp\left( -\beta E(W,y,X^i)\right) \, dy} \]

## Learning with Approximate Inference

Many of the losses or algorithms used by energy based models does not guarantee that the energy for \(E(W,Y,X^i)\) is pull-up propertly at all \(Y \in \mathcal{Y}, Y \neq Y^i\), such that \(E(W,Y^i,X^i)\) should be a global minimum.

LeCun et al. (2006) justifies that, if learning is driven by approximated inference, such that all the *constrastive samples* found by the *approximated inference* algorithm was pull-up, then there is no need to worry about the far away samples which cannot be found by the inference algorithm (which I personally disagree with this, where it might cause some "out-distribution" problems).

An example of such approximated inference driven learning algorithm is Contrastive Divergence:

### Contrastive Divergence

\[ W \leftarrow W - \eta \left( \frac{\partial E(W,Y^i,X^i)}{\partial W} - \frac{\partial E(W,\bar{Y}^i,X)}{\partial W} \right) \]

# References

LeCun, Yann, Sumit Chopra, Raia Hadsell, M. Ranzato, and F. Huang. 2006. “A Tutorial on Energy-Based Learning.” *Predicting Structured Data* 1 (0).