# Overview

Energy based models (EBM) consists of three ingredients:

1. The energy function: to assign a scalar to each configuration of the variables.
2. Inference: given a set of observed variables, to find values of the remaining variables that minimizes the energy.
3. Learning: to find an energy function that assigns low energy values to correct configurations of the variables, and high energy values to incorrect configurations of the variables.
4. Loss function: minimized during training, which measures to quality of the energy function.

## Loss functions

The general form of the training objective of energy based model can be formulated as (LeCun et al. 2006): $\mathcal{L}(E,\mathcal{S})=\frac{1}{P}\sum_{i=1}^P L(Y^i, E(W,\mathcal{Y},X^i)) + R(W)$

where $$L$$ is the loss function that should "pull-down" the energy function at "correct" configurations (i.e., the training data $$(X^i, Y^i)$$), and "pull-up" the energy function at incorrect configurations. $$R$$ is the constraint on the model parameter $$W$$, which may effectively restrict the VC-dimension of the model.

Here are some popular choices of loss functions for energy based models:

### Energy loss

$L=E(W,Y^i,X^i)$

which requires $$L$$ to satisfy the property that: "pull-down" the energy at correct configurations will automatically "pull-up" the energy at incorrect configurations.

### Generalized Perceptron Loss

$L = E(W,Y^i,X^i) - \min_{Y \in \mathcal{Y}} E(W,Y,X^i)$

There is no mechanism for creating an energy gap between the correct configurations and the incorrect ones, which may potentially produce (almost) flat energy surfaces if the architecture allows it.

### Generalized Margin Losses

Here is some common definitions of this section:

• $$\bar{Y}^i = \arg\min_{Y\in \mathcal{Y},\left\| Y-Y^i \right\| > \epsilon} E(W,Y,X^i)$$

#### Hinge loss

$L=\max\left(0, m+E(W,Y^i,X^i)-E(W,\bar{Y}^i,X^i)\right)$

#### Log Loss

$L = \log\left( 1 + \exp\left(E(W,Y^i,X^i) - E(W,\bar{Y}^i,X^i)\right) \right)$

#### LVQ2 Loss

$L = \max\left( 0, m+E(W,Y^i,X^i) - E(W,\bar{Y}^i, X^i) \right)$

#### MCE Loss

$L = \sigma\left( E(W,Y^i,X^i) - E(W,\bar{Y}^i,X^i) \right)$

#### Square-Square Loss

$L = E(W,Y^i,X^i) + \left( \max\left(0, m-E(W,\bar{Y}^i,X^i)\right)^2 \right)$

#### Square-Exponential Loss

$L = E(W,Y^i,X^i)^2 + \gamma \exp\left( -E(W,\bar{Y}^i,X^i) \right)$

### Negative Log-Likelihood Loss (NLL Loss)

\begin{align} L &= E(W,Y^i,X^i) + \mathcal{F}_{\beta}(W,\mathcal{Y},X^i) \\ \mathcal{F}_{\beta}(W,\mathcal{Y},X^i) &= \frac{1}{\beta} \log\left( \int_{y\in \mathcal{Y}}\exp\left( -\beta E(W,y,X^i)\right) \,dy \right) \end{align}

where $P(Y|X^i,W) = \frac{\exp\left( -\beta E(W,Y,X^i) \right)}{\int_{y\in \mathcal{Y}}\exp\left( -\beta E(W,y,X^i)\right) \, dy}$ Some authors have argued that the NLL loss puts too much emphasis on mistakes, which inspires the MEE loss.

#### Minimum Empirical Error Loss (MEE Loss)

$L = 1 - P(Y|X^i,W) = 1 - \frac{\exp\left( -\beta E(W,Y,X^i) \right)}{\int_{y\in \mathcal{Y}}\exp\left( -\beta E(W,y,X^i)\right) \, dy}$

## Learning with Approximate Inference

Many of the losses or algorithms used by energy based models does not guarantee that the energy for $$E(W,Y,X^i)$$ is pull-up propertly at all $$Y \in \mathcal{Y}, Y \neq Y^i$$, such that $$E(W,Y^i,X^i)$$ should be a global minimum.

LeCun et al. (2006) justifies that, if learning is driven by approximated inference, such that all the constrastive samples found by the approximated inference algorithm was pull-up, then there is no need to worry about the far away samples which cannot be found by the inference algorithm (which I personally disagree with this, where it might cause some "out-distribution" problems).

An example of such approximated inference driven learning algorithm is Contrastive Divergence:

### Contrastive Divergence

$W \leftarrow W - \eta \left( \frac{\partial E(W,Y^i,X^i)}{\partial W} - \frac{\partial E(W,\bar{Y}^i,X)}{\partial W} \right)$

# References

LeCun, Yann, Sumit Chopra, Raia Hadsell, M. Ranzato, and F. Huang. 2006. “A Tutorial on Energy-Based Learning.” Predicting Structured Data 1 (0).