Image Generation Popular Metrics: FID, FD, IS

FID vs FD vs IS

We would like to compare those metrics.

Fréchet inception distance, FID
Fréchet distance, FD
Inception Score, IS

Fréchet Inception Distance - FID

see WiKi link

The Fréchet inception distance (FID) is a metric used to assess the quality of images created by a generative model, like a generative adversarial network (GAN) or a diffusion model.

The FID compares the distribution of generated images with the distribution of a set of real images (a “ground truth” set). Rather than comparing individual images, mean and covariance statistics of many images generated by the model are compared with the same statistics generated from images in the ground truth or reference set. A convolutional neural network such as an inception architecture is used to produce higher-level features describing the images, thus leading to the name Fréchet inception distance.

The FID is inspired by the earlier inception score (IS) metric which evaluates only the distribution of generated images. The FID metric does not replace the IS metric; classifiers that achieve the best (lowest) FID score tend to have greater sample variety while classifiers achieving the best (highest) IS score tend to have better quality within individual images.

The FID metric was introduced in 2017, and is the current standard metric for assessing the quality of models that generate synthetic images as of 2024. It has been used to measure the quality of many recent models including the high-resolution StyleGAN1 and StyleGAN2 networks, and diffusion models.

The FID attempts to compare images visually through deep layers of an inception network. More recent works take this further by instead comparing CLIP embeddings of the images.

How it Works

Feature Extraction:
- Both real and generated images are passed through the Inception network, a pre-trained deep learning model (e.g., Inception-v3).
- Features are extracted from a specific layer (usually the pool3 layer).
Fréchet Distance Computation:
- Treat the extracted features as samples from two multivariate Gaussian distributions (one for real images and one for generated images).
- Calculate the Fréchet Distance (also called Wasserstein-2 distance) between these two distributions:
$\text{FID} = \| \mu_r - \mu_g \|^2 + \text{Tr}\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right)$ where: - $\mu_r$, $\mu_g$: Means of real and generated feature distributions. - $\Sigma_r$, $\Sigma_g$: Covariance matrices of real and generated feature distributions. - $\text{Tr}$: Trace of a matrix.

Interpretation

Lower FID indicates closer similarity between real and generated image distributions, implying better quality of the generated images.
It combines both feature-level mean and variance, making it more robust than metrics like Inception Score (IS).

Applications

Evaluating GANs and other generative models.
Comparing the fidelity and diversity of generated data against real-world samples.

Fréchet distance - FD

The Fréchet Distance (FD) is a mathematical metric used to measure the similarity between two curves or shapes, considering the location and order of points along the curves. It is often described as the minimum “leash length” required for a person and a dog to traverse two paths where both are allowed to move forward along their respective paths.

Formal Definition

For any two probability distributions $\mu$ and $\nu$ over $\mathbb {R} ^{n}$ having finite mean and variances, their earth-mover’s distance or Fréchet is

\[{\displaystyle d_{F}(\mu ,\nu ):=\left(\inf _{\gamma \in \Gamma (\mu ,\nu )}\int _{\mathbb {R} ^{n}\times \mathbb {R} ^{n}}\|x-y\|^{2}\,\mathrm {d} \gamma (x,y)\right)^{1/2},}\]

where $\Gamma (\mu ,\nu )$ is the set of all measures on $\mathbb {R} ^{n}\times \mathbb {R} ^{n}$ with marginals $\mu$ and $\nu$ on the first and second factors respectively. The set $\Gamma (\mu ,\nu )$ is also called the set of all couplings of $\mu$ and $\nu$.

For two multidimensional Gaussian distributions $\mathcal {N}(\mu ,\Sigma )$ and $\mathcal {N}(\mu’,\Sigma’ )$, it is expressed in closed form as

\[d_{F}({\mathcal {N}}(\mu ,\Sigma ),{\mathcal {N}}(\mu ',\Sigma '))^{2} = \| \mu -\mu' \|_{2}^{2}+\text{Tr}\left(\Sigma +\Sigma'- 2 \left( \Sigma \Sigma' \right) ^ {\frac {1}{2}}\right)\]

This allows us to define the FID in pseudocode form:

INPUT a function $f:\Omega_{X}\to \mathbb {R}^{n}$
INPUT two datasets $S, S’\subset \Omega_{X}$
Compute $f(S),f(S’)\subset \mathbb {R}^{n}$
Fit two gaussian distributions $\mathcal{N}(\mu, \Sigma)$ and $\mathcal{N}(\mu’,\Sigma’ )$, respectively for $f(S),f(S’)$.
RETURN $d_{F}({\mathcal{N}}(\mu, \Sigma ), {\mathcal{N}}(\mu’, \Sigma’))^{2}$.

In most practical uses of the FID, $\Omega _{X}$ is the space of images, and $f$ is an Inception v3 model trained on the ImageNet, but without its final classification layer. Technically, it is the 2048-dimensional activation vector of its last pooling layer. Of the two datasets $S$,$S’$, one of them is a reference dataset, which could be the ImageNet itself, and the other is a set of images generated by a generative model, such as GAN, or Diffusion model.

Intuition

Imagine a person walking along curve $P$ and a dog walking along curve $Q$, connected by a leash.
Both the person and the dog can control their pace but cannot move backward.
The Fréchet Distance is the shortest possible leash length required for them to traverse their respective curves.

Key Properties

Symmetry: $d_F(P, Q) = d_F(Q, P)$.
Triangle Inequality: It satisfies the triangle inequality, making it a proper metric.
Order Sensitivity: Unlike some other distance metrics, the Fréchet Distance takes into account the sequential order of points along the curves.

Applications

Trajectory analysis (e.g., comparing GPS tracks).
Shape matching in computational geometry.
Pattern recognition and image analysis.

This metric is particularly useful when the order of points matters, as it captures the temporal and spatial alignment between two paths.

Inception Score - IS

The Inception Score (IS) is an algorithm used to assess the quality of images created by a generative image model such as a generative adversarial network (GAN). The score is calculated based on the output of a separate, pretrained Inception v3 image classification model applied to a sample of (typically around 30,000) images generated by the generative model. The Inception Score is maximized when the following conditions are true:

The entropy of the distribution of labels predicted by the Inceptionv3 model for the generated images is minimized. In other words, the classification model confidently predicts a single label for each image. Intuitively, this corresponds to the desideratum of generated images being "sharp" or "distinct".
The predictions of the classification model are evenly distributed across all possible labels. This corresponds to the desideratum that the output of the generative model is "diverse".

It has been somewhat superseded by the related Fréchet inception distance. While the Inception Score only evaluates the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images (“ground truth”).

Definition

Let there be two spaces, the space of images $\Omega_{X}$ and the space of labels $\Omega_{Y}$. The space of labels is finite.

Let $p_{\text{gen}}$ be a probability distribution over $\Omega_{X}$ that we wish to judge.

Let a discriminator be a function of type $p_{\text{dis}} : \Omega_{X} \to M(\Omega_{Y})$

where $M(\Omega_{Y})$ is the set of all probability distributions on $\Omega_{Y}$.

For any image $x$, and any label $y$, let $p_{\text{dis}}(y \vert x)$ be the probability that image $x$ has label $y$, according to the discriminator.
It is usually implemented as an Inception-v3 network trained on ImageNet.

The Inception Score of $p_{\text{gen}}$ relative to $p_{\text{dis}}$ is

\[IS(p_{\text{gen}}, p_{\text{dis}}):=\exp \left(\mathbb {E} _{x\sim p_{\text{gen}}}\left[D_{KL}\left(p_{\text{dis}}(\cdot \vert x)\|\int p_{\text{dis}}(\cdot \vert x) p_{\text{gen}}(x)dx \right) \right] \right)\]

Equivalent rewrites include

$\ln IS(p_{\text{gen}},p_{\text{dis}}) := \mathbb {E}_{x\sim p_{\text{gen}}} \left[D_{KL} \left(p_{\text{dis}}(\cdot |x) \| \mathbb {E}_{x \sim p_{\text{gen}}}[p_{\text{dis}}(\cdot |x)] \right) \right]$⁡ Or ${\displaystyle \ln IS(p_{gen},p_{dis}):=H[\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]]-\mathbb {E} _{x\sim p_{gen}}[H[p_{dis}(\cdot |x)]]}$

$\ln \text{IS}$ is nonnegative by Jensen’s inequality.

Pseudocode:

INPUT discriminator $p_{\text{gen}}$.
INPUT generator $g$.
Sample images $x_{i}$ from generator.
Compute $p_{dis}(\cdot \vert x_{i})$, the probability distribution over labels conditional on image $x_{i}$.
Sum up the results to obtain ${\hat {p}}$, an empirical estimate of $\int p_{dis}(\cdot \vert x) p_{gen}(x)dx$.
Sample more images $x_{i}$ from generator, and for each, compute $D_{KL} \left ( p_{dis} (\cdot \vert x_{i}) |{\hat {p}} \right)$.
Average the results, and take its exponential.
RETURN the result.

Interpretation

A higher inception score is interpreted as “better”, as it means that $p_{\text{gen}}$ is a “sharp and distinct” collection of pictures.

$\ln IS(p_{gen}, p_{dis})\in [0,\ln N]$, where $N$ is the total number of possible labels.

$\ln IS(p_{gen}, p_{dis})=0$ iff for almost all $x\sim p_{gen}$

\[p_{dis}(\cdot |x)=\int p_{dis}(\cdot |x)p_{gen}(x)dx\]

That means $p_{\text{gen}}$ is completely “indistinct”. That is, for any image $x$ sampled from $p_{\text{gen}}$, discriminator returns exactly the same label predictions $p_{dis}(\cdot \vert x)$.

The highest inception score $N$ is achieved if and only if the two conditions are both true:

For almost all $x\sim p_{gen}$, the distribution $p_{dis}(y \vert x)$ is concentrated on one label. That is, $H_{y}[p_{dis}(y \vert x)]= 0$. That is, every image sampled from $p_{\text{gen}}$ is exactly classified by the discriminator.
For every label $y$, the proportion of generated images labeled as $y$ is exactly

\[\mathbb{E}_{x \sim p_{\text{gen}}} \left[p_{\text{dis}}(y \vert x)\right] = \frac{1}{N}.\]

That is, the generated images are equally distributed over all labels.

See the following for a simple version.

ChatGPT’s Answer

The Inception Score (IS) is a widely used metric to evaluate the performance of generative models, particularly Generative Adversarial Networks (GANs). It assesses the quality and diversity of generated images by leveraging a pre-trained classifier, usually the Inception network.

Definition

The Inception Score is defined as:

\[\text{IS} = \exp \left( \mathbb{E}_{x \sim P_g} \left[ \text{KL}(p(y \vert x) \| p(y)) \right] \right)\]

Where:

$P_g$: Distribution of the generated images.
$p(y \vert x)$: Probability distribution over class labels $y$ for an image $x$, as predicted by the Inception model.
$p(y)$: Marginal distribution of predicted labels, i.e., $p(y) = \mathbb{E}_{x \sim P_g} [p(y \vert x)]$.
$\text{KL}$: Kullback-Leibler (KL) divergence.

How It Works

Use a Pre-trained Model:
- A pre-trained Inception-v3 network is used as the classifier to predict $p(y \vert x)$, the class probabilities for an input image $x$.
Evaluate Quality:
- High-quality images should result in confident predictions, meaning $p(y \vert x)$ has low entropy (peaked at a specific class).
Evaluate Diversity:
- The generated dataset should cover a wide range of classes, resulting in a high-entropy marginal distribution $p(y)$.
Compute the Score:
- The KL divergence $\text{KL}(p(y \vert x) | p(y))$ measures how much $p(y \vert x)$ differs from the marginal $p(y)$.
- High IS values indicate that:
  - Individual images are of high quality (confident predictions).
  - The set of images is diverse (covers many classes).

Interpretation

Higher IS: Better performance, indicating that the model generates:
- High-quality images that look realistic.
- Diverse images spanning various categories.
Lower IS: Poorer performance, possibly due to:
- Low-quality images (blurry or nonsensical).
- Mode collapse (generating similar images repeatedly).

Advantages

Simple to Compute:
- Easy to implement using a pre-trained Inception network.
Captures Diversity and Quality:
- Combines class probability confidence (quality) with class coverage (diversity).

Limitations

Ignores Ground Truth:
- IS (Inception Score) only evaluates generated images relative to the classifier’s predictions, without direct comparison to real data.
Vulnerability to Mode Collapse:
- IS may fail to detect subtle mode collapse if the generated images still cover multiple classes but lack internal diversity.
Class Bias:
- Performance depends on the pre-trained model, which may introduce biases based on the training dataset.

Applications

Evaluating the performance of GANs and other generative models.
Comparing different models or architectures.
Monitoring training progress in generative tasks.

Example

For a GAN generating images of animals:

High IS: Includes realistic, diverse images of cats, dogs, birds, etc., each confidently classified by the Inception model.
Low IS: Outputs unrealistic or repetitive images, leading to uncertain or similar predictions across all images.

The Inception Score remains a popular metric despite its limitations, often used alongside others like the Fréchet Inception Distance (FID) for a more comprehensive evaluation.

Comparison

Aspect	Fréchet Distance	Fréchet Inception Distance (FID)	Inception Score (IS)
Domain	Geometry, trajectories, paths	Generative model evaluation	Generative model evaluation
Input	Curves or point sets	Image datasets	Image datasets
Features	Geometry of curves	High-dimensional features (Inception model)	High-dimensional features (Inception model)
Purpose	Measure similarity between curves	Assess generative model quality	Assess quality and diversity of generated images
Output	Scalar value (distance)	Scalar value (distance)	Scalar value (higher is better)
Robustness	Sensitive to geometry	Considers both mean and variance of features	Sensitive to mode collapse, does not consider variance

Summary:

Fréchet Distance: Geometric similarity measure for curves or paths, widely used in computational geometry.
Fréchet Inception Distance (FID): Evaluates generative model performance by comparing distributions of real and generated images in feature space, with a focus on fidelity and diversity.
Inception Score (IS): Measures generative model performance based on the quality and diversity of generated images, using class probability distributions from a pre-trained Inception network.

Both FID and IS are commonly used for evaluating GANs, with FID generally preferred due to its robustness and reliance on both real and generated data distributions.

Written on December 23, 2024