Notes on Variational Autoencoders

Sondre Wold

2025-04-08

This is an attempt at deriving the key properties of the Variational Autoencoder .

Problem statement

We have a dataset \(\textbf{X} = \{\mathbf{x}^{(i)}\}^N_{i=1}\) of \(N\) i.i.d samples of a continuous (or discrete) variable \(\mathbf{x}\). We assume that \(\textbf{X}\) is generated from a random process which involves an unobserved continuous random variable \(\mathbf{z}\). The random process involves two steps:

  1. \(\mathbf{z}^{(i)}\) is generated from a prior distribution \(\mathop{\mathrm{p_\theta}}(\mathbf{z})\).

  2. \(\mathbf{x}\) is generated from the conditional distribution \(\mathop{\mathrm{p_\theta}}(\mathbf{x}\mid \mathbf{z})\).

We want to learn the distribution \(\mathop{\mathrm{p_\theta}}(\mathbf{x})\) so that we can sample from it to generate new datapoints. However, we do not know \(\theta\) or \(\mathbf{z}\). The most direct way to learn \(\mathop{\mathrm{p_\theta}}\) would be to marginalize over the joint distribution: \[\int \mathop{\mathrm{p_\theta}}(\mathbf{x}, \mathbf{z})\, \mathrm{d}z\] This, however, is intractable, as integrating over all possible values of \(\mathbf{z}\) is not feasiable. If we rewrite \(\mathop{\mathrm{p_\theta}}(\mathbf{x}, \mathbf{z})\) using the chain rule of probability, e.g: \[\mathop{\mathrm{p_\theta}}(\mathbf{x}, \mathbf{z}) = \mathop{\mathrm{p_\theta}}(\mathbf{z}\mid \mathbf{x})p(\mathbf{x}),\] and if we solve for \(\mathop{\mathrm{p_\theta}}(x)\), we get that: \[\mathop{\mathrm{p_\theta}}(x) = \frac{\mathop{\mathrm{p_\theta}}(\mathbf{x}, \mathbf{z})}{\mathop{\mathrm{p_\theta}}(\mathbf{z}\mid \mathbf{x})},\] and we see that we need the posterior \(\mathop{\mathrm{p_\theta}}(\mathbf{z}\mid \mathbf{x})\) in order to get \(\mathop{\mathrm{p_\theta}}(\mathbf{x})\) and that we need \(\mathop{\mathrm{p_\theta}}(x)\) to get the posterior (if we solve for the conditional instead). This is a chicken-and-egg problem.

The solution is to approximate the true posterior. That is, we want to learn an approximate \(\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})\) such that \(\mathop{\mathrm{p_\theta}}(\mathbf{z}\mid \mathbf{x}) \approx \mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})\).

Deriving the ELBO

We now need to express \(\mathop{\mathrm{p_\theta}}(x)\) using our introduced approximated posterior.

\[\begin{aligned} \log \mathop{\mathrm{p_\theta}}(\mathbf{x}) & = \log \mathop{\mathrm{p_\theta}}(\mathbf{x}) \\ & = \log \mathop{\mathrm{p_\theta}}(\mathbf{x}) \int \mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x}) \\ & = \int \log \mathop{\mathrm{p_\theta}}(\mathbf{x}) \mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x}) \\ & = \mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[ \log \mathop{\mathrm{p_\theta}}(\mathbf{x}) \right] \\ & = \mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[ \log \frac{\mathop{\mathrm{p_\theta}}(\mathbf{x}, \mathbf{z})}{\mathop{\mathrm{p_\theta}}(\mathbf{z}\mid \mathbf{x})} \right] \\ & = \mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[ \log \frac{\mathop{\mathrm{p_\theta}}(\mathbf{x}, \mathbf{z})\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})}{\mathop{\mathrm{p_\theta}}(\mathbf{z}\mid \mathbf{x})\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \right] \\ & = \mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[ \log \frac{\mathop{\mathrm{p_\theta}}(\mathbf{x}, \mathbf{z})\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})}{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})\mathop{\mathrm{p_\theta}}(\mathbf{z}\mid \mathbf{x})} \right] \\ & = \mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[ \log \frac{\mathop{\mathrm{p_\theta}}(\mathbf{x}, \mathbf{z})}{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} + \log \frac{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})}{\mathop{\mathrm{p_\theta}}(\mathbf{z}\mid \mathbf{x})} \right] \\ & = \mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[ \log \frac{\mathop{\mathrm{p_\theta}}(\mathbf{x}, \mathbf{z})}{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})}\right] + \mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[\log \frac{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})}{\mathop{\mathrm{p_\theta}}(\mathbf{z}\mid \mathbf{x})} \right] \\ & = \mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[ \log \frac{\mathop{\mathrm{p_\theta}}(\mathbf{x}, \mathbf{z})}{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})}\right] + \underbrace{\operatorname{D}_{\text{KL}}(\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x}) \mid \mid \mathop{\mathrm{p_\theta}}(\mathbf{z}\mid \mathbf{x}))}_\text{Approximation error} \end{aligned}\]

We can see that the second term is measuring the divergence between the true posterior and our apprixmation. Since the KL divergence is always positive, we see that \(\log \mathop{\mathrm{p_\theta}}(\mathbf{x})\) has a lower bound on the first term: The probability of our data is at least equal to or bigger than the first term! Maximizing the first term is therefore maximizing the likelihood, which is what we want. This means that we can ignore the rightmost term and continue with the left term, which is known as the Evidence Lower Bound (ELBO):

\[\begin{aligned} \log \mathop{\mathrm{p_\theta}}(\mathbf{x}) & \geq \mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[ \log \frac{\mathop{\mathrm{p_\theta}}(\mathbf{x}, \mathbf{z})}{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})}\right] \\ & = \mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[ - \log \mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x}) + \log \mathop{\mathrm{p_\theta}}(\mathbf{x},\mathbf{z}) \right] \\ & = \mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[ - \log \mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x}) + \log \mathop{\mathrm{p_\theta}}(\mathbf{x}\mid \mathbf{z}) \mathop{\mathrm{p_\theta}}(\mathbf{z}) \right] \\ & = \mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[ - \log \mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x}) + \log \mathop{\mathrm{p_\theta}}(\mathbf{x}\mid \mathbf{z}) + \log \mathop{\mathrm{p_\theta}}(\mathbf{z}) \right] \\ & = \mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[ \log \frac{\mathop{\mathrm{p_\theta}}(\mathbf{z})}{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} + \log \mathop{\mathrm{p_\theta}}(\mathbf{x}\mid \mathbf{z}) \right] \\ & = \mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[ \log \frac{\mathop{\mathrm{p_\theta}}(\mathbf{z})}{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})}\right] + \mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[ \log \mathop{\mathrm{p_\theta}}(\mathbf{x}\mid \mathbf{z}) \right] \\ & = - \underbrace{\operatorname{D}_{\text{KL}}(\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})) \mid \mid \mathop{\mathrm{p_\theta}}(\mathbf{z})}_\text{Prior} + \underbrace{\mathbb{E}_{\mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})} \left[ \log \mathop{\mathrm{p_\theta}}(\mathbf{x}\mid \mathbf{z}) \right]}_\text{Reconstruction error} \end{aligned}\]

We want to minimize the prior, which means that we want the divergence between our learned latent space and our selected prior to be as small as possible. In practice, this prior is typically a gaussian. Simoultaniously, we want to maximize the reconstruction, getting our learned posterior as close as possible to the true posterior.

The Reparameterization Trick

Optimizing the ELBO requires sampling our latent variable \(\mathbf{z}\) from \(\mathop{\mathrm{q_\varphi}}\): \(\mathbf{z}\sim \mathop{\mathrm{q_\varphi}}(\mathbf{z}\mid \mathbf{x})\). This, however, is not differentiable, which prohibits backpropagation during training. This means that we need to do a little trick in order to get something differentiable. We can introduce a noise variable \(\epsilon\) and reformulate our sampling as a deterministic function of the distribution parameters \(\mu\) and \(\sigma^2\) together with this noise:

\[\mathbf{z}\sim \mathcal{N}(\mu, \sigma^2) \approx \mu + \sigma^2 \times \epsilon, \qquad \epsilon \sim \mathcal{N}(0, 1)\]

Model

The VAE has two major components: the Encoder & and the Decoder. The Encoder embeds the input in a latent space defined by the distribution parameters \(\mu\) and \(\sigma^2\). Using this latent space, the Decoder generates a new datapoint by sampling from the approximation of the true posterior.