Chapter 4: Deep Learning for Synthetic Data

Harnessing Generative Models to Create Realistic Data at Scale

1. The Deep Generative Revolution

For decades, synthetic data generation relied primarily on statistical models—distributions, copulas, Bayesian networks—that worked well when assumptions held. These approaches scaled gracefully and remained interpretable. But they struggled with complexity: the multimodal distributions in real-world data, the intricate dependencies across high-dimensional spaces, and the subtle patterns that define authentic data.

The rise of deep learning changed everything. Beginning around 2014–2016, researchers discovered that neural networks could learn to generate data that rivals reality in fidelity. Not through explicit assumptions, but through learned feature hierarchies that capture structure automatically. This shift opened pathways to synthetic data that genuinely deceives downstream classifiers, that preserves privacy while maintaining utility, and that can be generated at scales previously unimaginable.

Four architectures dominate this landscape: Generative Adversarial Networks (GANs), which pit generator against discriminator in a min-max game; Variational Autoencoders (VAEs), which learn compressed latent representations and probabilistic mappings; Normalizing Flows, which compose invertible transformations to push simple distributions into complex ones; and Diffusion Models, which gradually denoise random noise into structured data.

Key Insight: Deep generative models succeed not by explicitly modeling data distributions, but by learning implicit representations through gradient descent. This flexibility allows them to capture complex, non-Gaussian patterns that classical methods miss—and explains why a well-trained GAN can synthesize photorealistic images or coherent time series from scratch.

Each architecture trades off different dimensions: training stability, sampling speed, theoretical guarantees, and sample quality. Understanding these trade-offs and when to apply each method is essential for practitioners.

2. Generative Adversarial Networks (GANs)

Architecture: The Two-Player Game

A GAN consists of two neural networks locked in competition. The generator G takes random noise and transforms it into synthetic data. The discriminator D attempts to distinguish real data from fake. During training, G improves at fooling D, while D sharpens its ability to detect fakes. At equilibrium, G produces indistinguishable samples and D achieves 50% accuracy (random guessing).

Mathematically, they optimize the minimax objective:

min_G max_D V(G, D) = E_x[log D(x)] + E_z[log(1 - D(G(z)))]

Where x is real data, z is random noise, D(x) outputs the probability that x is real, and the two terms reflect the discriminator's dual goal: maximize the log-likelihood of correctly classifying real samples (first term) and fake samples (second term).

Training Dynamics and Mode Collapse

The theoretical elegance of this formulation belies practical challenges. In early training, D easily distinguishes real from fake. The gradient ∇_z log(1 − D(G(z))) becomes very small for poor samples, causing G's updates to stall. This is why practitioners use the non-saturating objective max_G log D(G(z)) (equivalently, minimize −log D(G(z))), which provides strong gradients even when the generator is losing badly.

More insidiously, G may discover a small set of samples that fool D, then collapse toward that mode, ignoring the rest of the data distribution. If the real data is bimodal or multimodal, the generator produces only from one cluster. Diagnosing mode collapse requires monitoring the diversity of generated samples—a challenge that occupied GAN researchers for years.

Wasserstein GANs (WGAN)

In 2017, Arjovsky et al. proposed a breakthrough: replace the Jensen-Shannon divergence (implicit in the original GAN objective) with the Wasserstein distance, or "earth mover's distance." This metric measures how much "earth" you must move to transform one distribution into another, and it remains informative even when distributions are disjoint—unlike JS divergence, which plateaus.

The WGAN objective becomes:

min_G max_D { E_x[D(x)] - E_z[D(G(z))] }

Where D is 1-Lipschitz (its gradients are bounded). The original WGAN enforced this with weight clipping; the widely used WGAN-GP variant replaces clipping with a gradient penalty, which is usually more stable in practice. The payoff: more stable convergence, meaningful loss values that correlate with sample quality, and reduced mode collapse. WGAN-style objectives became a practical standard in many applications.

Common Pitfall: Training GANs is notoriously unstable. Discriminators can become too powerful, leaving the generator no useful gradient. Or generators collapse to a few modes. Always monitor both losses during training; a sudden drop in discriminator loss often signals generator failure. Use techniques like spectral normalization, gradient penalties, and careful learning rate scheduling.

PyTorch Implementation: A Simple 2D GAN

GAN architecture: Generator produces fake data, Discriminator classifies real vs fake, gradient feedback loop — Figure 4.1 — Generative Adversarial Network architecture. The Generator transforms random noise into synthetic samples; the Discriminator learns to distinguish real from fake. The adversarial feedback loop drives both networks toward equilibrium, producing increasingly realistic synthetic data.

Let's implement a minimal WGAN to generate data from a 2D Gaussian mixture. This example illustrates the core mechanics and common pitfalls:

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.utils.data import DataLoader, TensorDataset

# Synthetic data: mixture of two Gaussians
def create_2d_data(n_samples=5000):
    cluster1 = np.random.randn(n_samples // 2, 2) + np.array([2, 2])
    cluster2 = np.random.randn(n_samples // 2, 2) + np.array([-2, -2])
    data = np.vstack([cluster1, cluster2]).astype(np.float32)
    return torch.FloatTensor(data)

# Generator: maps z (latent) to data space
class Generator(nn.Module):
    def __init__(self, latent_dim=2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(latent_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, 2)
        )
        self.latent_dim = latent_dim

    def forward(self, z):
        return self.net(z)

# Discriminator: maps data to scalar (1-Lipschitz critic)
class Discriminator(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(2, 128),
            nn.LeakyReLU(0.2),
            nn.Linear(128, 128),
            nn.LeakyReLU(0.2),
            nn.Linear(128, 1)  # No sigmoid for WGAN
        )

    def forward(self, x):
        return self.net(x)

# Gradient penalty for Lipschitz constraint
def gradient_penalty(discriminator, real_data, fake_data, device, lambda_gp=10):
    batch_size = real_data.size(0)
    alpha = torch.rand(batch_size, 1, device=device)

    interpolates = (alpha * real_data + (1 - alpha) * fake_data).requires_grad_(True)
    d_interpolates = discriminator(interpolates)

    fake = torch.ones(batch_size, 1, device=device, requires_grad=True)
    gradients = torch.autograd.grad(
        outputs=d_interpolates,
        inputs=interpolates,
        grad_outputs=fake,
        create_graph=True,
        retain_graph=True,
    )[0]

    gradients = gradients.view(batch_size, -1)
    gradient_penalty = ((gradients.norm(2, dim=1) - 1) ** 2).mean() * lambda_gp
    return gradient_penalty

# Training loop
def train_wgan(epochs=50, batch_size=64, latent_dim=2, device='cpu'):
    data = create_2d_data()
    dataloader = DataLoader(TensorDataset(data), batch_size=batch_size, shuffle=True)

    G = Generator(latent_dim).to(device)
    D = Discriminator().to(device)

    opt_G = optim.Adam(G.parameters(), lr=1e-4, betas=(0.5, 0.9))
    opt_D = optim.Adam(D.parameters(), lr=1e-4, betas=(0.5, 0.9))

    for epoch in range(epochs):
        for real_data, in dataloader:
            real_data = real_data.to(device)
            batch_size = real_data.size(0)

            # Train discriminator (critic)
            for _ in range(5):  # More D updates per G update
                z = torch.randn(batch_size, latent_dim, device=device)
                fake_data = G(z).detach()

                d_real = D(real_data).mean()
                d_fake = D(fake_data).mean()
                gp = gradient_penalty(D, real_data, fake_data, device)

                d_loss = -d_real + d_fake + gp

                opt_D.zero_grad()
                d_loss.backward()
                opt_D.step()

            # Train generator
            z = torch.randn(batch_size, latent_dim, device=device)
            fake_data = G(z)
            d_fake = D(fake_data).mean()
            g_loss = -d_fake  # Maximize D(G(z))

            opt_G.zero_grad()
            g_loss.backward()
            opt_G.step()

        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}/{epochs} | D loss: {d_loss.item():.4f} | G loss: {g_loss.item():.4f}")

    return G, D

# Generate samples
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
G, D = train_wgan(epochs=100, device=device)

z_test = torch.randn(1000, 2, device=device)
synthetic_samples = G(z_test).detach().cpu().numpy()
print(f"Generated {synthetic_samples.shape[0]} samples with shape {synthetic_samples.shape[1]}")

This code trains a WGAN to learn the bimodal Gaussian distribution. Key details:

Gradient penalty: Enforces the 1-Lipschitz constraint without weight clipping, which was unstable.
Multiple discriminator updates: Training D more often than G helps stabilize convergence.
No sigmoid in discriminator: WGAN discriminators are unbounded critics, not classifiers.
Adam hyperparameters: β₁ = 0.5 is standard for GANs; it reduces momentum and helps training converge.

3. Variational Autoencoders (VAEs)

The Encoder-Decoder Framework

Unlike GANs, which work with noise directly, VAEs learn to compress data into a latent space and reconstruct it. An encoder maps data x to a latent representation z; a decoder reconstructs x̂ from z. But VAEs do something crucial: they learn distributions over latent codes, not point estimates. The encoder outputs mean and log-variance, defining a Gaussian posterior q(z|x).

This probabilistic view connects VAEs to the broader framework of variational inference. We seek to maximize the evidence lower bound (ELBO):

ELBO = E_q(z|x)[log p(x|z)] - KL(q(z|x) || p(z))

The first term is reconstruction loss (how well the decoder recovers the original), and the second is a regularizer pushing the posterior toward a prior (usually standard normal). The tension between these terms is central to VAE behavior: strong reconstruction favors mode collapse around training data; strong regularization pushes latent space to follow a simple prior, enabling smooth interpolation and generation.

The Reparameterization Trick

To backpropagate through stochastic sampling, Kingma and Welling introduced the reparameterization trick. Instead of sampling z ~ q(z|x) directly (which breaks the gradient flow), we sample noise ε ~ N(0, I) and compute z = μ + σ ⊙ ε, where μ and log σ are encoder outputs. This makes z a deterministic function of inputs and noise, preserving gradients through the encoder.

VAE Advantages and Trade-offs

VAEs offer stability compared to GANs: the ELBO loss is tractable, training is straightforward, and mode coverage is built-in (the prior prevents collapse). They excel at learning meaningful latent spaces—interpolation between encoded samples is smooth and interpretable. However, they often produce blurrier outputs than GANs, especially for images, because the reconstruction loss encourages averaging over uncertainty.

Key Insight: VAEs make a probabilistic commitment: the model learns p(x|z), the conditional distribution of data given latent code. This enables principled sampling (draw z from the prior, decode) and provides a framework for semi-supervised learning, conditional generation, and other extensions. The price is assumed independence between output dimensions when using Gaussian decoders—a limitation for high-dimensional, discrete data.

PyTorch Implementation: A Simple VAE

VAE architecture: Encoder maps input to latent distribution, reparameterization trick, Decoder reconstructs — Figure 4.2 — Variational Autoencoder architecture. The Encoder maps input data to a distribution in latent space (μ, σ); the reparameterization trick enables gradient flow; the Decoder reconstructs the input. The loss combines reconstruction accuracy with KL divergence regularization.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class VAE(nn.Module):
    def __init__(self, input_dim=2, hidden_dim=128, latent_dim=2):
        super().__init__()
        self.latent_dim = latent_dim

        # Encoder: x -> [mu, log_var]
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        self.fc_mu = nn.Linear(hidden_dim, latent_dim)
        self.fc_logvar = nn.Linear(hidden_dim, latent_dim)

        # Decoder: z -> x_recon
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim)
        )

    def encode(self, x):
        h = self.encoder(x)
        mu = self.fc_mu(h)
        logvar = self.fc_logvar(h)
        return mu, logvar

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        z = mu + eps * std
        return z

    def decode(self, z):
        return self.decoder(z)

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        recon = self.decode(z)
        return recon, mu, logvar, z

def vae_loss(recon, x, mu, logvar, beta=1.0):
    # Reconstruction loss (MSE for continuous data)
    recon_loss = nn.functional.mse_loss(recon, x, reduction='mean')

    # KL divergence: KL(q(z|x) || p(z)) with p(z) = N(0, I)
    kl_loss = -0.5 * torch.mean(1 + logvar - mu.pow(2) - logvar.exp())

    # Total ELBO (we minimize -ELBO)
    return recon_loss + beta * kl_loss, recon_loss, kl_loss

def train_vae(data, epochs=50, batch_size=64, latent_dim=2, device='cpu'):
    dataloader = DataLoader(TensorDataset(data), batch_size=batch_size, shuffle=True)

    vae = VAE(input_dim=2, hidden_dim=128, latent_dim=latent_dim).to(device)
    optimizer = optim.Adam(vae.parameters(), lr=1e-3)

    # Annealing beta to balance reconstruction and KL (helps training stability)
    for epoch in range(epochs):
        beta = min(1.0, epoch / max(1, epochs // 2))  # Linearly anneal from 0 to 1

        for x, in dataloader:
            x = x.to(device)

            recon, mu, logvar, z = vae(x)
            loss, recon_loss, kl_loss = vae_loss(recon, x, mu, logvar, beta=beta)

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(vae.parameters(), max_norm=1.0)
            optimizer.step()

        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}/{epochs} | Loss: {loss.item():.4f} | "
                  f"Recon: {recon_loss.item():.4f} | KL: {kl_loss.item():.4f}")

    return vae

# Training
data = create_2d_data()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
vae = train_vae(data, epochs=100, device=device)

# Generation: sample z from standard normal, decode
z_sample = torch.randn(1000, 2, device=device)
with torch.no_grad():
    synthetic_data = vae.decode(z_sample).cpu().numpy()
print(f"Generated {synthetic_data.shape[0]} samples")

Key VAE implementation details:

Beta annealing: Starting with low β allows the reconstruction loss to dominate early, giving the encoder time to learn. Then β increases, enforcing the latent prior constraint. This improves both stability and final quality.
Gradient clipping: Prevents exploding gradients, which can occur during KL term updates.
Reparameterization: Sampling ε outside the model graph and using z = μ + σ ⊙ ε ensures backprop flows through both μ and σ.

4. Normalizing Flows

Invertible Transformations and Change of Variables

Normalizing flows take a different approach: rather than sampling from a latent space or training a generator, they transform a simple base distribution (e.g., standard normal) via a sequence of invertible functions. If each transformation is invertible and we know its Jacobian determinant, we can compute the density of the resulting distribution exactly.

Starting with z₀ ~ p₀(z₀), we apply transformations z₁ = f₁(z₀), z₂ = f₂(z₁), ..., zₖ = fₖ(zₖ₋₁). The density of the final sample follows from the change of variables formula:

log p(zₖ) = log p₀(z₀) - Σᵢ log |det(∂fᵢ/∂zᵢ₋₁)|

This is exact, not approximate. We can directly optimize maximum likelihood. The constraint is that f must be invertible and its Jacobian must be tractable to compute.

RealNVP

A practical realization is RealNVP (Real-valued Non-Volume Preserving), which uses coupling layers. At each layer, we split dimensions into "frozen" and "transformed" parts. The transformation for the active dimensions is a function of the frozen dimensions, ensuring invertibility and tractable Jacobians (triangular structure):

y₁ = x₁
y₂ = x₂ ⊙ exp(s(x₁)) + t(x₁)

Where s (scale) and t (translation) are neural networks taking x₁ as input. Inversion is straightforward: recover x₁ from y₁, then compute x₂ = (y₂ - t(x₁)) ⊙ exp(-s(x₁)). Stacking layers with alternating which dimensions are frozen creates an expressive bijection.

Advantages and Limitations

Flows offer exact likelihood computation and fast, stable training. They're excellent for density estimation and importance sampling. However, they require invertible architectures, which constrains expressiveness, and stacking many layers increases computational cost. For very high-dimensional data (e.g., images), flows are less popular than GANs or diffusion models, though recent work (e.g., Glow) has scaled them effectively.

Key Insight: Normalizing flows embed sampling and density evaluation in a single framework with theoretical guarantees. This makes them ideal when you need both to generate samples and to compute exact log-likelihoods—a combination GANs and VAEs cannot easily provide. The trade-off is architectural complexity and the need for invertible layers.

5. Diffusion Models

Forward and Reverse Diffusion

Diffusion models approach generation from a different angle: they learn to reverse a noise corruption process. The forward process gradually adds Gaussian noise to real data over T timesteps:

q(x_t | x_0) = N(x_t | √ᾱ_t x_0, (1 - ᾱ_t) I)

Where ᾱ_t is the cumulative product of noise schedules and decreases monotonically from 1 down to 0 as t goes from 0 to T. At t = 0, ᾱ_t ≈ 1 and the sample is essentially clean data; at t = T, ᾱ_t ≈ 0 and the sample is pure Gaussian noise. The reverse process learns to denoise: starting from noise, iteratively remove Gaussian noise to recover data:

p(x_{t-1} | x_t) = N(x_{t-1} | μ_θ(x_t, t), σ_t² I)

A neural network μ_θ (parameterized by θ) predicts the mean at each reverse step.

Training and the Score Matching Objective

Rather than directly predicting the mean, modern implementations use score matching: the network predicts the gradient of log-density, called the score. Training minimizes the expected squared difference between predicted and true scores:

L = E_t E_x₀ E_ε [ || ε_θ(x_t, t) - ε ||² ]

Where ε_θ predicts the noise added in the forward process, and ε ~ N(0, I). This is simpler than direct mean prediction and has better empirical properties. The model is conditioned on timestep t, typically via sinusoidal positional embeddings (borrowed from Transformer architectures).

DDPM: Denoising Diffusion Probabilistic Models

DDPM formalized this framework, showing that diffusion models could match or exceed GAN quality on image generation. The algorithm:

Sample x₀ from training data, noise level t ~ Uniform(1, T), and noise ε ~ N(0, I).
Compute noisy sample x_t via forward process.
Train network to predict ε from (x_t, t).
Sample via reverse process: iteratively apply the learned denoiser.

Sampling is slower than GANs (requires T forward passes, often 50–1000), but the model is stable, mode-covering, and produces high-quality samples. Recent advances (DDIM, latent diffusion) speed up sampling with fewer steps or operate in compressed latent spaces.

Key Insight: Diffusion models reframe generation as denoising. This simple, intuitive objective sidesteps adversarial training instabilities. The price is sampling speed, but the stability and quality gains have made diffusion the dominant approach for high-resolution image synthesis (DALL-E 3, Stable Diffusion, Midjourney all use diffusion at their core).

Diffusion model forward and reverse process: clean data through noise levels to pure noise and back — Figure 4.3 — The diffusion process. Forward: Gaussian noise is progressively added to clean data over T steps, eventually reaching pure noise. Reverse: a neural network learns to denoise step by step. At generation time, sampling starts from random noise and iteratively denoises to produce synthetic data.

6. Comparing Approaches: When to Use Each Method

Each architecture has distinct strengths and weaknesses. The following table summarizes key trade-offs:

Aspect	GAN	VAE	Normalizing Flow	Diffusion
Sample Quality	Excellent (sharp)	Good (blurry)	Very Good	Excellent
Sampling Speed	Very Fast	Fast	Moderate	Slow (50-1000 steps)
Training Stability	Unstable	Stable	Stable	Very Stable
Likelihood Evaluation	No (implicit)	Lower bound	Exact	Intractable
Mode Coverage	Often poor	Good	Good	Excellent
Implementation Complexity	Medium	Low-Medium	High	Medium
Best Use Case	Low-latency image synthesis	Representation learning, semi-supervised	Density estimation, likelihood-based tasks	High-quality synthesis, diverse outputs

Visual scorecard comparing GAN, VAE, Normalizing Flow, and Diffusion models across quality, stability, diversity, speed — Figure 4.4 — Deep generative models at a glance. GANs produce excellent samples but suffer from training instability and mode collapse. VAEs are stable and fast but often produce blurrier outputs. Normalizing Flows offer exact likelihood but are architecturally constrained. Diffusion models achieve the best quality and diversity at the cost of slow sampling.

Decision Framework

Use GANs if: You need real-time synthesis (e.g., game assets, live video synthesis) and can afford dedicated engineering to stabilize training. GANs are less popular in academic settings now but remain useful for latency-critical applications.

Use VAEs if: You need both a generative model and good latent representations (e.g., for downstream classification, clustering, or semi-supervised learning). VAEs also train quickly with fewer hyperparameter surprises.

Use Normalizing Flows if: Exact likelihood evaluation is critical (e.g., density estimation, anomaly detection via log-likelihood thresholding). Flows are less commonly used for synthesis alone but excel at tasks requiring both generation and density estimation.

Use Diffusion Models if: Sample quality and mode coverage are paramount, and sampling latency is acceptable. Diffusion models are the current state-of-the-art for image synthesis and have proven effective for audio, video, and structured data.

7. Training Tips and Common Pitfalls

Mode Collapse in GANs: The generator produces only a narrow subset of the true distribution. Indicators include low diversity in generated samples and high discriminator accuracy. Mitigation: gradient penalty (WGAN), spectral normalization, instance noise, or architectural choices (self-attention, progressive growing). Always track sample diversity during training.

Posterior Collapse in VAEs: The KL term becomes very small, and the model ignores the latent code, reducing to a deterministic autoencoder. The posterior q(z|x) matches the prior, and generation becomes poor. Mitigation: use KL annealing (as in the code above), free bits (enforce minimum KL), or increase latent dimensionality. Monitor KL loss separately from reconstruction loss.

Training Instability in Flows: Coupling layers can become overly flexible, and the network may fit to noise in small datasets. Mitigation: use appropriate architectural constraints, regularize weight magnitudes, and use held-out validation sets to detect overfitting early.

General Best Practices

Monitor multiple metrics: Plot generator/discriminator loss, sample quality metrics (FID for images), and diversity measures. A single loss value hides imbalance.
Use validation-based checkpointing: Save model snapshots based on validation metrics, not training loss. For GANs, save the generator every N iterations and later pick the one with best FID.
Hyperparameter sensitivity: Learning rates, batch sizes, and architecture details heavily influence convergence. Start with published baselines and vary one hyperparameter at a time.
Data preprocessing: Normalize inputs to reasonable ranges (e.g., [-1, 1] for images, standardized for tabular data). Bad preprocessing leads to training collapse.
Batch normalization effects: Batch norm can introduce instability in GANs due to dependence on batch composition. Consider layer norm or group norm as alternatives.
Architecture design: Use residual connections, skip connections, and attention mechanisms to improve stability and expressiveness. Modern networks often combine ideas from ResNets, Transformers, and classical deep learning.

8. From Images to Structured Data

While deep generative models were developed for images, they apply broadly to other modalities. The key is adapting the architecture and loss function to the data type.

Tabular Data

Tabular data is mixed-type (continuous, categorical, ordinal) and often sparse. GANs adapted for tabular data (e.g., TVAE, CTGAN) use tricks like:

Gumbel-softmax for categorical sampling, differentiable but discrete.
Per-column normalization to handle features with different scales and distributions.
Mode-specific loss weighting to handle imbalanced classes.

VAEs for tabular data use similar techniques but naturally handle mixed types through conditional distributions in the decoder (Gaussian for continuous, categorical for discrete).

Time Series

Time series require temporal structure. Recurrent networks (LSTM, GRU) or Transformers replace fully-connected layers. A common approach:

Encoder: LSTM that reads the entire sequence and outputs a latent vector.
Decoder: Another LSTM that generates the sequence step-by-step from the latent vector.
Loss: MSE on the full sequence, plus any VAE regularization (KL).

Recent diffusion models for time series (e.g., DiffWave, TimeGrad) treat time series as corrupted by noise and iteratively denoise, achieving strong results.

Graphs

Graph-structured data requires different networks. Graph neural networks (GNNs) embed nodes and edges jointly. Generative models for graphs often use:

Autoregressive approaches: Generate one node/edge at a time, conditioning on previously generated elements.
VAE + GNN encoder/decoder: Encoder uses GNN to embed the graph, decoder reconstructs via another GNN.
Diffusion on graphs: Noise and denoise graph structure iteratively.

Graph generation is active research and more complex than images, but the core principles (latent representations, denoising, adversarial training) transfer directly.

Key Insight: Deep generative models are modular. The same GAN or VAE framework applies to any data type if you choose the right encoder/decoder architecture (CNN for images, RNN for sequences, GNN for graphs) and loss function (MSE, cross-entropy, custom domain losses). This modularity is why deep learning generative models are so powerful: you inherit decades of progress in supervised learning architectures.

9. Hands-On: Training a Simple GAN from Scratch

Let's build a complete, minimal example that generates synthetic data from a known distribution. This walkthrough covers setup, training, and evaluation.

Complete Training Script

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, TensorDataset

# Step 1: Create synthetic target data
# Goal: Generator learns to produce samples from a bimodal Gaussian mixture
np.random.seed(42)
torch.manual_seed(42)

n_samples = 5000
cluster1 = np.random.randn(n_samples // 2, 2) * 0.5 + np.array([2, 0])
cluster2 = np.random.randn(n_samples // 2, 2) * 0.5 + np.array([-2, 0])
real_data = np.vstack([cluster1, cluster2]).astype(np.float32)
real_data_tensor = torch.FloatTensor(real_data)

# Step 2: Define generator and discriminator
class SimpleGenerator(nn.Module):
    def __init__(self, latent_dim=2):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 2)
        )

    def forward(self, z):
        return self.model(z)

class SimpleDiscriminator(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(2, 256),
            nn.LeakyReLU(0.2),
            nn.Linear(256, 256),
            nn.LeakyReLU(0.2),
            nn.Linear(256, 1)
        )

    def forward(self, x):
        return self.model(x)

# Step 3: Initialize models and optimizers
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
generator = SimpleGenerator(latent_dim=2).to(device)
discriminator = SimpleDiscriminator().to(device)

lr = 0.0002
beta1, beta2 = 0.5, 0.999
opt_G = optim.Adam(generator.parameters(), lr=lr, betas=(beta1, beta2))
opt_D = optim.Adam(discriminator.parameters(), lr=lr, betas=(beta1, beta2))

criterion = nn.BCEWithLogitsLoss()

# Step 4: Training loop
num_epochs = 200
batch_size = 64
dataloader = DataLoader(TensorDataset(real_data_tensor), batch_size=batch_size, shuffle=True)

g_losses, d_losses = [], []

for epoch in range(num_epochs):
    for batch_idx, (real_batch,) in enumerate(dataloader):
        real_batch = real_batch.to(device)
        batch_size_actual = real_batch.size(0)

        # Labels
        real_labels = torch.ones(batch_size_actual, 1, device=device)
        fake_labels = torch.zeros(batch_size_actual, 1, device=device)

        # Train Discriminator
        # Real data
        d_real_output = discriminator(real_batch)
        d_real_loss = criterion(d_real_output, real_labels)

        # Fake data
        z = torch.randn(batch_size_actual, 2, device=device)
        fake_batch = generator(z).detach()
        d_fake_output = discriminator(fake_batch)
        d_fake_loss = criterion(d_fake_output, fake_labels)

        d_loss = d_real_loss + d_fake_loss
        opt_D.zero_grad()
        d_loss.backward()
        opt_D.step()

        # Train Generator
        z = torch.randn(batch_size_actual, 2, device=device)
        fake_batch = generator(z)
        d_fake_output = discriminator(fake_batch)

        # Generator tries to fool discriminator (fake samples should look real)
        g_loss = criterion(d_fake_output, real_labels)

        opt_G.zero_grad()
        g_loss.backward()
        opt_G.step()

    g_losses.append(g_loss.item())
    d_losses.append(d_loss.item())

    if (epoch + 1) % 50 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}] | D Loss: {d_loss.item():.4f} | G Loss: {g_loss.item():.4f}")

# Step 5: Generate and visualize samples
generator.eval()
with torch.no_grad():
    z_test = torch.randn(2000, 2, device=device)
    fake_samples = generator(z_test).cpu().numpy()

# Plot results
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Real data
axes[0].scatter(real_data[:, 0], real_data[:, 1], alpha=0.5, s=10)
axes[0].set_title('Real Data')
axes[0].set_xlim(-4, 4)
axes[0].set_ylim(-3, 3)

# Generated data
axes[1].scatter(fake_samples[:, 0], fake_samples[:, 1], alpha=0.5, s=10, color='orange')
axes[1].set_title('Generated Data')
axes[1].set_xlim(-4, 4)
axes[1].set_ylim(-3, 3)

# Training curves
axes[2].plot(g_losses, label='Generator Loss')
axes[2].plot(d_losses, label='Discriminator Loss')
axes[2].set_xlabel('Epoch')
axes[2].set_ylabel('Loss')
axes[2].legend()
axes[2].set_title('Training Curves')

plt.tight_layout()
plt.savefig('gan_results.png', dpi=150)
plt.show()

# Step 6: Quantitative evaluation
from scipy.stats import ks_2samp

# Kolmogorov-Smirnov test for each dimension
ks_dim0 = ks_2samp(real_data[:, 0], fake_samples[:, 0])
ks_dim1 = ks_2samp(real_data[:, 1], fake_samples[:, 1])

print(f"\nKS Test (Dimension 0): statistic={ks_dim0.statistic:.4f}, p-value={ks_dim0.pvalue:.4f}")
print(f"KS Test (Dimension 1): statistic={ks_dim1.statistic:.4f}, p-value={ks_dim1.pvalue:.4f}")
print("(KS is per-dimension only; a small statistic does not prove that the "
      "joint 2-D distributions match — always inspect the scatter plot too.)")

Interpreting KS. The two-sample KS test is a marginal test: it compares one-dimensional CDFs. Running it independently on each dimension will miss joint-distribution failures such as mode collapse or a generator that has swapped the two clusters. Treat per-dimension KS as a necessary but not sufficient check, and complement it with multivariate metrics (Wasserstein, MMD, classifier-two-sample tests) or direct visual inspection of the joint support.

Walkthrough and Key Observations

Data Creation: We define a target distribution (bimodal Gaussian). The generator must learn this.
Architecture: Simple MLPs suffice for 2D data. In practice, use convolutions for images or other domain-specific layers.
Training Loop: Standard GAN training: alternate discriminator and generator updates. We use BCE loss with logits for stability.
Visualization: Scatter plots show whether the generator captures both modes. If it collapses to one cluster, mode collapse has occurred.
Evaluation: Kolmogorov-Smirnov test quantifies distributional similarity. For real applications, use Wasserstein distance, FID (Fréchet Inception Distance for images), or task-based metrics (downstream classifier accuracy).

Running this script should show the generator gradually learning to produce samples from both clusters. The generator loss should decrease, and the scatter plot of generated samples should eventually resemble the real data distribution.

Debugging: If the generator loss stagnates or oscillates wildly, try: reducing learning rate, using spectral normalization, or adding gradient penalty (as in the WGAN example). If discriminator loss goes to zero immediately, the generator might be producing obviously fake samples; let training continue, or reduce discriminator learning rate.

Conclusion

Deep generative models have revolutionized synthetic data generation, moving from hand-crafted statistical models to learned, implicit representations. Each approach—GANs, VAEs, Normalizing Flows, and Diffusion Models—embodies a different philosophy and trade-off:

GANs excel at sharp, photorealistic synthesis but require careful training.
VAEs provide stable, interpretable latent spaces and probabilistic inference.
Normalizing Flows offer exact likelihood evaluation for density estimation.
Diffusion Models combine stability, quality, and mode coverage, emerging as the state-of-the-art.

No single method dominates all scenarios. Practitioners must understand the strengths and weaknesses of each, choose appropriately for their domain (images, tabular data, sequences, graphs), and invest in careful implementation and hyperparameter tuning. The code examples in this chapter provide starting points; extending them to real-world data, larger scales, and domain-specific variations is the next step on your journey.

References and Further Reading

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative Adversarial Nets. Advances in Neural Information Processing Systems. papers.nips.cc/paper/5423-generative-adversarial-nets
Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. International Conference on Learning Representations. arxiv.org/abs/1312.6114
Rezende, D. J., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. International Conference on Machine Learning. arxiv.org/abs/1505.05770
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2017). Density Estimation using Real NVP. International Conference on Learning Representations. arxiv.org/abs/1605.08803
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. International Conference on Machine Learning. arxiv.org/abs/1701.07875
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). Improved Training of Wasserstein GANs. Advances in Neural Information Processing Systems. arxiv.org/abs/1704.00028
Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. International Conference on Machine Learning. proceedings.mlr.press/v37/sohl-dickstein15.html
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems. arxiv.org/abs/2006.11239
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. IEEE/CVF Conference on Computer Vision and Pattern Recognition. openaccess.thecvf.com

← Chapter 3 Chapter 5 →