Chapter 2

Statistical Foundations

Understanding the mathematical principles that underpin synthetic data generation

1. Why Statistics Matter for Synthetic Data

Synthetic data is not magic. It is, at its core, a mathematical statement about patterns found in real data. To generate synthetic data effectively, you must first understand what you are modeling—the distributions, correlations, and edge cases present in your source dataset. Without this understanding, you risk creating data that looks reasonable on the surface but fails to capture the underlying reality, leading to poor model performance when synthetic data is used for training or testing.

Statistics provides the framework for this understanding. It answers fundamental questions:

What is the typical value of a feature? (central tendency)
How spread out are the values? (dispersion)
Are features independent or correlated? (joint distributions)
What are the tail behaviors and extreme cases? (distributional shape)
How confident are we in our estimates? (uncertainty quantification)

2x2 comparison grid of statistical generation methods: Parametric, Copula, KDE, Bootstrap — Figure 2.0 — Overview of the four statistical approaches to synthetic data generation covered in this chapter. Each method makes different assumptions about the data and offers different trade-offs between flexibility and interpretability.

The approach taken in this chapter is deliberately model-first: we start with the assumption that real-world data follows (or approximates) known statistical distributions, and we use techniques to estimate those distributions from observed data. This is the foundation for rule-based and simulation-based synthetic data generation methods covered in Chapter 3. Later chapters will introduce more sophisticated, data-driven approaches like deep learning that can learn complex distributions implicitly.

2. Probability Distributions

A probability distribution describes how likely different outcomes are. In the context of synthetic data, knowing the shape and parameters of the distributions of your features allows you to sample new synthetic values that preserve the statistical properties of the original data.

2.1 Common Univariate Distributions

Normal (Gaussian) Distribution: The most famous distribution, characterized by a bell curve. Defined by mean μ and standard deviation σ. Many real-world phenomena (e.g., human height, measurement errors) approximate normality. Probability density function: f(x) = (1/(σ√(2π))) * exp(-(x-μ)²/(2σ²)).

Poisson Distribution: Used for count data (number of events in a fixed interval). Defined by a single parameter λ (rate). Useful for modeling discrete events like customer arrivals or error counts.

Exponential Distribution: Models the time between events in a Poisson process. Common in reliability engineering and queue theory. Has a characteristic "memoryless" property.

Categorical Distribution: Discrete distribution over a finite set of categories. Each category has a probability. Used for modeling features like color, region, or customer segment.

Beta Distribution: Flexible distribution on [0,1], useful for proportions and percentages. Controlled by two shape parameters (α, β).

2.2 Sampling from Distributions in Python

Here is a practical example using NumPy and SciPy to sample from various distributions:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Set seed for reproducibility
np.random.seed(42)

# Normal distribution
normal_samples = np.random.normal(loc=100, scale=15, size=10000)

# Poisson distribution (count data, e.g., customer visits per day)
poisson_samples = np.random.poisson(lam=5, size=10000)

# Exponential distribution (e.g., time until next event)
exponential_samples = np.random.exponential(scale=2.0, size=10000)

# Categorical distribution (e.g., product category)
categories = ['Electronics', 'Clothing', 'Food', 'Books']
probabilities = [0.3, 0.35, 0.25, 0.1]
categorical_samples = np.random.choice(categories, size=10000, p=probabilities)

# Beta distribution (e.g., user satisfaction on [0,1])
beta_samples = np.random.beta(a=2, b=5, size=10000)

# Visualize
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

axes[0, 0].hist(normal_samples, bins=50, density=True, alpha=0.7, edgecolor='black')
axes[0, 0].set_title('Normal Distribution (μ=100, σ=15)')
axes[0, 0].set_xlabel('Value')
axes[0, 0].set_ylabel('Density')

axes[0, 1].hist(poisson_samples, bins=range(0, max(poisson_samples)+2), density=True, alpha=0.7, edgecolor='black')
axes[0, 1].set_title('Poisson Distribution (λ=5)')
axes[0, 1].set_xlabel('Count')
axes[0, 1].set_ylabel('Probability')

axes[1, 0].hist(exponential_samples, bins=50, density=True, alpha=0.7, edgecolor='black')
axes[1, 0].set_title('Exponential Distribution (scale=2)')
axes[1, 0].set_xlabel('Time')
axes[1, 0].set_ylabel('Density')

unique, counts = np.unique(categorical_samples, return_counts=True)
axes[1, 1].bar(unique, counts/len(categorical_samples), alpha=0.7, edgecolor='black')
axes[1, 1].set_title('Categorical Distribution')
axes[1, 1].set_xlabel('Category')
axes[1, 1].set_ylabel('Probability')

plt.tight_layout()
plt.savefig('distributions.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Normal: mean={normal_samples.mean():.2f}, std={normal_samples.std():.2f}")
print(f"Poisson: mean={poisson_samples.mean():.2f}")
print(f"Exponential: mean={exponential_samples.mean():.2f}")
print(f"Beta: mean={beta_samples.mean():.2f}, min={beta_samples.min():.2f}, max={beta_samples.max():.2f}")
print(f"\nCategorical counts:\n{dict(zip(unique, counts))}")

Five common probability distributions: Normal, Poisson, Exponential, Categorical, and Beta, with PDF overlays — Figure 2.1 — Common probability distributions used in synthetic data generation. Top row: Normal (continuous, symmetric), Poisson (discrete counts), Exponential (right-skewed, memoryless). Bottom row: Categorical (discrete), Beta (bounded on [0,1]), and an overlay comparing the continuous PDFs.

Key Insight: The shape of a distribution matters for synthetic data quality. A model trained on normally-distributed features may perform poorly on actual data with skewed distributions or multimodal patterns. Always inspect your data's distribution empirically before choosing a sampling strategy.

3. Joint Distributions & Correlations

Real-world datasets rarely consist of independent features. Customer age correlates with purchase frequency; income correlates with spending; product quality correlates with return rate. Ignoring these correlations when generating synthetic data produces unrealistic records that fail to preserve the multivariate structure of the original dataset.

A joint distribution describes the probability of multiple variables occurring together. The marginal distribution of a single variable is obtained by "integrating out" all other variables. The key challenge in synthetic data generation is to preserve both marginal distributions (individual feature distributions) and joint relationships (correlations and dependencies).

3.1 Correlation and Covariance

Covariance measures how two variables move together. If one tends to increase when the other increases, covariance is positive. The covariance matrix Σ for multiple variables encodes all pairwise relationships.

Pearson correlation r normalizes covariance to [-1, 1], where r=1 means perfect positive correlation, r=-1 means perfect negative correlation, and r=0 means no linear relationship.

However, correlation only captures linear relationships. Rank correlations (Spearman, Kendall) are more robust to outliers and nonlinear monotonic relationships.

3.2 Multivariate Normal Distribution

The multivariate normal (MVN) distribution generalizes the univariate normal to multiple dimensions. It is fully characterized by a mean vector μ and covariance matrix Σ. A key property: any subset of MVN variables is also MVN.

Here is an example of sampling from a multivariate normal distribution and examining correlations:

import numpy as np
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt

# Define parameters for a 3-dimensional normal distribution
mean = np.array([50, 100, 0.5])  # age, income (thousands), fraction

# Define correlation structure
correlation_matrix = np.array([
    [1.0,   0.6,  -0.3],   # age correlated with income
    [0.6,   1.0,  -0.2],   # income slightly negatively correlated
    [-0.3, -0.2,   1.0]    # with fraction
])

# Convert correlation matrix to covariance matrix
# Cov(X,Y) = Corr(X,Y) * σ_X * σ_Y
stds = np.array([15, 30, 0.15])  # standard deviations
cov_matrix = np.outer(stds, stds) * correlation_matrix

# Sample from the distribution
n_samples = 5000
mvn = stats.multivariate_normal(mean=mean, cov=cov_matrix)
samples = mvn.rvs(size=n_samples, random_state=42)

# Convert to DataFrame for easier inspection
df = pd.DataFrame(samples, columns=['Age', 'Income_k', 'Fraction'])

# Compute realized correlations
print("Correlation Matrix (Realized):")
print(df.corr())
print("\nDescriptive Statistics:")
print(df.describe())

# Visualize pairwise relationships
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

axes[0].scatter(df['Age'], df['Income_k'], alpha=0.3, s=10)
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Income (thousands)')
axes[0].set_title(f"Age vs Income (r={df['Age'].corr(df['Income_k']):.2f})")

axes[1].scatter(df['Age'], df['Fraction'], alpha=0.3, s=10)
axes[1].set_xlabel('Age')
axes[1].set_ylabel('Fraction')
axes[1].set_title(f"Age vs Fraction (r={df['Age'].corr(df['Fraction']):.2f})")

axes[2].scatter(df['Income_k'], df['Fraction'], alpha=0.3, s=10)
axes[2].set_xlabel('Income (thousands)')
axes[2].set_ylabel('Fraction')
axes[2].set_title(f"Income vs Fraction (r={df['Income_k'].corr(df['Fraction']):.2f})")

plt.tight_layout()
plt.savefig('multivariate_normal.png', dpi=150, bbox_inches='tight')
plt.show()

Multivariate normal distributions with different correlation coefficients and a correlation heatmap — Figure 2.2 — The effect of correlation on multivariate normal samples. Strong positive correlation (r=0.8) produces an elongated ellipse along the diagonal; zero correlation produces a circular cloud; negative correlation (r=−0.6) tilts the ellipse against the diagonal. The heatmap (bottom-right) shows the full correlation matrix — the foundation for copula-based synthetic data generation.

Caveat: The multivariate normal assumption is powerful but restrictive. Not all datasets follow MVN—real data often has skewed marginals, outliers, or nonlinear dependencies. More flexible approaches using copulas (Section 4) handle this.

4. Copulas: Decoupling Marginals from Dependence

A copula is a mathematical function that couples (hence the name) the marginal distributions of multiple variables while allowing you to specify the dependence structure independently. This is powerful: you can take a normal distribution, an exponential distribution, and a categorical distribution—three different marginals—and link them with any dependence structure you choose.

Formally, a copula C is a multivariate cumulative distribution function (CDF) on [0,1]^d such that the marginal distributions are uniform on [0,1]. By Sklar's theorem, any multivariate distribution can be decomposed as:

F(x₁, x₂, ..., xₐ) = C(F₁(x₁), F₂(x₂), ..., Fₐ(xₐ))

where F is the joint CDF, Fᵢ are the marginal CDFs, and C is the copula. The power of this decomposition is that you can:

Estimate or specify each marginal Fᵢ independently
Estimate or specify the copula C independently
Combine them to generate synthetic samples

4.1 Gaussian Copula

The Gaussian copula is the most commonly used copula in practice. It derives a dependence structure from a multivariate normal distribution. The procedure:

Sample from a multivariate normal distribution with a specified correlation matrix
Transform each component to uniform [0,1] using the normal CDF (Φ)
Transform each uniform value to the desired marginal using the inverse CDF (quantile function)

Gaussian copula workflow: Real Data → Fit Marginals → Estimate Correlation → Sample → Synthetic Data — Figure 2.A — The Gaussian copula workflow. Real data is decomposed into marginal distributions and a correlation structure, which are then recombined through sampling to produce synthetic data with the desired properties.

4.2 Gaussian Copula Example in Python

Here we generate synthetic data with a normal marginal, an exponential marginal, and a discrete marginal, while preserving a specified correlation structure:

import numpy as np
from scipy import stats
import pandas as pd

def gaussian_copula_sample(marginal_rvs, correlation_matrix, n_samples, random_state=None):
    """
    Generate samples using Gaussian copula.

    Parameters:
    -----------
    marginal_rvs : list of scipy.stats rv_frozen objects
        Frozen scipy distributions for each marginal (e.g., norm(), expon(), etc.)
    correlation_matrix : ndarray
        Correlation matrix for the Gaussian copula
    n_samples : int
        Number of samples to generate
    random_state : int
        Seed for reproducibility

    Returns:
    --------
    ndarray of shape (n_samples, len(marginal_rvs))
    """
    rng = np.random.default_rng(random_state)
    d = len(marginal_rvs)

    # Step 1: Sample from multivariate normal (local RNG — no global seed mutation)
    mvn = stats.multivariate_normal(
        mean=np.zeros(d), cov=correlation_matrix, seed=rng
    )
    normal_samples = mvn.rvs(size=n_samples)

    # Step 2: Transform to uniform [0,1] using normal CDF
    uniform_samples = stats.norm.cdf(normal_samples)

    # Step 3: Transform to target marginals using inverse CDF (ppf = percent point function)
    synthetic = np.zeros((n_samples, d))
    for i, marginal in enumerate(marginal_rvs):
        synthetic[:, i] = marginal.ppf(uniform_samples[:, i])

    return synthetic

# Define target marginals
marginal_1 = stats.norm(loc=50, scale=15)           # Normal: Age
marginal_2 = stats.expon(loc=20, scale=30)          # Exponential: Time since customer
marginal_3 = stats.binom(n=1, p=0.4)                # Binomial: Churned or not

marginals = [marginal_1, marginal_2, marginal_3]

# Define correlation structure (for the Gaussian copula latent space)
correlation_matrix = np.array([
    [1.0,   0.5,  -0.3],
    [0.5,   1.0,  -0.4],
    [-0.3, -0.4,   1.0]
])

# Generate synthetic samples
n_synthetic = 5000
synthetic_data = gaussian_copula_sample(
    marginals,
    correlation_matrix,
    n_synthetic,
    random_state=42
)

# Package as DataFrame
df_synthetic = pd.DataFrame(
    synthetic_data,
    columns=['Age', 'Days_Since_Signup', 'Churned']
)
df_synthetic['Churned'] = df_synthetic['Churned'].astype(int)

print("Synthetic Data Summary:")
print(df_synthetic.describe())
print("\nRealized Correlation Matrix:")
print(df_synthetic.corr())

# Visualization
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

axes[0].hist(df_synthetic['Age'], bins=30, alpha=0.7, edgecolor='black')
axes[0].set_title('Age Distribution')
axes[0].set_xlabel('Age')

axes[1].hist(df_synthetic['Days_Since_Signup'], bins=30, alpha=0.7, edgecolor='black')
axes[1].set_title('Days Since Signup (Exponential)')
axes[1].set_xlabel('Days')

df_synthetic.groupby('Churned').size().plot(kind='bar', ax=axes[2], alpha=0.7, edgecolor='black')
axes[2].set_title('Churn Distribution')
axes[2].set_xlabel('Churned')
axes[2].set_ylabel('Count')

plt.tight_layout()
plt.savefig('gaussian_copula.png', dpi=150, bbox_inches='tight')
plt.show()

Three-step Gaussian copula process: MVN sampling, CDF transform to uniform, inverse CDF to target marginals — Figure 2.3 — The Gaussian copula in three steps, illustrated on a two-variable example. Left: correlated samples from a bivariate normal (r≈0.7). Center: transforming through Φ maps each margin to Uniform[0,1] while preserving rank correlation. Right: applying inverse CDFs (here exponential and Beta) produces synthetic data with the desired marginal distributions and the original dependence structure intact. The code listing above uses a different set of marginals (Normal, Exponential, Binomial) to show that the same machinery handles mixed continuous and discrete variables.

Key Insight: Gaussian copulas are flexible and computationally efficient, making them popular in practice. However, they may not capture tail dependencies (extreme events happening together) well. Other copulas like Clayton or Gumbel copulas are better for this, but Gaussian copulas are a good starting point.

Rank vs Pearson correlation. The correlation matrix you pass to a Gaussian copula governs the latent Gaussian layer, which is equivalent to the rank (Spearman) correlation of the output — not the Pearson correlation of the synthetic marginals. Because the inverse-CDF step is a monotonic but generally non-linear transformation, the Pearson correlation you measure on the synthetic data will typically be slightly attenuated compared to the input matrix. If you need a specific Pearson target, fit the copula to rank-transformed data (or invert the monotonic relationship numerically) rather than plugging Pearson values straight in.

5. Kernel Density Estimation (KDE)

Parametric methods (like fitting a normal distribution) assume a specific functional form. Non-parametric methods make fewer assumptions about the underlying distribution, learning the density directly from data.

Kernel Density Estimation (KDE) is a non-parametric approach that represents each data point as a "bump" (kernel) and sums them to estimate the overall density. The density estimate at point x is:

f_KDE(x) = (1/n) * Σᵢ₌₁ⁿ K_h(x - xᵢ)

where K_h is a kernel function (often Gaussian) with bandwidth h, and n is the number of observations. The bandwidth h controls smoothness: small h creates a wiggly estimate, large h creates a smooth one.

5.1 KDE for Univariate Density Estimation

import numpy as np
from scipy import stats
from sklearn.neighbors import KernelDensity
import matplotlib.pyplot as plt

# Generate some realistic data with a bimodal pattern
# (e.g., two customer segments with different spending patterns)
np.random.seed(42)
segment_1 = np.random.normal(loc=30, scale=10, size=300)
segment_2 = np.random.normal(loc=80, scale=15, size=700)
real_data = np.concatenate([segment_1, segment_2])

# Fit KDE with different bandwidths
kde_narrow = KernelDensity(bandwidth=2, kernel='gaussian')
kde_moderate = KernelDensity(bandwidth=5, kernel='gaussian')
kde_wide = KernelDensity(bandwidth=10, kernel='gaussian')

kde_narrow.fit(real_data.reshape(-1, 1))
kde_moderate.fit(real_data.reshape(-1, 1))
kde_wide.fit(real_data.reshape(-1, 1))

# Generate points for evaluation
x_eval = np.linspace(real_data.min() - 10, real_data.max() + 10, 500)

# Evaluate densities
dens_narrow = np.exp(kde_narrow.score_samples(x_eval.reshape(-1, 1)))
dens_moderate = np.exp(kde_moderate.score_samples(x_eval.reshape(-1, 1)))
dens_wide = np.exp(kde_wide.score_samples(x_eval.reshape(-1, 1)))

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Effect of bandwidth
axes[0].hist(real_data, bins=40, density=True, alpha=0.4, label='Histogram', edgecolor='black')
axes[0].plot(x_eval, dens_narrow, label='h=2 (narrow, overfits)', linewidth=2)
axes[0].plot(x_eval, dens_moderate, label='h=5 (moderate)', linewidth=2)
axes[0].plot(x_eval, dens_wide, label='h=10 (wide, oversmooths)', linewidth=2)
axes[0].set_xlabel('Spending ($)')
axes[0].set_ylabel('Density')
axes[0].set_title('Effect of Bandwidth on KDE')
axes[0].legend()

# Right: Sampling from KDE
samples_kde = kde_moderate.sample(n_samples=5000, random_state=42)
axes[1].hist(real_data, bins=40, density=True, alpha=0.4, label='Real Data', edgecolor='black')
axes[1].hist(samples_kde, bins=40, density=True, alpha=0.4, label='KDE Samples', edgecolor='black')
axes[1].set_xlabel('Spending ($)')
axes[1].set_ylabel('Density')
axes[1].set_title('Real Data vs. KDE-Sampled Synthetic Data')
axes[1].legend()

plt.tight_layout()
plt.savefig('kde_example.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Real data: mean={real_data.mean():.2f}, std={real_data.std():.2f}")
print(f"KDE samples: mean={samples_kde.mean():.2f}, std={samples_kde.std():.2f}")
print(f"KDE can be sampled directly, preserving the empirical distribution!")

KDE bandwidth comparison and real vs KDE-sampled synthetic data for bimodal distribution — Figure 2.4 — Kernel Density Estimation on bimodal data. Left: the effect of bandwidth — h=2 (narrow) overfits noise, h=10 (wide) oversmooths and loses the bimodal structure, h=5 (moderate) captures both peaks well. Right: 5,000 samples drawn from the KDE (h=5) closely reproduce the original bimodal distribution, demonstrating KDE as a non-parametric synthetic data generator.

Caveat: KDE scales poorly to high dimensions (curse of dimensionality). For multivariate data, you need either many observations or dimensionality reduction techniques. KDE is best suited for 1–3 dimensions or as a post-processing step after modeling correlations.

6. Maximum Likelihood Estimation (MLE)

Once you have chosen a parametric distribution (e.g., normal, exponential), you need to fit its parameters to your data. Maximum Likelihood Estimation is a principled statistical method that finds the parameters most likely to have generated the observed data.

Given data x₁, x₂, ..., xₙ and a distribution with parameters θ, the likelihood is:

L(θ) = Πᵢ₌₁ⁿ f(xᵢ | θ)

MLE finds θ* = argmax L(θ). In practice, we maximize the log-likelihood (to avoid numerical underflow) and often use optimization algorithms like gradient descent or Newton-Raphson.

6.1 MLE Example: Fitting a Normal Distribution

import numpy as np
from scipy import stats, optimize
import matplotlib.pyplot as plt

# Simulate real data from unknown distribution
np.random.seed(42)
real_data = np.random.normal(loc=100, scale=20, size=200)

# Manual MLE for Normal Distribution
def negative_log_likelihood(params, data):
    """Compute negative log-likelihood for normal distribution.

    log L(mu, sigma | data) = -n * log(sigma * sqrt(2*pi))
                              - sum((x - mu)^2) / (2 * sigma^2)
    We return -log L so scipy.optimize.minimize finds the MLE.
    """
    mu, sigma = params
    if sigma <= 0:
        return np.inf  # Invalid parameter
    n = len(data)
    nll = (0.5 * n * np.log(2 * np.pi)
           + n * np.log(sigma)
           + np.sum((data - mu) ** 2) / (2 * sigma ** 2))
    return nll

# Initial guess (method-of-moments)
mu_init = real_data.mean()
sigma_init = real_data.std()
initial_params = [mu_init, sigma_init]

# Optimize
result = optimize.minimize(
    negative_log_likelihood,
    initial_params,
    args=(real_data,),
    method='Nelder-Mead'
)

mu_mle, sigma_mle = result.x

print(f"True parameters: μ=100, σ=20")
print(f"MLE estimates: μ={mu_mle:.2f}, σ={sigma_mle:.2f}")
print(f"Sample statistics: mean={real_data.mean():.2f}, std={real_data.std():.2f}")

# Alternatively, use scipy.stats built-in fitting
mu_scipy, sigma_scipy = stats.norm.fit(real_data)
print(f"SciPy fit: μ={mu_scipy:.2f}, σ={sigma_scipy:.2f}")

# Generate synthetic data from fitted distribution
synthetic_normal = np.random.normal(loc=mu_mle, scale=sigma_mle, size=1000)

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Left: Histogram with fitted curve
axes[0].hist(real_data, bins=20, density=True, alpha=0.6, label='Real Data', edgecolor='black')
x_range = np.linspace(real_data.min(), real_data.max(), 100)
axes[0].plot(x_range, stats.norm.pdf(x_range, mu_mle, sigma_mle), 'r-', linewidth=2, label='Fitted Normal')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Density')
axes[0].set_title('MLE: Fitting Normal to Real Data')
axes[0].legend()

# Center: Q-Q plot to assess fit
stats.probplot(real_data, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot: Normal Assumption Check')

# Right: Synthetic samples from fitted model
axes[2].hist(real_data, bins=20, density=True, alpha=0.6,
             label='Real Data', edgecolor='black')
axes[2].hist(synthetic_normal, bins=20, density=True, alpha=0.6,
             label='Synthetic', edgecolor='black')
ks_stat, ks_pvalue = stats.ks_2samp(real_data, synthetic_normal)
axes[2].set_title(f'Real vs Synthetic\nKS p={ks_pvalue:.4f}')
axes[2].legend()

plt.tight_layout()
plt.savefig('mle_normal.png', dpi=150, bbox_inches='tight')
plt.show()

MLE fitting a normal distribution: histogram with fitted PDF, Q-Q plot, and synthetic vs real comparison — Figure 2.5 — Maximum Likelihood Estimation in action. Left: the fitted normal PDF (red curve) overlaid on the real data histogram. Center: the Q-Q plot confirms the normality assumption — points hug the diagonal. Right: 1,000 synthetic samples from the fitted distribution closely match the original, validated by a Kolmogorov-Smirnov test.

Key Insight: MLE is optimal in large samples (asymptotically efficient), but requires a correctly specified model. If your data doesn't follow the assumed distribution, MLE estimates may be biased. Always perform goodness-of-fit tests (Kolmogorov-Smirnov, Anderson-Darling) to validate the assumption.

7. Bootstrapping & Resampling

Bootstrap is a non-parametric resampling technique that generates new samples by repeatedly drawing from the existing data with replacement. It requires no distributional assumptions and is extremely flexible.

The bootstrap procedure is simple:

Take a random sample of size n (with replacement) from the original data of size n
Compute a statistic (mean, median, standard deviation, etc.) on this resample
Repeat steps 1–2 B times (e.g., B=1000)
The distribution of the statistic across resamples estimates its sampling distribution

7.1 Bootstrap for Confidence Intervals and Synthetic Sampling

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Real data: customer purchases (skewed distribution)
np.random.seed(42)
real_purchases = np.concatenate([
    np.random.exponential(scale=50, size=800),
    np.random.exponential(scale=200, size=200)  # High spenders
])

print(f"Real data: n={len(real_purchases)}, mean=${real_purchases.mean():.2f}, median=${np.median(real_purchases):.2f}")

# Bootstrap approach 1: Estimate sampling distribution of the mean
n_bootstrap = 1000
bootstrap_means = []

np.random.seed(42)
for _ in range(n_bootstrap):
    bootstrap_sample = np.random.choice(real_purchases, size=len(real_purchases), replace=True)
    bootstrap_means.append(bootstrap_sample.mean())

bootstrap_means = np.array(bootstrap_means)

# Compute confidence interval
ci_lower = np.percentile(bootstrap_means, 2.5)
ci_upper = np.percentile(bootstrap_means, 97.5)

print(f"\nBootstrap 95% CI for mean: [${ci_lower:.2f}, ${ci_upper:.2f}]")
print(f"Bootstrap mean of means: ${bootstrap_means.mean():.2f}")

# Bootstrap approach 2: Generate synthetic data by resampling
synthetic_bootstrap = np.random.choice(real_purchases, size=5000, replace=True)

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Original data
axes[0, 0].hist(real_purchases, bins=50, alpha=0.7, edgecolor='black')
axes[0, 0].set_title('Real Purchase Data')
axes[0, 0].set_xlabel('Purchase Amount ($)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_xlim(0, 500)

# Bootstrap synthetic
axes[0, 1].hist(synthetic_bootstrap, bins=50, alpha=0.7, edgecolor='black')
axes[0, 1].set_title('Bootstrap Synthetic Samples')
axes[0, 1].set_xlabel('Purchase Amount ($)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_xlim(0, 500)

# Bootstrap distribution of means
axes[1, 0].hist(bootstrap_means, bins=30, alpha=0.7, edgecolor='black')
axes[1, 0].axvline(ci_lower, color='r', linestyle='--', linewidth=2, label=f'95% CI: [{ci_lower:.0f}, {ci_upper:.0f}]')
axes[1, 0].axvline(ci_upper, color='r', linestyle='--', linewidth=2)
axes[1, 0].axvline(real_purchases.mean(), color='g', linestyle='-', linewidth=2, label=f'Real Mean: {real_purchases.mean():.0f}')
axes[1, 0].set_title('Bootstrap Distribution of Mean')
axes[1, 0].set_xlabel('Sample Mean ($)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].legend()

# Q-Q plot: real vs synthetic
axes[1, 1].scatter(np.percentile(real_purchases, np.linspace(1, 99, 50)),
                   np.percentile(synthetic_bootstrap, np.linspace(1, 99, 50)), alpha=0.6)
axes[1, 1].plot([0, real_purchases.max()], [0, real_purchases.max()], 'r--', linewidth=2)
axes[1, 1].set_xlabel('Real Data Quantiles')
axes[1, 1].set_ylabel('Synthetic Data Quantiles')
axes[1, 1].set_title('Quantile-Quantile Plot')

plt.tight_layout()
plt.savefig('bootstrap.png', dpi=150, bbox_inches='tight')
plt.show()

Bootstrap resampling: real data, bootstrap samples, distribution of means with confidence interval, Q-Q plot — Figure 2.6 — Bootstrapping for synthetic data and uncertainty estimation. Top-left: the original right-skewed purchase data. Top-right: bootstrap-resampled values closely match the real distribution. Bottom-left: the bootstrap distribution of the sample mean, with 95% confidence interval bounds (dashed lines). Bottom-right: Q-Q plot confirms that bootstrap quantiles align well with real data quantiles.

Caveat: Bootstrap resampling can only recombine existing values; it cannot generate truly novel synthetic records. For data augmentation, bootstrapping works well, but it cannot extrapolate beyond the support of the original data or generate new feature combinations not seen in the original sample.

8. Dimensionality Reduction: Understanding Data Structure

High-dimensional data is difficult to visualize, model, and synthesize. Dimensionality reduction techniques project data into lower-dimensional spaces while preserving important structure. These techniques are valuable for exploratory data analysis before generating synthetic data.

8.1 Principal Component Analysis (PCA)

PCA finds orthogonal directions (principal components) that maximize variance in the data. The first component captures the direction of greatest variation, the second captures the next orthogonal direction of greatest variation, and so on. PCA is linear, fast, and interpretable.

8.2 t-SNE (t-Distributed Stochastic Neighbor Embedding)

t-SNE is a nonlinear technique designed for visualization. It preserves local structure—nearby points in high dimensions remain nearby in low dimensions—but does not preserve global distances. Excellent for spotting clusters and outliers, but the coordinates are not easily invertible.

8.3 Example: PCA and t-SNE on Synthetic Customer Data

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Generate synthetic customer data with clusters
np.random.seed(42)

# Segment 1: Budget-conscious customers
seg1 = np.random.multivariate_normal(
    mean=[30, 2000, 0.8],  # age, annual_spend, loyalty_score
    cov=[[50, 50000, 0.01], [50000, 500000, 50], [0.01, 50, 0.01]],
    size=300
)

# Segment 2: Premium customers
seg2 = np.random.multivariate_normal(
    mean=[45, 15000, 0.95],
    cov=[[80, 200000, 0.005], [200000, 5000000, 200], [0.005, 200, 0.001]],
    size=200
)

# Segment 3: Mid-range customers
seg3 = np.random.multivariate_normal(
    mean=[38, 6000, 0.6],
    cov=[[60, 100000, 0.03], [100000, 1000000, 100], [0.03, 100, 0.05]],
    size=500
)

# Combine and label
data = np.vstack([seg1, seg2, seg3])
labels = np.array([0]*300 + [1]*200 + [2]*500)

# Standardize
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Apply PCA
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data_scaled)

# Apply t-SNE (slower, but often reveals structure better)
print("Computing t-SNE... (this may take a moment)")
tsne = TSNE(n_components=2, perplexity=30, max_iter=1000, random_state=42)
data_tsne = tsne.fit_transform(data_scaled)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

colors = ['red', 'blue', 'green']
labels_name = ['Budget', 'Premium', 'Mid-range']

# PCA
for i, (color, label) in enumerate(zip(colors, labels_name)):
    mask = labels == i
    axes[0].scatter(data_pca[mask, 0], data_pca[mask, 1],
                   c=color, label=label, alpha=0.6, s=30)
axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
axes[0].set_title('PCA: Preserves Global Structure')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# t-SNE
for i, (color, label) in enumerate(zip(colors, labels_name)):
    mask = labels == i
    axes[1].scatter(data_tsne[mask, 0], data_tsne[mask, 1],
                   c=color, label=label, alpha=0.6, s=30)
axes[1].set_xlabel('t-SNE Dimension 1')
axes[1].set_ylabel('t-SNE Dimension 2')
axes[1].set_title('t-SNE: Preserves Local Structure')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('dimensionality_reduction.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\nPCA Explained Variance Ratio: {pca.explained_variance_ratio_}")
print(f"Cumulative variance explained by first 2 PCs: {pca.explained_variance_ratio_.sum():.1%}")

PCA and t-SNE projections of 3-cluster synthetic customer data — Figure 2.7 — Dimensionality reduction as a diagnostic tool. Left: PCA preserves global structure — the three customer segments (Budget, Premium, Mid-range) are separated along the principal components, with explained variance shown on each axis. Right: t-SNE emphasizes local neighborhood structure, producing tighter, more distinct clusters. Both methods reveal that the data has clear multi-modal structure — important to capture in synthetic generation.

Key Insight: Dimensionality reduction is a diagnostic tool, not a synthetic data generation method by itself. Use PCA/t-SNE to understand your data's structure before deciding on a generation strategy. For example, if you discover 3 distinct clusters, you might generate synthetic data separately for each cluster and combine them.

9. Putting It Together: A Complete Statistical Pipeline

This section demonstrates a practical end-to-end workflow: load real data, analyze its structure, fit distributions, model correlations, and generate synthetic records.

import numpy as np
import pandas as pd
from scipy import stats
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# STEP 1: Load or simulate real data
print("=" * 60)
print("STEP 1: Load Real Data")
print("=" * 60)

np.random.seed(42)

# Simulate a realistic customer dataset
n_real = 500
real_data = {
    'age': np.random.normal(40, 15, n_real),
    'income': np.random.lognormal(10.5, 0.5, n_real),
    'tenure_months': np.random.exponential(scale=24, size=n_real),
    'is_premium': np.random.binomial(1, 0.3, n_real),
}

df_real = pd.DataFrame(real_data)
df_real['age'] = df_real['age'].clip(18, 85)  # Realistic bounds

print(f"Loaded {len(df_real)} real records")
print(df_real.describe())

# STEP 2: Analyze marginal distributions
print("\n" + "=" * 60)
print("STEP 2: Analyze Marginal Distributions")
print("=" * 60)

# Test normality with Shapiro-Wilk test
for col in ['age', 'income', 'tenure_months']:
    stat, p_value = stats.shapiro(df_real[col])
    print(f"{col}: Shapiro-Wilk p-value = {p_value:.4f}")

# Treat the p-value as a screening signal, not a final verdict.
# With large samples, tiny harmless deviations can reject normality; with small
# samples, meaningful non-normality may go undetected. Always inspect plots too.

# STEP 3: Fit distributions and estimate parameters
print("\n" + "=" * 60)
print("STEP 3: Fit Distributions to Marginals")
print("=" * 60)

fitted_dists = {}

# Age: Normal
mu_age, sigma_age = stats.norm.fit(df_real['age'])
fitted_dists['age'] = ('norm', {'loc': mu_age, 'scale': sigma_age})
print(f"Age ~ Normal(μ={mu_age:.1f}, σ={sigma_age:.1f})")

# Income: Lognormal (skewed). Fix location at 0 — income is non-negative.
shape, loc, scale = stats.lognorm.fit(df_real['income'], floc=0)
fitted_dists['income'] = ('lognorm', {'s': shape, 'loc': loc, 'scale': scale})
print(f"Income ~ Lognormal(shape={shape:.2f}, scale={scale:.0f})")

# Tenure: Exponential
scale_tenure = stats.expon.fit(df_real['tenure_months'])[1]
fitted_dists['tenure_months'] = ('expon', {'scale': scale_tenure})
print(f"Tenure ~ Exponential(scale={scale_tenure:.1f})")

# Premium flag: Bernoulli
p_premium = df_real['is_premium'].mean()
fitted_dists['is_premium'] = ('binom', {'n': 1, 'p': p_premium})
print(f"Premium ~ Bernoulli(p={p_premium:.2f})")

# STEP 4: Estimate correlation structure
print("\n" + "=" * 60)
print("STEP 4: Estimate Correlation Structure")
print("=" * 60)

corr_matrix = df_real[['age', 'income', 'tenure_months']].corr()
print("Correlation Matrix:")
print(corr_matrix)

# STEP 5: Generate synthetic data using Gaussian copula
print("\n" + "=" * 60)
print("STEP 5: Generate Synthetic Data (Gaussian Copula)")
print("=" * 60)

# Step 5a: Sample from copula
n_synthetic = 5000
mvn = stats.multivariate_normal(
    mean=np.zeros(3),
    cov=corr_matrix.values
)
copula_samples = stats.norm.cdf(mvn.rvs(size=n_synthetic, random_state=42))

# Step 5b: Transform to marginals via inverse-CDF of each fitted distribution.
# is_premium is treated as independent here (we did not include it in the copula
# above). Sampling it from a fresh uniform keeps its marginal rate at p_premium
# while avoiding a spurious coupling to age.
rng_is_premium = np.random.default_rng(seed=42)
synthetic_data = {
    'age': stats.norm.ppf(copula_samples[:, 0], loc=mu_age, scale=sigma_age),
    'income': stats.lognorm.ppf(copula_samples[:, 1], s=shape, loc=loc, scale=scale),
    'tenure_months': stats.expon.ppf(copula_samples[:, 2], scale=scale_tenure),
    'is_premium': (rng_is_premium.random(n_synthetic) < p_premium).astype(int),
}

df_synthetic = pd.DataFrame(synthetic_data)
df_synthetic['age'] = df_synthetic['age'].clip(18, 85)

print(f"Generated {len(df_synthetic)} synthetic records")
print(df_synthetic.describe())

# STEP 6: Compare real vs. synthetic
print("\n" + "=" * 60)
print("STEP 6: Validation")
print("=" * 60)

print("\nCORRELATION COMPARISON:")
print("Real correlation matrix:")
print(df_real[['age', 'income', 'tenure_months']].corr())
print("\nSynthetic correlation matrix:")
print(df_synthetic[['age', 'income', 'tenure_months']].corr())

# Visualization
fig, axes = plt.subplots(2, 4, figsize=(16, 8))

# Row 1: Marginal distributions
axes[0, 0].hist(df_real['age'], bins=30, alpha=0.6, label='Real', edgecolor='black')
axes[0, 0].hist(df_synthetic['age'], bins=30, alpha=0.6, label='Synthetic', edgecolor='black')
axes[0, 0].set_title('Age')
axes[0, 0].legend()

axes[0, 1].hist(np.log1p(df_real['income']), bins=30, alpha=0.6, label='Real', edgecolor='black')
axes[0, 1].hist(np.log1p(df_synthetic['income']), bins=30, alpha=0.6, label='Synthetic', edgecolor='black')
axes[0, 1].set_title('Income (log scale)')
axes[0, 1].legend()

axes[0, 2].hist(df_real['tenure_months'], bins=30, alpha=0.6, label='Real', edgecolor='black')
axes[0, 2].hist(df_synthetic['tenure_months'], bins=30, alpha=0.6, label='Synthetic', edgecolor='black')
axes[0, 2].set_title('Tenure (months)')
axes[0, 2].legend()

df_real['is_premium'].value_counts().plot(kind='bar', ax=axes[0, 3], alpha=0.6, label='Real')
df_synthetic['is_premium'].value_counts().plot(kind='bar', ax=axes[0, 3], alpha=0.6, label='Synthetic')
axes[0, 3].set_title('Premium Flag')
axes[0, 3].legend()

# Row 2: Scatter plots of key relationships
axes[1, 0].scatter(df_real['age'], np.log1p(df_real['income']), alpha=0.3, s=10, label='Real')
axes[1, 0].scatter(df_synthetic['age'], np.log1p(df_synthetic['income']), alpha=0.3, s=10, label='Synthetic')
axes[1, 0].set_xlabel('Age')
axes[1, 0].set_ylabel('log(Income)')
axes[1, 0].set_title('Age vs Income')
axes[1, 0].legend()

axes[1, 1].scatter(df_real['age'], df_real['tenure_months'], alpha=0.3, s=10, label='Real')
axes[1, 1].scatter(df_synthetic['age'], df_synthetic['tenure_months'], alpha=0.3, s=10, label='Synthetic')
axes[1, 1].set_xlabel('Age')
axes[1, 1].set_ylabel('Tenure')
axes[1, 1].set_title('Age vs Tenure')
axes[1, 1].legend()

axes[1, 2].scatter(np.log1p(df_real['income']), df_real['tenure_months'], alpha=0.3, s=10, label='Real')
axes[1, 2].scatter(np.log1p(df_synthetic['income']), df_synthetic['tenure_months'], alpha=0.3, s=10, label='Synthetic')
axes[1, 2].set_xlabel('log(Income)')
axes[1, 2].set_ylabel('Tenure')
axes[1, 2].set_title('Income vs Tenure')
axes[1, 2].legend()

axes[1, 3].axis('off')

plt.suptitle('Real vs. Synthetic Data Comparison', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('pipeline_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nSynthetic data generated and validated successfully!")
print("Next step: Use synthetic data for model training, testing, or augmentation.")

Complete pipeline validation: marginal distributions and scatter plots comparing real vs synthetic data — Figure 2.8 — End-to-end pipeline validation. Top row: marginal distribution comparisons — the synthetic data (orange) closely matches the real data (blue) across all four features, including the lognormal income distribution and the binary premium flag. Bottom row: bivariate scatter plots confirm that correlations between features are preserved in the synthetic data. This is the standard "visual sanity check" before proceeding to formal statistical evaluation (Chapter 9).

Key Insight: This pipeline—analyze, fit, model correlations, generate, validate—is the backbone of rule-based and simulation-driven synthetic data generation. Each step is interpretable, and failures (e.g., poor fit on a marginal distribution, incorrect correlations) are easily diagnosed and corrected.

Conclusion

Statistical foundations are essential to synthetic data generation. By understanding probability distributions, correlations, and estimation techniques, you can build principled synthetic data pipelines that preserve the essential properties of real data.

The techniques in this chapter—parametric distributions, Gaussian copulas, kernel density estimation, and maximum likelihood estimation—form the toolkit for Chapters 3 and beyond. Some practitioners will use these methods directly; others will combine them with machine learning and deep learning approaches to handle more complex, high-dimensional data.

Key takeaways:

Always analyze your data's marginal distributions before generation
Correlations matter: use copulas or other dependence models
Validate synthetic data against real data on both univariate and multivariate statistics
Trade off complexity and interpretability: simpler models are easier to debug
No one-size-fits-all solution: the best approach depends on your data, use case, and constraints

In Chapter 3, we will apply these statistical foundations to rule-based and simulation-based synthetic data generation, showing how to encode domain knowledge and business logic directly into the generation process.

References and Further Reading

Casella, G., & Berger, R. L. (2002). Statistical Inference. Duxbury, 2nd Edition.
Nelsen, R. B. (2006). An Introduction to Copulas. Springer, 2nd Edition.
Sklar, A. (1959). Fonctions de repartition a n dimensions et leurs marges. Publications de l'Institut de Statistique de l'Universite de Paris, 8, 229-231.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall.
Scott, D. W. (2015). Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, 2nd Edition.
Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. Annals of Statistics, 7(1), 1-26. projecteuclid.org/journals/annals-of-statistics/volume-7/issue-1
Massey, F. J. (1951). The Kolmogorov-Smirnov Test for Goodness of Fit. Journal of the American Statistical Association, 46(253), 68-78.

← Chapter 1 Chapter 3 →