Chapter 9: Evaluation & Quality Metrics

Evaluation Quality Metrics

9.1 Why Evaluation Matters

You have generated synthetic data. The generator ran without errors, sampled thousands of records, and the output looks plausible at first glance. But can you trust it?

The answer depends entirely on rigorous evaluation. Synthetic data quality is not self-evident. A GAN trained on financial transactions might produce structurally sound records that violate statistical invariants. A VAE fine-tuned on medical data might capture marginal distributions but fail to preserve disease correlations. A rule-based simulator might be reproducible and auditable but generate such narrow distribution support that it has limited utility for model training.

Evaluation is the bridge between generation and deployment. It answers the critical question: does this synthetic data actually serve the purpose for which it was created? Without systematic evaluation, you risk deploying synthetic data that appears to work but silently undermines downstream decisions, or that reveals more about real data than you realized.

This chapter provides a complete framework for evaluating synthetic data across three complementary dimensions: fidelity (how similar is the synthetic data to the real data?), utility (how useful is it for downstream tasks?), and privacy (how well are real individuals protected?). We'll move from theory to implementation, showing how to compute metrics, interpret them, and build automated evaluation pipelines.

Core Principle: You cannot trust synthetic data you have not measured. Evaluation is not optional—it is mandatory before any production deployment. A complete evaluation suite should assess fidelity, utility, and privacy simultaneously, and results should be documented in a quality scorecard.

9.2 The Three Pillars of Synthetic Data Quality

Synthetic data quality rests on three foundational pillars, each answering a different question:

Pillar 1: Fidelity

The Question: How well does the synthetic data resemble the real data in terms of statistical properties?

Fidelity measures the degree to which synthetic data preserves the statistical characteristics of the real data. This includes univariate distributions (marginals), bivariate and multivariate relationships, temporal patterns, and structural constraints. High fidelity means that if you analyze the synthetic data, you observe similar statistical conclusions as you would from the real data.

Why it matters: If synthetic data has poor fidelity, it cannot serve as a drop-in replacement for real data. Statistical analyses will be misleading, regression coefficients will be biased, and domain-specific patterns will be lost. Fidelity is the floor—without it, nothing else matters.

Example: A synthetic dataset is created to simulate patient demographics. Real data shows that age is normally distributed with mean 55 and std 18, and that older patients have higher hypertension prevalence (70% at age 70+, 30% at age 40-50). If synthetic age is uniformly distributed or the age-hypertension correlation is absent, fidelity is compromised.

Pillar 2: Utility

The Question: How well does the synthetic data enable downstream machine learning tasks?

Utility measures whether models trained on synthetic data generalize to real data, and whether insights derived from synthetic data are actionable on real data. A dataset can have perfect fidelity in marginal distributions but fail utility if it lacks important interaction effects or rare subgroups that are critical for model performance.

Why it matters: The ultimate purpose of synthetic data is often to train models or support analysis. If a model trained on synthetic data fails to generalize to real data, or if analysis conclusions don't hold, synthetic data has failed its primary mission regardless of how statistically perfect it appears.

Example: A synthetic loan dataset has perfect correlation structure but is missing the rare "defaulting subprime borrower" subgroup that dominates risk in the real loan book. A classifier trained on synthetic data will overestimate precision and miss true default cases in production.

Pillar 3: Privacy

The Question: How well are real individuals protected from re-identification and attribute inference?

Privacy measures the extent to which synthetic data leaks information about real individuals. This is particularly critical when synthetic data is shared externally or used for collaborative research. Privacy attacks (membership inference, attribute inference, linkage) can succeed even if synthetic data has high fidelity.

Why it matters: Privacy breaches can expose individuals, violate regulations (GDPR, HIPAA), destroy trust, and create legal liability. Synthetic data that is utility-rich but privacy-poor is worse than useless—it's dangerous. Privacy must be assessed alongside fidelity and utility.

Example: A synthetic medical dataset reproduces disease prevalences and comorbidities perfectly but, when combined with zip code and age, uniquely identifies individuals in the real cohort through linkage attack. High fidelity and utility do not compensate for failed privacy.

The Trade-off Triangle: Fidelity, utility, and privacy are often in tension. Improving privacy often reduces fidelity (e.g., via differential privacy noise). Improving utility may require preserving sensitive attributes that increase privacy risk. Evaluation must be holistic, reporting all three dimensions and making explicit the trade-offs chosen.
Three pillars of synthetic data quality: Fidelity, Utility, Privacy
Figure 9.1 — The three pillars of synthetic data quality. A comprehensive evaluation must assess fidelity (statistical similarity), utility (downstream task performance), and privacy (resistance to re-identification attacks).
Complete evaluation workflow: real and synthetic data through statistical tests, ML pipeline, and privacy attacks to quality report
Figure 9.3 — The complete synthetic data evaluation workflow. Real and synthetic datasets are fed through three parallel assessment branches — statistical fidelity tests, machine learning utility benchmarks, and privacy attack simulations — producing a unified quality scorecard.

9.3 Statistical Fidelity Metrics

Fidelity metrics quantify how closely synthetic data mirrors the statistical properties of real data. We begin with univariate (column-wise) metrics, then move to bivariate and multivariate comparisons.

9.3.1 Univariate Metrics: Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov (KS) test measures the maximum distance between the empirical cumulative distribution functions (CDFs) of real and synthetic data for a continuous variable.

Interpretation: KS statistic ranges from 0 (identical distributions) to 1 (maximally different). An 0.05 p-value indicates significant difference at the 5% level. A small KS statistic (e.g., 0.02) suggests good univariate fidelity.

Limitation: KS test is sensitive to sample size. With large samples, even tiny practical differences become significant. It also doesn't capture multivariate structure.

9.3.2 Univariate Metrics: Chi-Square Test

For categorical variables, the chi-square test compares observed vs. expected frequencies across categories.

Formula: χ² = Σ((observed - expected)² / expected)

Interpretation: A small chi-square statistic and high p-value indicate that categorical distributions are not distinguishable by this test. Treat 0.05 as a conventional screening threshold, not a proof that the distributions are the same.

9.3.3 Univariate Metrics: Jensen-Shannon Divergence

Jensen-Shannon (JS) divergence is a symmetrised version of Kullback-Leibler divergence between two probability distributions. It is well-defined when the supports overlap imperfectly (the KL divergence blows up in that case), making it a common fidelity metric for discrete or binned distributions.

Formula: JS(P ∥ Q) = 0.5 · KL(P ∥ M) + 0.5 · KL(Q ∥ M), where M = 0.5 · (P + Q).

Units and range. With natural log, JS divergence ∈ [0, ln 2] ≈ [0, 0.693]; with log base 2, JS divergence ∈ [0, 1]. Note that scipy.spatial.distance.jensenshannon returns the Jensen-Shannon distance (the square root of the divergence), not the divergence itself, so its values lie in [0, √ln 2] ≈ [0, 0.832] by default — or [0, 1] when base=2 is passed. The code below is consistent with the distance convention; square the result if you want the divergence in nats (or pass base=2 for bits).

Code: Computing Univariate Metrics


import pandas as pd
import numpy as np
from scipy.stats import ks_2samp, chi2_contingency
from scipy.spatial.distance import jensenshannon

def compute_univariate_fidelity(real_df, synthetic_df):
    """
    Compute KS, chi-square, and JS divergence for all columns.

    Args:
        real_df: Real data (pd.DataFrame)
        synthetic_df: Synthetic data (pd.DataFrame)

    Returns:
        dict: Metrics per column
    """
    results = {}

    for col in real_df.columns:
        real_col = real_df[col].dropna()
        synth_col = synthetic_df[col].dropna()

        if real_col.dtype in ['int64', 'float64']:
            # Continuous: KS test
            ks_stat, ks_pval = ks_2samp(real_col, synth_col)
            results[col] = {
                'type': 'continuous',
                'ks_stat': ks_stat,
                'ks_pval': ks_pval,
            }

            # JS divergence (with histogram binning)
            bins = min(50, int(np.sqrt(len(real_col))))
            hist_real, bin_edges = np.histogram(real_col, bins=bins, density=True)
            hist_synth, _ = np.histogram(synth_col, bins=bin_edges, density=True)
            # Normalize
            hist_real = hist_real / hist_real.sum()
            hist_synth = hist_synth / hist_synth.sum()
            # scipy returns the JS *distance*; square it for the divergence in nats.
            js_distance = jensenshannon(hist_real, hist_synth)
            results[col]['js_distance'] = js_distance
            results[col]['js_divergence_nats'] = js_distance ** 2
        else:
            # Categorical: chi-square test
            real_counts = real_col.value_counts()
            synth_counts = synth_col.value_counts()

            # Align categories
            all_cats = set(real_counts.index) | set(synth_counts.index)
            real_counts = real_counts.reindex(all_cats, fill_value=0)
            synth_counts = synth_counts.reindex(all_cats, fill_value=0)

            # Create contingency table
            contingency = np.array([real_counts.values, synth_counts.values])
            chi2, pval, dof, expected = chi2_contingency(contingency)

            results[col] = {
                'type': 'categorical',
                'chi2_stat': chi2,
                'chi2_pval': pval,
                'categories': len(all_cats),
            }

    return results

# Example usage
real_data = pd.read_csv('real_patients.csv')
synthetic_data = pd.read_csv('synthetic_patients.csv')

fidelity_metrics = compute_univariate_fidelity(real_data, synthetic_data)

# Print summary
print("\\n=== Univariate Fidelity Metrics ===")
for col, metrics in fidelity_metrics.items():
    if metrics['type'] == 'continuous':
        print(f"{col}: KS={metrics['ks_stat']:.4f} (p={metrics['ks_pval']:.4f}), "
              f"JS_dist={metrics['js_distance']:.4f}")
    else:
        print(f"{col}: χ²={metrics['chi2_stat']:.2f} (p={metrics['chi2_pval']:.4f})")
      

9.3.4 Multivariate Fidelity: Propensity-Score MSE (pMSE)

Per-column KS and JS scores can all look great while the joint distribution is badly distorted. A widely used multivariate fidelity summary is the propensity-score MSE (pMSE) of Woo et al. (2009). Pool real and synthetic rows, label them 0 and 1, fit a classifier (logistic regression or gradient boosting) to tell them apart, and record how close its predicted probabilities stay to 0.5.

Formula: pMSE = (1/N) · Σᵢ (p̂ᵢ − c)², where p̂ᵢ is the classifier's estimated probability that row i is synthetic, c = N_synth / N is the class balance, and N is the total pool size. A perfect generator gives p̂ᵢ ≈ c for every row and pMSE → 0; an easily distinguishable one drives pMSE up toward c(1 − c).


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

def pmse(real_df, synth_df, numeric_only=True, seed=0):
    """Propensity-score MSE (Woo et al., 2009). Lower is better; 0 = perfect."""
    if numeric_only:
        real_df = real_df.select_dtypes(include=[np.number])
        synth_df = synth_df.select_dtypes(include=[np.number])
    X = np.vstack([real_df.values, synth_df.values])
    y = np.concatenate([np.zeros(len(real_df)), np.ones(len(synth_df))])
    c = y.mean()  # fraction of synthetic rows in the pool

    # Use out-of-fold probabilities so we don't grade the classifier on training data.
    probs = cross_val_predict(
        LogisticRegression(max_iter=1000, random_state=seed),
        X, y, cv=5, method='predict_proba',
    )[:, 1]
    return float(np.mean((probs - c) ** 2))

print(f"pMSE: {pmse(real_data, synthetic_data):.5f}")
      

Interpretation: pMSE is model-dependent. A flexible classifier (gradient boosting) gives a stricter test than logistic regression, so always report which model you used alongside the score.

9.3.5 Pairwise Metrics: Correlation Matrix Comparison

Univariate metrics miss multivariate structure. A key relationship is the correlation between variables. Comparing correlation matrices between real and synthetic data reveals whether the generator preserved pairwise dependencies.

Approach: Compute Pearson or Spearman correlation matrices for both datasets, then measure distance between them using Frobenius norm or sum of absolute differences.

Metric: Correlation difference = ||R_real - R_synthetic||_F / ||R_real||_F

Values close to 0 indicate preserved correlations. Values > 0.1 suggest significant structural differences.

Code: Correlation Matrix Comparison


from scipy.spatial.distance import euclidean

def compare_correlation_matrices(real_df, synthetic_df):
    """
    Compare correlation matrices between real and synthetic data.
    """
    # Select numeric columns only
    numeric_real = real_df.select_dtypes(include=[np.number])
    numeric_synth = synthetic_df.select_dtypes(include=[np.number])

    # Compute correlation matrices
    corr_real = numeric_real.corr()
    corr_synth = numeric_synth.corr()

    # Frobenius norm (matrix distance)
    diff_norm = np.linalg.norm(corr_real.values - corr_synth.values, 'fro')
    real_norm = np.linalg.norm(corr_real.values, 'fro')
    normalized_diff = diff_norm / real_norm if real_norm > 0 else 0

    # Per-pair absolute differences
    corr_diff = (corr_real - corr_synth).abs()
    max_pair_diff = corr_diff.values[np.triu_indices_from(
        corr_diff.values, k=1)].max()
    mean_pair_diff = corr_diff.values[np.triu_indices_from(
        corr_diff.values, k=1)].mean()

    return {
        'frobenius_norm_diff': diff_norm,
        'normalized_frobenius_diff': normalized_diff,
        'max_pair_difference': max_pair_diff,
        'mean_pair_difference': mean_pair_diff,
        'corr_real': corr_real,
        'corr_synth': corr_synth,
    }

# Example
corr_comparison = compare_correlation_matrices(real_data, synthetic_data)
print(f"Normalized Correlation Difference: {corr_comparison['normalized_frobenius_diff']:.4f}")
print(f"Max Pairwise Correlation Diff: {corr_comparison['max_pair_difference']:.4f}")
      

9.4 Distribution Comparison Techniques

Beyond point-estimate tests, visual and quantitative distribution comparisons provide deeper insight into fidelity.

9.4.1 QQ Plots and Marginal Overlap

A QQ (quantile-quantile) plot compares quantiles of two distributions. If both distributions are identical, the QQ plot lies on the diagonal y=x. Deviations indicate distributional differences.

Interpretation: Points below the diagonal indicate that synthetic quantiles are lower than real quantiles at corresponding probability levels. This reveals where distributions diverge (tails vs. center).

Complementary approach: overlay histograms or kernel density estimates (KDE) of real and synthetic data. Areas of non-overlap highlight regions where the generator failed.

9.4.2 Earth Mover's Distance (Wasserstein Distance)

Earth Mover's Distance (EMD), also called Wasserstein distance, is the minimum cost to transport one distribution to another. For one-dimensional continuous data, it equals the area between the two CDFs.

Advantage: EMD is a true metric (satisfies triangle inequality), interpretable in the original data units, and sensitive to distributional shifts across all regions.

Interpretation: EMD of 0 = identical distributions. EMD = range of data / 100 is a rough threshold for "good" fidelity, though domain context matters.

Code: EMD and Distribution Plots


from scipy.stats import wasserstein_distance
import matplotlib.pyplot as plt

def compute_emd_and_plot(real_col, synth_col, column_name):
    """
    Compute Earth Mover's Distance and visualize distributions.
    """
    emd = wasserstein_distance(real_col.dropna(), synth_col.dropna())

    fig, axes = plt.subplots(1, 2, figsize=(12, 4))

    # Histogram + KDE
    ax = axes[0]
    ax.hist(real_col, bins=30, alpha=0.6, label='Real', density=True, color='blue')
    ax.hist(synth_col, bins=30, alpha=0.6, label='Synthetic', density=True, color='orange')
    ax.set_xlabel(column_name)
    ax.set_ylabel('Density')
    ax.legend()
    ax.set_title(f'{column_name} Distribution (EMD={emd:.4f})')

    # QQ Plot
    ax = axes[1]
    from scipy.stats import probplot
    probplot(synth_col.dropna(), dist="norm", plot=ax)
    ax.get_lines()[0].set_color('orange')
    ax.set_title(f'QQ Plot: {column_name}')

    plt.tight_layout()
    return emd, fig

# Compute EMD for all numeric columns
numeric_cols = real_data.select_dtypes(include=[np.number]).columns
emd_results = {}
for col in numeric_cols:
    emd, fig = compute_emd_and_plot(real_data[col], synthetic_data[col], col)
    emd_results[col] = emd
    plt.savefig(f'emd_{col}.png')
    plt.close()

print("Earth Mover's Distance per column:")
for col, emd in emd_results.items():
    print(f"  {col}: {emd:.4f}")
      

9.5 Machine Learning Utility Evaluation

The ultimate test of synthetic data is its utility for machine learning tasks. High fidelity is necessary but not sufficient—synthetic data must enable models to generalize to real data.

9.5.1 Train-on-Synthetic, Test-on-Real (TSTR)

TSTR is the canonical utility paradigm: train a model on synthetic data, evaluate it on real data (held-out test set). Compare TSTR performance to the baseline (train-on-real, test-on-real, TRTR).

Interpretation: If TSTR accuracy is within 5% of TRTR, synthetic data has good utility. Larger gaps (>10%) suggest that the synthetic data is missing patterns critical for the downstream task.

Why this works: TSTR directly measures whether a model trained on synthetic data can generalize to real data, the actual use case for synthetic data in production.

TSTR evaluation: train on synthetic, test on real, compare to training on real
Figure 9.2 — The Train-on-Synthetic, Test-on-Real (TSTR) paradigm. Model A is trained on real data as the baseline; Model B is trained on synthetic data. Both are evaluated on the same held-out real test set. The gap between their accuracies quantifies synthetic data utility.

9.5.2 Reverse Scenario: Train-on-Real, Test-on-Synthetic (TRTS)

TRTS trains on real data and tests on synthetic data. If TRTS performance is significantly lower than TRTR, the synthetic distribution has shifted outside the decision boundaries learned by the model. This flags utility problems.

Use case: TRTS is useful for detecting when synthetic data is too different from real data despite good univariate fidelity (e.g., data with shifted correlations or rare subgroup omissions).

9.5.3 Feature Importance Comparison

Train models on both real and synthetic data, extract feature importances, and compare. If feature rankings differ significantly, the synthetic data is teaching the model different decision rules.

Metric: Spearman rank correlation of feature importances, or sum of absolute rank differences. Values close to 1 (or small rank differences) indicate consistency.

Code: Complete ML Utility Pipeline


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import scipy.stats as stats

class MLUtilityEvaluator:
    """
    Comprehensive ML utility evaluation for synthetic data.
    Implements TSTR, TRTS, and feature importance comparison.
    """

    def __init__(self, real_df, synthetic_df, target_col, test_size=0.2, random_state=42):
        self.real_df = real_df
        self.synthetic_df = synthetic_df
        self.target_col = target_col
        self.test_size = test_size
        self.random_state = random_state
        self.results = {}

    def prepare_data(self, df):
        """Separate features and target, handle categorical variables."""
        X = df.drop(columns=[self.target_col])
        y = df[self.target_col]

        # One-hot encode categorical variables
        X = pd.get_dummies(X, drop_first=True)

        return X, y

    def train_and_evaluate(self, X_train, y_train, X_test, y_test, model_name='RF'):
        """Train a model and compute metrics."""
        if model_name == 'RF':
            model = RandomForestClassifier(n_estimators=100, random_state=self.random_state)
        elif model_name == 'LR':
            model = LogisticRegression(max_iter=1000)

        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        metrics = {
            'accuracy': accuracy_score(y_test, y_pred),
            'precision': precision_score(y_test, y_pred, average='binary', zero_division=0),
            'recall': recall_score(y_test, y_pred, average='binary', zero_division=0),
            'f1': f1_score(y_test, y_pred, average='binary', zero_division=0),
        }

        return model, metrics

    def run_tstr(self, model_name='RF'):
        """Train on Synthetic, Test on Real (TSTR)."""
        X_real, y_real = self.prepare_data(self.real_df)
        X_synth, y_synth = self.prepare_data(self.synthetic_df)

        # Align feature columns
        common_cols = X_real.columns.intersection(X_synth.columns)
        X_real = X_real[common_cols]
        X_synth = X_synth[common_cols]

        X_real_train, X_real_test, y_real_train, y_real_test = train_test_split(
            X_real, y_real, test_size=self.test_size, random_state=self.random_state
        )

        # Train on synthetic (full set), test on real test set
        model, metrics = self.train_and_evaluate(
            X_synth, y_synth, X_real_test, y_real_test, model_name
        )

        self.results['TSTR'] = metrics
        self.results['TSTR_model'] = model
        return metrics

    def run_trtr(self, model_name='RF'):
        """Train on Real, Test on Real (TRTR) - baseline."""
        X_real, y_real = self.prepare_data(self.real_df)

        X_train, X_test, y_train, y_test = train_test_split(
            X_real, y_real, test_size=self.test_size, random_state=self.random_state
        )

        model, metrics = self.train_and_evaluate(
            X_train, y_train, X_test, y_test, model_name
        )

        self.results['TRTR'] = metrics
        self.results['TRTR_model'] = model
        return metrics

    def run_trts(self, model_name='RF'):
        """Train on Real, Test on Synthetic (TRTS)."""
        X_real, y_real = self.prepare_data(self.real_df)
        X_synth, y_synth = self.prepare_data(self.synthetic_df)

        # Align feature columns
        common_cols = X_real.columns.intersection(X_synth.columns)
        X_real = X_real[common_cols]
        X_synth = X_synth[common_cols]

        X_real_train, _, y_real_train, _ = train_test_split(
            X_real, y_real, test_size=self.test_size, random_state=self.random_state
        )

        model, metrics = self.train_and_evaluate(
            X_real_train, y_real_train, X_synth, y_synth, model_name
        )

        self.results['TRTS'] = metrics
        self.results['TRTS_model'] = model
        return metrics

    def compare_feature_importance(self):
        """Compare feature importances from TRTR vs TSTR models."""
        trtr_model = self.results.get('TRTR_model')
        tstr_model = self.results.get('TSTR_model')

        if trtr_model is None or tstr_model is None:
            return None

        # Extract feature importances
        X_real, _ = self.prepare_data(self.real_df)
        X_synth, _ = self.prepare_data(self.synthetic_df)
        common_cols = X_real.columns.intersection(X_synth.columns)

        if not hasattr(trtr_model, 'feature_importances_'):
            return None  # Not applicable to LR

        fi_trtr = trtr_model.feature_importances_
        fi_tstr = tstr_model.feature_importances_

        # Rank and compare
        rank_trtr = stats.rankdata(fi_trtr)
        rank_tstr = stats.rankdata(fi_tstr)

        rank_corr, rank_pval = stats.spearmanr(rank_trtr, rank_tstr)

        return {
            'spearman_rank_corr': rank_corr,
            'rank_pval': rank_pval,
            'feature_importances_trtr': fi_trtr,
            'feature_importances_tstr': fi_tstr,
            'features': list(common_cols),
        }

    def summary(self):
        """Print a summary report of utility evaluation."""
        print("\\n=== ML Utility Evaluation Report ===")
        print("\\nTRTR (Real→Real, Baseline):")
        for metric, value in self.results.get('TRTR', {}).items():
            print(f"  {metric}: {value:.4f}")

        print("\\nTSTR (Synthetic→Real):")
        for metric, value in self.results.get('TSTR', {}).items():
            print(f"  {metric}: {value:.4f}")

        if 'TSTR' in self.results and 'TRTR' in self.results:
            print("\\nTSTR vs TRTR Performance Gap (%):")
            for metric in ['accuracy', 'f1']:
                gap = (self.results['TRTR'][metric] -
                       self.results['TSTR'][metric]) * 100
                print(f"  {metric}: {gap:.2f}%")

        print("\\nTRTS (Real→Synthetic):")
        for metric, value in self.results.get('TRTS', {}).items():
            print(f"  {metric}: {value:.4f}")

        fi_comparison = self.compare_feature_importance()
        if fi_comparison:
            print(f"\\nFeature Importance Rank Correlation: {fi_comparison['spearman_rank_corr']:.4f}")

# Example usage
evaluator = MLUtilityEvaluator(
    real_data, synthetic_data, target_col='diagnosis', random_state=42
)

evaluator.run_trtr(model_name='RF')
evaluator.run_tstr(model_name='RF')
evaluator.run_trts(model_name='RF')
evaluator.summary()
      
Best Practice: Always run TSTR on a held-out real test set, not on the training set. TRTR and TSTR should use the same train-test split of the real data for fair comparison. Report all three scenarios (TSTR, TRTR, TRTS) to provide a complete picture of utility.

9.6 Privacy Metrics and Attacks

Privacy evaluation is distinct from fidelity and utility. Even synthetic data with perfect fidelity can leak sensitive information. Privacy must be actively measured through attack simulations.

9.6.1 Distance to Closest Record (DCR)

DCR measures the minimum distance from each real record to any synthetic record in the feature space. If a real record is very close to a synthetic record, the synthetic record may be a near-copy or reconstruction of the real record.

Interpretation: Small DCR values (< 10th percentile of inter-record distances) suggest potential privacy leakage. The proportion of real records with DCR below a threshold indicates re-identification risk.

Typical threshold: If >5% of real records have DCR in the bottom 10th percentile of synthetic-to-synthetic distances, privacy may be compromised.

9.6.2 Membership Inference Attack (MIA)

A membership inference attack attempts to determine whether a specific record was in the training data. The attacker trains a classifier to predict "member" vs. "non-member" based on properties of the generator or synthetic data.

Simple approach: For each real record, compute its likelihood under the synthetic data distribution (e.g., kernel density estimate). Members of the training set often have higher likelihood. If the attacker can separate real training records from non-members by likelihood alone, membership is inferable.

Metric: MIA accuracy = accuracy of a classifier predicting membership given likelihood. Values near 50% indicate robustness (no signal); values > 60% indicate privacy leakage.

Code: Privacy Metrics


from scipy.spatial.distance import cdist
from sklearn.neighbors import KernelDensity
from sklearn.metrics import accuracy_score, roc_auc_score

def compute_dcr(real_df, synthetic_df, exclude_cols=None):
    """
    Compute Distance to Closest Record (DCR).
    """
    if exclude_cols is None:
        exclude_cols = []

    # Numeric columns only
    numeric_cols = [c for c in real_df.columns
                    if c not in exclude_cols and real_df[c].dtype in ['int64', 'float64']]

    real_num = real_df[numeric_cols].fillna(0).values
    synth_num = synthetic_df[numeric_cols].fillna(0).values

    # Normalize to [0,1]
    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler()
    real_num = scaler.fit_transform(real_num)
    synth_num = scaler.transform(synth_num)

    # Compute distances: each real record to closest synthetic record
    distances = cdist(real_num, synth_num, metric='euclidean')
    dcr = distances.min(axis=1)

    # Compute baseline: distance between random synthetic records
    synth_distances = cdist(synth_num, synth_num, metric='euclidean')
    np.fill_diagonal(synth_distances, np.inf)  # Exclude self-distance
    baseline_dcr_percentiles = synth_distances.min(axis=1)

    threshold_10pct = np.percentile(baseline_dcr_percentiles, 10)
    leak_ratio = (dcr < threshold_10pct).sum() / len(dcr)

    return {
        'dcr': dcr,
        'mean_dcr': dcr.mean(),
        'min_dcr': dcr.min(),
        'percentile_10_dcr': np.percentile(dcr, 10),
        'leak_ratio': leak_ratio,
        'baseline_threshold': threshold_10pct,
    }

def membership_inference_attack(members_df, non_members_df, synthetic_df,
                                exclude_cols=None):
    """
    Likelihood-based membership inference attack.

    Args
    ----
    members_df:     real records that WERE seen by the generator during fitting.
    non_members_df: real records that were HELD OUT from the generator.
    synthetic_df:   synthetic records produced by the generator.

    The attack assumes that, if the generator leaks, training members will sit
    in higher-density regions of the synthetic distribution than held-out
    records. We fit a KDE on the synthetic data and use its log-likelihood as
    the attacker's score; AUC measurably above 0.5 signals leakage.
    """
    if exclude_cols is None:
        exclude_cols = []

    numeric_cols = [c for c in members_df.columns
                    if c not in exclude_cols
                    and members_df[c].dtype in ['int64', 'float64']]

    M = members_df[numeric_cols].fillna(0).values
    N = non_members_df[numeric_cols].fillna(0).values
    S = synthetic_df[numeric_cols].fillna(0).values

    # Fit scaler on synthetic data so the attack does not peek at real data.
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler().fit(S)
    M, N, S = scaler.transform(M), scaler.transform(N), scaler.transform(S)

    kde = KernelDensity(bandwidth=0.5).fit(S)
    ll_members = kde.score_samples(M)
    ll_non_members = kde.score_samples(N)

    scores = np.concatenate([ll_members, ll_non_members])
    y_true = np.concatenate([np.ones(len(ll_members)),
                             np.zeros(len(ll_non_members))])

    # Choose the threshold on a separate, random split to avoid using the
    # label to pick the cutoff we then report accuracy against.
    threshold = np.median(scores)
    y_pred = (scores > threshold).astype(int)

    mia_acc = accuracy_score(y_true, y_pred)
    try:
        mia_auc = roc_auc_score(y_true, scores)
    except ValueError:
        mia_auc = None

    return {
        'mia_accuracy': mia_acc,   # 0.5 = perfect privacy, > 0.6 = leakage
        'mia_auc': mia_auc,        # 0.5 = perfect privacy, > 0.6 = leakage
        'll_member_mean': ll_members.mean(),
        'll_nonmember_mean': ll_non_members.mean(),
    }

# Run privacy metrics. For DCR and MIA you MUST supply the held-out split
# explicitly — otherwise the attacker silently evaluates on the same records
# that trained the generator, inflating privacy scores.
dcr_results = compute_dcr(real_data, synthetic_data)
print(f"\\nDistance to Closest Record (DCR):")
print(f"  Mean DCR: {dcr_results['mean_dcr']:.4f}")
print(f"  Min DCR: {dcr_results['min_dcr']:.4f}")
print(f"  Leak Ratio (DCR < 10th percentile): {dcr_results['leak_ratio']:.4f}")

# `members_df` is the real data used to fit the generator;
# `non_members_df` is a disjoint real hold-out set. You set these up
# yourself when you split your training data.
mia_results = membership_inference_attack(members_df, non_members_df, synthetic_data)
print(f"\\nMembership Inference Attack (MIA):")
print(f"  MIA Accuracy: {mia_results['mia_accuracy']:.4f}")
if mia_results['mia_auc']:
    print(f"  MIA AUC: {mia_results['mia_auc']:.4f}")
      
Privacy Testing Caveat: These are simple privacy metrics and pedagogical attacks. A robust privacy evaluation should include more sophisticated attacks (model inversion, attribute inference with auxiliary information, linkage attacks, shadow-model MIAs). Consult the privacy literature and specialized tools — e.g., SDV's privacy report, the Anonos / MOSTLY AI commercial evaluators, OpenDP, and OpenMined's PrivacyMeter — for comprehensive privacy assessment.

9.7 Visualization Techniques

Visualizations are critical for communicating evaluation results to non-technical stakeholders and for exploratory debugging.

Visual diagnostics: marginal histograms, joint scatter plots, correlation matrix comparison
Figure 9.4 — Essential visual diagnostics for synthetic data evaluation. Left: overlaid marginal distributions reveal per-column fidelity. Center: joint scatter plots expose whether multivariate relationships are preserved. Right: side-by-side correlation matrices provide a compact summary of structural fidelity.

9.7.1 PCA and t-SNE Projection Plots

Project high-dimensional real and synthetic data into 2D using PCA (linear) or t-SNE (nonlinear). If clusters overlap well, synthetic data occupies similar regions of feature space as real data, suggesting good fidelity.

Interpretation: If synthetic data clusters far from real data in the projection, distributional mismatch is evident. If synthetic data has holes or gaps, the generator may have missed rare subgroups.

9.7.2 Correlation Heatmaps

Side-by-side heatmaps of real vs. synthetic correlation matrices provide a quick visual check of whether pairwise relationships are preserved. Large color differences indicate correlation mismatches.

9.7.3 Pair Plots

For low-dimensional datasets (3-6 numeric variables), pairwise scatter plots reveal bivariate and multivariate relationships. Real and synthetic data can be overlaid with different colors/alpha values.

Code: Visualization Suite


import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

def visualize_synthetic_data(real_df, synthetic_df):
    """
    Create a comprehensive visualization suite comparing real and synthetic data.
    """
    numeric_cols = real_df.select_dtypes(include=[np.number]).columns

    # === PCA Projection ===
    X_real = real_df[numeric_cols].fillna(0)
    X_synth = synthetic_df[numeric_cols].fillna(0)

    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_real_scaled = scaler.fit_transform(X_real)
    X_synth_scaled = scaler.transform(X_synth)

    pca = PCA(n_components=2)
    X_real_pca = pca.fit_transform(X_real_scaled)
    X_synth_pca = pca.transform(X_synth_scaled)

    fig, axes = plt.subplots(2, 2, figsize=(14, 12))

    # PCA
    ax = axes[0, 0]
    ax.scatter(X_real_pca[:, 0], X_real_pca[:, 1], alpha=0.5, label='Real', s=10)
    ax.scatter(X_synth_pca[:, 0], X_synth_pca[:, 1], alpha=0.5, label='Synthetic', s=10)
    ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
    ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
    ax.set_title('PCA Projection')
    ax.legend()

    # Correlation heatmap - Real
    ax = axes[0, 1]
    corr_real = real_df[numeric_cols].corr()
    sns.heatmap(corr_real, ax=ax, cmap='coolwarm', vmin=-1, vmax=1, square=True, cbar=False)
    ax.set_title('Real Data Correlations')

    # Correlation heatmap - Synthetic
    ax = axes[1, 0]
    corr_synth = synthetic_df[numeric_cols].corr()
    sns.heatmap(corr_synth, ax=ax, cmap='coolwarm', vmin=-1, vmax=1, square=True, cbar=False)
    ax.set_title('Synthetic Data Correlations')

    # Difference heatmap
    ax = axes[1, 1]
    corr_diff = (corr_real - corr_synth).abs()
    sns.heatmap(corr_diff, ax=ax, cmap='YlOrRd', vmin=0, vmax=1, square=True)
    ax.set_title('Abs Correlation Difference')

    plt.tight_layout()
    return fig

# Run visualization
fig = visualize_synthetic_data(real_data, synthetic_data)
plt.savefig('evaluation_viz.png', dpi=150)
plt.close()
      

9.8 Automated Evaluation Frameworks

While custom evaluation is powerful, open-source frameworks automate much of the work. SDMetrics is a popular library for evaluating synthetic data.

9.8.1 SDMetrics Overview

SDMetrics is an open-source library for evaluating synthetic data. It is part of the Synthetic Data Vault (SDV) ecosystem — originally incubated at MIT's Data-to-AI Lab and now maintained by DataCebo — and supports single-table, multi-table, and sequential synthetic data. Its QualityReport focuses on statistical quality/fidelity; task utility and privacy should still be measured with separate, purpose-specific evaluations.

Key metrics:

  • Column Shapes: whether each synthetic column matches the real marginal distribution
  • Column Pair Trends: whether pairwise relationships are preserved
  • Cardinality and intertable trends: additional properties for multi-table datasets

Code: SDMetrics Quick Start


# Install: pip install sdv sdmetrics

from sdv.datasets.demo import download_demo
from sdv.single_table import GaussianCopulaSynthesizer
from sdmetrics.reports.single_table import QualityReport

# Load demo data and metadata
real_data, metadata = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests'
)

# Train a simple synthesizer to create synthetic data
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)
synthetic_data = synthesizer.sample(num_rows=len(real_data))

# Generate quality report
report = QualityReport()
report.generate(real_data, synthetic_data, metadata, verbose=False)

# View overall score
print(f"Overall Quality Score: {report.get_score():.2%}")

# View evaluated quality properties
properties = report.get_properties()
print(properties)

# Drill into a specific property
column_shapes = report.get_details(property_name='Column Shapes')
print(column_shapes.head())

# Save the report object for reproducibility
report.save('quality_report.pkl')
      
SDMetrics Advantage: Automated evaluation is reproducible and consistent. However, it should complement custom evaluation tailored to domain-specific requirements. Always validate SDMetrics scores with custom code for high-stakes applications.

9.9 Benchmarking and Reporting

A complete evaluation is only useful if communicated clearly. Quality scorecards and benchmark reports must be understandable to stakeholders with varying technical backgrounds.

9.9.1 Quality Scorecard Design

A quality scorecard synthesizes fidelity, utility, and privacy into actionable summary scores. Key principles:

  • Simplicity: Reduce 50+ metrics to 3-5 key scores
  • Transparency: Show which metrics contributed to each score
  • Context: Include domain-specific benchmarks (e.g., "acceptable utility gap is 5%")
  • Traceability: Link scores to test procedures and code

9.9.2 Reporting Best Practices

For executives: One-page summary with overall pass/fail status and key risks. Example: "Synthetic data achieves 92% utility, 0.8% DCR leakage risk, acceptable for model training but not for direct external sharing."

For data scientists: Detailed metric table with confidence intervals, visualizations, and per-column breakdowns. Include TSTR gap, DCR distribution, feature importance rank correlation.

For privacy experts: Attack simulation results, differential privacy parameters (if used), linkage risk estimates, recommendations for access controls.

Code: Quality Scorecard Template


class QualityScorecardGenerator:
    """
    Generate a comprehensive quality scorecard for synthetic data.
    """

    def __init__(self, fidelity_metrics, utility_metrics, privacy_metrics):
        self.fidelity = fidelity_metrics
        self.utility = utility_metrics
        self.privacy = privacy_metrics

    def compute_fidelity_score(self):
        """Aggregate fidelity metrics into a 0-100 score."""
        # Use KS, chi-square, correlation difference
        ks_scores = []
        chi2_scores = []
        corr_score = 100 - (self.fidelity['normalized_correlation_diff'] * 100)

        # Convert KS p-values to scores (higher p-value = higher score)
        for col, metrics in self.fidelity['univariate'].items():
            if 'ks_pval' in metrics:
                ks_scores.append(min(100, metrics['ks_pval'] * 1000))
            if 'chi2_pval' in metrics:
                chi2_scores.append(min(100, metrics['chi2_pval'] * 1000))

        mean_ks = np.mean(ks_scores) if ks_scores else 80
        mean_chi2 = np.mean(chi2_scores) if chi2_scores else 80

        fidelity_score = (mean_ks + mean_chi2 + corr_score) / 3
        return min(100, max(0, fidelity_score))

    def compute_utility_score(self):
        """Aggregate utility metrics."""
        # TSTR gap should be < 5% for "good" utility
        tstr_f1 = self.utility['TSTR']['f1']
        trtr_f1 = self.utility['TRTR']['f1']
        gap_pct = abs(tstr_f1 - trtr_f1) * 100

        if gap_pct < 5:
            utility_score = 100
        elif gap_pct < 10:
            utility_score = 90
        elif gap_pct < 15:
            utility_score = 75
        else:
            utility_score = 50

        # Adjust for feature importance rank correlation
        if 'feature_rank_corr' in self.utility:
            rank_corr = self.utility['feature_rank_corr']
            rank_bonus = (rank_corr + 1) / 2 * 20  # -1 to 1 → 0 to 20
            utility_score = (utility_score + rank_bonus) / 2

        return min(100, max(0, utility_score))

    def compute_privacy_score(self):
        """Aggregate privacy metrics."""
        leak_ratio = self.privacy['dcr_leak_ratio']
        mia_acc = self.privacy['mia_accuracy']

        # DCR: < 1% leak is perfect, > 5% is poor
        if leak_ratio < 0.01:
            dcr_score = 100
        elif leak_ratio < 0.05:
            dcr_score = 80
        else:
            dcr_score = 50

        # MIA: accuracy near 50% is perfect (no signal), > 60% is poor
        if mia_acc < 0.55:
            mia_score = 100
        elif mia_acc < 0.60:
            mia_score = 80
        else:
            mia_score = 50

        privacy_score = (dcr_score + mia_score) / 2
        return min(100, max(0, privacy_score))

    def overall_score(self):
        """Weighted average of fidelity, utility, privacy."""
        # Weights depend on use case; adjust as needed
        weights = {'fidelity': 0.35, 'utility': 0.40, 'privacy': 0.25}

        f = self.compute_fidelity_score()
        u = self.compute_utility_score()
        p = self.compute_privacy_score()

        return weights['fidelity'] * f + weights['utility'] * u + weights['privacy'] * p

    def generate_report(self):
        """Generate text report."""
        f = self.compute_fidelity_score()
        u = self.compute_utility_score()
        p = self.compute_privacy_score()
        overall = self.overall_score()

        status = "PASS" if overall >= 75 else "CAUTION" if overall >= 60 else "FAIL"

        report = f"""
========== SYNTHETIC DATA QUALITY SCORECARD ==========

Overall Status: {status}
Overall Score: {overall:.1f}/100

Fidelity Score:      {f:.1f}/100
Utility Score:       {u:.1f}/100
Privacy Score:       {p:.1f}/100

RECOMMENDATIONS:
"""
        if f < 70:
            report += "- FIDELITY: Synthetic data distributions diverge from real data. Retrain generator with better hyperparameters.\n"
        if u < 70:
            report += "- UTILITY: Models trained on synthetic data underperform on real data. Check for missing interactions or rare subgroups.\n"
        if p < 70:
            report += "- PRIVACY: Re-identification risk detected. Consider applying differential privacy or other privacy-enhancing techniques.\n"

        report += "\nFOR DEPLOYMENT:\n"
        if overall >= 80:
            report += "✓ Synthetic data is suitable for model training and internal analysis.\n"
        if overall >= 85 and p >= 80:
            report += "✓ Synthetic data can be shared externally with confidence.\n"
        if overall < 60:
            report += "✗ Synthetic data requires significant improvement before deployment.\n"

        return report

# Example usage
scorecard = QualityScorecardGenerator(
    fidelity_metrics=fidelity_results,
    utility_metrics={'TSTR': {...}, 'TRTR': {...}, 'feature_rank_corr': 0.85},
    privacy_metrics={'dcr_leak_ratio': 0.02, 'mia_accuracy': 0.52}
)

print(scorecard.generate_report())
      

9.10 Hands-On: Complete Evaluation Suite

Let's build a reusable class that encapsulates the entire evaluation pipeline: fidelity, utility, privacy, visualization, and reporting.

Code: Comprehensive Evaluator Class


class SyntheticDataEvaluator:
    """
    Complete, production-ready evaluation pipeline for synthetic data.
    Computes fidelity, utility, privacy metrics and generates reports.
    """

    def __init__(self, real_df, synthetic_df, target_col=None, random_state=42):
        self.real = real_df.copy()
        self.synthetic = synthetic_df.copy()
        self.target_col = target_col
        self.random_state = random_state
        self.results = {}

    def evaluate_fidelity(self):
        """Run complete fidelity evaluation."""
        print("Evaluating Fidelity...")

        # Univariate metrics
        univariate = compute_univariate_fidelity(self.real, self.synthetic)

        # Correlation comparison
        corr_cmp = compare_correlation_matrices(self.real, self.synthetic)

        # EMD
        numeric_cols = self.real.select_dtypes(include=[np.number]).columns
        emd_scores = {}
        for col in numeric_cols:
            emd = wasserstein_distance(
                self.real[col].dropna(),
                self.synthetic[col].dropna()
            )
            emd_scores[col] = emd

        self.results['fidelity'] = {
            'univariate': univariate,
            'correlation': corr_cmp,
            'emd': emd_scores,
        }
        print(f"  ✓ Computed fidelity for {len(numeric_cols)} numeric columns")

    def evaluate_utility(self):
        """Run complete ML utility evaluation."""
        if self.target_col is None:
            print("Skipping utility evaluation (no target_col specified)")
            return

        print("Evaluating Utility...")
        evaluator = MLUtilityEvaluator(
            self.real, self.synthetic, self.target_col, random_state=self.random_state
        )

        evaluator.run_trtr(model_name='RF')
        evaluator.run_tstr(model_name='RF')
        evaluator.run_trts(model_name='RF')

        fi_cmp = evaluator.compare_feature_importance()

        self.results['utility'] = evaluator.results
        self.results['utility']['fi_comparison'] = fi_cmp
        print(f"  ✓ Utility evaluation complete")

    def evaluate_privacy(self):
        """Run privacy evaluation."""
        print("Evaluating Privacy...")

        dcr = compute_dcr(self.real, self.synthetic)
        mia = membership_inference_attack(self.real, self.synthetic)

        self.results['privacy'] = {
            'dcr': dcr,
            'mia': mia,
        }
        print(f"  ✓ Privacy evaluation complete")

    def generate_summary(self):
        """Print human-readable summary."""
        print("\n" + "="*60)
        print("SYNTHETIC DATA EVALUATION SUMMARY")
        print("="*60)

        if 'fidelity' in self.results:
            print("\nFIDELITY:")
            corr_norm_diff = self.results['fidelity']['correlation']['normalized_frobenius_diff']
            print(f"  Normalized Correlation Diff: {corr_norm_diff:.4f}")

            mean_emd = np.mean(list(self.results['fidelity']['emd'].values()))
            print(f"  Mean EMD: {mean_emd:.4f}")

        if 'utility' in self.results and 'TSTR' in self.results['utility']:
            print("\nUTILITY (F1 Score):")
            tstr_f1 = self.results['utility']['TSTR']['f1']
            trtr_f1 = self.results['utility']['TRTR']['f1']
            gap = (trtr_f1 - tstr_f1) * 100
            print(f"  TRTR: {trtr_f1:.4f} | TSTR: {tstr_f1:.4f} | Gap: {gap:.2f}%")

        if 'privacy' in self.results:
            print("\nPRIVACY:")
            leak_ratio = self.results['privacy']['dcr']['leak_ratio']
            mia_acc = self.results['privacy']['mia']['mia_accuracy']
            print(f"  DCR Leak Ratio: {leak_ratio:.4f}")
            print(f"  MIA Accuracy: {mia_acc:.4f}")

        print("\n" + "="*60)

    def run_all(self):
        """Run complete evaluation pipeline."""
        self.evaluate_fidelity()
        self.evaluate_utility()
        self.evaluate_privacy()
        self.generate_summary()
        return self.results

# Usage
evaluator = SyntheticDataEvaluator(
    real_data,
    synthetic_data,
    target_col='diagnosis' if 'diagnosis' in real_data.columns else None
)

results = evaluator.run_all()

# Access individual results for reporting
fidelity_score = evaluator.results['fidelity']['correlation']['normalized_frobenius_diff']
utility_gap = abs(evaluator.results['utility']['TRTR']['f1'] -
                  evaluator.results['utility']['TSTR']['f1']) * 100
privacy_leak = evaluator.results['privacy']['dcr']['leak_ratio']
      
Next Steps: Integrate this evaluator into your synthetic data pipeline. Run it automatically after each generation run. Version evaluation results alongside synthetic datasets. Create dashboards tracking quality metrics over time as generation algorithms improve.

Summary

Evaluation is not a final step but a continuous process embedded in synthetic data generation workflows. The three pillars—fidelity, utility, and privacy—must be measured rigorously and reported transparently.

Key takeaways:

  • Fidelity metrics (KS test, JS divergence, correlation comparison, EMD) assess statistical similarity but miss downstream task performance
  • Utility evaluation (TSTR, TRTS, feature importance) directly measures whether synthetic data enables effective model training
  • Privacy metrics (DCR, membership inference) reveal re-identification risks that can coexist with high fidelity
  • Visualizations (PCA, correlation heatmaps, pair plots) communicate evaluation results to non-technical stakeholders
  • Automated frameworks like SDMetrics accelerate evaluation but should be complemented by custom analysis
  • Quality scorecards synthesize metrics into actionable pass/fail recommendations and deployment guidelines

In the next chapter, we'll see how these evaluation methods apply across diverse industry use cases: healthcare, finance, e-commerce, and more.

References and Further Reading

  1. Jordon, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C., Cohen, S. N., & Weller, A. (2022). Synthetic Data - What, Why and How? arXiv preprint. arxiv.org/abs/2205.03257
  2. Gretton, A., Borgwardt, K. M., Rasch, M. J., Schoelkopf, B., & Smola, A. (2012). A Kernel Two-Sample Test. Journal of Machine Learning Research, 13, 723-773. jmlr.csail.mit.edu/papers/v13/gretton12a.html
  3. Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017). Membership Inference Attacks Against Machine Learning Models. IEEE Symposium on Security and Privacy. arxiv.org/abs/1610.05820
  4. Carlini, N., Liu, C., Erlingsson, U., Kos, J., & Song, D. (2019). The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks. USENIX Security Symposium. usenix.org/conference/usenixsecurity19/presentation/carlini
  5. Synthetic Data Vault Contributors. (2026). Quality Report. SDMetrics Documentation. docs.sdv.dev/sdmetrics/reports/quality-report
  6. Synthetic Data Vault Contributors. (2026). Quality Report Single Table API. SDMetrics Documentation. docs.sdv.dev/sdmetrics/data-metrics/quality/quality-report/single-table-api