Chapter 5: Large Language Models & Synthetic Text

Advanced Text & Language LLMs

Large language models have fundamentally transformed how we generate synthetic data. Unlike traditional statistical methods or even GANs, LLMs bring a new capability: the ability to understand intent, maintain coherence over long sequences, and generate realistic text that reflects complex linguistic patterns. In this chapter, we explore the mechanics of using LLMs for synthetic data generation, from prompt engineering fundamentals to production-scale pipelines.

1. The LLM Revolution for Data Generation

The emergence of transformer-based language models—starting with BERT, then GPT-2, GPT-3, and now Claude, Llama, and Mistral—has opened new frontiers in synthetic data generation. These models excel at tasks that require semantic understanding, structural reasoning, and consistency across long contexts.

Why LLMs Change the Game

Traditional approaches to generating synthetic text (templates, rule-based systems, older sequence models) struggle with three core challenges:

LLMs address all three. They've been trained on billions of tokens of diverse text, learning deep patterns about language, logic, and domain-specific knowledge. When you ask an LLM to generate a customer review, patient intake form, or code snippet, it draws on this learned representation to produce realistic, coherent outputs.

Key Insight: LLMs are particularly powerful for generating structured, multi-field data (like JSON records) because they understand the relationships between fields and can enforce constraints through in-context learning. A model can learn from a few examples that "purchase_date" should always be before "delivery_date" without explicit rule-based constraints.
LLM synthetic data pipeline: Schema → Prompt → Generate → Validate → Filter → Dataset
Figure 5.1 — The LLM-based synthetic data pipeline. From schema definition through prompt engineering, batch generation, validation, and filtering to produce a clean synthetic dataset.

Open-Source vs. API-Based Models

Today's landscape offers two main paths:

Dimension API-Based (GPT-4, Claude) Open-Source (Llama, Mistral)
Cost per token $0.01–0.05 per 1K tokens One-time, then free (self-hosted)
Performance State-of-the-art, closed-source training Competitive, improving rapidly (Llama 3.1, Mistral)
Latency 100–500ms per request (network overhead) 10–100ms on GPU (single hop)
Data privacy Shared with provider (unless enterprise plan) Full control, on-prem deployment
Customization Limited (fine-tuning available at scale) Full fine-tuning, quantization, adaptation

2. Prompt Engineering for Data Generation

Three prompting strategies: Zero-Shot, Few-Shot, Chain-of-Thought with quality progression
Figure 5.2 — Prompting strategies for synthetic data generation. Zero-shot relies on instructions alone; few-shot provides examples that define the output format; chain-of-thought adds reasoning steps for complex, multi-field records. Quality and consistency improve with more structured prompts.

The art of generating high-quality synthetic data with LLMs lies in prompt design. A well-crafted prompt provides the model with context, constraints, and examples—turning a general-purpose language model into a specialized data generator.

Core Principles

1. Explicit schema definition: Tell the model exactly what you want—field names, types, constraints.

2. Few-shot examples: Show 1–3 real examples to establish the pattern. The model will infer structure from these samples.

3. Constraint specification: Clearly list any rules (e.g., "age must be between 18 and 65", "product_price must be greater than cost").

4. Output format: Specify exactly how the output should be formatted (JSON, CSV, plain text with delimiters).

Example: Structured Customer Record Generation

Here's a prompt that generates realistic customer records in JSON format:

"""
You are a synthetic data generator for an e-commerce platform.
Generate diverse, realistic customer records in JSON format.

Schema:
{
  "customer_id": "string (unique identifier, format: CUST_XXXXX)",
  "name": "string (full name)",
  "email": "string (valid email)",
  "signup_date": "ISO 8601 date (between 2020 and 2024)",
  "country": "string (country code, ISO 3166)",
  "lifetime_value": "float (in USD, range 10 to 50000)",
  "subscription_tier": "enum (free, basic, premium, enterprise)",
  "account_status": "enum (active, inactive, suspended)"
}

Constraints:
- If subscription_tier is 'enterprise', lifetime_value must be > 5000
- If account_status is 'inactive', no purchases in last 180 days
- All emails must have valid domains (@gmail.com, @company.com, etc.)
- Names should reflect diverse geographies and cultural backgrounds

Examples (these are real patterns, mimic diversity):
{"customer_id": "CUST_00152", "name": "Aisha Mohammed", "email": "aisha.m@gmail.com", "signup_date": "2021-03-15", "country": "AE", "lifetime_value": 1250.50, "subscription_tier": "premium", "account_status": "active"}
{"customer_id": "CUST_00289", "name": "Chen Wei", "email": "chen.wei.dev@company.com", "signup_date": "2022-07-22", "country": "CN", "lifetime_value": 15000.00, "subscription_tier": "enterprise", "account_status": "active"}

Generate 5 diverse customer records following the schema and constraints. Output only valid JSON objects, one per line.
"""
Pro Tip: Use constraint anchoring—explicitly state that constraints apply across fields. For example, "If subscription_tier is 'enterprise', then lifetime_value must be > 5000". LLMs excel at multi-field reasoning when you make relationships explicit.

Chain-of-Thought for Complex Records

For highly constrained data, add a "reasoning" step before generation. This is known as chain-of-thought prompting:

"""
Task: Generate a realistic medical appointment record.

Reasoning step: Before generating, think through:
1. What is a reasonable age for this patient?
2. What conditions are common for that age?
3. What medications are appropriate?
4. When should follow-up happen?

Schema: {...appointment fields...}

Constraints:
- appointment_date must be in the future
- If condition is 'diabetes', should have medication 'metformin' or 'insulin'
- follow_up_days should be 0–90 based on severity

Now generate a single appointment record with step-by-step reasoning, then output JSON.
"""

Chain-of-thought prompting increases both quality and consistency, especially for complex or multi-step constraints. The model "reasons through" the problem before generating, leading to fewer constraint violations.

3. Generating Labeled Training Data for NLP

One of the most practical applications of LLMs is creating labeled datasets for NLP tasks. Rather than manually annotating thousands of examples, you can use a capable LLM to generate labeled samples at scale.

Sentiment Analysis Dataset

Here's a complete Python example using OpenAI-compatible APIs (works with OpenAI, Anthropic, or open-source models via vLLM):

#!/usr/bin/env python3
import json
import os
from typing import Any
import httpx

# Configuration (adjust for your provider)
API_BASE = os.getenv("LLM_API_BASE", "https://api.openai.com/v1")
API_KEY = os.getenv("LLM_API_KEY")
MODEL = os.getenv("LLM_MODEL", "gpt-5-mini")

def generate_sentiment_dataset(num_samples: int = 100) -> list[dict[str, Any]]:
    """
    Generate labeled sentiment analysis examples.
    """
    prompt = f"""You are generating training data for a sentiment analysis model.
Create diverse product reviews with ground-truth sentiment labels.

Schema:
{{
  "review_text": "string (product review, 20-100 words)",
  "sentiment": "enum (positive, negative, neutral)",
  "aspect": "string (what aspect of product, e.g. 'quality', 'price', 'shipping')"
}}

Constraints:
- Positive: feedback about value, quality, delivery, or features (includes words like "great", "excellent", "love")
- Negative: complaints about defects, delays, or poor quality (includes words like "terrible", "broken", "useless")
- Neutral: factual statements without strong emotion (e.g. "The box was blue and square")
- Each sentiment should appear roughly equally

Generate {num_samples} distinct review examples. Output only valid JSON, one object per line."""

    # Synchronous client — this function is plain `def`, not `async def`.
    client = httpx.Client(
        base_url=API_BASE,
        headers={"Authorization": f"Bearer {API_KEY}"},
        timeout=60.0,
    )

    response = client.post(
        "/chat/completions",
        json={
            "model": MODEL,
            "temperature": 0.8,  # Higher temp for diversity
            "max_tokens": 4000,
            "messages": [
                {"role": "user", "content": prompt}
            ],
        },
    )

    if response.status_code != 200:
        raise Exception(f"API error: {response.status_code} {response.text}")

    data = response.json()
    content = data["choices"][0]["message"]["content"]

    # Parse JSONL output
    examples = []
    for line in content.strip().split("\n"):
        if line.strip():
            try:
                obj = json.loads(line)
                examples.append(obj)
            except json.JSONDecodeError:
                print(f"Warning: Could not parse line: {line}")

    return examples

if __name__ == "__main__":
    print("Generating 50 labeled sentiment examples...")
    examples = generate_sentiment_dataset(50)

    print(f"Generated {len(examples)} examples:")
    for ex in examples[:3]:
        print(json.dumps(ex, indent=2))

    # Save to file
    with open("sentiment_data.jsonl", "w") as f:
        for ex in examples:
            f.write(json.dumps(ex) + "\n")
    print("Saved to sentiment_data.jsonl")

Named Entity Recognition (NER) Dataset

For NER, you need both text and token-level annotations. This is trickier but LLMs can handle it:

#!/usr/bin/env python3
import json
from typing import Any

def generate_ner_dataset(num_samples: int = 50) -> list[dict[str, Any]]:
    """
    Generate NER training data with BIO tags.
    Uses LLM to create text, then annotate it.
    """
    prompt = f"""Generate {num_samples} diverse sentences about technology companies, products, and people.
For each sentence, provide:
1. The sentence text
2. Token-level BIO tags: B-ORG, I-ORG, B-PRODUCT, I-PRODUCT, B-PERSON, I-PERSON, O (other)

Schema:
{{
  "text": "string (full sentence)",
  "tokens": ["string array of words"],
  "tags": ["string array of BIO tags, same length as tokens"]
}}

Example:
{{
  "text": "Apple released the iPhone 15 last month.",
  "tokens": ["Apple", "released", "the", "iPhone", "15", "last", "month", "."],
  "tags": ["B-ORG", "O", "O", "B-PRODUCT", "I-PRODUCT", "O", "O", "O"]
}}

Generate diverse, realistic examples. Output valid JSON, one per line."""

    # (In real code, call LLM API similar to above)
    # For brevity, showing structure only
    return []
Important Caveat: When generating labeled data with LLMs, quality can vary. Always validate a sample of generated labels with human experts. Use the synthetic data as a starting point, then have humans review and correct 5–10% of it. This hybrid approach is faster than pure manual annotation while maintaining quality.

4. Synthetic Instruction Tuning Data

Modern language models are often fine-tuned on instruction-following examples. Creating such data synthetically is both cost-effective and scalable. Three popular approaches:

Self-Instruct

Self-Instruct loop: seed instructions → generate → filter → expand pool iteratively
Figure 5.3 — The Self-Instruct bootstrapping loop. Starting from a small seed set of instructions, an LLM generates new instructions and input-output pairs, which are filtered for quality and added back to the pool. This iterative process can expand 175 seed instructions to over 52,000 training examples.

Start with a small seed set of manually written instructions. Use an LLM to generate new instructions and responses. Iteratively filter for quality:

def self_instruct_pipeline(seed_instructions: list[str], num_iterations: int = 3):
    """
    Self-instruct: Use LLM to expand instruction set.
    """
    pool = seed_instructions.copy()

    for iteration in range(num_iterations):
        # Sample a batch from pool
        batch = random.sample(pool, min(10, len(pool)))

        prompt = f"""Based on these existing instructions:
{chr(10).join(batch)}

Generate 5 new, diverse instructions that are similar in style but cover different tasks.
For each instruction, provide:
1. The instruction
2. A realistic example input
3. The expected output

Ensure diversity: mix question-answering, text generation, reasoning, coding, and creative tasks.
Output JSON objects, one per line."""

        # Call LLM
        new_instructions = call_llm(prompt)

        # Filter by quality: remove duplicates, check length, semantic diversity
        filtered = filter_and_deduplicate(new_instructions)

        # Add to pool
        pool.extend(filtered)

    return pool

Alpaca-Style Expansion

The Alpaca model (Stanford) demonstrated generating 52K instruction-tuning examples from just 175 manually written examples by prompt morphing:

def alpaca_expansion(seed_set: list[dict], target_count: int = 52000):
    """
    Alpaca-style: Use seed examples to expand via prompt morphing.

    Each seed example has: instruction, input, output.
    """
    expanded = seed_set.copy()

    while len(expanded) < target_count:
        # Sample a seed example
        seed = random.choice(seed_set)

        prompt = f"""Given this instruction and output:
Instruction: {seed['instruction']}
Input: {seed.get('input', '')}
Output: {seed['output']}

Generate a similar but distinct instruction-input-output triplet.
Vary the task type, domain, or difficulty. Output as JSON."""

        new_example = call_llm(prompt)

        # Check for duplicates and quality
        if is_diverse_and_novel(new_example, expanded):
            expanded.append(new_example)

    return expanded

Evol-Instruct (Evolutionary Prompting)

A more sophisticated approach that iteratively increases instruction complexity:

import random

# Pseudocode helper: replace with your provider-specific LLM call.
def call_llm(prompt: str) -> str:
    raise NotImplementedError

def evol_instruct(seed_instructions: list[str], depth: int = 5) -> list[dict]:
    """
    Evol-instruct: Increase complexity through mutations.
    Mutations: add constraints, increase depth, combine tasks, etc.
    """
    instructions = [{"text": s, "generation": 0} for s in seed_instructions]

    for gen in range(depth):
        new_batch = []
        for instr in instructions:
            if instr["generation"] == gen:
                # Apply random mutation
                mutation = random.choice([
                    "add_constraint",   # Add a condition or requirement
                    "increase_depth",   # Make it multi-step
                    "broaden",          # Generalize to related domain
                    "combine"           # Merge with another task
                ])

                prompt = f"""Instruction: {instr['text']}

Apply this mutation: {mutation}

For 'add_constraint': add a specific requirement or rule.
For 'increase_depth': make it a multi-step or more complex task.
For 'broaden': generalize to similar domains.
For 'combine': merge with a related task.

Output the mutated instruction as a single line."""

                mutated = call_llm(prompt)
                new_batch.append({
                    "text": mutated,
                    "generation": gen + 1,
                    "parent": instr["text"]
                })

        instructions.extend(new_batch)

    return instructions

5. Generating Tabular Data with LLMs

LLMs aren't just for text—they're surprisingly effective at generating structured tabular data (CSV, JSON) with schema constraints. The key is to make constraints explicit and leverage in-context learning.

Customer Transaction Records

#!/usr/bin/env python3
import json
from datetime import datetime, timedelta
import random

def generate_transaction_batch(count: int = 100) -> list[dict]:
    """
    Use LLM to generate realistic transaction records with cross-field constraints.
    """

    # Define schema and examples upfront
    schema = {
        "transaction_id": "unique UUID",
        "customer_id": "CUST_XXXXX format",
        "transaction_date": "ISO 8601 date",
        "amount": "float, USD, 0.01 to 100000",
        "merchant_category": "enum: groceries, electronics, travel, dining, other",
        "payment_method": "enum: credit_card, debit_card, digital_wallet, bank_transfer",
        "status": "enum: completed, pending, failed, refunded",
        "is_international": "boolean",
        "fraud_score": "float 0.0-1.0 (estimated by model)"
    }

    examples = [
        {
            "transaction_id": "TX_92f3c18e",
            "customer_id": "CUST_00152",
            "transaction_date": "2024-01-15",
            "amount": 45.99,
            "merchant_category": "groceries",
            "payment_method": "debit_card",
            "status": "completed",
            "is_international": False,
            "fraud_score": 0.02
        },
        {
            "transaction_id": "TX_18a92d7f",
            "customer_id": "CUST_00289",
            "transaction_date": "2024-01-15",
            "amount": 1200.00,
            "merchant_category": "travel",
            "payment_method": "credit_card",
            "status": "completed",
            "is_international": True,
            "fraud_score": 0.05
        }
    ]

    prompt = f"""You are generating synthetic transaction data for fraud detection training.
Create diverse, realistic transaction records.

Schema:
{json.dumps(schema, indent=2)}

Constraints:
- If is_international=True, amount is likely > 100 and fraud_score > 0.05
- If merchant_category='travel', status is usually 'completed' or 'pending'
- If fraud_score > 0.7, status might be 'failed' or 'refunded'
- If payment_method='digital_wallet', is_international is less common
- Transaction dates should span the last 90 days

Real examples (mimic this style):
{json.dumps(examples, indent=2)}

Generate {count} diverse transaction records. Output only valid JSON, one per line."""

    # (Call your LLM API here)
    response_text = call_llm_api(prompt, temperature=0.7, max_tokens=8000)

    records = []
    for line in response_text.strip().split("\n"):
        if line.strip():
            try:
                record = json.loads(line)
                # Validate before adding
                if validate_transaction(record):
                    records.append(record)
            except json.JSONDecodeError:
                pass

    return records

def validate_transaction(record: dict) -> bool:
    """Validate transaction against schema and constraints."""
    required_fields = ["transaction_id", "customer_id", "amount", "status"]
    if not all(f in record for f in required_fields):
        return False

    # Check types and ranges
    try:
        amount = float(record["amount"])
        if amount <= 0 or amount > 100000:
            return False
    except (ValueError, TypeError):
        return False

    # Check cross-field constraints
    if record.get("is_international") and record.get("amount", 0) < 100:
        # This is OK (weak constraint), allow it
        pass

    return True

6. Quality Control & Filtering

Quality funnel: 10,000 raw outputs filtered through validation, constraints, deduplication to 7,800 clean records
Figure 5.4 — The quality control funnel. Raw LLM outputs pass through successive validation stages — schema conformance, business-rule constraints, semantic deduplication — with each stage removing malformed or duplicate records. A well-designed pipeline typically retains 70–85% of generated records.

Generated synthetic data is only as good as your filtering pipeline. Here's a practical quality assurance framework:

Multi-Stage Validation

class SyntheticDataValidator:
    """
    Multi-stage validation pipeline for LLM-generated data.
    """

    def __init__(self):
        self.validation_results = {
            "schema_valid": 0,
            "constraint_valid": 0,
            "dedup_kept": 0,
            "semantic_valid": 0,
            "total_rejected": 0
        }

    def validate_schema(self, record: dict, schema: dict) -> bool:
        """Check that record matches schema (types, required fields)."""
        for field, field_type in schema.items():
            if field not in record:
                return False
            # Type checking logic
        return True

    def validate_constraints(self, record: dict, constraints: list[callable]) -> bool:
        """Check cross-field constraints."""
        for constraint_fn in constraints:
            if not constraint_fn(record):
                return False
        return True

    def deduplicate(self, records: list[dict]) -> list[dict]:
        """Remove duplicates and near-duplicates."""
        seen = {}
        unique = []
        for record in records:
            # Use hash of key fields
            key = self._record_hash(record)
            if key not in seen:
                seen[key] = True
                unique.append(record)
        self.validation_results["dedup_kept"] = len(unique)
        return unique

    def _record_hash(self, record: dict) -> str:
        """Create a stable hash of record (ignoring ID fields)."""
        key_fields = {k: v for k, v in record.items() if k != "id"}
        return hash(json.dumps(key_fields, sort_keys=True, default=str))

    def semantic_check(self, text: str, model: Any) -> float:
        """
        Use a classifier or embedding model to check coherence/quality.
        For example, if generating reviews, check sentiment matches labels.
        """
        # Pseudocode: compare text sentiment to label
        predicted_sentiment = model.predict(text)
        return predicted_sentiment

    def filter_pipeline(self, records: list[dict], schema: dict,
                        constraints: list[callable]) -> list[dict]:
        """End-to-end validation."""
        valid = []
        for record in records:
            if not self.validate_schema(record, schema):
                self.validation_results["total_rejected"] += 1
                continue

            if not self.validate_constraints(record, constraints):
                self.validation_results["total_rejected"] += 1
                continue

            self.validation_results["schema_valid"] += 1
            self.validation_results["constraint_valid"] += 1
            valid.append(record)

        # Deduplicate
        valid = self.deduplicate(valid)

        return valid
Best Practice: Plan for a 30–50% rejection rate on initial generation. This is normal. Use a staging pipeline: generate 2–3x your target count, filter aggressively, and you'll land on high-quality data. As your prompt and constraints improve, rejection rates drop.

7. Cost & Scale Considerations

Generating data at scale with LLMs introduces practical constraints: cost, latency, and rate limits.

Cost Analysis

Approach Cost per 1M records Time to generate Best for
Frontier API model Provider-dependent; check live pricing 10–20 hours (batch) High quality, small-medium datasets
Small API model Often cheapest managed option; check live pricing 10–20 hours (batch) Large datasets, cost-conscious
Open-weight model (self-hosted) Infrastructure only; depends on GPU and utilization 2–5 hours (4x A100 GPU) Very large datasets, private data
Hosted open-weight model Provider-dependent; usually between self-hosted and frontier APIs 5–10 hours Large datasets, balanced cost/quality

Batching and Optimization

To minimize latency and cost when generating large datasets:

#!/usr/bin/env python3
import asyncio
from concurrent.futures import ThreadPoolExecutor
import time

class BatchDataGenerator:
    def __init__(self, api_key: str, batch_size: int = 10):
        self.api_key = api_key
        self.batch_size = batch_size
        self.request_queue = []

    async def generate_batched(self, num_records: int, prompt_template: str):
        """
        Generate data in batches with rate limiting.
        """
        all_records = []
        batches = (num_records + self.batch_size - 1) // self.batch_size

        for batch_num in range(batches):
            # Create batch prompt
            batch_prompt = prompt_template.replace(
                "{count}",
                str(self.batch_size)
            )

            # Use exponential backoff for rate limits
            retry_count = 0
            while retry_count < 3:
                try:
                    response = await self._call_llm_async(batch_prompt)
                    batch_records = self._parse_response(response)
                    all_records.extend(batch_records)
                    break
                except RateLimitError:
                    retry_count += 1
                    wait_time = 2 ** retry_count
                    print(f"Rate limited. Waiting {wait_time}s...")
                    await asyncio.sleep(wait_time)

            # Progress tracking
            print(f"Generated batch {batch_num + 1}/{batches}")
            time.sleep(0.1)  # Small delay between batches

        return all_records

    async def _call_llm_async(self, prompt: str):
        """Async call to LLM API."""
        # Implementation uses aiohttp or similar
        pass

    def _parse_response(self, response: str) -> list[dict]:
        """Parse JSONL response."""
        records = []
        for line in response.strip().split("\n"):
            if line.strip():
                records.append(json.loads(line))
        return records
Token Budgeting: Before generating at scale, estimate token usage. If you generate 1M records at 100 tokens each, that's 100M tokens. Prices change quickly, so compute the estimate from the provider's current input/output token rates instead of relying on a fixed table. Always set a token budget and monitor spending.

8. Distillation & Self-Training

A powerful technique: use synthetic data to train smaller, faster, cheaper models. This is called knowledge distillation when the student learns from a teacher model.

Distillation Pipeline

Step 1: Generate synthetic labeled data with a large teacher model (GPT-4).

Step 2: Train a smaller student model (DistilBERT, Llama-7B) on the synthetic data.

Step 3: Evaluate the student against the teacher on held-out real data.

def distillation_pipeline(task: str, num_synthetic: int = 100000):
    """
    Knowledge distillation: Use large model to label data for small model.
    """

    # Step 1: Generate synthetic data with teacher
    print("Step 1: Generating synthetic training data with teacher model...")
    teacher_prompt = f"""Generate diverse {task} examples. Output JSON."""
    synthetic_data = call_llm_api(teacher_prompt, model="gpt-4")

    # Step 2: Train student on synthetic data
    print("Step 2: Training student model on synthetic data...")
    student_model = load_pretrained("distilbert-base-uncased")

    train_dataset = load_dataset_from_synthetic(synthetic_data)
    trainer = Trainer(
        model=student_model,
        train_dataset=train_dataset,
        learning_rate=2e-5,
        num_train_epochs=3
    )
    trainer.train()

    # Step 3: Evaluate
    print("Step 3: Evaluating student on real validation data...")
    real_validation = load_real_dataset("validation.jsonl")
    metrics = evaluate(student_model, real_validation)
    print(f"Accuracy: {metrics['accuracy']:.4f}")
    print(f"F1: {metrics['f1']:.4f}")

    return student_model, metrics

9. Ethical Considerations

Generating synthetic data with LLMs introduces unique ethical challenges:

Bias Amplification

If the teacher LLM was trained on biased data (as most were), synthetic data can amplify those biases. For example:

Mitigation: Audit synthetic data for demographic parity, stratify generation by protected attributes, and include diverse examples in few-shot prompts.

Attribution and Provenance

Synthetic text, especially from instruction-tuning, may reflect patterns from copyrighted material. While the generated text is new, it encodes knowledge from training data. Best practice: document the source models and training data used, and be transparent with users who consume synthetic data.

Ensuring Diversity

Explicitly prompt for diversity. Instead of "generate 100 customer names", try:

"""Generate 100 diverse customer names representing multiple geographies,
cultures, and naming traditions. Ensure representation from:
- East Asia, South Asia, Middle East, Africa, Europe, Americas
- Both traditionally 'male' and 'female' names
- Various name lengths and phonetic patterns
"""

10. Hands-On: Building a Synthetic NLP Dataset

Let's build an end-to-end pipeline: schema definition, generation, parsing, validation, and export.

#!/usr/bin/env python3
"""
Complete pipeline: Generate, validate, and export a synthetic NLP dataset.
Use case: Customer support ticket classification.
"""

import json
import sys
from typing import Any
from dataclasses import dataclass
import httpx

@dataclass
class Ticket:
    ticket_id: str
    customer_message: str
    category: str  # billing, technical, shipping, account
    priority: str  # low, medium, high
    sentiment: str  # positive, negative, neutral

class TicketDatasetPipeline:
    def __init__(self, api_key: str, model: str = "gpt-5-mini"):
        self.api_key = api_key
        self.model = model
        self.client = httpx.Client(
            headers={"Authorization": f"Bearer {api_key}"}
        )

    def generate(self, count: int = 500) -> list[dict]:
        """Generate synthetic tickets."""
        prompt = f"""Generate {count} realistic customer support tickets.
Each ticket represents a customer contacting support about an issue.

Schema:
{{
  "ticket_id": "TICK_XXXXX (unique)",
  "customer_message": "string (customer's question/complaint, 20-150 words)",
  "category": "enum (billing, technical, shipping, account)",
  "priority": "enum (low, medium, high)",
  "sentiment": "enum (positive, negative, neutral)"
}}

Constraints:
- If category='billing', priority is often 'high'
- If sentiment='positive', priority is often 'low'
- If category='technical', message should mention errors/bugs
- Messages should reflect real support interactions

Examples:
{{"ticket_id": "TICK_00001", "customer_message": "I was charged twice for my order. Can you refund the duplicate charge?", "category": "billing", "priority": "high", "sentiment": "negative"}}
{{"ticket_id": "TICK_00002", "customer_message": "Thanks for the quick shipping! I received my item as expected.", "category": "shipping", "priority": "low", "sentiment": "positive"}}

Generate {count} diverse, realistic tickets. Output only JSON, one per line."""

        response = self.client.post(
            "https://api.openai.com/v1/chat/completions",
            json={
                "model": self.model,
                "temperature": 0.8,
                "max_tokens": 10000,
                "messages": [{"role": "user", "content": prompt}],
            },
        )

        if response.status_code != 200:
            raise Exception(f"API error: {response.status_code}")

        data = response.json()
        content = data["choices"][0]["message"]["content"]

        tickets = []
        for line in content.strip().split("\n"):
            if line.strip():
                try:
                    tickets.append(json.loads(line))
                except json.JSONDecodeError:
                    print(f"Warning: Could not parse: {line}", file=sys.stderr)

        return tickets

    def validate(self, tickets: list[dict]) -> list[Ticket]:
        """Validate and convert to dataclass."""
        valid_tickets = []
        for ticket in tickets:
            try:
                # Type checks
                assert isinstance(ticket.get("ticket_id"), str)
                assert isinstance(ticket.get("customer_message"), str)
                assert ticket.get("category") in ["billing", "technical", "shipping", "account"]
                assert ticket.get("priority") in ["low", "medium", "high"]
                assert ticket.get("sentiment") in ["positive", "negative", "neutral"]

                # Length check
                msg_len = len(ticket["customer_message"].split())
                assert 20 <= msg_len <= 150, f"Message too short/long: {msg_len} words"

                valid_tickets.append(Ticket(**ticket))
            except (AssertionError, KeyError, TypeError) as e:
                print(f"Invalid ticket: {e}", file=sys.stderr)

        return valid_tickets

    def deduplicate(self, tickets: list[Ticket]) -> list[Ticket]:
        """Remove duplicates based on message text."""
        seen = set()
        unique = []
        for ticket in tickets:
            msg_hash = hash(ticket.customer_message)
            if msg_hash not in seen:
                seen.add(msg_hash)
                unique.append(ticket)
        return unique

    def export_jsonl(self, tickets: list[Ticket], filepath: str):
        """Export to JSONL format."""
        with open(filepath, "w") as f:
            for ticket in tickets:
                f.write(json.dumps({
                    "ticket_id": ticket.ticket_id,
                    "customer_message": ticket.customer_message,
                    "category": ticket.category,
                    "priority": ticket.priority,
                    "sentiment": ticket.sentiment
                }) + "\n")
        print(f"Exported {len(tickets)} tickets to {filepath}")

    def export_csv(self, tickets: list[Ticket], filepath: str):
        """Export to CSV format."""
        import csv
        with open(filepath, "w", newline="") as f:
            writer = csv.DictWriter(f,
                fieldnames=["ticket_id", "customer_message", "category", "priority", "sentiment"])
            writer.writeheader()
            for ticket in tickets:
                writer.writerow({
                    "ticket_id": ticket.ticket_id,
                    "customer_message": ticket.customer_message,
                    "category": ticket.category,
                    "priority": ticket.priority,
                    "sentiment": ticket.sentiment
                })
        print(f"Exported {len(tickets)} tickets to {filepath}")

    def run_pipeline(self, count: int = 500) -> tuple[list[Ticket], dict]:
        """Run complete pipeline: generate -> validate -> deduplicate -> export."""
        print(f"Generating {count} synthetic tickets...")
        raw_tickets = self.generate(count)
        print(f"Generated {len(raw_tickets)} raw tickets")

        print("Validating...")
        valid_tickets = self.validate(raw_tickets)
        print(f"Valid: {len(valid_tickets)} tickets")

        print("Deduplicating...")
        unique_tickets = self.deduplicate(valid_tickets)
        print(f"After dedup: {len(unique_tickets)} tickets")

        # Export
        self.export_jsonl(unique_tickets, "tickets.jsonl")
        self.export_csv(unique_tickets, "tickets.csv")

        # Summary stats
        stats = {
            "total_generated": len(raw_tickets),
            "total_valid": len(unique_tickets),
            "category_dist": self._category_distribution(unique_tickets),
            "priority_dist": self._priority_distribution(unique_tickets),
            "sentiment_dist": self._sentiment_distribution(unique_tickets)
        }

        return unique_tickets, stats

    def _category_distribution(self, tickets: list[Ticket]) -> dict:
        dist = {}
        for ticket in tickets:
            dist[ticket.category] = dist.get(ticket.category, 0) + 1
        return dist

    def _priority_distribution(self, tickets: list[Ticket]) -> dict:
        dist = {}
        for ticket in tickets:
            dist[ticket.priority] = dist.get(ticket.priority, 0) + 1
        return dist

    def _sentiment_distribution(self, tickets: list[Ticket]) -> dict:
        dist = {}
        for ticket in tickets:
            dist[ticket.sentiment] = dist.get(ticket.sentiment, 0) + 1
        return dist

# Main execution
if __name__ == "__main__":
    import os
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise ValueError("Set OPENAI_API_KEY environment variable")

    pipeline = TicketDatasetPipeline(api_key)
    tickets, stats = pipeline.run_pipeline(count=500)

    print("\n=== SUMMARY STATS ===")
    print(f"Category distribution: {stats['category_dist']}")
    print(f"Priority distribution: {stats['priority_dist']}")
    print(f"Sentiment distribution: {stats['sentiment_dist']}")
Pro Tip: Always include validation and deduplication in your pipeline. Even with strict constraints, LLMs sometimes generate invalid JSON or duplicates. A robust pipeline catches these before export.

References and Further Reading

  1. Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., & Hajishirzi, H. (2023). Self-Instruct: Aligning Language Models with Self-Generated Instructions. Association for Computational Linguistics. arxiv.org/abs/2212.10560
  2. Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T. B. (2023). Stanford Alpaca: An Instruction-following LLaMA Model. Stanford Center for Research on Foundation Models. crfm.stanford.edu/2023/03/13/alpaca
  3. Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Lin, Q., & Jiang, D. (2023). WizardLM: Empowering Large Language Models to Follow Complex Instructions. International Conference on Learning Representations. arxiv.org/abs/2304.12244
  4. Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Del Giorno, A., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Behl, H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee, Y. T., & Li, Y. (2023). Textbooks Are All You Need. arXiv preprint. arxiv.org/abs/2306.11644
  5. OpenAI. (2026). Text Generation and the Responses API. OpenAI API Documentation. platform.openai.com/docs/guides/text
  6. OpenAI. (2026). Structured Model Outputs. OpenAI API Documentation. platform.openai.com/docs/guides/structured-outputs
  7. OpenAI. (2026). API Pricing. OpenAI. openai.com/api/pricing

Conclusion

LLMs represent a paradigm shift in synthetic data generation. They excel at creating diverse, coherent, and realistic data at scale—especially for text and structured records with complex constraints. The key to success is thoughtful prompt engineering, multi-stage validation, and clear awareness of ethical implications.

As you implement these techniques, remember: synthetic data is a tool to accelerate development and testing, not a replacement for real data in production systems. Use synthetic data to prototype, train models, and validate pipelines. Validate early and often. And always maintain a human-in-the-loop component for critical applications.