Text2SQL - Natural Language to SQL

This guide shows how to combine the synthesis components into end-to-end pipelines for different scenarios. Each pipeline addresses a specific use case for generating Text2SQL training data.

Pipeline Fundamentals

Every pipeline follows a common pattern:

Source - Where pairs come from (schema, history, SQL logs)
Transform - How pairs are modified (evolution, validation)
Output - Final dataset format (JSONL, CSV, etc.)

import {
  // Sources
  SchemaSynthesizer,
  MessageExtractor,
  SqlExtractor,
  // Evolution
  BreadthEvolver,
  DepthEvolver,
  // Generators
  PersonaGenerator,
  TeachingsGenerator,
  // Decorators
  ValidatedProducer,
  FilteredProducer,
  DeduplicatedProducer,
} from '@deepagents/text2sql/synthesis';

Scenario A: Bootstrap from Schema

Use case: Starting fresh with no existing data. Generate training pairs directly from your database schema.

Basic Bootstrap

import { SchemaSynthesizer } from '@deepagents/text2sql/synthesis';

const pairs = await new SchemaSynthesizer(adapter, {
    count: 100,
    complexity: ['low', 'medium', 'hard'],
  }).toPairs();

Enhanced Bootstrap with Personas

Generate more realistic questions by using personas that match your actual users:

import {
  SchemaSynthesizer,
  PersonaGenerator,
} from '@deepagents/text2sql/synthesis';

// Step 1: Generate personas from schema
const personas = await new PersonaGenerator(adapter, {
  count: 5,
}).generate();

// Step 2: Generate pairs using personas
const pairs = await new SchemaSynthesizer(adapter, {
    count: 100,
    complexity: ['low', 'medium', 'hard'],
    personas,
  }).toPairs();

Full Bootstrap Pipeline

Complete pipeline with personas, teachings, evolution, and validation:

import {
  SchemaSynthesizer,
  PersonaGenerator,
  TeachingsGenerator,
  BreadthEvolver,
  DepthEvolver,
  ValidatedProducer,
  DeduplicatedProducer,
} from '@deepagents/text2sql/synthesis';

// Step 1: Generate supporting artifacts
const [personas, teachings] = await Promise.all([
  new PersonaGenerator(adapter, { count: 5 }).generate(),
  new TeachingsGenerator(adapter).generate(),
]);

// Step 2: Generate baseline pairs
const baseline = await new SchemaSynthesizer(adapter, {
    count: 50,
    complexity: 'low',
    personas,
    teachings,
  }).toPairs();
console.log(`Baseline: ${baseline.length} pairs`);

// Step 3: Evolve in depth (more complex queries)
const complex = await new DepthEvolver(baseline, adapter, {
    count: 2,
    techniques: ['add-aggregation', 'add-join', 'add-filter'],
  }).toPairs();
console.log(`After depth evolution: ${complex.length} pairs`);

// Step 4: Evolve in breadth (paraphrased variations)
const varied = await new BreadthEvolver([...baseline, ...complex], {
    count: 3,
  }).toPairs();
console.log(`After breadth evolution: ${varied.length} pairs`);

// Step 5: Validate and deduplicate
const final = await new DeduplicatedProducer(
    new ValidatedProducer(
      { produce: async function* () { yield varied; } },
      adapter
    ),
    { mode: 'sql' }
  ).toPairs();
console.log(`Final dataset: ${final.length} validated pairs`);

Scenario B: Augment Chat History

Use case: You have existing chat logs with SQL queries. Extract and augment them.

Basic Extraction

import { MessageExtractor } from '@deepagents/text2sql/synthesis';

const messages = await loadChatHistory(); // Your data source

const pairs = await new MessageExtractor(messages).toPairs();

Extraction with Context

For multi-turn conversations where context matters:

import {
  WindowedContextExtractor,
} from '@deepagents/text2sql/synthesis';

const pairs = await new WindowedContextExtractor(messages, {
    windowSize: 3, // Include 3 preceding messages as context
  }).toPairs();

// Each pair includes context array
// { question, sql, context: ['prev msg 1', 'prev msg 2', 'prev msg 3'], success }

Full History Augmentation Pipeline

import {
  MessageExtractor,
  BreadthEvolver,
  ValidatedProducer,
  FilteredProducer,
  DeduplicatedProducer,
} from '@deepagents/text2sql/synthesis';

// Step 1: Extract from history
const messages = await loadChatHistory();
const extracted = await new MessageExtractor(messages).toPairs();
console.log(`Extracted: ${extracted.length} pairs`);

// Step 2: Validate against current schema
const validated = await new ValidatedProducer(
    { produce: async function* () { yield extracted; } },
    adapter
  ).toPairs();
console.log(`Valid: ${validated.filter(p => p.success).length} pairs`);

// Step 3: Filter to successful only
const successful = validated.filter(p => p.success);

// Step 4: Generate variations
const augmented = await new BreadthEvolver(successful, { count: 3 }).toPairs();
console.log(`Augmented: ${augmented.length} pairs`);

// Step 5: Deduplicate
const final = await new DeduplicatedProducer(
    { produce: async function* () { yield augmented; } },
    { mode: 'exact' }
  ).toPairs();
console.log(`Final: ${final.length} unique pairs`);

Scenario C: Reverse-Engineer SQL Logs

Use case: You have SQL query logs but no natural language questions. Generate questions for existing SQL.

Basic SQL Extraction

import { SqlExtractor } from '@deepagents/text2sql/synthesis';

const sqlLogs = await loadQueryLogs(); // Your SQL logs
const sqls = sqlLogs.map(log => log.sql);

const pairs = await new SqlExtractor(sqls, adapter, {
    validateSql: true,
    skipInvalid: true,
  }).toPairs();

Full SQL Log Pipeline

import {
  SqlExtractor,
  BreadthEvolver,
  DepthEvolver,
  DeduplicatedProducer,
} from '@deepagents/text2sql/synthesis';

// Step 1: Load and deduplicate SQL logs
const sqlLogs = await loadQueryLogs();
const uniqueSqls = [...new Set(sqlLogs.map(log => log.sql))];
console.log(`Unique SQLs: ${uniqueSqls.length}`);

// Step 2: Generate questions for each SQL
const extracted = await new SqlExtractor(uniqueSqls, adapter, {
    validateSql: true,
    skipInvalid: true,
  }).toPairs();
console.log(`Extracted: ${extracted.length} pairs`);

// Step 3: Generate question variations (breadth)
const varied = await new BreadthEvolver(extracted, { count: 2 }).toPairs();
console.log(`With variations: ${varied.length} pairs`);

// Step 4: Deduplicate by question
const final = await new DeduplicatedProducer(
    { produce: async function* () { yield varied; } },
    { mode: 'exact' }
  ).toPairs();
console.log(`Final: ${final.length} pairs`);

Scenario D: Multi-Source Synthesis

Use case: Combine multiple data sources into a comprehensive dataset.

import {
  SchemaSynthesizer,
  MessageExtractor,
  SqlExtractor,
  PersonaGenerator,
  BreadthEvolver,
  DepthEvolver,
  ValidatedProducer,
  DeduplicatedProducer,
} from '@deepagents/text2sql/synthesis';

// Generate personas for realistic questions
const personas = await new PersonaGenerator(adapter, { count: 5 }).generate();

// Source 1: Schema-based synthesis
const fromSchema = await new SchemaSynthesizer(adapter, {
    count: 50,
    complexity: ['low', 'medium'],
    personas,
  }).toPairs();
console.log(`From schema: ${fromSchema.length} pairs`);

// Source 2: Chat history
const messages = await loadChatHistory();
const fromHistory = await new MessageExtractor(messages).toPairs();
console.log(`From history: ${fromHistory.length} pairs`);

// Source 3: SQL logs
const sqlLogs = await loadQueryLogs();
const fromLogs = await new SqlExtractor(sqlLogs.map(l => l.sql), adapter, { skipInvalid: true }).toPairs();
console.log(`From logs: ${fromLogs.length} pairs`);

// Combine all sources
const combined = [...fromSchema, ...fromHistory, ...fromLogs];
console.log(`Combined: ${combined.length} pairs`);

// Evolve for complexity
const evolved = await new DepthEvolver(combined, adapter, {
    count: 1,
    techniques: ['add-aggregation', 'add-filter'],
  }).toPairs();

// Evolve for variety
const varied = await new BreadthEvolver([...combined, ...evolved], { count: 2 }).toPairs();

// Validate and deduplicate
const final = await new DeduplicatedProducer(
    new ValidatedProducer(
      { produce: async function* () { yield varied; } },
      adapter
    ),
    { mode: 'sql' }
  ).toPairs();

console.log(`Final dataset: ${final.length} pairs`);

Scenario E: Production Pipeline with Quality Gates

Use case: Production-ready pipeline with validation, filtering, and quality checks.

import {
  SchemaSynthesizer,
  PersonaGenerator,
  TeachingsGenerator,
  BreadthEvolver,
  DepthEvolver,
  ValidatedProducer,
  FilteredProducer,
  DeduplicatedProducer,
} from '@deepagents/text2sql/synthesis';

async function buildProductionDataset(adapter, options: {
  targetCount: number;
  tables?: string[];
}) {
  const { targetCount, tables } = options;

  // Phase 1: Setup
  console.log('Phase 1: Generating personas and teachings...');
  const [personas, teachings] = await Promise.all([
    new PersonaGenerator(adapter, { count: 5 }).generate(),
    new TeachingsGenerator(adapter).generate(),
  ]);

  // Phase 2: Baseline generation
  console.log('Phase 2: Generating baseline pairs...');
  const baseline = await new SchemaSynthesizer(adapter, {
      count: Math.ceil(targetCount / 4),
      complexity: ['low', 'medium', 'hard'],
      personas,
      teachings,
    }).toPairs();

  // Phase 3: Depth evolution (complexity)
  console.log('Phase 3: Depth evolution...');
  const complex = await new DepthEvolver(baseline, adapter, {
      count: 2,
      techniques: ['add-aggregation', 'add-join', 'add-reasoning'],
      concurrency: 4,
    }).toPairs();

  // Phase 4: Breadth evolution (variety)
  console.log('Phase 4: Breadth evolution...');
  const allPairs = [...baseline, ...complex];
  const varied = await new BreadthEvolver(allPairs, {
      count: 2,
      concurrency: 4,
    }).toPairs();

  // Phase 5: Quality gates
  console.log('Phase 5: Applying quality gates...');

  // Gate 1: Validation
  const validated = await new ValidatedProducer(
      { produce: async function* () { yield varied; } },
      adapter,
      { execute: true } // Actually run queries
    ).toPairs();

  // Gate 2: Filter by tables (if specified)
  let filtered = validated;
  if (tables?.length) {
    filtered = await new FilteredProducer(
        { produce: async function* () { yield validated; } },
        { successOnly: true, tables }
      ).toPairs();
  } else {
    filtered = validated.filter(p => p.success);
  }

  // Gate 3: Deduplication
  const final = await new DeduplicatedProducer(
      { produce: async function* () { yield filtered; } },
      { mode: 'sql' }
    ).toPairs();

  // Report
  console.log('\n=== Pipeline Complete ===');
  console.log(`Baseline:    ${baseline.length} pairs`);
  console.log(`After depth: ${complex.length} pairs`);
  console.log(`After breadth: ${varied.length} pairs`);
  console.log(`Validated:   ${validated.filter(p => p.success).length} pairs`);
  console.log(`Final:       ${final.length} pairs`);

  return final;
}

// Usage
const dataset = await buildProductionDataset(adapter, {
  targetCount: 500,
  tables: ['orders', 'customers', 'products'],
});

Component Decision Guide

Use this guide to choose components:

When to use each source

Source	Use When	Output
`SchemaSynthesizer`	No existing data, bootstrapping	New pairs from schema
`MessageExtractor`	Have chat history with SQL	Pairs from conversation
`SqlExtractor`	Have SQL logs, need questions	Questions for existing SQL

When to use each evolver

Evolver	Use When	Effect
`BreadthEvolver`	Need phrasing variety	Same SQL, different questions
`DepthEvolver`	Need query complexity	More complex SQL and questions

When to use each generator

Generator	Use When	Effect
`PersonaGenerator`	Want realistic user questions	Personas for SchemaSynthesizer
`TeachingsGenerator`	Want domain-aware questions	Business rules for synthesis

When to use each decorator

Decorator	Use When	Effect
`ValidatedProducer`	Need valid SQL	Validates/executes queries
`FilteredProducer`	Need subset of data	Filters by tables, success
`DeduplicatedProducer`	Have duplicates	Removes duplicate pairs

Exporting Datasets

Once you have pairs, export for training:

// JSONL format (common for fine-tuning)
const jsonl = pairs
  .filter(p => p.success)
  .map(p => JSON.stringify({
    messages: [
      { role: 'user', content: p.question },
      { role: 'assistant', content: p.sql },
    ],
  }))
  .join('\n');

await fs.writeFile('training.jsonl', jsonl);

// CSV format
const csv = [
  'question,sql',
  ...pairs.map(p => `"${p.question.replace(/"/g, '""')}","${p.sql.replace(/"/g, '""')}"`)
].join('\n');

await fs.writeFile('training.csv', csv);

// With context (for multi-turn)
const withContext = pairs
  .filter(p => p.success)
  .map(p => ({
    context: p.context || [],
    question: p.question,
    sql: p.sql,
  }));

await fs.writeFile('training.json', JSON.stringify(withContext, null, 2));

Best Practices

Start with validation - Always validate pairs against your current schema before using
Deduplicate early - Remove duplicates before evolution to save LLM calls
Balance complexity - Mix simple and complex queries for robust training
Use personas - Personas make questions more realistic and varied
Iterate on quality - Review samples at each stage, adjust parameters
Monitor success rates - Low success rates indicate schema mismatches
Combine evolvers - Use DepthEvolver for complexity, BreadthEvolver for variety

Training Data Overview - Core concepts
From Schema - SchemaSynthesizer details
From History - MessageExtractor details
From SQL - SqlExtractor details
Breadth Evolution - Paraphrasing details
Depth Evolution - Complexity evolution details
Persona Generator - Generating personas
Teachings Generator - Generating teachings

Building Pipelines