Text2SQL - Natural Language to SQL

The SchemaSynthesizer generates question/SQL pairs directly from your database schema. This is useful for bootstrapping training data when you don't have existing queries or chat history.

Basic Usage

import { SchemaSynthesizer } from '@deepagents/text2sql/synthesis';

const pairs = await new SchemaSynthesizer(adapter, {
  count: 50,
  complexity: 'medium',
}).toPairs();

How It Works

SchemaSynthesizer uses your database schema to generate realistic pairs using an AsyncGenerator pattern:

Loads schema via the adapter (tables, columns, relationships)
Analyzes table structures and relationships
Generates natural language questions at specified complexity
Produces SQL that answers each question
Yields pairs in chunks via async *produce()

The synthesizer implements an async generator that yields pairs incrementally, allowing for efficient processing of large datasets.

Configuration Options

interface SchemaSynthesizerOptions {
  /** Number of pairs to generate */
  count: number;
  /** Complexity level(s) to generate */
  complexity?: QuestionComplexity | QuestionComplexity[];
  /** Personas to guide question generation */
  personas?: Persona[];
  /** Domain knowledge to incorporate */
  teachings?: Teachables[];
  /** Model to use for generation */
  model?: AgentModel;
  /** Number of concurrent generations (default: 5) */
  concurrency?: number;
}

type QuestionComplexity = 'low' | 'medium' | 'hard' | 'window';

interface Persona {
  role: string;
  perspective: string;
}

Complexity Levels

Control the sophistication of generated queries:

// Low complexity (single table, basic filters)
const low = await new SchemaSynthesizer(adapter, {
  count: 20,
  complexity: 'low',
}).toPairs();
// "How many customers are there?"
// "Show all products"

// Medium complexity (joins, aggregations)
const medium = await new SchemaSynthesizer(adapter, {
  count: 20,
  complexity: 'medium',
}).toPairs();
// "What is the total revenue by product category?"
// "Which customers have placed more than 5 orders?"

// Hard complexity (multi-join, subqueries, CTEs)
const hard = await new SchemaSynthesizer(adapter, {
  count: 20,
  complexity: 'hard',
}).toPairs();
// "What is the month-over-month growth rate for each product category?"
// "Which customers have a higher than average order value?"

// Window functions (advanced SQL with window functions)
const window = await new SchemaSynthesizer(adapter, {
  count: 20,
  complexity: 'window',
}).toPairs();
// "Show running total of sales for each product"
// "Rank customers by their total order value"

Mixed Complexity

Generate a balanced dataset:

const pairs = await new SchemaSynthesizer(adapter, {
  count: 80,
  complexity: ['low', 'medium', 'hard', 'window'], // 20 of each
}).toPairs();

Working with Personas

Personas guide question generation to match different user perspectives:

import { SchemaSynthesizer, PersonaGenerator } from '@deepagents/text2sql/synthesis';

// Generate personas first
const personaGen = new PersonaGenerator(adapter, { count: 5 });
const personas = await personaGen.generate();

// Use personas in synthesis
const pairs = await new SchemaSynthesizer(adapter, {
  count: 10,
  complexity: ['low', 'medium', 'hard'],
  personas,
}).toPairs();

You can also define personas manually:

const customPersonas = [
  { role: 'Sales Manager', perspective: 'Focus on revenue, sales trends, and customer acquisition' },
  { role: 'Data Analyst', perspective: 'Focus on data quality, statistical analysis, and patterns' },
  { role: 'Product Manager', perspective: 'Focus on product performance, user engagement, and features' },
];

const pairs = await new SchemaSynthesizer(adapter, {
  count: 30,
  personas: customPersonas,
}).toPairs();

Integration with Teachings

Teachings provide domain-specific knowledge to improve question quality:

import { TeachingsGenerator, SchemaSynthesizer } from '@deepagents/text2sql/synthesis';

const teachGen = new TeachingsGenerator(adapter);
const teachings = await teachGen.generate();

const pairs = await new SchemaSynthesizer(adapter, {
  count: 50,
  teachings,
}).toPairs();

Teachings help the synthesizer understand:

Business logic and domain terminology
Common patterns and workflows
Important metrics and KPIs
Typical user questions and use cases

Concurrency Control

Control how many pairs are generated in parallel:

// Default concurrency (5)
const pairs = await new SchemaSynthesizer(adapter, {
  count: 100,
}).toPairs();

// Higher concurrency for faster generation
const fastPairs = await new SchemaSynthesizer(adapter, {
  count: 100,
  concurrency: 10,
}).toPairs();

// Lower concurrency to reduce API rate limits
const slowPairs = await new SchemaSynthesizer(adapter, {
  count: 100,
  concurrency: 2,
}).toPairs();

Custom Model Selection

Specify a custom model for generation:

import { SchemaSynthesizer } from '@deepagents/text2sql/synthesis';
import { groq } from '@deepagents/agent';

const pairs = await new SchemaSynthesizer(adapter, {
  count: 50,
  model: groq('gpt-oss-20b'),
}).toPairs();

Example: Bootstrap Training Set

import { SchemaSynthesizer } from '@deepagents/text2sql/synthesis';

// Generate diverse training data
const pairs = await new SchemaSynthesizer(adapter, {
  count: 100,
  complexity: ['low', 'medium', 'hard'],
  concurrency: 8,
}).toPairs();

// Export as JSONL for fine-tuning
const jsonl = pairs
  .filter(p => p.success)
  .map(p => JSON.stringify({ question: p.question, sql: p.sql }))
  .join('\n');

await fs.writeFile('training-data.jsonl', jsonl);

Example: Persona-Driven Generation

Generate questions from different user perspectives:

import { SchemaSynthesizer, PersonaGenerator } from '@deepagents/text2sql/synthesis';

// Generate personas
const personaGen = new PersonaGenerator(adapter, { count: 3 });
const personas = await personaGen.generate();

// Generate pairs with personas
const pairs = await new SchemaSynthesizer(adapter, {
  count: 60,
  complexity: ['low', 'medium', 'hard'],
  personas,
  concurrency: 10,
}).toPairs();

// Questions will reflect different user perspectives
// e.g., "As a sales manager, show me top performing products this quarter"

Example: Domain-Aware Generation

Incorporate domain knowledge:

import {
  SchemaSynthesizer,
  TeachingsGenerator,
  PersonaGenerator,
} from '@deepagents/text2sql/synthesis';

// Generate domain knowledge and personas
const teachGen = new TeachingsGenerator(adapter);
const teachings = await teachGen.generate();

const personaGen = new PersonaGenerator(adapter, { count: 5 });
const personas = await personaGen.generate();

// Generate pairs with full context
const pairs = await new SchemaSynthesizer(adapter, {
  count: 100,
  complexity: ['low', 'medium', 'hard', 'window'],
  personas,
  teachings,
  concurrency: 8,
}).toPairs();

Combining with Breadth Evolution

Expand synthetic data with paraphrases:

import { SchemaSynthesizer, BreadthEvolver } from '@deepagents/text2sql/synthesis';

// Generate base pairs
const basePairs = await new SchemaSynthesizer(adapter, {
  count: 30,
  complexity: ['low', 'medium', 'hard'],
}).toPairs();

// Expand with variations
const expanded = await new BreadthEvolver(basePairs, { count: 3 }).toPairs();

// Now have ~90 variations (30 base × 3 paraphrases each)

Best Practices

Start with schema quality - Better schema documentation (column comments, constraints) leads to better questions
Mix complexity levels - A balanced dataset trains better than all-simple or all-complex
Use personas - Generate personas to create diverse, realistic questions from different user perspectives
Incorporate teachings - Add domain knowledge to improve question relevance and accuracy
Adjust concurrency - Balance generation speed with API rate limits and resource constraints
Validate output - Review a sample of generated pairs for quality
Iterate - Generate, review, adjust complexity/personas/teachings, regenerate
Supplement with real data - Synthetic data bootstraps; real user queries improve accuracy

From Database Schema