Deep Agents
AgentContextOrchestratorRetrievalText2SQLToolbox

From Database Schema

Synthesize training pairs from your database schema

The SchemaSynthesizer generates question/SQL pairs directly from your database schema. This is useful for bootstrapping training data when you don't have existing queries or chat history.

Basic Usage

import { SchemaSynthesizer } from '@deepagents/text2sql/synthesis';

const pairs = await new SchemaSynthesizer(adapter, {
  count: 50,
  complexity: 'medium',
}).toPairs();

How It Works

SchemaSynthesizer uses your database schema to generate realistic pairs using an AsyncGenerator pattern:

  1. Loads schema via the adapter (tables, columns, relationships)
  2. Analyzes table structures and relationships
  3. Generates natural language questions at specified complexity
  4. Produces SQL that answers each question
  5. Yields pairs in chunks via async *produce()

The synthesizer implements an async generator that yields pairs incrementally, allowing for efficient processing of large datasets.

Configuration Options

interface SchemaSynthesizerOptions {
  /** Number of pairs to generate */
  count: number;
  /** Complexity level(s) to generate */
  complexity?: QuestionComplexity | QuestionComplexity[];
  /** Personas to guide question generation */
  personas?: Persona[];
  /** Domain knowledge to incorporate */
  teachings?: Teachables[];
  /** Model to use for generation */
  model?: AgentModel;
  /** Number of concurrent generations (default: 5) */
  concurrency?: number;
}

type QuestionComplexity = 'low' | 'medium' | 'hard' | 'window';

interface Persona {
  role: string;
  perspective: string;
}

Complexity Levels

Control the sophistication of generated queries:

// Low complexity (single table, basic filters)
const low = await new SchemaSynthesizer(adapter, {
  count: 20,
  complexity: 'low',
}).toPairs();
// "How many customers are there?"
// "Show all products"

// Medium complexity (joins, aggregations)
const medium = await new SchemaSynthesizer(adapter, {
  count: 20,
  complexity: 'medium',
}).toPairs();
// "What is the total revenue by product category?"
// "Which customers have placed more than 5 orders?"

// Hard complexity (multi-join, subqueries, CTEs)
const hard = await new SchemaSynthesizer(adapter, {
  count: 20,
  complexity: 'hard',
}).toPairs();
// "What is the month-over-month growth rate for each product category?"
// "Which customers have a higher than average order value?"

// Window functions (advanced SQL with window functions)
const window = await new SchemaSynthesizer(adapter, {
  count: 20,
  complexity: 'window',
}).toPairs();
// "Show running total of sales for each product"
// "Rank customers by their total order value"

Mixed Complexity

Generate a balanced dataset:

const pairs = await new SchemaSynthesizer(adapter, {
  count: 80,
  complexity: ['low', 'medium', 'hard', 'window'], // 20 of each
}).toPairs();

Working with Personas

Personas guide question generation to match different user perspectives:

import { SchemaSynthesizer, PersonaGenerator } from '@deepagents/text2sql/synthesis';

// Generate personas first
const personaGen = new PersonaGenerator(adapter, { count: 5 });
const personas = await personaGen.generate();

// Use personas in synthesis
const pairs = await new SchemaSynthesizer(adapter, {
  count: 10,
  complexity: ['low', 'medium', 'hard'],
  personas,
}).toPairs();

You can also define personas manually:

const customPersonas = [
  { role: 'Sales Manager', perspective: 'Focus on revenue, sales trends, and customer acquisition' },
  { role: 'Data Analyst', perspective: 'Focus on data quality, statistical analysis, and patterns' },
  { role: 'Product Manager', perspective: 'Focus on product performance, user engagement, and features' },
];

const pairs = await new SchemaSynthesizer(adapter, {
  count: 30,
  personas: customPersonas,
}).toPairs();

Integration with Teachings

Teachings provide domain-specific knowledge to improve question quality:

import { TeachingsGenerator, SchemaSynthesizer } from '@deepagents/text2sql/synthesis';

const teachGen = new TeachingsGenerator(adapter);
const teachings = await teachGen.generate();

const pairs = await new SchemaSynthesizer(adapter, {
  count: 50,
  teachings,
}).toPairs();

Teachings help the synthesizer understand:

  • Business logic and domain terminology
  • Common patterns and workflows
  • Important metrics and KPIs
  • Typical user questions and use cases

Concurrency Control

Control how many pairs are generated in parallel:

// Default concurrency (5)
const pairs = await new SchemaSynthesizer(adapter, {
  count: 100,
}).toPairs();

// Higher concurrency for faster generation
const fastPairs = await new SchemaSynthesizer(adapter, {
  count: 100,
  concurrency: 10,
}).toPairs();

// Lower concurrency to reduce API rate limits
const slowPairs = await new SchemaSynthesizer(adapter, {
  count: 100,
  concurrency: 2,
}).toPairs();

Custom Model Selection

Specify a custom model for generation:

import { SchemaSynthesizer } from '@deepagents/text2sql/synthesis';
import { groq } from '@deepagents/agent';

const pairs = await new SchemaSynthesizer(adapter, {
  count: 50,
  model: groq('gpt-oss-20b'),
}).toPairs();

Example: Bootstrap Training Set

import { SchemaSynthesizer } from '@deepagents/text2sql/synthesis';

// Generate diverse training data
const pairs = await new SchemaSynthesizer(adapter, {
  count: 100,
  complexity: ['low', 'medium', 'hard'],
  concurrency: 8,
}).toPairs();

// Export as JSONL for fine-tuning
const jsonl = pairs
  .filter(p => p.success)
  .map(p => JSON.stringify({ question: p.question, sql: p.sql }))
  .join('\n');

await fs.writeFile('training-data.jsonl', jsonl);

Example: Persona-Driven Generation

Generate questions from different user perspectives:

import { SchemaSynthesizer, PersonaGenerator } from '@deepagents/text2sql/synthesis';

// Generate personas
const personaGen = new PersonaGenerator(adapter, { count: 3 });
const personas = await personaGen.generate();

// Generate pairs with personas
const pairs = await new SchemaSynthesizer(adapter, {
  count: 60,
  complexity: ['low', 'medium', 'hard'],
  personas,
  concurrency: 10,
}).toPairs();

// Questions will reflect different user perspectives
// e.g., "As a sales manager, show me top performing products this quarter"

Example: Domain-Aware Generation

Incorporate domain knowledge:

import {
  SchemaSynthesizer,
  TeachingsGenerator,
  PersonaGenerator,
} from '@deepagents/text2sql/synthesis';

// Generate domain knowledge and personas
const teachGen = new TeachingsGenerator(adapter);
const teachings = await teachGen.generate();

const personaGen = new PersonaGenerator(adapter, { count: 5 });
const personas = await personaGen.generate();

// Generate pairs with full context
const pairs = await new SchemaSynthesizer(adapter, {
  count: 100,
  complexity: ['low', 'medium', 'hard', 'window'],
  personas,
  teachings,
  concurrency: 8,
}).toPairs();

Combining with Breadth Evolution

Expand synthetic data with paraphrases:

import { SchemaSynthesizer, BreadthEvolver } from '@deepagents/text2sql/synthesis';

// Generate base pairs
const basePairs = await new SchemaSynthesizer(adapter, {
  count: 30,
  complexity: ['low', 'medium', 'hard'],
}).toPairs();

// Expand with variations
const expanded = await new BreadthEvolver(basePairs, { count: 3 }).toPairs();

// Now have ~90 variations (30 base × 3 paraphrases each)

Best Practices

  1. Start with schema quality - Better schema documentation (column comments, constraints) leads to better questions
  2. Mix complexity levels - A balanced dataset trains better than all-simple or all-complex
  3. Use personas - Generate personas to create diverse, realistic questions from different user perspectives
  4. Incorporate teachings - Add domain knowledge to improve question relevance and accuracy
  5. Adjust concurrency - Balance generation speed with API rate limits and resource constraints
  6. Validate output - Review a sample of generated pairs for quality
  7. Iterate - Generate, review, adjust complexity/personas/teachings, regenerate
  8. Supplement with real data - Synthetic data bootstraps; real user queries improve accuracy