From Database Schema
Synthesize training pairs from your database schema
The SchemaSynthesizer generates question/SQL pairs directly from your database schema. This is useful for bootstrapping training data when you don't have existing queries or chat history.
Basic Usage
import { SchemaSynthesizer } from '@deepagents/text2sql/synthesis';
const pairs = await new SchemaSynthesizer(adapter, {
count: 50,
complexity: 'medium',
}).toPairs();How It Works
SchemaSynthesizer uses your database schema to generate realistic pairs using an AsyncGenerator pattern:
- Loads schema via the adapter (tables, columns, relationships)
- Analyzes table structures and relationships
- Generates natural language questions at specified complexity
- Produces SQL that answers each question
- Yields pairs in chunks via
async *produce()
The synthesizer implements an async generator that yields pairs incrementally, allowing for efficient processing of large datasets.
Configuration Options
interface SchemaSynthesizerOptions {
/** Number of pairs to generate */
count: number;
/** Complexity level(s) to generate */
complexity?: QuestionComplexity | QuestionComplexity[];
/** Personas to guide question generation */
personas?: Persona[];
/** Domain knowledge to incorporate */
teachings?: Teachables[];
/** Model to use for generation */
model?: AgentModel;
/** Number of concurrent generations (default: 5) */
concurrency?: number;
}
type QuestionComplexity = 'low' | 'medium' | 'hard' | 'window';
interface Persona {
role: string;
perspective: string;
}Complexity Levels
Control the sophistication of generated queries:
// Low complexity (single table, basic filters)
const low = await new SchemaSynthesizer(adapter, {
count: 20,
complexity: 'low',
}).toPairs();
// "How many customers are there?"
// "Show all products"
// Medium complexity (joins, aggregations)
const medium = await new SchemaSynthesizer(adapter, {
count: 20,
complexity: 'medium',
}).toPairs();
// "What is the total revenue by product category?"
// "Which customers have placed more than 5 orders?"
// Hard complexity (multi-join, subqueries, CTEs)
const hard = await new SchemaSynthesizer(adapter, {
count: 20,
complexity: 'hard',
}).toPairs();
// "What is the month-over-month growth rate for each product category?"
// "Which customers have a higher than average order value?"
// Window functions (advanced SQL with window functions)
const window = await new SchemaSynthesizer(adapter, {
count: 20,
complexity: 'window',
}).toPairs();
// "Show running total of sales for each product"
// "Rank customers by their total order value"Mixed Complexity
Generate a balanced dataset:
const pairs = await new SchemaSynthesizer(adapter, {
count: 80,
complexity: ['low', 'medium', 'hard', 'window'], // 20 of each
}).toPairs();Working with Personas
Personas guide question generation to match different user perspectives:
import { SchemaSynthesizer, PersonaGenerator } from '@deepagents/text2sql/synthesis';
// Generate personas first
const personaGen = new PersonaGenerator(adapter, { count: 5 });
const personas = await personaGen.generate();
// Use personas in synthesis
const pairs = await new SchemaSynthesizer(adapter, {
count: 10,
complexity: ['low', 'medium', 'hard'],
personas,
}).toPairs();You can also define personas manually:
const customPersonas = [
{ role: 'Sales Manager', perspective: 'Focus on revenue, sales trends, and customer acquisition' },
{ role: 'Data Analyst', perspective: 'Focus on data quality, statistical analysis, and patterns' },
{ role: 'Product Manager', perspective: 'Focus on product performance, user engagement, and features' },
];
const pairs = await new SchemaSynthesizer(adapter, {
count: 30,
personas: customPersonas,
}).toPairs();Integration with Teachings
Teachings provide domain-specific knowledge to improve question quality:
import { TeachingsGenerator, SchemaSynthesizer } from '@deepagents/text2sql/synthesis';
const teachGen = new TeachingsGenerator(adapter);
const teachings = await teachGen.generate();
const pairs = await new SchemaSynthesizer(adapter, {
count: 50,
teachings,
}).toPairs();Teachings help the synthesizer understand:
- Business logic and domain terminology
- Common patterns and workflows
- Important metrics and KPIs
- Typical user questions and use cases
Concurrency Control
Control how many pairs are generated in parallel:
// Default concurrency (5)
const pairs = await new SchemaSynthesizer(adapter, {
count: 100,
}).toPairs();
// Higher concurrency for faster generation
const fastPairs = await new SchemaSynthesizer(adapter, {
count: 100,
concurrency: 10,
}).toPairs();
// Lower concurrency to reduce API rate limits
const slowPairs = await new SchemaSynthesizer(adapter, {
count: 100,
concurrency: 2,
}).toPairs();Custom Model Selection
Specify a custom model for generation:
import { SchemaSynthesizer } from '@deepagents/text2sql/synthesis';
import { groq } from '@deepagents/agent';
const pairs = await new SchemaSynthesizer(adapter, {
count: 50,
model: groq('gpt-oss-20b'),
}).toPairs();Example: Bootstrap Training Set
import { SchemaSynthesizer } from '@deepagents/text2sql/synthesis';
// Generate diverse training data
const pairs = await new SchemaSynthesizer(adapter, {
count: 100,
complexity: ['low', 'medium', 'hard'],
concurrency: 8,
}).toPairs();
// Export as JSONL for fine-tuning
const jsonl = pairs
.filter(p => p.success)
.map(p => JSON.stringify({ question: p.question, sql: p.sql }))
.join('\n');
await fs.writeFile('training-data.jsonl', jsonl);Example: Persona-Driven Generation
Generate questions from different user perspectives:
import { SchemaSynthesizer, PersonaGenerator } from '@deepagents/text2sql/synthesis';
// Generate personas
const personaGen = new PersonaGenerator(adapter, { count: 3 });
const personas = await personaGen.generate();
// Generate pairs with personas
const pairs = await new SchemaSynthesizer(adapter, {
count: 60,
complexity: ['low', 'medium', 'hard'],
personas,
concurrency: 10,
}).toPairs();
// Questions will reflect different user perspectives
// e.g., "As a sales manager, show me top performing products this quarter"Example: Domain-Aware Generation
Incorporate domain knowledge:
import {
SchemaSynthesizer,
TeachingsGenerator,
PersonaGenerator,
} from '@deepagents/text2sql/synthesis';
// Generate domain knowledge and personas
const teachGen = new TeachingsGenerator(adapter);
const teachings = await teachGen.generate();
const personaGen = new PersonaGenerator(adapter, { count: 5 });
const personas = await personaGen.generate();
// Generate pairs with full context
const pairs = await new SchemaSynthesizer(adapter, {
count: 100,
complexity: ['low', 'medium', 'hard', 'window'],
personas,
teachings,
concurrency: 8,
}).toPairs();Combining with Breadth Evolution
Expand synthetic data with paraphrases:
import { SchemaSynthesizer, BreadthEvolver } from '@deepagents/text2sql/synthesis';
// Generate base pairs
const basePairs = await new SchemaSynthesizer(adapter, {
count: 30,
complexity: ['low', 'medium', 'hard'],
}).toPairs();
// Expand with variations
const expanded = await new BreadthEvolver(basePairs, { count: 3 }).toPairs();
// Now have ~90 variations (30 base × 3 paraphrases each)Best Practices
- Start with schema quality - Better schema documentation (column comments, constraints) leads to better questions
- Mix complexity levels - A balanced dataset trains better than all-simple or all-complex
- Use personas - Generate personas to create diverse, realistic questions from different user perspectives
- Incorporate teachings - Add domain knowledge to improve question relevance and accuracy
- Adjust concurrency - Balance generation speed with API rate limits and resource constraints
- Validate output - Review a sample of generated pairs for quality
- Iterate - Generate, review, adjust complexity/personas/teachings, regenerate
- Supplement with real data - Synthetic data bootstraps; real user queries improve accuracy