Training Data Overview
Generate question/SQL pairs for fine-tuning and evaluation
The synthesis module helps you generate training data for fine-tuning models or evaluating Text2SQL performance. It produces ExtractedPair objects containing natural language questions paired with their SQL queries.
Use Cases
- Fine-tuning - Generate training pairs to improve model accuracy on your schema
- Evaluation - Create test sets to measure query generation quality
- Data augmentation - Expand existing pairs with variations and paraphrases
Core Concepts
ExtractedPair
Every producer outputs the same structure:
interface ExtractedPair {
question: string; // Natural language question
sql: string; // SQL query that answers the question
context?: string[]; // Preceding messages (for multi-turn)
success: boolean; // Whether the query is valid/executed successfully
}PairProducer
All extractors and synthesizers implement this abstract class:
abstract class PairProducer<T extends ExtractedPair> {
abstract produce(): AsyncGenerator<T[], void, unknown>;
}The streaming pattern yields chunks of pairs, allowing for efficient memory usage and real-time processing.
toPairs()
Convenience method to collect all pairs from a producer's AsyncGenerator:
import { SqlExtractor } from '@deepagents/text2sql/synthesis';
const pairs = await new SqlExtractor(sqls, adapter).toPairs();Implementation (shared helper used by .toPairs()):
export async function toPairs<T extends ExtractedPair>(
producer: PairProducer<T>,
): Promise<T[]> {
const pairs: T[] = [];
for await (const chunk of producer.produce()) {
pairs.push(...chunk);
}
return pairs;
}The .toPairs() method delegates to the helper above, so you can use the method consistently across producers.
Extractors vs Synthesizers vs Generators
| Type | Input | Output | Use Case |
|---|---|---|---|
| Extractors | Chat history, SQL logs | Pairs harvested from real usage | Production data mining |
| Synthesizers | Schema or seed pairs | Newly generated pairs | Bootstrapping, augmentation |
| Generators | Schema | Personas or Teachings | Pipeline setup |
Extractors
Harvest pairs from existing artifacts:
- MessageExtractor - Parse chat history for db_query tool calls
- SqlExtractor - Generate questions for existing SQL queries
- Contextual extractors - Handle multi-turn conversations with context
Synthesizers
Generate new pairs from scratch:
- SchemaSynthesizer - Generate pairs from database schema
- BreadthEvolver - Paraphrase questions (in-breadth evolution, keeps same SQL)
- DepthEvolver - Evolve to more complex questions (in-depth evolution, changes both Q and SQL)
Generators
Generate supporting artifacts for the pipeline:
- PersonaGenerator - Generate user personas from schema
- Teachings Generator - Current guidance for building manual fragment bundles
Evol-Instruct Methodology
The synthesis module implements Microsoft's Evol-Instruct methodology for instruction tuning:
-
BreadthEvolver - In-breadth evolution: Creates paraphrased variations of questions while keeping the same SQL query. This increases dataset diversity without changing the underlying query complexity.
-
DepthEvolver - In-depth evolution: Transforms questions into more complex versions that require changes to both the question and SQL query. This progressively increases dataset difficulty and model capabilities.
This two-pronged approach ensures both breadth (variety) and depth (complexity) in your training data.
Decorators
Decorators wrap producers to add functionality:
ValidatedProducer
Validate SQL and optionally execute queries:
import { ValidatedProducer, MessageExtractor } from '@deepagents/text2sql/synthesis';
const validated = new ValidatedProducer(
new MessageExtractor(messages),
adapter,
{ execute: true }
);
const pairs = await validated.toPairs();
// Only includes pairs where SQL validated/executed successfullyFilteredProducer
Filter pairs by criteria:
import { FilteredProducer, SqlExtractor } from '@deepagents/text2sql/synthesis';
const filtered = new FilteredProducer(
new SqlExtractor(sqls, adapter),
{
successOnly: true,
tables: ['orders', 'customers'], // Only pairs using these tables
dialect: 'postgresql', // SQL dialect for table extraction via node-sql-parser
}
);| Option | Type | Default | Description |
|---|---|---|---|
successOnly | boolean | undefined | Only include pairs where success is true |
tables | string[] | undefined | Only include pairs whose SQL references these tables |
dialect | string | undefined | SQL dialect passed to node-sql-parser for table extraction (e.g. 'postgresql', 'mysql', 'transactsql') |
filter | (pair: ExtractedPair) => boolean | undefined | Custom filter predicate |
DeduplicatedProducer
Remove duplicate pairs:
import { DeduplicatedProducer, MessageExtractor } from '@deepagents/text2sql/synthesis';
const deduped = new DeduplicatedProducer(
new MessageExtractor(messages),
{
strategy: 'sql-only', // Dedupe by SQL only (ignore question variations)
dialect: 'postgresql', // SQL dialect for node-sql-parser normalization
}
);| Option | Type | Default | Description |
|---|---|---|---|
strategy | 'exact' | 'sql-only' | 'question-only' | 'exact' | Deduplication strategy: 'exact' compares both question and SQL, 'sql-only' compares SQL only, 'question-only' compares question only |
dialect | string | undefined | SQL dialect passed to node-sql-parser for SQL normalization during deduplication (e.g. 'postgresql', 'mysql', 'transactsql') |
Composing Producers
Decorators can be chained:
const producer = new DeduplicatedProducer(
new FilteredProducer(
new ValidatedProducer(
new MessageExtractor(messages),
adapter
),
{ successOnly: true }
),
{ strategy: 'exact' }
);
const pairs = await producer.toPairs();Output Format
Export pairs for training:
const pairs = await new SqlExtractor(sqls, adapter).toPairs();
// JSONL format (common for fine-tuning)
const jsonl = pairs
.filter(p => p.success)
.map(p => JSON.stringify({
question: p.question,
sql: p.sql
}))
.join('\n');
// CSV format
const csv = pairs
.map(p => `"${p.question}","${p.sql}"`)
.join('\n');Next Steps
- From History - Extract pairs from chat logs
- From SQL - Generate questions for existing queries
- From Schema - Synthesize pairs from database schema
- Breadth Evolution - Create paraphrased variations
- Depth Evolution - Generate more complex questions
- Persona Generator - Generate user personas
- Teachings Generator - Generate domain teachings