Training Data Overview
Generate question/SQL pairs for fine-tuning and evaluation
The synthesis module helps you generate training data for fine-tuning models or evaluating Text2SQL performance. It produces ExtractedPair objects containing natural language questions paired with their SQL queries.
Use Cases
- Fine-tuning - Generate training pairs to improve model accuracy on your schema
- Evaluation - Create test sets to measure query generation quality
- Data augmentation - Expand existing pairs with variations and paraphrases
Core Concepts
ExtractedPair
Every producer outputs the same structure:
interface ExtractedPair {
question: string; // Natural language question
sql: string; // SQL query that answers the question
context?: string[]; // Preceding messages (for multi-turn)
success: boolean; // Whether the query is valid/executed successfully
}PairProducer
All extractors and synthesizers implement this abstract class:
abstract class PairProducer<T extends ExtractedPair> {
abstract produce(): AsyncGenerator<T[], void, unknown>;
}The streaming pattern yields chunks of pairs, allowing for efficient memory usage and real-time processing.
toPairs()
Helper function to iterate any producer's AsyncGenerator:
import { toPairs, SqlExtractor } from '@deepagents/text2sql/synthesis';
const pairs = await toPairs(new SqlExtractor(sqls, adapter));Implementation:
export async function toPairs<T extends ExtractedPair>(
producer: PairProducer<T>,
): Promise<T[]> {
const pairs: T[] = [];
for await (const chunk of producer.produce()) {
pairs.push(...chunk);
}
return pairs;
}Or pass the producer directly to toPairs():
const pairs = await toPairs(new SqlExtractor(sqls, adapter));Extractors vs Synthesizers vs Generators
| Type | Input | Output | Use Case |
|---|---|---|---|
| Extractors | Chat history, SQL logs | Pairs harvested from real usage | Production data mining |
| Synthesizers | Schema or seed pairs | Newly generated pairs | Bootstrapping, augmentation |
| Generators | Schema | Personas or Teachings | Pipeline setup |
Extractors
Harvest pairs from existing artifacts:
- MessageExtractor - Parse chat history for db_query tool calls
- SqlExtractor - Generate questions for existing SQL queries
- Contextual extractors - Handle multi-turn conversations with context
Synthesizers
Generate new pairs from scratch:
- SchemaSynthesizer - Generate pairs from database schema
- BreadthEvolver - Paraphrase questions (in-breadth evolution, keeps same SQL)
- DepthEvolver - Evolve to more complex questions (in-depth evolution, changes both Q and SQL)
Generators
Generate supporting artifacts for the pipeline:
- PersonaGenerator - Generate user personas from schema
- TeachingsGenerator - Generate domain teachings from schema
Evol-Instruct Methodology
The synthesis module implements Microsoft's Evol-Instruct methodology for instruction tuning:
-
BreadthEvolver - In-breadth evolution: Creates paraphrased variations of questions while keeping the same SQL query. This increases dataset diversity without changing the underlying query complexity.
-
DepthEvolver - In-depth evolution: Transforms questions into more complex versions that require changes to both the question and SQL query. This progressively increases dataset difficulty and model capabilities.
This two-pronged approach ensures both breadth (variety) and depth (complexity) in your training data.
Decorators
Decorators wrap producers to add functionality:
ValidatedProducer
Validate SQL and optionally execute queries:
import { ValidatedProducer, MessageExtractor } from '@deepagents/text2sql/synthesis';
const validated = new ValidatedProducer(
new MessageExtractor(messages),
adapter,
{ execute: true }
);
const pairs = await toPairs(validated);
// Only includes pairs where SQL validated/executed successfullyFilteredProducer
Filter pairs by criteria:
import { FilteredProducer, SqlExtractor } from '@deepagents/text2sql/synthesis';
const filtered = new FilteredProducer(
new SqlExtractor(sqls, adapter),
{
successOnly: true,
tables: ['orders', 'customers'], // Only pairs using these tables
}
);DeduplicatedProducer
Remove duplicate pairs:
import { DeduplicatedProducer, MessageExtractor } from '@deepagents/text2sql/synthesis';
const deduped = new DeduplicatedProducer(
new MessageExtractor(messages),
{ mode: 'sql' } // Dedupe by SQL only (ignore question variations)
);Composing Producers
Decorators can be chained:
const producer = new DeduplicatedProducer(
new FilteredProducer(
new ValidatedProducer(
new MessageExtractor(messages),
adapter
),
{ successOnly: true }
),
{ mode: 'exact' }
);
const pairs = await toPairs(producer);Output Format
Export pairs for training:
const pairs = await toPairs(new SqlExtractor(sqls, adapter));
// JSONL format (common for fine-tuning)
const jsonl = pairs
.filter(p => p.success)
.map(p => JSON.stringify({
question: p.question,
sql: p.sql
}))
.join('\n');
// CSV format
const csv = pairs
.map(p => `"${p.question}","${p.sql}"`)
.join('\n');Next Steps
- From History - Extract pairs from chat logs
- From SQL - Generate questions for existing queries
- From Schema - Synthesize pairs from database schema
- Breadth Evolution - Create paraphrased variations
- Depth Evolution - Generate more complex questions
- Persona Generator - Generate user personas
- Teachings Generator - Generate domain teachings