Breadth Evolution
Generate paraphrased variations using in-breadth evolution
The BreadthEvolver generates paraphrased variations of questions while keeping the SQL unchanged. This creates training data diversity where many different phrasings map to the same SQL query.
Based on Microsoft's Evol-Instruct methodology for in-breadth evolution, this synthesizer helps improve model robustness to different question phrasings without changing the underlying query logic.
Basic Usage
import { BreadthEvolver, toPairs } from '@deepagents/text2sql/synthesis';
const existingPairs = [
{ question: 'Show top customers', sql: 'SELECT ...', success: true },
{ question: 'Monthly revenue', sql: 'SELECT ...', success: true },
];
const expanded = await toPairs(new BreadthEvolver(existingPairs, { count: 3 }));
// Original: "Show top customers"
// Variations:
// - "Who are our best customers?"
// - "List the highest-value customers"
// - "Display our top-performing customers"How It Works
BreadthEvolver uses an LLM to generate semantically equivalent questions:
- Takes existing question/SQL pairs as input
- For each pair, generates N paraphrased questions
- Keeps the original SQL unchanged for each variation
- All variations must be answerable by the exact same SQL query
- Returns variations with the same SQL and context
The key principle: same SQL, different question phrasings.
Configuration Options
interface BreadthEvolverOptions {
/** Number of variations to generate per pair (required) */
count: number;
/** Optional persona to style the paraphrases */
persona?: Persona;
/** Custom model override (defaults to gpt-oss-20b) */
model?: AgentModel;
/** Parallel processing limit (default: 4) */
concurrency?: number;
}Number of Variations
Control how many paraphrases to generate per pair:
// 3 variations per pair
const pairs = await toPairs(
new BreadthEvolver(existingPairs, { count: 3 })
);
// 10 original pairs → 30 variations
// 5 variations per pair
const pairs = await toPairs(
new BreadthEvolver(existingPairs, { count: 5 })
);
// 10 original pairs → 50 variationsPersona-Driven Paraphrasing
Generate variations that match a specific user persona:
import { BreadthEvolver, toPairs } from '@deepagents/text2sql/synthesis';
const persona = {
role: 'Sales Manager',
perspective: 'Focus on revenue metrics and customer relationships. Use business terminology.',
};
const expanded = await toPairs(
new BreadthEvolver(existingPairs, {
count: 3,
persona
})
);
// Original: "Show customer orders"
// With persona:
// - "What's the deal pipeline for this account?"
// - "Pull up the purchase history for this client"
// - "Show me the order book for this customer"Concurrency Control
Process pairs in parallel for faster generation:
// Default concurrency (4)
const pairs = await toPairs(
new BreadthEvolver(existingPairs, { count: 3 })
);
// Higher concurrency for large datasets
const pairs = await toPairs(
new BreadthEvolver(existingPairs, {
count: 3,
concurrency: 10
})
);Custom Model
Override the default model:
import { groq } from '@ai-sdk/groq';
const pairs = await toPairs(
new BreadthEvolver(existingPairs, {
count: 3,
model: groq('openai/gpt-oss-20b')
})
);Example: Augment Extracted Data
Expand pairs extracted from chat history:
import {
MessageExtractor,
BreadthEvolver,
toPairs,
} from '@deepagents/text2sql/synthesis';
// Extract from history
const extracted = await toPairs(new MessageExtractor(messages));
console.log(`Extracted ${extracted.length} pairs`);
// Augment with variations
const augmented = await toPairs(
new BreadthEvolver(extracted, { count: 3 })
);
console.log(`Augmented to ${augmented.length} pairs`);Example: Pipeline with Validation
import {
MessageExtractor,
BreadthEvolver,
ValidatedProducer,
DeduplicatedProducer,
toPairs,
} from '@deepagents/text2sql/synthesis';
// 1. Extract from history
const extracted = await toPairs(new MessageExtractor(messages));
// 2. Generate variations
const variations = await toPairs(
new BreadthEvolver(extracted, { count: 3 })
);
// 3. Validate and deduplicate
const final = await toPairs(
new DeduplicatedProducer(
new ValidatedProducer(
{ produce: async function*() { yield variations; } },
adapter
),
{ mode: 'exact' }
)
);Example: With Persona-Based Variations
Generate variations for different user types:
import { BreadthEvolver, toPairs } from '@deepagents/text2sql/synthesis';
const personas = [
{ role: 'Sales Manager', perspective: 'Revenue and customer focus' },
{ role: 'Data Analyst', perspective: 'Technical and precise queries' },
{ role: 'Executive', perspective: 'High-level metrics and trends' },
];
const allVariations = [];
for (const persona of personas) {
const variations = await toPairs(
new BreadthEvolver(basePairs, { count: 2, persona })
);
allVariations.push(...variations);
}
console.log(`Generated ${allVariations.length} persona-specific variations`);Variation Quality
Generated variations maintain semantic equivalence while varying phrasing:
| Original | Variation 1 | Variation 2 | Variation 3 |
|---|---|---|---|
| "How many orders last month?" | "What was our order count for last month?" | "Show me total orders from the previous month" | "Orders from last month - how many?" |
| "Show revenue by product" | "Break down our revenue by product" | "What's the revenue split across products?" | "Product revenue breakdown" |
| "List active customers" | "Who are our current active customers?" | "Show me all customers with active status" | "Display the active customer list" |
BreadthEvolver vs DepthEvolver
Understanding the difference:
BreadthEvolver (This Page)
- Same SQL, different phrasings
- Generates paraphrased questions
- SQL remains identical
- Increases dataset diversity
- Helps model handle varied user language
// Example: Breadth evolution
const breadth = await toPairs(
new BreadthEvolver(
[{ question: 'Show customers', sql: 'SELECT * FROM customers', success: true }],
{ count: 3 }
)
);
// Results:
// - "Display all customers" → SELECT * FROM customers
// - "List our customer base" → SELECT * FROM customers
// - "Who are our customers?" → SELECT * FROM customersDepthEvolver
- Different SQL, more complex questions
- Evolves questions to be more sophisticated
- SQL becomes more complex
- Increases query complexity
- Helps model handle advanced queries
// Example: Depth evolution
const depth = await toPairs(
new DepthEvolver(
[{ question: 'Show customers', sql: 'SELECT * FROM customers', success: true }],
adapter,
{ count: 2, techniques: ['add-aggregation', 'add-filter'] }
)
);
// Results:
// - "How many customers per region?" → SELECT region, COUNT(*) FROM customers GROUP BY region
// - "Show customers who joined this year" → SELECT * FROM customers WHERE YEAR(joined_at) = YEAR(NOW())Combining Sources
Build a comprehensive training set:
import {
MessageExtractor,
SqlExtractor,
SchemaSynthesizer,
BreadthEvolver,
toPairs,
} from '@deepagents/text2sql/synthesis';
// From multiple sources
const fromHistory = await toPairs(new MessageExtractor(messages));
const fromLogs = await toPairs(new SqlExtractor(sqlLogs, adapter));
const synthetic = await toPairs(new SchemaSynthesizer(adapter, { count: 50 }));
// Combine
const allPairs = [...fromHistory, ...fromLogs, ...synthetic];
// Augment everything with breadth variations
const augmented = await toPairs(
new BreadthEvolver(allPairs, { count: 2 })
);
console.log(`Final dataset: ${augmented.length} pairs`);Best Practices
- Start with quality pairs - Variations inherit quality from originals; poor questions lead to poor variations
- Use personas strategically - Generate variations for your actual user personas to match real usage patterns
- Deduplicate after generation - Some variations may be similar; deduplicate the final set
- Review samples - Spot-check that variations preserve exact semantic meaning
- Balance your dataset - Don't over-augment simple queries relative to complex ones
- Adjust concurrency - Higher concurrency for large datasets, lower for rate-limited models
- Combine with DepthEvolver - Use BreadthEvolver for diversity, DepthEvolver for complexity
Related Pages
- Depth Evolution - Evolve questions into more complex versions
- From Database Schema - Generate synthetic pairs from schema
- From Chat History - Extract pairs from conversation history