Deep Agents
AgentOrchestratorRetrievalText2SQLToolbox

Training Data Overview

Generate question/SQL pairs for fine-tuning and evaluation

The synthesis module helps you generate training data for fine-tuning models or evaluating Text2SQL performance. It produces ExtractedPair objects containing natural language questions paired with their SQL queries.

Use Cases

  • Fine-tuning - Generate training pairs to improve model accuracy on your schema
  • Evaluation - Create test sets to measure query generation quality
  • Data augmentation - Expand existing pairs with variations and paraphrases

Core Concepts

ExtractedPair

Every producer outputs the same structure:

interface ExtractedPair {
  question: string;    // Natural language question
  sql: string;         // SQL query that answers the question
  context?: string[];  // Preceding messages (for multi-turn)
  success: boolean;    // Whether the query is valid/executed successfully
}

PairProducer

All extractors and synthesizers implement this abstract class:

abstract class PairProducer<T extends ExtractedPair> {
  abstract produce(): AsyncGenerator<T[], void, unknown>;
}

The streaming pattern yields chunks of pairs, allowing for efficient memory usage and real-time processing.

toPairs()

Helper function to iterate any producer's AsyncGenerator:

import { toPairs, SqlExtractor } from '@deepagents/text2sql/synthesis';

const pairs = await toPairs(new SqlExtractor(sqls, adapter));

Implementation:

export async function toPairs<T extends ExtractedPair>(
  producer: PairProducer<T>,
): Promise<T[]> {
  const pairs: T[] = [];
  for await (const chunk of producer.produce()) {
    pairs.push(...chunk);
  }
  return pairs;
}

Or pass the producer directly to toPairs():

const pairs = await toPairs(new SqlExtractor(sqls, adapter));

Extractors vs Synthesizers vs Generators

TypeInputOutputUse Case
ExtractorsChat history, SQL logsPairs harvested from real usageProduction data mining
SynthesizersSchema or seed pairsNewly generated pairsBootstrapping, augmentation
GeneratorsSchemaPersonas or TeachingsPipeline setup

Extractors

Harvest pairs from existing artifacts:

  • MessageExtractor - Parse chat history for db_query tool calls
  • SqlExtractor - Generate questions for existing SQL queries
  • Contextual extractors - Handle multi-turn conversations with context

Synthesizers

Generate new pairs from scratch:

  • SchemaSynthesizer - Generate pairs from database schema
  • BreadthEvolver - Paraphrase questions (in-breadth evolution, keeps same SQL)
  • DepthEvolver - Evolve to more complex questions (in-depth evolution, changes both Q and SQL)

Generators

Generate supporting artifacts for the pipeline:

Evol-Instruct Methodology

The synthesis module implements Microsoft's Evol-Instruct methodology for instruction tuning:

  • BreadthEvolver - In-breadth evolution: Creates paraphrased variations of questions while keeping the same SQL query. This increases dataset diversity without changing the underlying query complexity.

  • DepthEvolver - In-depth evolution: Transforms questions into more complex versions that require changes to both the question and SQL query. This progressively increases dataset difficulty and model capabilities.

This two-pronged approach ensures both breadth (variety) and depth (complexity) in your training data.

Decorators

Decorators wrap producers to add functionality:

ValidatedProducer

Validate SQL and optionally execute queries:

import { ValidatedProducer, MessageExtractor } from '@deepagents/text2sql/synthesis';

const validated = new ValidatedProducer(
  new MessageExtractor(messages),
  adapter,
  { execute: true }
);

const pairs = await toPairs(validated);
// Only includes pairs where SQL validated/executed successfully

FilteredProducer

Filter pairs by criteria:

import { FilteredProducer, SqlExtractor } from '@deepagents/text2sql/synthesis';

const filtered = new FilteredProducer(
  new SqlExtractor(sqls, adapter),
  {
    successOnly: true,
    tables: ['orders', 'customers'], // Only pairs using these tables
  }
);

DeduplicatedProducer

Remove duplicate pairs:

import { DeduplicatedProducer, MessageExtractor } from '@deepagents/text2sql/synthesis';

const deduped = new DeduplicatedProducer(
  new MessageExtractor(messages),
  { mode: 'sql' } // Dedupe by SQL only (ignore question variations)
);

Composing Producers

Decorators can be chained:

const producer = new DeduplicatedProducer(
  new FilteredProducer(
    new ValidatedProducer(
      new MessageExtractor(messages),
      adapter
    ),
    { successOnly: true }
  ),
  { mode: 'exact' }
);

const pairs = await toPairs(producer);

Output Format

Export pairs for training:

const pairs = await toPairs(new SqlExtractor(sqls, adapter));

// JSONL format (common for fine-tuning)
const jsonl = pairs
  .filter(p => p.success)
  .map(p => JSON.stringify({
    question: p.question,
    sql: p.sql
  }))
  .join('\n');

// CSV format
const csv = pairs
  .map(p => `"${p.question}","${p.sql}"`)
  .join('\n');

Next Steps