Text2SQL - Natural Language to SQL

The TeachingsGenerator analyzes your database schema to automatically generate domain-specific teachings - structured knowledge items that improve SQL generation accuracy. These teachings include vocabulary, patterns, guardrails, and examples grounded in your actual schema.

Basic Usage

import { TeachingsGenerator } from '@deepagents/text2sql/synthesis';

const generator = new TeachingsGenerator(adapter, {
  context: 'E-commerce database tracking orders and inventory'
});

const teachings = await generator.generate();
// Returns 3-10 high-value teachables (terms, hints, guardrails, etc.)

How It Works

TeachingsGenerator examines your database schema to create relevant, high-impact teachables:

Introspects schema (tables, columns, relationships, constraints)
Analyzes table/column names to infer domain vocabulary
Identifies potential guardrails (large tables, sensitive data)
Detects common patterns and relationships
Generates 3-10 teachables prioritizing:
- Guardrails for safety and performance
- Clarifications for ambiguous terms
- Hints for best practices
- Domain terminology

The generator produces teachings that are automatically added to your Text2SQL instance to guide query generation.

Configuration Options

interface TeachingsGeneratorOptions {
  /** Additional domain context to guide generation */
  context?: string;
  /** Model to use for generation */
  model?: AgentModel;
}

Domain Context

Provide business context to improve teaching quality:

// Without context - relies purely on schema
const teachings = await new TeachingsGenerator(adapter).generate();

// With context - generates more relevant teachings
const teachings = await new TeachingsGenerator(adapter, {
  context: `Our database tracks e-commerce orders.
  - Active customers are those who ordered in the last 90 days
  - We exclude test accounts with email ending in @test.com
  - Revenue should exclude cancelled and refunded orders`
}).generate();

Custom Model

Override the default model:

import { groq } from '@ai-sdk/groq';

const teachings = await new TeachingsGenerator(adapter, {
  context: 'Healthcare patient records',
  model: groq('llama-3.3-70b-versatile')
}).generate();

Retry Handling

The generate() method includes automatic retry logic for transient errors:

// Retries up to 3 times (default)
const teachings = await generator.generate();

// Custom retry count
const teachings = await generator.generate(5);

Retryable errors include:

Parse errors (malformed output)
Schema validation failures
"No object generated" errors
AI provider errors (rate limits, timeouts)

Non-retryable errors (thrown immediately):

Authentication failures
Invalid schema format
Network connection errors

Using with SchemaSynthesizer

Generate training pairs with domain knowledge:

import {
  TeachingsGenerator,
  SchemaSynthesizer,
  PersonaGenerator,
  toPairs
} from '@deepagents/text2sql/synthesis';

// Generate teachings
const teachingsGen = new TeachingsGenerator(adapter, {
  context: 'SaaS subscription and usage tracking database'
});
const teachings = await teachingsGen.generate();

// Generate personas
const personaGen = new PersonaGenerator(adapter, { count: 5 });
const personas = await personaGen.generate();

// Generate pairs with teachings and personas
const pairs = await toPairs(new SchemaSynthesizer(adapter, {
  count: 10,
  complexity: 'medium',
  personas: personas,
  teachings: teachings  // Guides SQL generation
}));

Example Output

For an e-commerce database, generated teachings might include:

[
  {
    type: 'guardrail',
    rule: 'Avoid unbounded scans on orders table - always include date range filter',
    reason: 'Performance - orders table contains millions of rows',
    action: 'Ask user for timeframe if not specified'
  },
  {
    type: 'hint',
    text: 'Exclude test accounts with email ending in @test.com from all metrics'
  },
  {
    type: 'term',
    name: 'active customer',
    definition: 'customer who placed an order in the last 90 days'
  },
  {
    type: 'clarification',
    when: 'user asks for "revenue"',
    ask: 'Do you want gross revenue or net revenue (excluding refunds/cancellations)?',
    reason: 'Revenue can mean different things for different analyses'
  },
  {
    type: 'quirk',
    issue: 'order_total includes shipping and tax',
    workaround: 'Use order_subtotal for product revenue only'
  },
  {
    type: 'example',
    question: 'show top selling products',
    answer: `SELECT p.product_name, COUNT(oi.order_id) as order_count
FROM products p
JOIN order_items oi ON p.product_id = oi.product_id
JOIN orders o ON oi.order_id = o.order_id
WHERE o.status NOT IN ('cancelled', 'refunded')
  AND o.created_at >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY p.product_id, p.product_name
ORDER BY order_count DESC
LIMIT 10`,
    note: 'Excludes cancelled/refunded orders and focuses on recent history'
  }
]

Integration with Text2SQL

Use text2sql.teach() for automatic teaching (recommended):

// Automatic: generates and applies teachings
await text2sql.teach(
  'Our database tracks e-commerce orders. Active customers ordered in last 90 days.'
);

// Manual: generate teachings separately
const generator = new TeachingsGenerator(adapter, {
  context: 'E-commerce database'
});
const teachings = await generator.generate();

// Apply manually
text2sql.instruct(...teachings);

The automatic teach() method handles:

Schema introspection
Teaching generation
Application to Text2SQL instance
Formatting for system prompt

Full Pipeline Example

Generate comprehensive training dataset with teachings:

import {
  TeachingsGenerator,
  PersonaGenerator,
  SchemaSynthesizer,
  DepthEvolver,
  BreadthEvolver,
  toPairs
} from '@deepagents/text2sql/synthesis';

// 1. Generate teachings
const teachingsGen = new TeachingsGenerator(adapter, {
  context: `Financial services database.
  - NPL = non-performing loan (90+ days past due)
  - Always exclude test accounts (account_type = 'test')
  - Basis points: 1% = 100 bps`
});
const teachings = await teachingsGen.generate();

// 2. Generate personas
const personaGen = new PersonaGenerator(adapter, { count: 8 });
const personas = await personaGen.generate();

// 3. Generate base pairs
const basePairs = await toPairs(new SchemaSynthesizer(adapter, {
  count: 5,
  complexity: ['low', 'medium', 'hard'],
  personas: personas,
  teachings: teachings
}));
console.log(`Base pairs: ${basePairs.length}`);

// 4. Evolve in depth (make questions harder)
const depthEvolved = await toPairs(
  new DepthEvolver(basePairs, adapter, {
    count: 2  // 2 harder versions per question
  })
);
console.log(`Depth evolved: ${depthEvolved.length}`);

// 5. Evolve in breadth (paraphrase)
const breadthEvolved = await toPairs(
  new BreadthEvolver([...basePairs, ...depthEvolved], {
    count: 3  // 3 paraphrases per question
  })
);
console.log(`Breadth evolved: ${breadthEvolved.length}`);

const allPairs = [...basePairs, ...depthEvolved, ...breadthEvolved];
console.log(`Total dataset: ${allPairs.length} pairs`);

Teaching Types Generated

The generator produces these teachable types (in priority order):

Guardrails - Safety and performance boundaries
- Large table scan warnings
- Sensitive data protection
- Query complexity limits
Hints - Always-apply rules
- Test account exclusions
- Default date ranges
- Preferred join patterns
Clarifications - When to ask users for more info
- Ambiguous metrics (revenue, active, conversion)
- Time period specifications
- Aggregation level preferences
Terms - Domain vocabulary
- Business acronyms and jargon
- Domain-specific concepts
- Canonical values and their meanings
Examples - Common query patterns
- Frequently asked questions
- Complex join patterns
- Aggregation examples
Quirks - Data edge cases and workarounds
- Column format issues
- Calculation gotchas
- Schema anomalies

Best Practices

Provide rich context - The more domain context you provide, the better the teachings
Review generated teachings - Spot-check output to ensure relevance and accuracy
Combine with manual teachings - Auto-generated + manually curated = comprehensive
Iterate - Regenerate teachings as your schema evolves
Use retry parameter - Set higher retry count for production pipelines

Teach the System - Comprehensive guide to all teachable types
Persona Generator - Generate user personas for diverse questions
From Schema - Using teachings with SchemaSynthesizer

Comparison: Manual vs Generated

Aspect	Manual Teaching	Generated Teaching
Precision	High - you define exact rules	Medium - inferred from schema
Coverage	Low - only what you write	Medium - covers common patterns
Effort	High - requires domain expertise	Low - automated
Best For	Business-critical rules, edge cases	Bootstrapping, domain vocabulary

Recommendation: Use both. Generate teachings for baseline knowledge, then supplement with manual teachings for critical business rules and edge cases.

Teachings Generator