Text2SQL - Natural Language to SQL

The MessageExtractor parses chat history to extract question/SQL pairs from actual user interactions. This is the primary way to harvest training data from production usage.

Basic Usage

import { MessageExtractor, toPairs } from '@deepagents/text2sql/synthesis';

const pairs = await toPairs(new MessageExtractor(messages));

console.log(pairs);
// [
//   { question: 'Show top customers', sql: 'SELECT ...', success: true },
//   { question: 'Revenue by month', sql: 'SELECT ...', success: true },
// ]

How It Works

MessageExtractor scans chat messages for db_query tool calls:

Iterates through messages in order
Tracks the most recent user message
When it finds an assistant message with db_query tool call:
- Extracts the SQL from the tool input
- Pairs it with the preceding user message
- Records success/failure based on tool state

Configuration Options

interface MessageExtractorOptions {
  /** Include failed queries in output (default: false) */
  includeFailures?: boolean;
  /** Tool name to extract SQL from (default: 'db_query') */
  toolName?: string;
}

Include Failures

By default, only successful queries are extracted:

// Only successful queries
const pairs = await toPairs(new MessageExtractor(messages));

// Include failures for error analysis
const allPairs = await toPairs(
  new MessageExtractor(messages, { includeFailures: true })
);

// Filter later
const failed = allPairs.filter(p => !p.success);

Custom Tool Name

If you use a different tool name for queries:

const pairs = await toPairs(
  new MessageExtractor(messages, { toolName: 'execute_sql' })
);

Message Format

Messages must follow the Vercel AI SDK UIMessage format:

interface UIMessage {
  role: 'user' | 'assistant' | 'system';
  parts: UIPart[];
}

The extractor looks for tool parts in assistant messages:

const messages: UIMessage[] = [
  {
    role: 'user',
    parts: [{ type: 'text', text: 'Show top 10 customers' }],
  },
  {
    role: 'assistant',
    parts: [
      {
        type: 'tool-invocation',
        toolInvocationId: '123',
        toolName: 'db_query',
        state: 'output-available',
        input: { sql: 'SELECT * FROM customers LIMIT 10' },
        output: { rows: [...] },
      },
    ],
  },
];

For multi-turn conversations where context matters, use contextual extractors. These extractors extend BaseContextualExtractor (an abstract base class using the Template Pattern) to add conversation context to extracted pairs.

FullContextExtractor

Keeps all preceding messages as context:

import { FullContextExtractor, toPairs } from '@deepagents/text2sql/synthesis';

const extractor = new FullContextExtractor(messages, adapter);
const pairs = await toPairs(extractor);

// pairs[0].context = ['User: Show me sales data']
// pairs[0].question = 'Now filter to California'
// pairs[0].sql = 'SELECT ... WHERE state = "CA"'

WindowedContextExtractor

Keeps only the last N messages in a sliding window:

import { WindowedContextExtractor, toPairs } from '@deepagents/text2sql/synthesis';

const extractor = new WindowedContextExtractor(messages, adapter, {
  windowSize: 5, // Keep last 5 messages as context
});
const pairs = await toPairs(extractor);

SegmentedContextExtractor

Uses an LLM to detect topic changes and reset context. When a topic change is detected, the context is reset and the triggering message is first resolved into a standalone form:

import { SegmentedContextExtractor, toPairs } from '@deepagents/text2sql/synthesis';

const extractor = new SegmentedContextExtractor(messages, adapter);
const pairs = await toPairs(extractor);

// Context resets when the user asks about a completely different topic

LastQueryExtractor

Extracts only the last db_query from a conversation and uses the full conversation context to generate a standalone question via LLM:

import { LastQueryExtractor, toPairs } from '@deepagents/text2sql/synthesis';

const extractor = new LastQueryExtractor(messages, adapter);
const pairs = await toPairs(extractor);

// Returns at most 1 pair - the last query with resolved question
// pairs[0].question = 'Show customers in NY sorted by revenue'
// pairs[0].sql = 'SELECT * FROM customers WHERE region = 'NY' ORDER BY revenue DESC'
// pairs[0].context = ['User: Show me customers', 'Assistant: ...', 'User: Filter to NY', ...]

Use this when you need to extract a standalone question for the most recent SQL query in a conversation. It collects all user and assistant messages leading up to the last query and asks the LLM to generate a question that captures the full intent without needing prior context.

Example: Extracting from History Store

import { MessageExtractor, toPairs } from '@deepagents/text2sql/synthesis';

// Load chat history
const chat = await history.getChat('chat-123');
const messages = chat.messages.map(m => m.content);

// Extract pairs
const pairs = await toPairs(new MessageExtractor(messages));

// Export for training
console.log(`Extracted ${pairs.length} pairs`);
for (const pair of pairs) {
  console.log(`Q: ${pair.question}`);
  console.log(`SQL: ${pair.sql}`);
  console.log(`Success: ${pair.success}`);
  console.log('---');
}

Example: Batch Processing Multiple Chats

import { MessageExtractor, toPairs } from '@deepagents/text2sql/synthesis';

const allPairs: ExtractedPair[] = [];

// Process all chats for a user
const chats = await history.listChats('user-456');

for (const chatSummary of chats) {
  const chat = await history.getChat(chatSummary.id);
  const messages = chat.messages.map(m => m.content);

  const pairs = await toPairs(new MessageExtractor(messages));
  allPairs.push(...pairs);
}

console.log(`Total pairs extracted: ${allPairs.length}`);

Validating Extracted Pairs

Combine with ValidatedProducer to re-validate SQL:

import {
  MessageExtractor,
  ValidatedProducer,
  toPairs
} from '@deepagents/text2sql/synthesis';

const pairs = await toPairs(
  new ValidatedProducer(
    new MessageExtractor(messages),
    adapter,
    { execute: false } // Just validate syntax
  )
);

Best Practices

Filter production data - Remove test accounts and internal users before extraction
Validate extracted SQL - Schema may have changed since the query was run
Include context for multi-turn - Use contextual extractors when questions reference previous turns
Deduplicate - Users often ask similar questions; use DeduplicatedProducer
Sample strategically - Balance simple and complex queries in your training set

From Chat History