AI Generated SQL for  BigQuery

How to create the best, most accurate, AI-generated SQL for BigQuery data exploration.

Tips to create the best, most accurate, AI-generated SQL for BigQuery data exploration.

I’m not talking about Gemini hear, because oddly enough the built-in recommendation are often the lowest quality in my experience.

If you’ve ever tried to get ChatGPT or Claude to write BigQuery SQL for your data warehouse, you’ve probably been frustrated by the results. Generic queries that don’t understand your schema, syntax errors, or SQL that technically runs but returns meaningless data.

The problem isn’t that AI is bad at SQL, it’s that you’re not giving it the right information about your data. AI tools are incredible at generating accurate BigQuery queries, but only when they fully understand your specific tables, columns, and data types.

Structured Schema Specification

The most effective way to get accurate SQL is feeding AI your complete Data Definition Language (DDL). BigQuery makes this easy with the INFORMATION_SCHEMA tables:

SELECT 
  table_catalog,
  table_schema,
  table_name,
  table_type,
  ddl
FROM bastille-450915.analytics_401818627.INFORMATION_SCHEMA.TABLES
WHERE table_schema NOT IN ('INFORMATION_SCHEMA', 'information_schema')
ORDER BY table_schema, table_name;

This query returns everything AI needs to understand your data structure. Export the results as JSON format rather than CSV. While CSV gets recommended everywhere, it truncates long DDL statements and has escaping issues with complex SQL syntax. JSON preserves the complete DDL exactly as stored in BigQuery.

Provide Sample Data

DDL tells AI what your tables look like, but sample data shows what’s actually in them. Generate a small sample dataset using:

SELECT *
FROM `bastille-450915.analytics_401818627.your-table`
TABLESAMPLE SYSTEM (1 PERCENT)
LIMIT 500;

For tables with consistent data distribution, use:

SELECT *
FROM `bastille-450915.analytics_401818627.your-table`
ORDER BY 1
LIMIT 500;

Export this as JSON too. Sample data gives AI concrete examples of your data values, formats, and relationships instead of forcing it to guess.

Provide Better Context

Beyond schema and sample data, you can dramatically improve AI-generated SQL by being more specific about what you need:

Be explicit about constraints

Tell AI about runtime or cost concerns so it avoids expensive full table scans. Mention if you can use scripting, temporary tables, or need to avoid hardcoded dataset names.

Clarify your permissions

If you only have access to certain datasets or can’t query INFORMATION_SCHEMA tables, say so upfront. This prevents “Access Denied” errors.

Mention data quirks

If you have field names that are inconsistent across tables, columns stored as strings but containing numeric data, or JSON-structured columns, call this out.

Use precise language

Instead of “I want a list of all tables,” say “I want table names, row counts, and last modified dates across all datasets in project my_project, ordered by last modified date descending.”

Specify output preferences

Whether you want a single query or broken into CTEs, uppercase or lowercase keywords, comments removed.

The difference between generic AI-generated SQL and perfectly accurate queries comes down to context. Give AI complete information about your data structure and requirements, and you’ll get SQL that runs correctly on the first try.

Conclusion

When you prompt ChatGPT, Claude, or any other AI tool be sure to include the following for the most accurate results.

  1. Complete DDL: Paste the JSON export from the INFORMATION_SCHEMA query
  2. Sample Data: Include the 500-row sample in JSON format
  3. Specific Questions: Instead of “analyze my data,” ask “what are the top 5 products by revenue in Q4 2024?”
  4. BigQuery Syntax Requirements: Mention you need BigQuery Standard SQL

This approach works because AI understands your exact schema, data types, and actual data patterns rather than making assumptions about generic table structures.

Getting AI to write accurate BigQuery SQL isn’t about finding the perfect prompt or using the right AI tool. It’s about giving AI the same information a human SQL expert would need: your complete schema, sample data, and clear requirements.

The two-step approach of exporting your DDL from INFORMATION_SCHEMA and providing sample data eliminates 90% of the guesswork that leads to bad AI-generated queries. Combined with specific context about constraints, permissions, and output preferences, you’ll get production-ready SQL that understands your actual data structure.

This method works consistently across ChatGPT, Claude, and other AI tools because you’re solving the fundamental problem: AI can’t read your mind about what your data looks like.

References