Using Reflection to Help LLMs Write Better SQL

Getting LLMs to write good SQL can be tricky. Sure, they can generate syntactically correct queries, but do those queries actually answer the question you asked? Sometimes an LLM might give you technically valid SQL that doesn’t quite capture what you’re really looking for.

I wanted to experiment with the reflection pattern to see if we could get better results. The idea is simple: after the LLM generates SQL and executes it, have it reflect on whether the query actually answers the original question. If not, let it try again with the benefit of seeing both the question and the initial results.

Let me show you how this works.

Setting up the database

I used an Oracle Autonomous Database on Oracle Cloud for this experiment. First, I created a user with the necessary permissions. Connect as ADMIN and run this:

create user moviestream identified by <password>;
grant connect, resource, unlimited tablespace to moviestream;
grant execute on dbms_cloud to moviestream;
grant execute on dbms_cloud_repo to moviestream;
grant create table to moviestream;
grant create view to moviestream;
grant all on directory data_pump_dir to moviestream;
grant create procedure to moviestream;
grant create sequence to moviestream;
grant create job to moviestream;

Next, let’s load the sample dataset. Still as ADMIN, run this:

declare 
    l_uri varchar2(500) := 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/c4u04/b/building_blocks_utilities/o/setup/workshop-setup.sql';
begin
    dbms_cloud_repo.install_sql(
        content => to_clob(dbms_cloud.get_object(object_uri => l_uri))
    );
end;
/

Then connect as the moviestream user and run this to load the rest of the dataset:

BEGIN
    workshop.add_dataset(tag => 'end-to-end');
END;
/

This takes a few minutes to complete, after which we have a database with customer and sales data to work with.

The approach

The reflection pattern works like this:

Give the LLM the database schema and a natural language question
LLM generates SQL (v1)
Execute the SQL and get results
LLM reflects: “Does this SQL actually answer the question?”
Generate improved SQL (v2) based on the reflection
Execute v2 and provide the final answer

The key insight here is that by seeing the actual results, the LLM can judge whether it interpreted the question correctly. For example, if you ask “who are our top customers?”, the LLM might initially think “highest income” when you actually meant “highest spending”. Seeing the results helps it course-correct.

Setting up the Python environment

I used a Jupyter notebook for this experiment. First, let’s install the libraries we need:

%pip install aisuite oracledb

I’m using Andrew Ng’s aisuite for a unified interface to different LLM providers, and oracledb to connect to the database.

Now let’s import aisuite:

import aisuite as ai

Connecting to Oracle Autonomous Database

For Oracle Autonomous Database, you’ll need to download the wallet and set up the connection. Here’s how I connected:

import oracledb

username = "moviestream"
password = "<password>"
dsn = "<connection_string>"
wallet = '<path_to_wallet>'

try:
    connection = oracledb.connect(
        user=username, 
        password=password, 
        dsn=dsn,
        config_dir=wallet,
        wallet_location=wallet,
        wallet_password='<wallet_password>')
    print("Connection successful!")
except Exception as e:
    print(e)
    print("Connection failed!")

And set the TNS_ADMIN environment variable:

import os
os.environ['TNS_ADMIN'] = wallet

Configuring the LLM client

Let’s set up the AI client. I used GPT-4o for this experiment:

client = ai.Client()
os.environ['OPENAI_API_KEY']='<your_api_key>'

models = ['openai:gpt-4o']

Getting the database schema

For the LLM to write good SQL, it needs to know what tables and columns are available. Let’s write a function to introspect the schema:

def get_schema():
    stmt = f'''
    SELECT 
        utc.table_name,
        utc.column_name,
        utc.data_type,
        utc.data_length,
        utc.nullable,
        utc.column_id,
        ucc.comments AS column_comment,
        utab.comments AS table_comment
    FROM 
        user_tab_columns utc
    LEFT JOIN 
        user_col_comments ucc 
        ON utc.table_name = ucc.table_name 
        AND utc.column_name = ucc.column_name
    LEFT JOIN 
        user_tab_comments utab 
        ON utc.table_name = utab.table_name
    ORDER BY 
        utc.table_name, 
        utc.column_id;
    '''

    cursor = connection.cursor()
    cursor.execute(stmt)
    rows = cursor.fetchall()

    # Convert to one long string
    result_string = '\n'.join([str(row) for row in rows])

    cursor.close()

    return result_string

This function queries the Oracle data dictionary to get information about all tables and columns, including any comments. It returns everything as a single string that we can pass to the LLM.

Generating SQL from natural language

Now let’s write the function that takes a natural language question and generates SQL:

def generate_sql(question: str, schema: str, model: str):
    prompt = f'''
    You are an SQL assistant for Oracle Database.
    You create Oracle SQL statements to help answer user questions.
    Given the user's question and the schema information, write an SQL
    query to answer the question.

    Schema:
    {schema}

    User question:
    {question}

    Respond with the SQL only.  Do not add any extra characters or delimiters.
    '''
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return response.choices[0].message.content.strip()

This function takes the question, the schema information, and the model to use. It constructs a prompt that gives the LLM the context it needs and asks for just the SQL query.

Executing SQL queries

We need a function to actually run the generated SQL:

def execute_sql(stmt):
    cursor = connection.cursor()
    cursor.execute(stmt)
    rows = cursor.fetchall()

    # Convert to one long string
    result_string = '\n'.join([str(row) for row in rows])

    cursor.close()

    return result_string

This executes the query and returns the results as a string.

The reflection step

Here’s where it gets interesting – the function that reviews the SQL and results, and potentially generates improved SQL:

import json

def refine_sql(question, sql_query, output, schema, model):
    prompt = f'''
    You are a SQL reviewer and refiner. 

    User asked:
    {question}

    Original SQL:
    {sql_query}

    SQL Output:
    {output}

    Schema:
    {schema}

    Step 1: Evaluate if the SQL OUTPUT fully answers the user's question.
    Step 2: If improvement is needed, provide a refined SQL query for Oracle.
    If the original SQL is already correct, return it unchanged.

    Return a strict JSON object with two fields:
    - "feedback": brief evaluation and suggestions
    - "refined_sql": the final SQL to run

    Return ONLY the actual JSON document.
    Do NOT add any extra characters or delimiters outside of the actual JSON itself.
    In particular do NOT include backticks before and after the JSON document.
    '''

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )

    content = response.choices[0].message.content
    try:
        obj = json.loads(content)
        feedback = str(obj.get("feedback", "")).strip()
        refined_sql = str(obj.get("refined_sql", sql_query)).strip()
        if not refined_sql:
            refined_sql = sql_query
    except Exception:
        # Fallback if model doesn't return valid JSON
        feedback = content.strip()
        refined_sql = sql_query

    return feedback, refined_sql

This is the heart of the reflection pattern. The function:

Shows the LLM the original question, the generated SQL, and the actual results
Asks it to evaluate whether the SQL output really answers the question
If not, asks for an improved query
Returns both the feedback and the refined SQL as JSON

The JSON format makes it easy to parse the response and extract both pieces of information. I had to be fairly pedantic to get gpt-4o to give me just JSON!

Providing a final answer

Finally, let’s write a function to convert the query results into a natural language answer:

def provide_final_answer(question, output, model):
    prompt = f'''
    You are helpful assistant.
    Given a user's question, and the results of a database query
    which has been created, evaluated, improved and executed already
    in order to get the provided output, you should provide an
    answer to the user's question.

    User question:
    {question}

    Query results:
    {output}
    '''
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return response.choices[0].message.content.strip()

This takes the final query results and turns them into a friendly, natural language response for the user.

Putting it all together

Now let’s create the main function that orchestrates the entire process:

def generate_and_reflect(question: str):
    
    schema = get_schema()
    print('SCHEMA')
    print(schema)
    print()

    sql_v1 = generate_sql(question, schema, models[0])
    print("SQL V1")
    print(sql_v1)
    print()

    output_v1 = execute_sql(sql_v1)
    print("SQL V1 output")
    print(output_v1)
    print()

    feedback, sql_v2 = refine_sql(question, sql_v1, output_v1, schema, models[0])
    print("FEEDBACK")
    print(feedback)
    print()
    print("SQL V2")
    print(sql_v2)
    print()

    output_v2 = execute_sql(sql_v2)
    print("SQL V2 output")
    print(output_v2)
    print()

    final_answer = provide_final_answer(question, output_v2, models[0])
    print("FINAL ANSWER")
    print(final_answer)
    print()

This function:

Gets the database schema
Generates the first SQL query
Executes it and prints the results
Sends everything to the reflection function for evaluation
Generates and executes the refined SQL
Converts the final results into a natural language answer

Running the experiment

Let’s try it out with a question that could be interpreted multiple ways:

generate_and_reflect('who are our top customers?')

The results

Here’s what happened when I ran this:

First attempt (SQL V1):

SELECT CUST_ID, FIRST_NAME, LAST_NAME, EMAIL, INCOME
FROM CUSTOMER
ORDER BY INCOME DESC
FETCH FIRST 10 ROWS ONLY;

The LLM interpreted “top customers” as customers with the highest income. It returned folks with incomes around $187,000:

(1138797, 'Haruru', 'Takahashi', 'haruru.takahashi6@oraclemail.com', 187168.8)
(1007335, 'Eddie', 'Crawford', 'eddie.crawford@oraclemail.com', 187145.4)
(1404002, 'Yuuto', 'Arai', 'yuuto.arai3@oraclemail.com', 187136.04)
...

Reflection:

The original SQL query retrieves the top 10 customers based on income, which may not 
fully answer the question of 'top customers' as it could be interpreted in terms of 
sales or transactions. To better answer the question, we should consider the total 
sales or transactions made by each customer.

Great! The LLM recognized that “top customers” probably means customers who spend the most, not customers who earn the most.

Second attempt (SQL V2):

SELECT C.CUST_ID, C.FIRST_NAME, C.LAST_NAME, C.EMAIL, SUM(S.ACTUAL_PRICE) AS TOTAL_SALES 
FROM CUSTOMER C 
JOIN CUSTSALES S ON C.CUST_ID = S.CUST_ID 
GROUP BY C.CUST_ID, C.FIRST_NAME, C.LAST_NAME, C.EMAIL 
ORDER BY TOTAL_SALES DESC 
FETCH FIRST 10 ROWS ONLY;

Much better! Now it’s joining with the sales data and calculating total spending per customer.

Final results:

(1234517, 'Tsubasa', 'Nakajima', 'tsubasa.nakajima2@oraclemail.com', 2356.049999999997)
(1280887, 'Steffi', 'Bielvenstram', 'steffi.bielvenstram@oraclemail.com', 2334.7299999999996)
(1017254, 'Guadalupe', 'Zamora', 'guadalupe.zamora@oraclemail.com', 2329.7599999999998)
...

The top customer is Tsubasa Nakajima with $2,356.05 in total sales, followed by Steffi Bielvenstram with $2,334.73, and so on. These are very different customers from the high-income list we got in the first attempt!

Natural language answer:

Our top customers, based on the provided data, are:

1. Tsubasa Nakajima - Email: tsubasa.nakajima2@oraclemail.com, Total: $2356.05
2. Steffi Bielvenstram - Email: steffi.bielvenstram@oraclemail.com, Total: $2334.73
3. Guadalupe Zamora - Email: guadalupe.zamora@oraclemail.com, Total: $2329.76
...

These customers have the highest total amounts associated with them.

What I learned

This reflection approach really does help. The LLM is pretty good at recognizing when its initial SQL doesn’t quite match the intent of the question – especially when it can see the actual results.

The pattern of generate → execute → reflect → regenerate is more expensive (two LLM calls instead of one for generation, plus one more for the final answer), but the quality improvement is noticeable. For production use, you might want to:

Cache schema information instead of fetching it every time
Add more sophisticated error handling for SQL errors
Consider running both queries in parallel and comparing results
Track which types of questions benefit most from reflection
Use the reflection feedback to build a dataset for fine-tuning

The approach is straightforward to implement and the results speak for themselves – the reflection step caught a subtle but important misinterpretation that would have given technically correct but unhelpful results.

Give it a try with your own database and questions – I think you’ll find the reflection step catches a lot of these subtle misinterpretations that would otherwise lead to valid but wrong answers.

What next? I am going to experiment with some more complex questions, and then compare the performance of a number of different LLMs to see how they go with and without reflection. Stay tuned 🙂