Getting LLMs to write good SQL can be tricky. Sure, they can generate syntactically correct queries, but do those queries actually answer the question you asked? Sometimes an LLM might give you technically valid SQL that doesn’t quite capture what you’re really looking for.
I wanted to experiment with the reflection pattern to see if we could get better results. The idea is simple: after the LLM generates SQL and executes it, have it reflect on whether the query actually answers the original question. If not, let it try again with the benefit of seeing both the question and the initial results.
Let me show you how this works.
Setting up the database
I used an Oracle Autonomous Database on Oracle Cloud for this experiment. First, I created a user with the necessary permissions. Connect as ADMIN and run this:
create user moviestream identified by <password>;
grant connect, resource, unlimited tablespace to moviestream;
grant execute on dbms_cloud to moviestream;
grant execute on dbms_cloud_repo to moviestream;
grant create table to moviestream;
grant create view to moviestream;
grant all on directory data_pump_dir to moviestream;
grant create procedure to moviestream;
grant create sequence to moviestream;
grant create job to moviestream;
Next, let’s load the sample dataset. Still as ADMIN, run this:
declare
l_uri varchar2(500) := 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/c4u04/b/building_blocks_utilities/o/setup/workshop-setup.sql';
begin
dbms_cloud_repo.install_sql(
content => to_clob(dbms_cloud.get_object(object_uri => l_uri))
);
end;
/
Then connect as the moviestream user and run this to load the rest of the dataset:
BEGIN
workshop.add_dataset(tag => 'end-to-end');
END;
/
This takes a few minutes to complete, after which we have a database with customer and sales data to work with.
The approach
The reflection pattern works like this:
- Give the LLM the database schema and a natural language question
- LLM generates SQL (v1)
- Execute the SQL and get results
- LLM reflects: “Does this SQL actually answer the question?”
- Generate improved SQL (v2) based on the reflection
- Execute v2 and provide the final answer
The key insight here is that by seeing the actual results, the LLM can judge whether it interpreted the question correctly. For example, if you ask “who are our top customers?”, the LLM might initially think “highest income” when you actually meant “highest spending”. Seeing the results helps it course-correct.
Setting up the Python environment
I used a Jupyter notebook for this experiment. First, let’s install the libraries we need:
%pip install aisuite oracledb
I’m using Andrew Ng’s aisuite for a unified interface to different LLM providers, and oracledb to connect to the database.
Now let’s import aisuite:
import aisuite as ai
Connecting to Oracle Autonomous Database
For Oracle Autonomous Database, you’ll need to download the wallet and set up the connection. Here’s how I connected:
import oracledb
username = "moviestream"
password = "<password>"
dsn = "<connection_string>"
wallet = '<path_to_wallet>'
try:
connection = oracledb.connect(
user=username,
password=password,
dsn=dsn,
config_dir=wallet,
wallet_location=wallet,
wallet_password='<wallet_password>')
print("Connection successful!")
except Exception as e:
print(e)
print("Connection failed!")
And set the TNS_ADMIN environment variable:
import os
os.environ['TNS_ADMIN'] = wallet
Configuring the LLM client
Let’s set up the AI client. I used GPT-4o for this experiment:
client = ai.Client()
os.environ['OPENAI_API_KEY']='<your_api_key>'
models = ['openai:gpt-4o']
Getting the database schema
For the LLM to write good SQL, it needs to know what tables and columns are available. Let’s write a function to introspect the schema:
def get_schema():
stmt = f'''
SELECT
utc.table_name,
utc.column_name,
utc.data_type,
utc.data_length,
utc.nullable,
utc.column_id,
ucc.comments AS column_comment,
utab.comments AS table_comment
FROM
user_tab_columns utc
LEFT JOIN
user_col_comments ucc
ON utc.table_name = ucc.table_name
AND utc.column_name = ucc.column_name
LEFT JOIN
user_tab_comments utab
ON utc.table_name = utab.table_name
ORDER BY
utc.table_name,
utc.column_id;
'''
cursor = connection.cursor()
cursor.execute(stmt)
rows = cursor.fetchall()
# Convert to one long string
result_string = '\n'.join([str(row) for row in rows])
cursor.close()
return result_string
This function queries the Oracle data dictionary to get information about all tables and columns, including any comments. It returns everything as a single string that we can pass to the LLM.
Generating SQL from natural language
Now let’s write the function that takes a natural language question and generates SQL:
def generate_sql(question: str, schema: str, model: str):
prompt = f'''
You are an SQL assistant for Oracle Database.
You create Oracle SQL statements to help answer user questions.
Given the user's question and the schema information, write an SQL
query to answer the question.
Schema:
{schema}
User question:
{question}
Respond with the SQL only. Do not add any extra characters or delimiters.
'''
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
return response.choices[0].message.content.strip()
This function takes the question, the schema information, and the model to use. It constructs a prompt that gives the LLM the context it needs and asks for just the SQL query.
Executing SQL queries
We need a function to actually run the generated SQL:
def execute_sql(stmt):
cursor = connection.cursor()
cursor.execute(stmt)
rows = cursor.fetchall()
# Convert to one long string
result_string = '\n'.join([str(row) for row in rows])
cursor.close()
return result_string
This executes the query and returns the results as a string.
The reflection step
Here’s where it gets interesting – the function that reviews the SQL and results, and potentially generates improved SQL:
import json
def refine_sql(question, sql_query, output, schema, model):
prompt = f'''
You are a SQL reviewer and refiner.
User asked:
{question}
Original SQL:
{sql_query}
SQL Output:
{output}
Schema:
{schema}
Step 1: Evaluate if the SQL OUTPUT fully answers the user's question.
Step 2: If improvement is needed, provide a refined SQL query for Oracle.
If the original SQL is already correct, return it unchanged.
Return a strict JSON object with two fields:
- "feedback": brief evaluation and suggestions
- "refined_sql": the final SQL to run
Return ONLY the actual JSON document.
Do NOT add any extra characters or delimiters outside of the actual JSON itself.
In particular do NOT include backticks before and after the JSON document.
'''
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
content = response.choices[0].message.content
try:
obj = json.loads(content)
feedback = str(obj.get("feedback", "")).strip()
refined_sql = str(obj.get("refined_sql", sql_query)).strip()
if not refined_sql:
refined_sql = sql_query
except Exception:
# Fallback if model doesn't return valid JSON
feedback = content.strip()
refined_sql = sql_query
return feedback, refined_sql
This is the heart of the reflection pattern. The function:
- Shows the LLM the original question, the generated SQL, and the actual results
- Asks it to evaluate whether the SQL output really answers the question
- If not, asks for an improved query
- Returns both the feedback and the refined SQL as JSON
The JSON format makes it easy to parse the response and extract both pieces of information. I had to be fairly pedantic to get gpt-4o to give me just JSON!
Providing a final answer
Finally, let’s write a function to convert the query results into a natural language answer:
def provide_final_answer(question, output, model):
prompt = f'''
You are helpful assistant.
Given a user's question, and the results of a database query
which has been created, evaluated, improved and executed already
in order to get the provided output, you should provide an
answer to the user's question.
User question:
{question}
Query results:
{output}
'''
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
return response.choices[0].message.content.strip()
This takes the final query results and turns them into a friendly, natural language response for the user.
Putting it all together
Now let’s create the main function that orchestrates the entire process:
def generate_and_reflect(question: str):
schema = get_schema()
print('SCHEMA')
print(schema)
print()
sql_v1 = generate_sql(question, schema, models[0])
print("SQL V1")
print(sql_v1)
print()
output_v1 = execute_sql(sql_v1)
print("SQL V1 output")
print(output_v1)
print()
feedback, sql_v2 = refine_sql(question, sql_v1, output_v1, schema, models[0])
print("FEEDBACK")
print(feedback)
print()
print("SQL V2")
print(sql_v2)
print()
output_v2 = execute_sql(sql_v2)
print("SQL V2 output")
print(output_v2)
print()
final_answer = provide_final_answer(question, output_v2, models[0])
print("FINAL ANSWER")
print(final_answer)
print()
This function:
- Gets the database schema
- Generates the first SQL query
- Executes it and prints the results
- Sends everything to the reflection function for evaluation
- Generates and executes the refined SQL
- Converts the final results into a natural language answer
Running the experiment
Let’s try it out with a question that could be interpreted multiple ways:
generate_and_reflect('who are our top customers?')
The results
Here’s what happened when I ran this:
First attempt (SQL V1):
SELECT CUST_ID, FIRST_NAME, LAST_NAME, EMAIL, INCOME
FROM CUSTOMER
ORDER BY INCOME DESC
FETCH FIRST 10 ROWS ONLY;
The LLM interpreted “top customers” as customers with the highest income. It returned folks with incomes around $187,000:
(1138797, 'Haruru', 'Takahashi', 'haruru.takahashi6@oraclemail.com', 187168.8)
(1007335, 'Eddie', 'Crawford', 'eddie.crawford@oraclemail.com', 187145.4)
(1404002, 'Yuuto', 'Arai', 'yuuto.arai3@oraclemail.com', 187136.04)
...
Reflection:
The original SQL query retrieves the top 10 customers based on income, which may not
fully answer the question of 'top customers' as it could be interpreted in terms of
sales or transactions. To better answer the question, we should consider the total
sales or transactions made by each customer.
Great! The LLM recognized that “top customers” probably means customers who spend the most, not customers who earn the most.
Second attempt (SQL V2):
SELECT C.CUST_ID, C.FIRST_NAME, C.LAST_NAME, C.EMAIL, SUM(S.ACTUAL_PRICE) AS TOTAL_SALES
FROM CUSTOMER C
JOIN CUSTSALES S ON C.CUST_ID = S.CUST_ID
GROUP BY C.CUST_ID, C.FIRST_NAME, C.LAST_NAME, C.EMAIL
ORDER BY TOTAL_SALES DESC
FETCH FIRST 10 ROWS ONLY;
Much better! Now it’s joining with the sales data and calculating total spending per customer.
Final results:
(1234517, 'Tsubasa', 'Nakajima', 'tsubasa.nakajima2@oraclemail.com', 2356.049999999997)
(1280887, 'Steffi', 'Bielvenstram', 'steffi.bielvenstram@oraclemail.com', 2334.7299999999996)
(1017254, 'Guadalupe', 'Zamora', 'guadalupe.zamora@oraclemail.com', 2329.7599999999998)
...
The top customer is Tsubasa Nakajima with $2,356.05 in total sales, followed by Steffi Bielvenstram with $2,334.73, and so on. These are very different customers from the high-income list we got in the first attempt!
Natural language answer:
Our top customers, based on the provided data, are:
1. Tsubasa Nakajima - Email: tsubasa.nakajima2@oraclemail.com, Total: $2356.05
2. Steffi Bielvenstram - Email: steffi.bielvenstram@oraclemail.com, Total: $2334.73
3. Guadalupe Zamora - Email: guadalupe.zamora@oraclemail.com, Total: $2329.76
...
These customers have the highest total amounts associated with them.
What I learned
This reflection approach really does help. The LLM is pretty good at recognizing when its initial SQL doesn’t quite match the intent of the question – especially when it can see the actual results.
The pattern of generate → execute → reflect → regenerate is more expensive (two LLM calls instead of one for generation, plus one more for the final answer), but the quality improvement is noticeable. For production use, you might want to:
- Cache schema information instead of fetching it every time
- Add more sophisticated error handling for SQL errors
- Consider running both queries in parallel and comparing results
- Track which types of questions benefit most from reflection
- Use the reflection feedback to build a dataset for fine-tuning
The approach is straightforward to implement and the results speak for themselves – the reflection step caught a subtle but important misinterpretation that would have given technically correct but unhelpful results.
Give it a try with your own database and questions – I think you’ll find the reflection step catches a lot of these subtle misinterpretations that would otherwise lead to valid but wrong answers.
What next? I am going to experiment with some more complex questions, and then compare the performance of a number of different LLMs to see how they go with and without reflection. Stay tuned 🙂





















You must be logged in to post a comment.