Natural Language Queries for any Database Table with Zero-Shot RoBERTa-based SQL Query Generation
I recently came across an interesting paper entitled “Data Agnostic RoBERTa-based Natural Language to SQL Query Generation” by Debaditya Pal, Harsh Sharma and Kaustubh Chaudhuri which promised SQL generation for any tabular dataset. Infact the dataset is not even needed by the algorithm which only takes as input a list of table fields and their data types (see Figure 2). I thought that this would be incredble if it actually worked and decided to have a play with the provided Google Colab notebook. Within 5 minutes I was able to generate valid SQL by just asking questions and was quite impressed.
In this blog post, I describe the inner working of the algorithm, discuss both the advantages and shortcomming of the model and detail how I’ve gone about integrating the algorithm within a Colab Notebook that allows a tabular data CSV to be uploaded and executes the generated SQL against an in-memory SQLite database. Essentially this blog post contains all the ingredients to allow you to implement a Natural Language to SQL generator in your own app.
How does the Neural Architecture work?
The Neural Architecture includes both the RoBERTa pre-trained transformer embedding and a series of bi-directional LSTM models. Two knowledge vectors called the Question Mark and Header Mark vectors are created, concatenated and passed as the input to the first bi-recectional LSTM. The knowledge vectors contain information of which field names (columns) and which words in the natural language question may be important to the model. The Question Mark vector contains true for each word in the natural language question if the word matches a table column name. The Header Mark vector returns true for each header in the natural language query, checking for partial and fuzzy matches as well.
Each of the the bi-directional LSTM models take the RoBERTa pre-trained transformer embedding as input. The bi-directional LSTM models then predict each clause of the SQL statement (SELECT Column, SELECT Aggregate function, WHERE column and Criteria and WHERE Operation). The model does not predict the full SQL statement, rather each clause is predicted separately and has its own cost function.
A Quick Evaluation with Sample English Queries
The use of the RoBERTa embedding makes the approach suitable for questions with a variety of phrasings. For example “How many students received a ‘Quiz 1’ score greater than 50 and less than 70?” returns count (i.e., ‘SELECT count()’) and “Find students that received a ‘Quiz 1’ score greater than 50 and less than 70?” returns matching records. “What is the mean value in ‘Quiz 1’?” and “What is the average value in ‘Quiz 1’?” both return ‘SELECT avg()’.
Shortcomings of the Approach
Some shortcomings can be addressed via user interface design. For example only the field that is being queried is returned in the ‘SELECT’ clause and no ‘ORDER BY’ clause is included. It would make sense to include UI options to allow a user to add more fields and select the order by field. In the practical implementation (detailed below) I have used the Google Colab DataTable extension which allows column re-ordering and post-process the returned SQL to add all the columns (i.e. SELECT *). COnversational BOT ui’s offer additional solutions as they offer the ability to ask follow up questions such as “Which fields would you like displayed?” and “How should the data be ordered?”.
Multiple tables and ‘GROUP BY’ are not yet supported but I believe that the model can adapted and trained to support these in the future.
The biggest shortcomings I have found come from the fact that no data is included in training. For example an English question such as “How many female students?” on a table that has a “Gender” column does not know that a WHERE clause is needed (e.g., WHERE Gender =’female’). I think it would be important to include a fine-tuning step to adapt the model to domain specific datasets. Fine-tuning would need to be conducted by the implementation developer or data scientist and would therefor preserve the objective of complete data privacy.
A Practical Implementation
I have released a Google Colab Notebook with all the code to implement the technique in Python. The Notebook includes code to upload a CSV file, infer the field types (via the TableSchema library), load the CSV into an in-memory SQLite database, load the pre-trained NL to SQL converter, execute generated SQL against the SQLite database and display the returned data. Some post processing on the generated SQL is included to support queries with ‘distinct’ and return all the fields in the dataset (i.e., SELECT *) is also included.
Future Directions
As the output of the algorithm is a data structure describing each query clause, this internal representation could easily be converted to output the syntax for numerous ORMs and Pandas. Lookout for a Streamlit version and a NL to Pandas conversion tool on this blog soon!
Overall I’m really impressed with the potential of transformer based NLP models and how they can be used in a plug and play manner to create new intelligent tools. I hope to explore many more practical ways to incorporate recent advances in NLP on this blog over the next few months.
Enjoy and please include a comment below with details of your use of Natural Language to SQL generation.