When evaluating the accuracy of NL2SQL methods, especially for deep learning techniques, it is necessary to use various NL2SQL benchmarks; each benchmark has a relatively small size and addresses limited-scope problems under several explicit or implicit assumptions about natural language questions, SQL queries, or databases. Due to these assumptions, each benchmark only partially evaluates the NL2SQL methods, and often a NL2SQL is optimized for a particular benchmark without considering the general problem. Therefore, we evaluate NL2SQL methods using 13 NL2SQL benchmarks including a newly released one, WTQ (Table 1).
Table 1: The statistics for the NL2SQL benchmarks.
Benchmark | Total queries | Training queries | Validation queries | Test queries | Tables | Rows | Size (MB) |
---|---|---|---|---|---|---|---|
WikiSQL | 80654 | 56355 | 8421 | 15878 | 26531 | 459K | 420 |
ATIS | 5317 | 4379 | 491 | 447 | 25 | 162K | 39.2 |
Advising (querysplit) | 4387 | 2040 | 515 | 1832 | 15 | 332K | 43.8 |
Advising (questionsplit) | 4387 | 3585 | 229 | 573 | 15 | 332K | 43.8 |
GeoQuery | 880 | 550 | 50 | 280 | 7 | 937 | 0.14 |
Scholar | 816 | 498 | 100 | 218 | 10 | 144M | 8776 |
Patients | 342 | 214 | 19 | 109 | 1 | 100 | 0.016 |
Restaurant | 251 | 157 | 14 | 80 | 3 | 18.7K | 3.05 |
MAS | 196 | 123 | 11 | 62 | 17 | 54.3M | 4270 |
IMDB | 131 | 82 | 7 | 42 | 16 | 39.7M | 1812 |
YELP | 128 | 80 | 7 | 41 | 7 | 4.48M | 2232 |
Spider | 9693 | 8659 | 1034 | - | 873 | 1.57M | 184 |
WTQ (Ours) | 9287 | 5804 | 528 | 2955 | 2102 | 58.0K | 35.6 |
WTQ
The WTQ benchmark consists of 9,287 randomly sampled questions out of 22,033 questions in WikiTableQuestions. Since the WikiTableQuestions is a question understanding dataset that only contains the natural language questions without gold SQL queries, we collected gold SQL queries through crowd-sourcing.
We chose WikiTableQuestions since it has a salient feature compared to the existing NL2SQL benchmarks: complex questions in different domains. It contains 2,108 web tables from various domains in Wikipedia, and has complex queries including order by, group by, and nested queries.
We released the WTQ’s training and validation dataset, and operate a challenge for developing NL2SQL on the WTQ dataset.
Evaluation methodology
One important issue in NL2SQL is how to measure the translation accuracy which is the number of correctly translated SQL queries over the total number of test queries. The correctness can be judged by comparing each translated SQL query with the gold SQL query in the test dataset. However, existing accuracy measures define the correctness in a misleading way. String matching and parse tree matching only compare the syntax, and result matching over-estimates the accuracy since different queries can produce the same results by chance. Manual matching requires considerable effort and cannot guarantee reliability.
To overcome these limitations, we judge the correctness by semantic equivalence. That is, two SQL queries are regarded as the same iff they have the same meaning, always generating the same results on any database instance.
Validation tool based on semantic equivalence
We release a new, automatic validation tool based on semantic equivalence at github. Please refer to our paper for detailed explanation.