When evaluating the accuracy of NL2SQL methods, especially for deep learning techniques, it is necessary to use various NL2SQL benchmarks; each benchmark has a relatively small size and addresses limited-scope problems under several explicit or implicit assumptions about natural language questions, SQL queries, or databases. Due to these assumptions, each benchmark only partially evaluates the NL2SQL methods, and often a NL2SQL is optimized for a particular benchmark without considering the general problem. Therefore, we evaluate NL2SQL methods using 13 NL2SQL benchmarks including a newly released one, WTQ (Table 1).

Table 1: The statistics for the NL2SQL benchmarks.

Benchmark Total queries Training queries Validation queries Test queries Tables Rows Size (MB)
WikiSQL 80654 56355 8421 15878 26531 459K 420
ATIS 5317 4379 491 447 25 162K 39.2
Advising (querysplit) 4387 2040 515 1832 15 332K 43.8
Advising (questionsplit) 4387 3585 229 573 15 332K 43.8
GeoQuery 880 550 50 280 7 937 0.14
Scholar 816 498 100 218 10 144M 8776
Patients 342 214 19 109 1 100 0.016
Restaurant 251 157 14 80 3 18.7K 3.05
MAS 196 123 11 62 17 54.3M 4270
IMDB 131 82 7 42 16 39.7M 1812
YELP 128 80 7 41 7 4.48M 2232
Spider 9693 8659 1034 - 873 1.57M 184
WTQ (Ours) 9287 5804 528 2955 2102 58.0K 35.6


WTQ

The WTQ benchmark consists of 9,287 randomly sampled questions out of 22,033 questions in WikiTableQuestions. Since the WikiTableQuestions is a question understanding dataset that only contains the natural language questions without gold SQL queries, we collected gold SQL queries through crowd-sourcing.

We chose WikiTableQuestions since it has a salient feature compared to the existing NL2SQL benchmarks: complex questions in different domains. It contains 2,108 web tables from various domains in Wikipedia, and has complex queries including order by, group by, and nested queries.

We released the WTQ’s training and validation dataset, and operate a challenge for developing NL2SQL on the WTQ dataset.

Evaluation methodology

One important issue in NL2SQL is how to measure the translation accuracy which is the number of correctly translated SQL queries over the total number of test queries. The correctness can be judged by comparing each translated SQL query with the gold SQL query in the test dataset. However, existing accuracy measures define the correctness in a misleading way. String matching and parse tree matching only compare the syntax, and result matching over-estimates the accuracy since different queries can produce the same results by chance. Manual matching requires considerable effort and cannot guarantee reliability.

To overcome these limitations, we judge the correctness by semantic equivalence. That is, two SQL queries are regarded as the same iff they have the same meaning, always generating the same results on any database instance.

Validation tool based on semantic equivalence

We release a new, automatic validation tool based on semantic equivalence at github. Please refer to our paper for detailed explanation.