What is NL2SQLben

When evaluating the accuracy of NL2SQL methods, especially for deep learning techniques, it is necessary to use various NL2SQL benchmarks; each benchmark has a relatively small size and addresses limited-scope problems under several explicit or implicit assumptions about natural language questions, SQL queries, or databases. Due to these assumptions, each benchmark only partially evaluates the NL2SQL methods, and often a NL2SQL is optimized for a particular benchmark without considering the general problem. Therefore, we evaluate NL2SQL methods using 13 NL2SQL benchmarks including a newly released one, WTQ (Table 1).

Table 1: The statistics for the NL2SQL benchmarks.

Benchmark	Total queries	Training queries	Validation queries	Test queries	Tables	Rows	Size (MB)
WikiSQL	80654	56355	8421	15878	26531	459K	420
ATIS	5317	4379	491	447	25	162K	39.2
Advising (querysplit)	4387	2040	515	1832	15	332K	43.8
Advising (questionsplit)	4387	3585	229	573	15	332K	43.8
GeoQuery	880	550	50	280	7	937	0.14
Scholar	816	498	100	218	10	144M	8776
Patients	342	214	19	109	1	100	0.016
Restaurant	251	157	14	80	3	18.7K	3.05
MAS	196	123	11	62	17	54.3M	4270
IMDB	131	82	7	42	16	39.7M	1812
YELP	128	80	7	41	7	4.48M	2232
Spider	9693	8659	1034	-	873	1.57M	184
WTQ (Ours)	9287	5804	528	2955	2102	58.0K	35.6

WTQ

The WTQ benchmark consists of 9,287 randomly sampled questions out of 22,033 questions in WikiTableQuestions. Since the WikiTableQuestions is a question understanding dataset that only contains the natural language questions without gold SQL queries, we collected gold SQL queries through crowd-sourcing.

We chose WikiTableQuestions since it has a salient feature compared to the existing NL2SQL benchmarks: complex questions in different domains. It contains 2,108 web tables from various domains in Wikipedia, and has complex queries including order by, group by, and nested queries.

We released the WTQ’s training and validation dataset, and operate a challenge for developing NL2SQL on the WTQ dataset.

Evaluation methodology

One important issue in NL2SQL is how to measure the translation accuracy which is the number of correctly translated SQL queries over the total number of test queries. The correctness can be judged by comparing each translated SQL query with the gold SQL query in the test dataset. However, existing accuracy measures define the correctness in a misleading way. String matching and parse tree matching only compare the syntax, and result matching over-estimates the accuracy since different queries can produce the same results by chance. Manual matching requires considerable effort and cannot guarantee reliability.

To overcome these limitations, we judge the correctness by semantic equivalence. That is, two SQL queries are regarded as the same iff they have the same meaning, always generating the same results on any database instance.

Validation tool based on semantic equivalence

We release a new, automatic validation tool based on semantic equivalence at github. Please refer to our paper for detailed explanation.