Semantic Parsing English to GraphQL | Andre Carerra | OpenAI Scholars Demo Day 2020 | OpenAI

🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance

Video

Transcript

so i’d like to talk about semantic

parsing of english to graphql

so first of all a little bit about

graphql for those

that don’t have too much of a background

um graphql is basically a query language

for your api

this compare this can be compared to sql

which is a query language to

two databases graphql can cover a whole

api

which means it can cover a broad set of

business logic

databases and data types

one one of the strengths of graphql is

its

ease of developer use it lends itself

well to

nested relations it provides a schema

which serves as an api contract

and and it makes it can make the

development experience for for software

engineers

a little bit easier so here

we can see a little bit about what

graphql looks like

graphql lets you describe your data in a

schema

so and here you have a small example of

that

it can show you the relationships

between different types of data

with that schema we can send a query

and that query can give us a predictable

result of exactly what we expect to

receive

from from this interface

semantic parsing is the task of

converting a natural language utterance

to a logical

logical form a machine understandable

representation of its meaning

in this case we’re converting english

natural language to

graphql which is the logical form or the

machine understandable

machine understandable representation of

its meaning

so here we can see a little bit more of

what this looks like

so for example if we have some english

prompt

and english natural language such as

what is the name

and date of the song released most

recently

and some graphql schema that defines

that data

we can find the corresponding graphql

query

such as this so why did i want to go

through this project

there’s a few reasons the first of which

is i wanted to understand the limits of

general language models for semantic

parsing

this task is a little bit similar to

machine translation

where and we take some

input language and output a target

language in our case we’re

inputting english and outputting graphql

uh i wanted to see how it could

potentially ease the learning curve for

developers

so if a developer has the ability to

interact with a

with a model that generates queries for

them so that they can see what these

queries

how these queries are structured it

might make it easier for them

and finally potent this could be

potential tooling for non-technical

data users so an example of this would

be a

a manager that uses salesforce um

instead of

reaching out to an engineer to for them

to generate a custom query

uh they could type it out in natural in

english

and they would be able to get a response

so previously there has been some work

into semantic parsing

over a broad range of languages and

domains

for my specific use case i was really

interested to see what the sql

data sets looked like as you can see

here there’s

several sql data sets that

have a broad range of different domains

and different

query complexities of these the one that

stood out the most

and i’ll cover why in a second is spider

though though the problem was that there

is no graphql datasets

so i wanted to train a model to to learn

and to to create graphql

and that would be very difficult without

a graphql data set

so i looked into spider specifically so

spider is semantic parsing of

natural language text to sql it has

in the data set it has 10 000 questions

around 5 000 unique

complex sql queries covering 200

different databases

with 138 different domains

so these each of these um

queries uh in the sorry in the corpus

between the train and test or validation

data sets

there are different queries and

different databases

so for a system to perform well

it must generalize to new queries and

new database schemas and new questions

which is a difficult task to cover

this specific task has been tackled

for the last few years and has achieved

pretty good results

so here on this slide you can see what

the

top five ranking leaders in

the exact match accuracy look like and

so

on the testing set

we can see there’s around 60 percent

accuracy

on this sql spider task

so for my task i had to be able to

create a graphql data set and so this

spider dataset served as a great

starting point

first off was being able to

convert sql to graphql i started out not

even being able to know if i could if

this was possible

um and so it was it took a little bit of

time to understand what the tools were

and what i could use to accomplish this

first task the

two two very important tools that i used

were hasuria and pg loader

the surah just generates a graphql

schema based off of a database

and pg loader converted sqlite to

postgres

and that allowed me to use his sewer on

top of it

then the bulk of the work was converting

sql abstract syntax trees

to graphql abstract sync text trees

and i’ll cover what that is right now so

here’s a simple example of a sequel

abstract syntax tree to

a graphql abstract syntax tree on the

left side we see sql so the

query right here is select count star

from songs

so we have we take this query we parse

it

into a tree and like i said before this

is a very simple example

then this tree is then converted to a

graphql tree

as you can see on the right side then

that tree is just converted to a

raw graphql query

part of generating this data sets

required the use of

validation scripts and so there was a

few things i looked out for in these

validation scripts

so when i was verifying this data set i

covered the syntax

making sure that the queries that were

formed were actual graphqls

queries i also covered validated the

syntax against the schema

so making sure that the keywords that

were used in the queries itself

were valid for the schema that we were

looking at and then

i executed those careers against an

endpoint to make sure that they

were valid so

what this resulted in this whole process

resulted in is that

um half of the queries were transferred

around half of the queries

so the big problem here the big obstacle

was that surah

doesn’t include a group by clause so

group by is a

is a clause that’s used very often in

sql

and the surah didn’t have a good way to

transfer that over

and this could have been done manually

but i wasn’t able to do that within

the limit the time limits of the program

but in the end i was very confident

because of the validation script in the

data set

um so diving into the details this looks

like 160 schemas across 138 different

domains

around 4 300 unique english prompts and

around 2400 unique graphql queries

so after that the next step was

um experimenting on that data set and

seeing what kind of results i could get

i experimented with a few different

models the ones that stood out the most

were bart and t5 so these are

two different models that were that

other researchers from

different research groups have come up

with both of them are

encoder decoder transformer models

and they’re very similar where they vary

though is that

bart is a bi-directional encoder

but both of them are encoder decoder

models

and lend themselves very well to

translation tasks so i thought they

could lend themselves to my task as well

so as i mentioned before

the the process looks a little bit like

this so we input

a an english prompt so in this case what

is the name and date of the song

released most recently concatenated with

the graphql schema so it would look a

little bit like the what you see on the

left here

that is passed through to our model t5

and then our output slash target

um is this graphql query we see on the

right here

and this is just done with an

autoregressive objective

part of this required the use of a

validation metric to make sure that

my results looked what i like what i

expected them to look like

so in this case our i wanted my outputs

to look like my targets

um the validation metric that i came up

with

was exact matching exact set matching

accuracy

what this means is that since i could

parse my graphql queries

the the target and the output into

abstract syntax react and

could compare those two queries to see

how accurate they were

um and so since in

in these abstract syntax trees the order

of the

of the children’s the children nodes

don’t matter

um so two trees with a different order

of child nodes could be equivalent so

for example

um we see these these two little trees

right here

where on the left side we have a green

a green child node and then on the right

side we have a right green child node

and these two are equivalent queries and

so we want i wanted to make sure that

this validation

metric could be able to handle for that

and so a good example of this would be

in this in this query here

um the song name and release date could

be switched

but the query would still be equivalent

so what did the results look like um i

with these two models uh specifically

using t5 which performed the best i got

46 to 50

exact set matching accuracy on the

graphql validation

data set and this is in comparison to

the

20 sql exact set matching accuracy that

i got with this

with these same models and this is an

interesting result for a couple of

reasons

this range of 46 to 50 is actually

the same model but the 46

model um was just trained on

the graphql queries the 50 model was

trained on the graphql queries

and the sql queries and for some reason

it was able to perform better

my guess is that that was able to happen

because the model

learned what’s important what keywords

were important between schemas

and as i mentioned before this uh

the existing spyder um

uh best models um got around 65 percent

exact match accuracy as we can see here

so why did my model perform worse on

these

and this is because these models that

are in the leaderboard here

used specific architecture that was

specific

only to sql so these models here can

only

produce sql they wouldn’t be able to

produce graphql as well

uh whereas my model was also able to

produce

sql and graphql and so this sets it up

for future work

where we could see potential uses where

we could find a way to

increase the accuracy of both of them

and maybe even across different query

languages as well

as i mentioned before the data set since

this dataset was based on the spyder

dataset the for a model to perform well

um it must be able to generalize well

over new schemas

and new questions and queries and so

to me a 50 accuracy

says that the model is able to

generalize these results pretty well

um what i failed to mention a little bit

earlier is that this

exact set matching accuracy is more of a

lower bound

because there are multiple queries that

could display the same information

but it’s a little bit more difficult to

be able to parse those trees as well

so what does this look like it’ll be

helpful to see

if we’re in action so here

we have a

database that we’re using so this

database is called music one which has

information about different music genre

is an artist

and we can ask a question such as what

is the country of the artist named

enrique so here we generate a graphql

query

as we can see below on the left side and

then we can send that query to server

and get a response and so here we see

that our response looks like this

and the answer to the question is the

country’s usa

here’s another example and this shows

that the model is able to generalize to

a different database

so here there’s the 160 different

databases

um and instead of selecting music one

we’re going to look at flight 2 which

has

information about uh different airlines

and flights

so now we ask a new question give the

airline with the abbreviation

ual we generate a graphql query

and we send that query and get a

response in this case it says

united airlines um

and so this is just a small example of

of what it can do

um in the future

the the next in the next couple days i

want to release these models

and code so that everybody’s free to use

them and improve them

i’ve been working on a paper constantly

to submit to archive and within this

next week i’ll i’ll be submitting that

as well and then

a little bit longer term a couple more

tasks so one of them is to

add more examples to take advantage of

graphql

these examples only take advantage of

the more simple aspects of graphql but

it wouldn’t be too difficult to add more

complex examples

and another another task is to test on

an enterprise schema

um so salesforce and github both have

graphql apis

and so i’d like to see what kind of

results i could get by

semantic parsing english to these

graphql endpoints

and so i just wanted to say thanks

specifically to

a few communities and people so i wanted

to

thank openai for the opportunity to be

part of the scholars program

i’ve learned a lot and i’ve and from

from the people at open eye and i

and i enjoyed having the flexibility to

work on on this project as well

um also thanks to hugging face and pie

torch lightning communities uh

they made themselves very available to

me to

ask any questions on how to use their

tooling and then i wanted to

thank my my mentor melanie

for taking the time to work with me and

as sam

said it did feel more like somebody

working with me

and and getting through the more

difficult parts of

getting into the field wanted to thank

also christina and mariah the

they were in charge of the scholars

program and

and were very helpful and made

themselves very available to to all of

the

the scholars um whenever we needed any

help

and i also just wanted to thank my wife

noel

who has been very supportive throughout

this whole program

and now i’ll open it up for any

questions that

i can see here on the q a

so first result our first question are

your results using models trained from

scratch

or using fine-tuned or using fine-tuning

pre-trained models also what are the

model sizes

so yeah good question so the models that

i was using

um were pre-trained on large corpuses

and

in in in the case of t5 for example

it was trained on a corpus that was

generated from the internet

and so it’s learned a lot about language

in general

so what i did is fine-tuned

those specific models for my specific

task

um in a couple of different ways but the

main one was

converting english to graphql

and model sizes so these models weren’t

extremely big i was able to fit both

models onto the gpus that i could

run on google colab so google co-lab

gives around 16 gigabytes of memory for

these models

so i was able to run those perfectly

fine on there

with some smaller batch sizes um so

anybody should be able to do that with a

google with a google account using

google collab

and the second question is how

often does this model generate

syntactically invalid

outputs that’s a good question as well

so

um from the so as i mentioned before

in the validation metric that validation

metric also

covers metric or also covers examples

that

are not valid and so any example that is

invalid

would be counted as wrong right and so

50

means that 50 were valid

and um the correct query that i expected

and so 50 and that also means that 50

were

probably not um but that’s

so the the upper bound of how many

um syntactically invalid outputs would

be 50

but in practice when i looked at it it

tended to be a lot smaller than that it

was

i would say something like five percent

of the outputs were syntactically

invalid um

and that’s because this model has the

flexibility to output different types of

query languages and

next question by mris what is the most

challenging what was the most

challenging part of the project

and i think the most challenging part

was definitely converting

those sql queries to graphql queries um

it was a lot so it was a lot of parsing

it was a lot of using trees and graphs

to figure out

what the right graphql query would be

it’s all in all that that process

probably took

a month by itself to work out um

and at the very beginning i didn’t even

know if it was possible but

as we realized about half of the queries

were possible to transfer over

let’s see if there’s any more

yes okay i have another question from

alec radford

do you think general purpose

architectures like t5 are sufficient

or is it still a need for domain

specific architectures

like the specific sql specific ones you

mentioned yeah i’ve been thinking about

this question

actually and i think um

i think it depends on on the difficulty

um

because there is a difference between a

model that is

where the whole model architecture is

specifically tied to sql

and a model where just the output heads

are tied to sql

so if we could um and this could be

future work obviously if we could

replace the sql head

ju and just the sql head and put a

graphql head on there

that would be that would be great um but

obviously this

this lends itself to to the time

required to

to create those heads um but i think

we need to do we probably need to

explore both a little bit better

and compare them and so i think that’s a

good place to

look in the future

let’s see

and the next question is how often does

the model generate syntactically invalid

outputs i already covered that um

probably around five percent of the time

and what’s the main metric you’ve used

for evaluating your model

um yeah that’s the metric i mentioned

previously which is the exact

set matching accuracy so i would

take the i would basically work

work this way and convert the graphql

queries into

abstract syntax trees and then compare

two trees to each other

and that way i could evaluate my model

throughout every epic

[Music]

and then from christina uh

really cool work how do you imagine the

model architecture to differ

for graphql as opposed to sql is the

problem space different in any way

so yes the

the model architecture i feel like most

of the model could remain the same

because the these general

architectures that we as we’ve seen with

other examples

they’re able to we can use them for

translation tasks for

task generation for classification

different things like that

um so i i feel like my intuition is that

well where the

work will go is in preparing the right

types of heads

and what that requires is is just

somebody who understands

how to basically output these abstract

syntax trees

directly um

so i think i i think that though that

these general models will be able to

um if as long as the general

architecture

or in other words the middle part of the

model is

able to understand the english that

these models will be able to perform

well