🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance
Video
Transcript
so i’d like to talk about semantic
parsing of english to graphql
so first of all a little bit about
graphql for those
that don’t have too much of a background
um graphql is basically a query language
for your api
this compare this can be compared to sql
which is a query language to
two databases graphql can cover a whole
api
which means it can cover a broad set of
business logic
databases and data types
one one of the strengths of graphql is
its
ease of developer use it lends itself
well to
nested relations it provides a schema
which serves as an api contract
and and it makes it can make the
development experience for for software
engineers
a little bit easier so here
we can see a little bit about what
graphql looks like
graphql lets you describe your data in a
schema
so and here you have a small example of
that
it can show you the relationships
between different types of data
with that schema we can send a query
and that query can give us a predictable
result of exactly what we expect to
receive
from from this interface
semantic parsing is the task of
converting a natural language utterance
to a logical
logical form a machine understandable
representation of its meaning
in this case we’re converting english
natural language to
graphql which is the logical form or the
machine understandable
machine understandable representation of
its meaning
so here we can see a little bit more of
what this looks like
so for example if we have some english
prompt
and english natural language such as
what is the name
and date of the song released most
recently
and some graphql schema that defines
that data
we can find the corresponding graphql
query
such as this so why did i want to go
through this project
there’s a few reasons the first of which
is i wanted to understand the limits of
general language models for semantic
parsing
this task is a little bit similar to
machine translation
where and we take some
input language and output a target
language in our case we’re
inputting english and outputting graphql
uh i wanted to see how it could
potentially ease the learning curve for
developers
so if a developer has the ability to
interact with a
with a model that generates queries for
them so that they can see what these
queries
how these queries are structured it
might make it easier for them
and finally potent this could be
potential tooling for non-technical
data users so an example of this would
be a
a manager that uses salesforce um
instead of
reaching out to an engineer to for them
to generate a custom query
uh they could type it out in natural in
english
and they would be able to get a response
so previously there has been some work
into semantic parsing
over a broad range of languages and
domains
for my specific use case i was really
interested to see what the sql
data sets looked like as you can see
here there’s
several sql data sets that
have a broad range of different domains
and different
query complexities of these the one that
stood out the most
and i’ll cover why in a second is spider
though though the problem was that there
is no graphql datasets
so i wanted to train a model to to learn
and to to create graphql
and that would be very difficult without
a graphql data set
so i looked into spider specifically so
spider is semantic parsing of
natural language text to sql it has
in the data set it has 10 000 questions
around 5 000 unique
complex sql queries covering 200
different databases
with 138 different domains
so these each of these um
queries uh in the sorry in the corpus
between the train and test or validation
data sets
there are different queries and
different databases
so for a system to perform well
it must generalize to new queries and
new database schemas and new questions
which is a difficult task to cover
this specific task has been tackled
for the last few years and has achieved
pretty good results
so here on this slide you can see what
the
top five ranking leaders in
the exact match accuracy look like and
so
on the testing set
we can see there’s around 60 percent
accuracy
on this sql spider task
so for my task i had to be able to
create a graphql data set and so this
spider dataset served as a great
starting point
first off was being able to
convert sql to graphql i started out not
even being able to know if i could if
this was possible
um and so it was it took a little bit of
time to understand what the tools were
and what i could use to accomplish this
first task the
two two very important tools that i used
were hasuria and pg loader
the surah just generates a graphql
schema based off of a database
and pg loader converted sqlite to
postgres
and that allowed me to use his sewer on
top of it
then the bulk of the work was converting
sql abstract syntax trees
to graphql abstract sync text trees
and i’ll cover what that is right now so
here’s a simple example of a sequel
abstract syntax tree to
a graphql abstract syntax tree on the
left side we see sql so the
query right here is select count star
from songs
so we have we take this query we parse
it
into a tree and like i said before this
is a very simple example
then this tree is then converted to a
graphql tree
as you can see on the right side then
that tree is just converted to a
raw graphql query
part of generating this data sets
required the use of
validation scripts and so there was a
few things i looked out for in these
validation scripts
so when i was verifying this data set i
covered the syntax
making sure that the queries that were
formed were actual graphqls
queries i also covered validated the
syntax against the schema
so making sure that the keywords that
were used in the queries itself
were valid for the schema that we were
looking at and then
i executed those careers against an
endpoint to make sure that they
were valid so
what this resulted in this whole process
resulted in is that
um half of the queries were transferred
around half of the queries
so the big problem here the big obstacle
was that surah
doesn’t include a group by clause so
group by is a
is a clause that’s used very often in
sql
and the surah didn’t have a good way to
transfer that over
and this could have been done manually
but i wasn’t able to do that within
the limit the time limits of the program
but in the end i was very confident
because of the validation script in the
data set
um so diving into the details this looks
like 160 schemas across 138 different
domains
around 4 300 unique english prompts and
around 2400 unique graphql queries
so after that the next step was
um experimenting on that data set and
seeing what kind of results i could get
i experimented with a few different
models the ones that stood out the most
were bart and t5 so these are
two different models that were that
other researchers from
different research groups have come up
with both of them are
encoder decoder transformer models
and they’re very similar where they vary
though is that
bart is a bi-directional encoder
but both of them are encoder decoder
models
and lend themselves very well to
translation tasks so i thought they
could lend themselves to my task as well
so as i mentioned before
the the process looks a little bit like
this so we input
a an english prompt so in this case what
is the name and date of the song
released most recently concatenated with
the graphql schema so it would look a
little bit like the what you see on the
left here
that is passed through to our model t5
and then our output slash target
um is this graphql query we see on the
right here
and this is just done with an
autoregressive objective
part of this required the use of a
validation metric to make sure that
my results looked what i like what i
expected them to look like
so in this case our i wanted my outputs
to look like my targets
um the validation metric that i came up
with
was exact matching exact set matching
accuracy
what this means is that since i could
parse my graphql queries
the the target and the output into
abstract syntax react and
could compare those two queries to see
how accurate they were
um and so since in
in these abstract syntax trees the order
of the
of the children’s the children nodes
don’t matter
um so two trees with a different order
of child nodes could be equivalent so
for example
um we see these these two little trees
right here
where on the left side we have a green
a green child node and then on the right
side we have a right green child node
and these two are equivalent queries and
so we want i wanted to make sure that
this validation
metric could be able to handle for that
and so a good example of this would be
in this in this query here
um the song name and release date could
be switched
but the query would still be equivalent
so what did the results look like um i
with these two models uh specifically
using t5 which performed the best i got
46 to 50
exact set matching accuracy on the
graphql validation
data set and this is in comparison to
the
20 sql exact set matching accuracy that
i got with this
with these same models and this is an
interesting result for a couple of
reasons
this range of 46 to 50 is actually
the same model but the 46
model um was just trained on
the graphql queries the 50 model was
trained on the graphql queries
and the sql queries and for some reason
it was able to perform better
my guess is that that was able to happen
because the model
learned what’s important what keywords
were important between schemas
and as i mentioned before this uh
the existing spyder um
uh best models um got around 65 percent
exact match accuracy as we can see here
so why did my model perform worse on
these
and this is because these models that
are in the leaderboard here
used specific architecture that was
specific
only to sql so these models here can
only
produce sql they wouldn’t be able to
produce graphql as well
uh whereas my model was also able to
produce
sql and graphql and so this sets it up
for future work
where we could see potential uses where
we could find a way to
increase the accuracy of both of them
and maybe even across different query
languages as well
as i mentioned before the data set since
this dataset was based on the spyder
dataset the for a model to perform well
um it must be able to generalize well
over new schemas
and new questions and queries and so
to me a 50 accuracy
says that the model is able to
generalize these results pretty well
um what i failed to mention a little bit
earlier is that this
exact set matching accuracy is more of a
lower bound
because there are multiple queries that
could display the same information
but it’s a little bit more difficult to
be able to parse those trees as well
so what does this look like it’ll be
helpful to see
if we’re in action so here
we have a
database that we’re using so this
database is called music one which has
information about different music genre
is an artist
and we can ask a question such as what
is the country of the artist named
enrique so here we generate a graphql
query
as we can see below on the left side and
then we can send that query to server
and get a response and so here we see
that our response looks like this
and the answer to the question is the
country’s usa
here’s another example and this shows
that the model is able to generalize to
a different database
so here there’s the 160 different
databases
um and instead of selecting music one
we’re going to look at flight 2 which
has
information about uh different airlines
and flights
so now we ask a new question give the
airline with the abbreviation
ual we generate a graphql query
and we send that query and get a
response in this case it says
united airlines um
and so this is just a small example of
of what it can do
um in the future
the the next in the next couple days i
want to release these models
and code so that everybody’s free to use
them and improve them
i’ve been working on a paper constantly
to submit to archive and within this
next week i’ll i’ll be submitting that
as well and then
a little bit longer term a couple more
tasks so one of them is to
add more examples to take advantage of
graphql
these examples only take advantage of
the more simple aspects of graphql but
it wouldn’t be too difficult to add more
complex examples
and another another task is to test on
an enterprise schema
um so salesforce and github both have
graphql apis
and so i’d like to see what kind of
results i could get by
semantic parsing english to these
graphql endpoints
and so i just wanted to say thanks
specifically to
a few communities and people so i wanted
to
thank openai for the opportunity to be
part of the scholars program
i’ve learned a lot and i’ve and from
from the people at open eye and i
and i enjoyed having the flexibility to
work on on this project as well
um also thanks to hugging face and pie
torch lightning communities uh
they made themselves very available to
me to
ask any questions on how to use their
tooling and then i wanted to
thank my my mentor melanie
for taking the time to work with me and
as sam
said it did feel more like somebody
working with me
and and getting through the more
difficult parts of
getting into the field wanted to thank
also christina and mariah the
they were in charge of the scholars
program and
and were very helpful and made
themselves very available to to all of
the
the scholars um whenever we needed any
help
and i also just wanted to thank my wife
noel
who has been very supportive throughout
this whole program
and now i’ll open it up for any
questions that
i can see here on the q a
so first result our first question are
your results using models trained from
scratch
or using fine-tuned or using fine-tuning
pre-trained models also what are the
model sizes
so yeah good question so the models that
i was using
um were pre-trained on large corpuses
and
in in in the case of t5 for example
it was trained on a corpus that was
generated from the internet
and so it’s learned a lot about language
in general
so what i did is fine-tuned
those specific models for my specific
task
um in a couple of different ways but the
main one was
converting english to graphql
and model sizes so these models weren’t
extremely big i was able to fit both
models onto the gpus that i could
run on google colab so google co-lab
gives around 16 gigabytes of memory for
these models
so i was able to run those perfectly
fine on there
with some smaller batch sizes um so
anybody should be able to do that with a
google with a google account using
google collab
and the second question is how
often does this model generate
syntactically invalid
outputs that’s a good question as well
so
um from the so as i mentioned before
in the validation metric that validation
metric also
covers metric or also covers examples
that
are not valid and so any example that is
invalid
would be counted as wrong right and so
50
means that 50 were valid
and um the correct query that i expected
and so 50 and that also means that 50
were
probably not um but that’s
so the the upper bound of how many
um syntactically invalid outputs would
be 50
but in practice when i looked at it it
tended to be a lot smaller than that it
was
i would say something like five percent
of the outputs were syntactically
invalid um
and that’s because this model has the
flexibility to output different types of
query languages and
next question by mris what is the most
challenging what was the most
challenging part of the project
and i think the most challenging part
was definitely converting
those sql queries to graphql queries um
it was a lot so it was a lot of parsing
it was a lot of using trees and graphs
to figure out
what the right graphql query would be
it’s all in all that that process
probably took
a month by itself to work out um
and at the very beginning i didn’t even
know if it was possible but
as we realized about half of the queries
were possible to transfer over
let’s see if there’s any more
yes okay i have another question from
alec radford
do you think general purpose
architectures like t5 are sufficient
or is it still a need for domain
specific architectures
like the specific sql specific ones you
mentioned yeah i’ve been thinking about
this question
actually and i think um
i think it depends on on the difficulty
um
because there is a difference between a
model that is
where the whole model architecture is
specifically tied to sql
and a model where just the output heads
are tied to sql
so if we could um and this could be
future work obviously if we could
replace the sql head
ju and just the sql head and put a
graphql head on there
that would be that would be great um but
obviously this
this lends itself to to the time
required to
to create those heads um but i think
we need to do we probably need to
explore both a little bit better
and compare them and so i think that’s a
good place to
look in the future
let’s see
and the next question is how often does
the model generate syntactically invalid
outputs i already covered that um
probably around five percent of the time
and what’s the main metric you’ve used
for evaluating your model
um yeah that’s the metric i mentioned
previously which is the exact
set matching accuracy so i would
take the i would basically work
work this way and convert the graphql
queries into
abstract syntax trees and then compare
two trees to each other
and that way i could evaluate my model
throughout every epic
[Music]
and then from christina uh
really cool work how do you imagine the
model architecture to differ
for graphql as opposed to sql is the
problem space different in any way
so yes the
the model architecture i feel like most
of the model could remain the same
because the these general
architectures that we as we’ve seen with
other examples
they’re able to we can use them for
translation tasks for
task generation for classification
different things like that
um so i i feel like my intuition is that
well where the
work will go is in preparing the right
types of heads
and what that requires is is just
somebody who understands
how to basically output these abstract
syntax trees
directly um
so i think i i think that though that
these general models will be able to
um if as long as the general
architecture
or in other words the middle part of the
model is
able to understand the english that
these models will be able to perform
well