🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance
Video
Transcript
hi I’m ilithyia power and I am pretty
new to the field of deep learning I’ve
been in it for about five months now
through the course of the Scholars
Program I’m getting a warning that I
have bad Network quality so if I’m not
coming through clearly somebody let me
know in the background so anyway about
at the end of last year I sorry
distracted by the network quality my
background is in software engineering
and site reliability engineering and
I’ve always been interested in AI but at
the end of last year I decided to try
and make the switch to a new career and
so to that end I applied to the to the
Scholars Program and I was incredibly
grateful to be able to get in and it’s
been an amazing start to a new career
I want to thank open AI i particularly
want to thank my mentor and the other
mentors who have been helpful and the
other scholars it’s been a fantastic
cohort to go through all of this with so
during the course of the program I got
very interested in interpretability
interpretability is basically
mind-reading for AI it’s about tearing
open neural networks and looking at how
they represent and process information
and it’s difficult to do because AI and
deep learning in particular is very
different from traditional software
engineering so there’s a picture that
almost everyone in the field has seen
software engineering a human being write
some software the software takes inputs
and gives outputs they could be
questions and answers like a search
engine or you know whatever but in deep
learning a human being creates math and
gives it some data to train on and
that’s what writes the software that
takes inputs and outputs and it turns
out that software written by math and by
a computer is much harder to understand
and software written by a human being
but it really matters because AI is
everywhere it impacts us in tremendous
ways throughout our lives so I’m a
transgender person and that means that
for a lot of my life my body is a
different shape than cisgender people’s
bodies and that means that scanners at
airports usually flagged me for needing
a pat-down it’s humiliating it’s
embarrassing
it’s not the end of the world but it’s
not cool and AI impacts other people in
worse ways there are systems that that
you know self-driving cars are more
likely to hit people of color and you
know there’s all sorts of biases and in
Justices that can come in so if we
understand how these systems work then
we can reduce their bias in addition to
that if we understand how they work then
we can improve their efficiency we can
find smaller networks that do the same
sort of job and take a lot less
electricity a lot less time a lot less
resources and a lot less money and
finally if we understand how on their
own networks thought how neural networks
represent information then we have a
better chance of actually being able to
understand human thought which to me is
the most interesting question of all so
I decided to dig into interpretability
by analyzing GPT - this was a
state-of-the-art language generation
language modeling network that opening I
released about a year and a half ago and
the way this network works is you give
it some input some text and it generates
output so this is an actual example I
fed this the phrase my talk is about
into GPT - and it said the future of
education you can give it the beginning
of a sentence and get yen you can give
it a paragraph and get an essay it’s
very good at generating text and a lot
of what it generates is
indistinguishable from human beings this
is pretty powerful and pretty dangerous
I know you can do something like train
GPT - on you know some sort of subreddit
and cut and get it to generate political
text and then you could use it to look
like there’s a bunch of people on the
internet who all have the same idea and
it’s really just software and that’s
pretty dangerous so we need to
understand it we need to dig into it and
know how it works and how to combat
things that are generated by it and how
to make sure that it’s used in safe ways
so I had a certain amount of time to do
this project and I decided I would bite
off a tractable part of this problem the
first thing I would do is just try and
understand how GPT to
understands English grammar so to
explain how I figured that out I need to
give a little bit of background on how
GPT 2 works some of the people on this
call will know all about this and are
literally world experts I think the lead
author on the GPT 2 paper is on this
call also my mom is on this call hi mom
so I want to make sure and give some
background that’s applicable to a wide
variety of audiences and try not to
leave anybody behind based on a lack of
already having you know a full knowledge
of how this works I also think that’s a
core part of interpretability trying to
make sure to democratize this
information and spread it around so that
people outside the field can actually
have an understanding of what’s going on
so I’m gonna spend a second talking
about transform our architecture and
then I’ll get into what I built on top
of it GPT 2 is a transformer but I’ll
get into that in a minute
so when I feed this beginning of the
sentence in my talk is about the first
thing it does is split that split that
strain into tokens tokens can be words
they could be punctuation marks they
could be collections of bytes in this
string they could be yeah just basically
sub parts of the string I restricted
myself to sentences that had tokens
I had a one-to-one mapping between the
tokens and the sentences and punctuation
marks because that made it a little bit
easier for me to analyze GPT 2 has a
little bit of a subtle way of doing this
but I kind of circumvented it these
tokens oops oops I’m clicking the wrong
button here these tokens get converted
into vectors and the word my always
converts into this vector here and this
is actually talked with a space in front
of it that always converts into this
vector so I end up with use for vectors
that excuse me and they could get fed
into GPT too and they flow through the
network along these positions so if I
put four tokens in I get four tokens out
in this particular diagram there’s
there’s four
flowing through it so what are they
flowing through the first part here is
an embedding layer that’s what turns
them into vectors then it has a bunch of
decoder blocks GPT too is comes in a
variety of sizes I looked at GPT too
small which is what would fit on my home
graphics card and even it is huge it has
over a hundred million parameters
variables and so I knew that I needed to
try and break it up too to tackle this
problem and most of these parameters are
here in these decoder blocks and finally
it has a language modeling layer so each
decoder block takes n vectors in each
position and outputs vectors in each
position and then this language modeling
layer takes the final set of vectors
that come out of the top decoder block
and produces probabilities for what the
next word might be and I’ll get into
that in a second inside of these decoder
blocks are what are called attention
heads now attention heads mix-and-match
information between the different
positions to feed out into the new
position so they kind of like collect
the information that spread across the
input and collect it into focus areas so
you can kind of think of this as as
being like if you’ve ever been to a
sushi boat restaurant that has the
little stream with a little boat that
floats along next to your table with
pieces of sushi on it so you can imagine
each of these positions flowing through
the network being like a sushi boat path
and the tokens the vectors going through
there are like sushi boats and a
detention head might look at all of
these positions and take all the
cucumber out of all the sushi and put it
into only the one in position one and
well actually wouldn’t do that only the
one in in the last position attention
heads in GP t2 are not allowed to take
information from future tokens and feed
it into past positions the information
can only flow this way and it can’t flow
this way so anyway you can imagine these
attention heads kind of mixing and
matching little bits of the sushi
together and feeding them forward trying
to get a more organized picture of
what’s going on for the task it’s
to perform each of these layers each of
these decoder or blocks here has 12
attention heads and they can all operate
independently and then at the top of
each layer there’s a linear layer that
puts all their outputs together and
organizes them into output for that
whole layer okay that’s a whirlwind tour
of transformer architecture so what is
GPT to actually doing so it’s supposed
to in each position the goal is for it
to output the next word and like I said
this top language modeling layer outputs
probabilities and so ideally you want
the word talk to have a higher
probability than others and here you
want the word is that have a higher
probability because the next word here
was talk so you want that to generate
talk the next word here is is so you
want that to generate is okay and so it
goes through this it does it all the way
to the end and here it’s going to
generate some word that that you haven’t
had in your input which you can then
feed back in and generate future words
so this is how GPT two comes up with a
completion of the sentence or a
paragraph or you know whatever this is
called Auto regression so okay so what I
did here in order to understand how
grammar is understood inside the network
I stripped off this language modeling
linear layer and replaced it with a
grammar whoops with a grammar modeling
layer so what this means is instead of
having an output probabilities of
English words or byte pairen codings of
English words which is how gptt tokenize
–is I had it output probabilities of
parts of speech and I looked at three
different kinds of grammar simple part
of speech detailed part of speech and
syntactic dependencies so simple part of
speech is like pronoun verb etc etc
detailed part of speech is like object
of the preposition and syntactic
dependencies is I’m sorry object of the
preposition is syntactic dependencies
and detailed part of speech is just more
fine-grain you know what what is each
word doing
so anyway I put this grammar modeling
layer on the top of this and I trained
it I built three data sets one for each
of these different types of grammatical
structures huge data sets 300,000
sentences and I used Spacey which is a
natural language processing tool out in
the wild to tag all these sentences with
their grammatical structures please note
here the goal of this project was not to
produce grammatical tagger because space
he already does that and does that
better than the thing I built my goal
here was to use a grammatical powder on
top of GPT - as a way of measuring
information inside of GPT - so you can
see here this shows it outputs parts of
speech I also looked at once I had this
grammatical tagger in place I looked at
what are called entropy what I’m not
going to explain the technical details
of this I’m short on time here the gist
is I looked at the entropy z’ of the
attention matrices coming out of the
attention heads for sentences in each of
these of each of these different
structures and the entropy of an
attention matrix basically what it does
is it tells you how complicated the
mixing and matching that that layer that
that head is doing so if all the head is
doing is taking all of the cucumber out
of all the sushis and putting it in
position one that’s a relatively low
entropy operation it’s not that
complicated but if the head is mixing
and matching a whole bunch of things in
complicated ways then the entropy will
be higher so these are pictures of the
attention matrix entropies and this is
organized
these are attention heads and this is
layer one of the network layer two of
the network the diagram I had before
only showed three layers but GPT too
small has 12 layers hi
I’ve shown you the wrong one and given
away a little bit of the future I was
supposed to show you one with 12 layers
here instead of 11 ignore the man behind
the curtain I’ll get to that in a moment
what’s interesting here though to note
is that the entropies are much higher at
lower layers of the network and so what
that tells us is that the network is
doing a lot more restructuring and
looking at the relationships between
words in these first four layers for
this grammatical task than in the upper
layers interesting
so maybe grammatical comprehension lives
at lower layers of the network so to
test that I took my grammatical
classifier and I ran it on top of each
layer of GPT - and looked at how hard it
was to train and how good of a score it
could get basically how low the loss was
and I’ve got a video here of what that
looked like so
so you can see you can see here layer
zero means I ran it right on top of the
embedding before any of the layers of
GPT two ran trained up for up to two
hundred eight pox I actually trained it
longer but I cut the graph off at two
hundred it kept going another like two
hundred and fifty or so and it did not
learn a ton and this particular one was
for syntactic dependencies you can see
at layer one did a bit better at layer
two it did yet better still
and at layer four it did pretty great
layer five it did excellent
[Music]
and so this shows how well this grammar
classifier trained on top of each of
these layers of the network so this is
really interesting it did a much better
job at layers five and six and you can
see it actually got its best score on
layer five it did the very best it did a
much better job at layers 5 and 6 then
it de layers before and that at the
layers after so it means that this
information came into view through these
attention had heads manipulating it in
these first four layers the grammatical
information did and then it started to
go back out of you so this led me to the
question of is it because the later half
of the network is trying to generate
future words that that’s what it was
trained to do and so that’s why it maybe
it’s more focused on the future than it
is on the past so
actually yeah so I trained it for
syntactic tagging of what the expected
output token should be instead of just
the input tokens and you can see that it
peaked out up here at layer eight so if
we just look this is incoming
and that’s outgoing incoming and
outgoing so this grammar classifier
basically is it’s like a tool to measure
where the information lives in the
network and how much information is
easily accessible for this grammatical
task at different layers and you can see
that the information for understanding
and grammar of the incoming sentence or
incoming tokens is much better at lower
layers and for outgoing it’s much better
at higher layers cool so what we’re
actually seeing here and sorry I’ve got
my slides out of order and I’ve given
away another thing I’m gonna say what
we’re seeing here is that is that these
heads are rotating this information into
view of these positions in a kind of
abstract informational space and here’s
an example of what I mean by that I laid
a bunch of markers on a table and
looking at them from this angle you
can’t tell how many markers are there
because you’re looking at the wrong
angle so if I rotate them slightly you
can tell there’s more than one but not
really how many or what colors they are
if I rotate them a bit further you can
tell there’s a few but it’s not clear
how many greens there are and if I
rotate them yet further you can see
exactly how many markers there are and
exactly what colors they are so this is
what I mean by rotating information this
is kind of an abstract version of the
same thing the grammatical information
is being rotated and not just rotated
but stretched and compressed and warped
and other types of things so that comes
into view of these positions that are
flying through the network I also did
the same thing for simple part of speech
and a tailed part of speech and you can
see those both coalesce in layer 3 which
makes sense those are simpler to figure
out so once I had this I took my grammar
classifier and I chopped off the top
half of GPT 2 and just ran it on top
of layer 5 and in here I decided to look
at how important each head each
attention head in the remaining Network
was for this classification and I tried
a couple of strategies the first
strategy I followed a paper called our
16 heads better than one where I I’m not
even going to bother and try and make
this interpretable to two non-technical
people I fed in a mask tensor a ones
tensor and I multiplied that by the
output of each attention head and then
took the did back propagation to find
the Jacobian of the grammatical
classification loss with respect to the
coefficient of each head and that would
give me some at least locally linear
interpretation of how important that
head was for grammatical classification
but it turned out that strategy didn’t
actually work that well it had worked
pretty well in the paper for Bert but it
didn’t work that well for GPT too so
instead I tried a slower more
computationally intensive strategy where
I just chopped out each head
individually and looked at its impact to
the grammatical classification so if it
had a big impact then that attention had
mattered and that was a place where
grammar was being learned and using that
I was able to pull out a lot of the
heads in here so for this particular
grammatical structure the very best loss
I could get was cutting out almost every
head in the network so the black here is
where I removed a head and the white are
the heads remaining this grammatical
structure needed a bit more a few more
heads this one needed almost no heads in
fact it didn’t need heads at all in some
of these layers which is kind of amazing
and so anyway in the future I would like
to look at I would like to take these
maps of heads that matter for different
grammatical structures and dig into them
and figure out what’s going on in these
individual heads now that I’ve reduced
JP t2 to a much smaller collection of
sub networks that are practical to
analyze and I’d like to compare and
contrast how these maps relate between
structures like here you can see these
three heads are not needed for this
structure or that structure or this
structure so there’s a relation
chips in here and I think we can find
sub networks of GPT to that really two
different grammatical structures and
hopefully that will one day down the
road get us to the point where we can
better tear open these language models
and have a much deeper understanding of
what’s going on in them okay hopefully
I’m under my time anyway time for Q&A I
know we’re all running a little bit long
so I don’t know if there’s time for Q&A
but we’ll see anybody got questions
I’m looking over here because because I
have a separate monitor with a QA oh
here we go from papers like the image G
PT we know that transformers have great
representations in the middle of the
network in how far is the grammar loss
predictive of useful representations for
other tasks and not just grammar
detection that’s a great question I
haven’t read the image dpgp T paper like
I said I have been in the field of deep
learning for about five months during a
pandemic and a revolution and I also had
a bunch of medical problems so I don’t
actually know the results of this paper
but it sounds cool
I would love to read it I think it’s a
good question how is the grammar loss
predictive of useful representations for
other tasks and not just grammar
detection I think it’s probably
generalizes pretty well it’s gonna
you’re gonna need to have some way of
classifying what it is that you’re
looking for so in this particular waste
in this particular case I had a good
easy way to generate a large data set
that I could tag with grammatical
structures so I was able to measure a
particular like had a good concrete
understanding and good concrete
mechanism for measuring information
presence I think for situations where
you can easily or plausibly produce a
data set that actually in in train a
classifier that actually measures the
kind of information you’re looking for
then this is pretty generalizable for
other things more abstract type
questions it’s going to be a lot harder
yeah it’s all about math and if you
can’t find a good way to numerically
measure it it’s gonna be hard to do some
things you can just brute-force
visualize but but I don’t have the
compute power to do that yet hopefully I
will in a not too distant future okay do
you think the number of heads that are
needed are correlated with the
complexity of the sentence structure or
did you notice any specific repeated
patterns you know I was actually really
surprised that some sentence structures
needed so few heads and it makes me want
to dig into how much information is
these linear sub layers of the
transformer blocks because clearly
they’re doing something like like you
saw before some of these layers didn’t
need any heads at all which is kind of
shocking I do think there’s there’s
clearly a correlation between the
complexity of the network that’s needed
and the complexity of the sentence
structure that’s coming in I don’t know
that it’s a perfect correlation and I
haven’t gone and done a calculation like
for instance I would like to do some
analysis like a way of measuring the
complexity of a sentence and compare
that directly to the number of heads and
give a mathematical answer to this
question I haven’t done that yet but
just visually it does look like there’s
some correlation there and it does look
like sentences that have similar
structures to one another have
similarities in the heads that are
important which is a validation that
this strategy makes some sense yeah okay
any other questions
our rights I think that might be it for
questions