Looking For Grammar In All The Right Places | Alethea Power | OpenAI Scholars Demo Day 2020 | OpenAI

🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance

Video

Transcript

hi I’m ilithyia power and I am pretty

new to the field of deep learning I’ve

been in it for about five months now

through the course of the Scholars

Program I’m getting a warning that I

have bad Network quality so if I’m not

coming through clearly somebody let me

know in the background so anyway about

at the end of last year I sorry

distracted by the network quality my

background is in software engineering

and site reliability engineering and

I’ve always been interested in AI but at

the end of last year I decided to try

and make the switch to a new career and

so to that end I applied to the to the

Scholars Program and I was incredibly

grateful to be able to get in and it’s

been an amazing start to a new career

I want to thank open AI i particularly

want to thank my mentor and the other

mentors who have been helpful and the

other scholars it’s been a fantastic

cohort to go through all of this with so

during the course of the program I got

very interested in interpretability

interpretability is basically

mind-reading for AI it’s about tearing

open neural networks and looking at how

they represent and process information

and it’s difficult to do because AI and

deep learning in particular is very

different from traditional software

engineering so there’s a picture that

almost everyone in the field has seen

software engineering a human being write

some software the software takes inputs

and gives outputs they could be

questions and answers like a search

engine or you know whatever but in deep

learning a human being creates math and

gives it some data to train on and

that’s what writes the software that

takes inputs and outputs and it turns

out that software written by math and by

a computer is much harder to understand

and software written by a human being

but it really matters because AI is

everywhere it impacts us in tremendous

ways throughout our lives so I’m a

transgender person and that means that

for a lot of my life my body is a

different shape than cisgender people’s

bodies and that means that scanners at

airports usually flagged me for needing

a pat-down it’s humiliating it’s

embarrassing

it’s not the end of the world but it’s

not cool and AI impacts other people in

worse ways there are systems that that

you know self-driving cars are more

likely to hit people of color and you

know there’s all sorts of biases and in

Justices that can come in so if we

understand how these systems work then

we can reduce their bias in addition to

that if we understand how they work then

we can improve their efficiency we can

find smaller networks that do the same

sort of job and take a lot less

electricity a lot less time a lot less

resources and a lot less money and

finally if we understand how on their

own networks thought how neural networks

represent information then we have a

better chance of actually being able to

understand human thought which to me is

the most interesting question of all so

I decided to dig into interpretability

by analyzing GPT - this was a

state-of-the-art language generation

language modeling network that opening I

released about a year and a half ago and

the way this network works is you give

it some input some text and it generates

output so this is an actual example I

fed this the phrase my talk is about

into GPT - and it said the future of

education you can give it the beginning

of a sentence and get yen you can give

it a paragraph and get an essay it’s

very good at generating text and a lot

of what it generates is

indistinguishable from human beings this

is pretty powerful and pretty dangerous

I know you can do something like train

GPT - on you know some sort of subreddit

and cut and get it to generate political

text and then you could use it to look

like there’s a bunch of people on the

internet who all have the same idea and

it’s really just software and that’s

pretty dangerous so we need to

understand it we need to dig into it and

know how it works and how to combat

things that are generated by it and how

to make sure that it’s used in safe ways

so I had a certain amount of time to do

this project and I decided I would bite

off a tractable part of this problem the

first thing I would do is just try and

understand how GPT to

understands English grammar so to

explain how I figured that out I need to

give a little bit of background on how

GPT 2 works some of the people on this

call will know all about this and are

literally world experts I think the lead

author on the GPT 2 paper is on this

call also my mom is on this call hi mom

so I want to make sure and give some

background that’s applicable to a wide

variety of audiences and try not to

leave anybody behind based on a lack of

already having you know a full knowledge

of how this works I also think that’s a

core part of interpretability trying to

make sure to democratize this

information and spread it around so that

people outside the field can actually

have an understanding of what’s going on

so I’m gonna spend a second talking

about transform our architecture and

then I’ll get into what I built on top

of it GPT 2 is a transformer but I’ll

get into that in a minute

so when I feed this beginning of the

sentence in my talk is about the first

thing it does is split that split that

strain into tokens tokens can be words

they could be punctuation marks they

could be collections of bytes in this

string they could be yeah just basically

sub parts of the string I restricted

myself to sentences that had tokens

I had a one-to-one mapping between the

tokens and the sentences and punctuation

marks because that made it a little bit

easier for me to analyze GPT 2 has a

little bit of a subtle way of doing this

but I kind of circumvented it these

tokens oops oops I’m clicking the wrong

button here these tokens get converted

into vectors and the word my always

converts into this vector here and this

is actually talked with a space in front

of it that always converts into this

vector so I end up with use for vectors

that excuse me and they could get fed

into GPT too and they flow through the

network along these positions so if I

put four tokens in I get four tokens out

in this particular diagram there’s

there’s four

flowing through it so what are they

flowing through the first part here is

an embedding layer that’s what turns

them into vectors then it has a bunch of

decoder blocks GPT too is comes in a

variety of sizes I looked at GPT too

small which is what would fit on my home

graphics card and even it is huge it has

over a hundred million parameters

variables and so I knew that I needed to

try and break it up too to tackle this

problem and most of these parameters are

here in these decoder blocks and finally

it has a language modeling layer so each

decoder block takes n vectors in each

position and outputs vectors in each

position and then this language modeling

layer takes the final set of vectors

that come out of the top decoder block

and produces probabilities for what the

next word might be and I’ll get into

that in a second inside of these decoder

blocks are what are called attention

heads now attention heads mix-and-match

information between the different

positions to feed out into the new

position so they kind of like collect

the information that spread across the

input and collect it into focus areas so

you can kind of think of this as as

being like if you’ve ever been to a

sushi boat restaurant that has the

little stream with a little boat that

floats along next to your table with

pieces of sushi on it so you can imagine

each of these positions flowing through

the network being like a sushi boat path

and the tokens the vectors going through

there are like sushi boats and a

detention head might look at all of

these positions and take all the

cucumber out of all the sushi and put it

into only the one in position one and

well actually wouldn’t do that only the

one in in the last position attention

heads in GP t2 are not allowed to take

information from future tokens and feed

it into past positions the information

can only flow this way and it can’t flow

this way so anyway you can imagine these

attention heads kind of mixing and

matching little bits of the sushi

together and feeding them forward trying

to get a more organized picture of

what’s going on for the task it’s

to perform each of these layers each of

these decoder or blocks here has 12

attention heads and they can all operate

independently and then at the top of

each layer there’s a linear layer that

puts all their outputs together and

organizes them into output for that

whole layer okay that’s a whirlwind tour

of transformer architecture so what is

GPT to actually doing so it’s supposed

to in each position the goal is for it

to output the next word and like I said

this top language modeling layer outputs

probabilities and so ideally you want

the word talk to have a higher

probability than others and here you

want the word is that have a higher

probability because the next word here

was talk so you want that to generate

talk the next word here is is so you

want that to generate is okay and so it

goes through this it does it all the way

to the end and here it’s going to

generate some word that that you haven’t

had in your input which you can then

feed back in and generate future words

so this is how GPT two comes up with a

completion of the sentence or a

paragraph or you know whatever this is

called Auto regression so okay so what I

did here in order to understand how

grammar is understood inside the network

I stripped off this language modeling

linear layer and replaced it with a

grammar whoops with a grammar modeling

layer so what this means is instead of

having an output probabilities of

English words or byte pairen codings of

English words which is how gptt tokenize

–is I had it output probabilities of

parts of speech and I looked at three

different kinds of grammar simple part

of speech detailed part of speech and

syntactic dependencies so simple part of

speech is like pronoun verb etc etc

detailed part of speech is like object

of the preposition and syntactic

dependencies is I’m sorry object of the

preposition is syntactic dependencies

and detailed part of speech is just more

fine-grain you know what what is each

word doing

so anyway I put this grammar modeling

layer on the top of this and I trained

it I built three data sets one for each

of these different types of grammatical

structures huge data sets 300,000

sentences and I used Spacey which is a

natural language processing tool out in

the wild to tag all these sentences with

their grammatical structures please note

here the goal of this project was not to

produce grammatical tagger because space

he already does that and does that

better than the thing I built my goal

here was to use a grammatical powder on

top of GPT - as a way of measuring

information inside of GPT - so you can

see here this shows it outputs parts of

speech I also looked at once I had this

grammatical tagger in place I looked at

what are called entropy what I’m not

going to explain the technical details

of this I’m short on time here the gist

is I looked at the entropy z’ of the

attention matrices coming out of the

attention heads for sentences in each of

these of each of these different

structures and the entropy of an

attention matrix basically what it does

is it tells you how complicated the

mixing and matching that that layer that

that head is doing so if all the head is

doing is taking all of the cucumber out

of all the sushis and putting it in

position one that’s a relatively low

entropy operation it’s not that

complicated but if the head is mixing

and matching a whole bunch of things in

complicated ways then the entropy will

be higher so these are pictures of the

attention matrix entropies and this is

organized

these are attention heads and this is

layer one of the network layer two of

the network the diagram I had before

only showed three layers but GPT too

small has 12 layers hi

I’ve shown you the wrong one and given

away a little bit of the future I was

supposed to show you one with 12 layers

here instead of 11 ignore the man behind

the curtain I’ll get to that in a moment

what’s interesting here though to note

is that the entropies are much higher at

lower layers of the network and so what

that tells us is that the network is

doing a lot more restructuring and

looking at the relationships between

words in these first four layers for

this grammatical task than in the upper

layers interesting

so maybe grammatical comprehension lives

at lower layers of the network so to

test that I took my grammatical

classifier and I ran it on top of each

layer of GPT - and looked at how hard it

was to train and how good of a score it

could get basically how low the loss was

and I’ve got a video here of what that

looked like so

so you can see you can see here layer

zero means I ran it right on top of the

embedding before any of the layers of

GPT two ran trained up for up to two

hundred eight pox I actually trained it

longer but I cut the graph off at two

hundred it kept going another like two

hundred and fifty or so and it did not

learn a ton and this particular one was

for syntactic dependencies you can see

at layer one did a bit better at layer

two it did yet better still

and at layer four it did pretty great

layer five it did excellent

[Music]

and so this shows how well this grammar

classifier trained on top of each of

these layers of the network so this is

really interesting it did a much better

job at layers five and six and you can

see it actually got its best score on

layer five it did the very best it did a

much better job at layers 5 and 6 then

it de layers before and that at the

layers after so it means that this

information came into view through these

attention had heads manipulating it in

these first four layers the grammatical

information did and then it started to

go back out of you so this led me to the

question of is it because the later half

of the network is trying to generate

future words that that’s what it was

trained to do and so that’s why it maybe

it’s more focused on the future than it

is on the past so

actually yeah so I trained it for

syntactic tagging of what the expected

output token should be instead of just

the input tokens and you can see that it

peaked out up here at layer eight so if

we just look this is incoming

and that’s outgoing incoming and

outgoing so this grammar classifier

basically is it’s like a tool to measure

where the information lives in the

network and how much information is

easily accessible for this grammatical

task at different layers and you can see

that the information for understanding

and grammar of the incoming sentence or

incoming tokens is much better at lower

layers and for outgoing it’s much better

at higher layers cool so what we’re

actually seeing here and sorry I’ve got

my slides out of order and I’ve given

away another thing I’m gonna say what

we’re seeing here is that is that these

heads are rotating this information into

view of these positions in a kind of

abstract informational space and here’s

an example of what I mean by that I laid

a bunch of markers on a table and

looking at them from this angle you

can’t tell how many markers are there

because you’re looking at the wrong

angle so if I rotate them slightly you

can tell there’s more than one but not

really how many or what colors they are

if I rotate them a bit further you can

tell there’s a few but it’s not clear

how many greens there are and if I

rotate them yet further you can see

exactly how many markers there are and

exactly what colors they are so this is

what I mean by rotating information this

is kind of an abstract version of the

same thing the grammatical information

is being rotated and not just rotated

but stretched and compressed and warped

and other types of things so that comes

into view of these positions that are

flying through the network I also did

the same thing for simple part of speech

and a tailed part of speech and you can

see those both coalesce in layer 3 which

makes sense those are simpler to figure

out so once I had this I took my grammar

classifier and I chopped off the top

half of GPT 2 and just ran it on top

of layer 5 and in here I decided to look

at how important each head each

attention head in the remaining Network

was for this classification and I tried

a couple of strategies the first

strategy I followed a paper called our

16 heads better than one where I I’m not

even going to bother and try and make

this interpretable to two non-technical

people I fed in a mask tensor a ones

tensor and I multiplied that by the

output of each attention head and then

took the did back propagation to find

the Jacobian of the grammatical

classification loss with respect to the

coefficient of each head and that would

give me some at least locally linear

interpretation of how important that

head was for grammatical classification

but it turned out that strategy didn’t

actually work that well it had worked

pretty well in the paper for Bert but it

didn’t work that well for GPT too so

instead I tried a slower more

computationally intensive strategy where

I just chopped out each head

individually and looked at its impact to

the grammatical classification so if it

had a big impact then that attention had

mattered and that was a place where

grammar was being learned and using that

I was able to pull out a lot of the

heads in here so for this particular

grammatical structure the very best loss

I could get was cutting out almost every

head in the network so the black here is

where I removed a head and the white are

the heads remaining this grammatical

structure needed a bit more a few more

heads this one needed almost no heads in

fact it didn’t need heads at all in some

of these layers which is kind of amazing

and so anyway in the future I would like

to look at I would like to take these

maps of heads that matter for different

grammatical structures and dig into them

and figure out what’s going on in these

individual heads now that I’ve reduced

JP t2 to a much smaller collection of

sub networks that are practical to

analyze and I’d like to compare and

contrast how these maps relate between

structures like here you can see these

three heads are not needed for this

structure or that structure or this

structure so there’s a relation

chips in here and I think we can find

sub networks of GPT to that really two

different grammatical structures and

hopefully that will one day down the

road get us to the point where we can

better tear open these language models

and have a much deeper understanding of

what’s going on in them okay hopefully

I’m under my time anyway time for Q&A I

know we’re all running a little bit long

so I don’t know if there’s time for Q&A

but we’ll see anybody got questions

I’m looking over here because because I

have a separate monitor with a QA oh

here we go from papers like the image G

PT we know that transformers have great

representations in the middle of the

network in how far is the grammar loss

predictive of useful representations for

other tasks and not just grammar

detection that’s a great question I

haven’t read the image dpgp T paper like

I said I have been in the field of deep

learning for about five months during a

pandemic and a revolution and I also had

a bunch of medical problems so I don’t

actually know the results of this paper

but it sounds cool

I would love to read it I think it’s a

good question how is the grammar loss

predictive of useful representations for

other tasks and not just grammar

detection I think it’s probably

generalizes pretty well it’s gonna

you’re gonna need to have some way of

classifying what it is that you’re

looking for so in this particular waste

in this particular case I had a good

easy way to generate a large data set

that I could tag with grammatical

structures so I was able to measure a

particular like had a good concrete

understanding and good concrete

mechanism for measuring information

presence I think for situations where

you can easily or plausibly produce a

data set that actually in in train a

classifier that actually measures the

kind of information you’re looking for

then this is pretty generalizable for

other things more abstract type

questions it’s going to be a lot harder

yeah it’s all about math and if you

can’t find a good way to numerically

measure it it’s gonna be hard to do some

things you can just brute-force

visualize but but I don’t have the

compute power to do that yet hopefully I

will in a not too distant future okay do

you think the number of heads that are

needed are correlated with the

complexity of the sentence structure or

did you notice any specific repeated

patterns you know I was actually really

surprised that some sentence structures

needed so few heads and it makes me want

to dig into how much information is

these linear sub layers of the

transformer blocks because clearly

they’re doing something like like you

saw before some of these layers didn’t

need any heads at all which is kind of

shocking I do think there’s there’s

clearly a correlation between the

complexity of the network that’s needed

and the complexity of the sentence

structure that’s coming in I don’t know

that it’s a perfect correlation and I

haven’t gone and done a calculation like

for instance I would like to do some

analysis like a way of measuring the

complexity of a sentence and compare

that directly to the number of heads and

give a mathematical answer to this

question I haven’t done that yet but

just visually it does look like there’s

some correlation there and it does look

like sentences that have similar

structures to one another have

similarities in the heads that are

important which is a validation that

this strategy makes some sense yeah okay

any other questions

our rights I think that might be it for

questions