🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance
Video
Transcript
hi my name is Jota Jota I’m here to talk
about quantifying interpret ability of
models trained on quayne room now
for anyone else who might be new to
interpret ability I just want to clarify
clarify that phrase as Alethea power
said yesterday in their presentation
interpret ability is essentially the
field of mind reading for neural
networks unlike with humans when a human
makes the decision such as when you’re
seeing an image and you want to classify
what exactly that image looks like and
you and a human says oh this image of
the dog is of the dog you can ask the
human why they think it’s dog whereas
with with okay whereas with which the
computer neural networks when we use
them you can’t ask a neural network
exactly why do you think an image is
classified this way instead we use
interpret ability as a means of breaking
down neural networks understanding why
they make the choices they do so that is
interpret ability to be clear and also I
wanted to reduce myself before I get
into the details of this that just I’m a
software engineer I joined the Scholars
program back in February under my mentor
called Cobb I learned reinforcement
learning for two months after which I
worked on this project and it’s supposed
to be here today
so it’s good start involved alright so
the whole goal of my project is testing
the diversity hypothesis now the
diversity hypothesis that’s proposing
the original paper which is currently
unpublished for written by Jacob Felton
and Chris Ola to engineers and
researchers here at open AI and by the
way if you have access to this to their
draft I highly recommend checking it out
be a prett the diversity I promise this
is as follows interpretable features
tend to arise at a given level of
abstraction if and only if the training
distribution is diverse enough at that
level of up stretch so let’s clarify a
few terms here one that diversity in
this context is in the in our use gonna
be just the amount of distinct input
that our neural network gets to train on
so for example with an image classifier
if you give a neural network 100
examples of images
train
and understand how to classify that
would be less diverse than a network
that’s been given a hundred thousand
images to train on and understand them
and classify and when we speak of levels
of abstraction here we’re mostly
speaking about different layers of our
neural network essentially neural
networks are divided into the layers
that early layers seem to catch simple
patterns like lines and shapes and later
layers tend to catch more complex and
abstracts patterns like dogs or cats
this is again in the context of just
image classification so essentially as a
model trains on more diverse data we
expect that it’s easier for us to
understand why model makes the choices
it does and do we have any proof of this
yeah we did because this unpublished
paper made an experiment specifically in
the context of coin run which I’ll go
into detail later but for now picture it
as a game similar to Mario Brothers or
essentially have a agent that is
controlling a little character that goes
around the level now as you can see in
this left slide that as models oh this
lower bar shows models and in a specific
architecture and a differing amount of
levels it’s been trained on so on the
left side represents models have been
trained on great little data about 100
distinct training levels whereas the
right side shows models that the same
model trained on more training data we
can see that as expected it performs
well when tested against data hasn’t
seen if it’s been trained on a large
amount of data so essentially as you
train on more data you expect them
almost performed better in tests just as
the performance improves so does the
interpret ability of the features of
this model essentially meaning that
models that have been trained on very
little data in corn run in this instance
only have about 1 in every 5 features
that is even like understandable to
humans as in if the if a human breaks
down the network they can only
understand roughly one in every five
features or individual components of the
network and what they’re doing whereas
as a model it’s trained on more and more
training levels the amount of features
that are interpreted by humans goes up
to four out of five
and now essentially my goal here is to
test out B and and so this of course
indicates that the diversity hypothesis
is valid and my goal is to test out the
same the only difference is that the
researchers who generated this graph did
so the human loop this process takes a
while it’s roughly an hour and a half
for every individual researcher to go
and look at the what these models are
going through and matching the features
with their results and trying to
understand what they’re looking at and
it’s a long process I’d like to make a
definition for interpretability that’s
algorithmic that a computer can use so
that we don’t have to have a human in
the loop to test this idea of
interpreter this will help us scale up
an experiment so let’s first start off
by breaking down the tools used in this
previous experiment and and the tools in
particular will be quite a run in
attribution so coin run the domain that
this training is done on is Mario
platformers mentioned before and I’m
going to show an example of it so here
we see a player jumping through a
platform avoiding a little on me and
like that busts are right there and
grabbing the gold coin at the end now
the yes the player jumps through avoids
enemies and if it lands on an enemy it
fails the level it doesn’t get reward if
it gets a gold coin it actually is
successful in in completing the level
now I want to emphasize that with coin
run essentially there’s different assets
and textures or different textures for
the assets that as we go through levels
there’s more diversity and just like
placement of platforms there’s also what
the platform’s look like but the
background looks like what the player
looks like so there’s a lot of potential
diversity in how the assets are
recognized and how a network can
understand the the domain now it’s good
attribution so how do we know what a
network is looking at or what it’s
paying attention to this is what
attribution allows us to see so from a
technical point of view attribution is
the drip is when the derivative of a
network sorry when we get the output of
a network and we get its derivative with
respects to the input to an F
essentially we can abstract this away as
it allows us to see what a network is
most paying attention to you know when
it’s classifying you’re doing some
action with an endpoint so let’s imagine
in the domain of classifying images what
that looks so in this first image we
have a picture of a bird we have an
image classifier that ran this image we
would expect the classifier to output
bird as a classification in the second
frame we run attribution on that network
with respect to this image and it tells
us and it gives us an output with the
same exact dimensions is the image
showing that the wider pixels here at
pixels the higher values are the ones
that the network is most paying
attention to so in this case we can see
that the network is paying attention to
the I have the bird the beak of the bird
the feathers along the bird’s head and
even the plumage like closer to the base
and that’s really interesting it doesn’t
pay attention to grass at all and we can
see that if you map the the the most
noticeable pixels to the original image
you get this kind of results an image of
just the bird but essentially the grass
is not weighing in on a network’s
decision of classify this image as a
bird so this allows us to see what then
it works in tension - can we use this on
coin rod and we can I’m going to show an
example essentially we’re doing a very
similar process except when a model is
being trained and running on coin run
we’re specifically are putting two
things the controls of the players
that’s how it controls the player when
it’s playing and something called a
value function which is really just an
estimate of the models an estimate of
the models performance that it’ll have a
high value if the model thinks it’s
doing well and low if not so with this
results of attribution let’s see what
the model is looking at in a very
specific track of a pointer we’ll see
here as the player jumps through the
game you can actually see that the
models paying attention to buzz thoughts
in the game as in like it’s actually
seeing oh there’s like an enemy I have
to avoid it it’s really easiest for
human to see okay this is exactly what
the models
looking at now I want to clarify that in
this context we’re not running
attribution on the entire network but on
a section this allows us to see only
what the model thinks of in terms of
abstract ideas and more like developed
assets and where they are in the image
and it also allows us to have multiple
results which is why you see these
multiple colors that there’s multiple
features that a network can use to
interpret the image now let’s see what a
port interpreter will example would be
essentially a player runs through and as
jumps you can see there’s these little
purple kind of shapes in the background
that’s an feature that is picking up
random artifacts in the background it
has no idea what is like irrelevant to
itself and what’s not whereas the other
model we can tell easily here human
would it be able to tell you what’s this
model paying attention to we have no
idea it’s just like these random shapes
in the background there’s some aspects
of it that maybe you can guess what the
models looking at but old smelly I would
classify this feature as uninterpreted
so with this process this is a human the
loop process I just described can we
make this into something a human you can
do for us instead and we can or at least
we can track so this is kind of my
little trying to add something new here
I’m gonna define interpretive
interpretability for this context of
coin run as just the area of intersect
of a quantity equivalent to the area of
intersection between Attribution and the
objects of interest divided by the
output of attribution so that’s kind of
a mouthful but let’s kind of break down
what that looks like essentially as a
character jumps through the game we have
this one example of a frame we can
actually separate what I call the
objects of interest which is really just
any objects that are in the background
and give them put them in an image where
everything has a value of one except for
the background so that’s what this looks
like essentially everything is white
except a black brown the background
which is black and with this matte would
we call mask of assets we can add the
attribution results and let’s say here
we have these two little spots from
attribution roughly ten pixels each and
we see that ten pixels are right on top
of a wall so it intersects with the
object of interests
ten or not during the background so
again with our quantity of attribute of
interpretability we see that the
numerator essentially the the
intersection is 10 pixels divided by the
total area of attribution which is 10
plus 10 20 pixels so 10 divided by 20
that’s a 50% interpretability score for
this specific frame now if we do this
process not just for one frame but for
512 and we do it not just for one
feature but for all features of a neural
network we get the average of all those
results we can have what I claim to be
an interpretability scored for the model
now what does that look like well it
looks the same for all the models that
trained on going from models trained on
100 levels which in this context is very
small and models trained on 100,000
levels which is a model that is trained
on a lot of data and we have been expect
to have your interpretable features if
the hypothesis for valid now what is
this show because I myself gone through
this human the loop domain um experiment
and I trust that these human the loop
results are valid and do reflect you
interpret ability this means that my
definition for interpret ability
currently isn’t working out there all
roughly at for 35 to 40 percent why is
that well it’s very heavily because a
lot of my results of attribution are not
as small as I’ve shown you in this
example we’re often the examples be the
results of attribution are actually much
larger or on the order magnitude like 20
by 20 pixels and they take up a lot of
the screen and especially when you have
a lot of assets to take the majority of
screen it it ends up picking up about
the same amount of the mass every time
and it’s just it’s not a good method now
I can narrow down the scope because this
is partially due to a method I that’s
just called using the receptive field
where you’re essentially trying to
translate the results of attribution to
the input domain and it’s it’s currently
not essentially it’s not working quite
well except after you have to use the
receptive field and a way in which
you’re weighing them the more connected
parts of a network with the lesser
connected parts of the network
and so essentially some refinement and
obvious improvements can be done but
currently the measurement does not work
so what am I conclusions here I think
that this ability because
interpretability is still like something
that can be calculated with an
algorithmic process with you mean the
loop I think it can be done by computer
but we still have to refine this current
definition I have and this is just in
the domain of pointer and has to be
expanded to other domains to be
functional for us and what else so this
my experiment doesn’t actually prove or
disprove the diversity hypothesis but I
still think that it’s important to try
to further experiment and see if this
hypothesis is valid because the
hypothesis is very powerful that if we
understand that diversity of features
does increase with generalization of a
model this is a new axis by which we can
improve moms and it’s very exciting but
so yes so that’s my conclusions
I don’t like technology very quickly
there’s a little surance I’m just my
mentor called caboose fantastic during
this program and honestly this work
wouldn’t have been possible or I
wouldn’t been able to do this work
without him
chicken healthy and crystal all the
original writers of the paper who did a
fantastic job and I highly recommend
looking at it if you have access this is
how you published would I yank a scene I
who ran the Scholars Program and have
done a great job doing it were very
supportive mario and francis for making
this presentation possible Alethea andre
Kathy come on look at that Pamela the
other scholars and there were fantastic
Greg and Brockman smallman for since
you’re making everything here possible
and virtual Yang my fiancee who was also
a great support during this work thank
you very much
I will check now if there are any
questions how can you tell if a
distribution is diverse excellent
question is there a metric to quantify
the diversity property of a distribution
of a model great question so in this
case of a distribution or model so in
this case we’re thinking of diversity in
the context of the input distribution
and I’m sure there’s better ways to do
this in other domains but the coin run
I’m only defining diversity as the
amount of distinct levels that an agent
is trained on so it’s really just kind
of trusting it’s not trying to
discrimination of diversity isn’t trying
to distinguish between how diverse two
different levels are more justice kind
of going with the easy first solution of
any distinct model is it like the amount
of distinct levels that have models
trained on are is the level of diversity
that’s thank you for that question by
the way since you are using a dynamic
image video how is the attribution model
able to perform consistent
identification and tracking of the
objects is attending to let me swallow
that or digested I real quick since you
are hmm I was able to perform consistent
identification track the objects it’s
it’s a great question so ultimately the
model is operating on the frame-by-frame
basis there’s no consistent like oh this
is what I used for attribution last time
this is what I’m using this time so
there is a lot of flickering where it’s
not that consistent but I think with the
examples I showed there was one that was
super interpretable and it shows just
how well it’s actually generalizing and
and is human interpretable that it’s
actually able to consistently recognized
features from one frame to the next it’s
not that like it’s sort of that shows
how well that it’s generalized that even
in a dynamic image shirt in the other
video that it’s still able to like
consistently make sense to humans say
you don’t have to go frame by frame so
it’s a great question I would say for
the less interpretable ones it doesn’t
look consistent at all I had to pause if
you remember the less interprete below
model I had to pause the multiple times
just for you to see the Purple Haze and
see what it’s picking up so so yeah I
hope that answers your question you can
ask another how is that tribution
different from sailing that’s excellent
question attribution is Zaillian C Maps
the term attribution I don’t know why it
picked ups in specific fields relative
to others but I want to show real quick
and give thanks so here we go cable
since from in a write this paper I’m
referencing here deep inside of
convolutional networks coin that turns
salient see maps I believe they did this
before the term attribution caught on I
don’t know why there’s a different usage
in the field and also if anyone knows
why or if there is like a difference in
concepts I would love to know but yes
essentially they’re the same thing
how would you think the interpretability
of the model as you define it would
scale to other games besides coin run
like bouncy ball hmm excellent question
my original goal or my strat amount not
my original my stretch goal when I
started was to actually try and
experiment should I lay define
interpretability for the context of coin
run on another project game I have to
admit in the heat of the moment of the
presentation I don’t remember which game
bouncy ball is so there’s a few things
to keep in mind here one that coin run
is a game in which the assets of value
take up about like 50% or less of the
screen at time if you’re playing a game
like checkers the assets of allegoric
taking up the entire screen so it
doesn’t actually change very well it’s
that kind of game so bounce evolve if
it’s the game where assets are
relatively rare they’re not rare but at
least less than 50% of the screen I
think it could work
there’s a few other nuances I’m like
what else would the interpretability
definition work for oh yeah I hope that
answers question also make sure to check
out bouncy ball later another one why
does good attribution saliency imply
interpretability isn’t this mostly an
accidental correlation dracolich though
useful and not a causal implication we
also digest this one real quick imply
interpretability
okay so essentially what the purpose of
testing the diversity hypothesis is is
testing if this is true if good
attribution sailings sleep kind of does
imply interpretability is that right no
not exactly
it’s okay so I’m going to say that good
attribution salience seat isn’t itself
like not really a phrase what what makes
it good here we’re using it it’s just
the tool itself attribution that we’re
just saying that if attribution aligns
with what we consider to be an object of
importance in a game then it is good and
so it’s sort of
it’s not an so it’s sort of like it’s
kind of coming back to this idea of like
what does it mean to be human interprete
beware for us.we and for this the
context of this experiments it’s saying
that we humans pay attention to things I
like when you look with your eye that
you tend to focus on specific objects to
narrow down your domain in order to like
digest an image or kind of interpreted
and so in the same way we’re sort of
expecting like okay there has to be
certain parts of the image that the
network is most sensitive to to make a
prop to first actually understand that
it’s using those parts of the image it’s
interprets I’m not satisfied with this
dancer reach out to me I’d love to talk
about this more and I also think that
Jacob Hilton would have a much better
response than myself because he wrote a
lot of this original work and Chris Ola
in the paper oh yeah but thank you for
the question is there another one
I don’t think so mething mood is
grabbing screen so thank you very much