🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance
Video
Transcript
you
hello everybody we’re gonna go ahead and
get started
fantastic welcome to open AI scholars
demo day thank you all for being here
this evening we have eight presentations
tonight as a result of our Scholars
program we had 550 people apply and
eight scholars so that makes us a bit
more competitive than Harvard but you
know we don’t want it we don’t want to
break any who and the scholars have
spent the past three months studying
machine learning full-time two of those
months have been spent on a
self-selected curriculum going through
the different skills that they need to
be able to complete a final presentation
which they will be showing to you today
this is a project that they’ve completed
in one month which I’ve been told is a
very short amount of time to complete a
experiment in machine learning when you
are just a beginner however they’ve been
assisted by our really awesome mentors
who are a combination of opening eye and
external folks and we’re really excited
to have everyone here so I’m gonna go
ahead and pass it off to Ilya who’s
gonna say a few words before we get
started
Thanks hello everyone and welcome to the
scholars demo day it’s really exciting
to see the projects that the scholars
were able to accomplish in just one
month one thing about machine learning
is that it’s not the easiest field to
enter without mentorship and working
closely with a good mentor can really
narrow down the very big surface area of
machine learning down to manageable
pieces that make progress far more
rapidly and I can say for myself I
definitely wouldn’t be where I am
without the mentorship that I received
so with this I want to say thank you and
real gratitude to all the mentors who
helped bring the scholars to where they
are right now
and finally doing a project is just in
just one months is not an easy feat at
all so congratulations on the scholars
who’ve done this and let’s see let’s see
the projects
[Applause]
thank you for coming to our demo day
today I’m going to talk about exploring
gamma the discount of the future or the
weight of the past so here’s a bit of
background for those who are unfamiliar
with reinforcement learning it is the
framework where an agent takes action in
the environment to maximize cumulative
rewards so the measure we care about X
the expected sum of total rewards and
the measure we actually optimize is they
expected sum of total discounted rewards
so the difference of here is the
discount and not only it prevents us
from getting explosive sum to infinity
but also it injects some preferences
which I’m going to talk about so what is
discount factor from economics this
discount factor gamma specifies some
intertemporal temp preferences and what
I mean by preferences is to imagine a
thought experiment where if I hand you a
full approached moral or some gamma
fraction of an apple today and then I
twist the gamma until you find the
single gamma that you are indifferent
between the two then this gamma reviews
your preference so with this preference
in mind let’s look at a toy example in
some simple grid word experiment here we
have two agent one has a low discount
preference so it will have a what prefer
immediate rewards such as the coins on
the Left which has lower value or the
agent on below has a high discount
meaning it will prefer long-term future
reward which is attainment so we can see
different preferences results in
distinctive behavior that agent could
express as a beginner 2rl algorithm we
may just ask which discount factor gamma
to use well some will tell you 0.99 or
otherwise say something close to one no
no not one for boundedness and some ways
say just try a set of commas and pick
the best so none of these answers seemed
entirely satisfactory so
I found out this about Blackwell
optimality principle actually says that
in all environments there exists an
optimal policy that simultaneously
optimal with gamma higher than some
threshold so looking at this principle
we may find it quite intuitive because
given some real world environments where
we always a maximize cumulative reward
that applies gamma equals 1 however in
real what we can now use gamma equal to
1 but having a gamma of 0.99 results in
a very similar intertemporal preference
as a gamma equal to 1 having a gamma
0.99 in the diamond case in the in the
grillwork case the agent will still pick
diamond over the coin despite having a
slightly lower discount factor so the
question I’m interested in is to tip our
algorithms always find the Blackwell
optimal policy for gammas above the
threshold well let’s go back to this
time we’ll use a lower exploration
instead of Hayek’s persian in the
previous case and we see the behavior is
the same for the low discount agent
however for the highest own agent it
cannot obtain any reward so not
necessarily in this case that Alton
power algorithm could find this kind of
optimal policy so in this work we will
demonstrate this issue and propose
methods to repair it to do that I set up
to world war environment one with sparse
reward the other with dense reward so
the agent in red is trying to collect
all the yellow coins which will give
them positive reward and they want to
avoid poison we shall give them negative
reward and lastly they they don’t want
to get trapped otherwise they will die
so the algorithm I use is from baselines
overnight spaced lines D queuing and
then the experiment setup is I only vary
the discount factor and keeping
everything else constant and here’s the
initial results on the sparse
environment so I pick a set of gammas
from 0.1 0.2 0.5 0.8 and 0.99 as you can
see on
right curve the highest gamma 0.99
yields the best performance and this
seems like there might be a stretch
between gamma 0.5 and point 9 0.8 such
that above that all gammas or high
gammas yield optimal performance so this
is quite consistent with the back
blackwell optimality principle let’s
look like how let’s look at how this
will work in a dense environment so here
we see that the highest gamma which is
in pink 0.99 is actually not doing well
it’s not even doing as well as the gamma
0.5 in this case so we see some
inconsistency with the Blackwell
optimality principle let’s dig further
so after thinking about it my hypothesis
to this behavior is that discount factor
may play a dual role in DQ and update
and specifically it not only explicitly
specifies that intertemporal preferences
which is to discount the future but
implicitly it includes some confidence
on bootstrapping of the function
approximator
which is to weighing the past so I
proposed a time variant discounting
gamma of T so that when we specify a
male pick five a fraction that’s a some
fraction of the total time steps and
during that period we are going to vary
our gamma from 0.1 to a final gamma that
we specified so that they have a linear
schedule then we wait earlier
experiences less the gamma will keep
fixed after the myopic fraction fraction
so using this simple scheme on the dense
environment
t queuing algorithm let’s compare the
experiments so the pink curve is still
the fixed gamma from the in the dense
environment and although other colorful
curves are always different myopic
fraction and to see this more clearly I
grouped all myopic fraction setups in a
single block in blue and you can see
other myopic fraction no matter how
large it is outperforms the basslines
algorithm with fixed gamma and because
0.99 what wasn’t doing well and then
with this initial myopia it could
actually become optimal so this is a
really good result and let’s see if it
can work in the path gamma which is 0.8
in the tense environment so I do the
same thing where the pink curve is the
fixed gamma 0.8 which was the best gamma
in the dense environment and all the
myopic fraction are in blue on the right
side and then we see that any male fake
fraction could eventually achieve the
same level of performance as the
original fixed gamma and just take
longer so what about the sparse
environment so I tried the same thing
where on the yeah so we have the fixed
gamma on the left we have gamma of 0.8
on the right we have gamma point 99 and
you can see subsequently as we increase
the myopic fraction the longer it takes
to reach optimal but they reach optimal
nonetheless so fixed gamma yields the
best performance in this sparse
environment but all my opaque fraction
converts to optimal eventually so it
doesn’t really hurt in the long run so
summing up from all this initial amalia
results we see that high discount gamma
could become optimal with initial myopia
and the benefits of this simple scheme
is that we don’t need to find humour
because it improves learning intense
reward environments and doesn’t harm
learning in sparse reward environment
however you may say that all our
hypothesis is that this initial myopia
will mitigate bias thus better
performance there might be a competing
hypothesis is by introducing this
initial myopia you may have more
exploration which might lead to better
performance so I’m trying to see whether
this benefits is result of bias
reduction or exploring
to do that I set up three experiment
where are fixed the discount factor and
for the baseline setup I have zero
myopia and load low exploration for the
myopia setup I have low myopia and low
exploration and for the exploration
setup I have zero myopia and high
exploration so to see the results let’s
look at the plot on the left here our
gamma is fixed to be zero point eight
across our three setups and for the
baseline it is in pink and then we see
that it has similar performance as the
myopia which is in orange and
exploration actually did worse than both
of them as it because it takes longer
and on the right we have our gamma of
0.99 which is considered the poor gamma
in the dense environment and we can see
with myopia actually outperform post
baseline and exploration and with the
exploration although is eventually
surpasses baseline a bit but still takes
a long time so we can sort of conclude
that exploration helps but not
significantly and it takes longer
training time but with myopia it
improves significantly and converges
faster so as part of the future
directions we try to formalize this dual
role intuition and run more experiments
on standard testing ground and I will
also want to compare a myopia schedule
lambda versus gamma in PPO I have some
initial results and happy to discuss
offline after the presentation so just a
recap here some related works and final
takeaway discount factor matters in deep
reinforcement learning and it has a dual
role that specifies intertemporal
preferences and also includes confidence
on bootstrapping from function
approximation and a simple myopic
schedule is a robust and effective way
to improve performance and the same
logic may work beyond EQ
and discrete action state framework
thank you I’m happy to take questions
[Applause]
yeah so in the baselines dqn they have
the similar linear schedule for
exploration where it started low I
started high and become low and stay low
for the whole time yeah
hello I was wondering if you can think
of a way so I really liked this idea of
seeing Gemma as something that kind of
enclose your confidence do you think
there’s a way to kind of like take you
uncertainty into account more directly
it seems like you’re currently doing it
over time but do you think there’s a
there’s a way to like make that more
explicit by I don’t know getting some
uncertainty estimate for your cue
function thank you it’s a really good
question I’m thinking to first try it on
the generalized advantages estimate
because it exhibits basically separates
out lamda and n gamma in the equation
where I feel like lamda is precisely the
role it plays in our confidence in
bootstrapping so I guess by varying
lambda perhaps in the algorithm PPO
could help with this issue yeah
any other questions
[Applause]
okay so I’d like to begin by thanking
everyone at open AI who organized this
event you can’t hear me how about now
okay and I’d also like to thank all of
you for attending so previously I was a
PhD student at the University of Chicago
where I studied cell biology but these
days wood excites me most is the
prospect of getting general-purpose
robots deployed in the real world and
then getting them to do useful things
and so one challenge associated with
this is that robots will have to learn
how to solve new tasks with little to no
external feedback so for the past month
I’ve been working on a project that
leverages a robot’s internal motivation
and over in order to overcome this lack
of supervision so I’ll start by placing
intrinsic motivation in the context of
reinforcement learning so in the
reinforcement learning setting you have
an agent that interacts with an
environment via its policy so the policy
takes in observations and outputs
actions so at every time step the agent
takes an action and receives a reward
from the environment so by learning to
maximize its total expected reward the
agent can find a good policy and
eventually learn how to solve a given
task but finding a good policy can
actually be very difficult when the
rewards from the environment are sparse
so what we really need our reward
functions that are intrinsic to the
agent instead of relying solely on
sparse extrinsic rewards okay so how do
you get how do you get dense intrinsic
rewards so there’s actually been a lot
of work in this area but most approach
that are a lot of approaches Center on
some notion of novelty so by seeking
novelty the agent is driven to explore
and in doing so learns new skills that
might help it solve a given task
and so this approach is actually had a
lot had a lot of success recently with
solving challenging Atari games like
Montezuma’s Revenge
however these ideas haven’t really been
applied to robotics so in this project
I’ll show you how using a simple
formulation for intrinsic rewards will
lead to nice solutions for solving
challenging robotics problems so before
I discuss the method that I use I’ll
talk about the fet robotics environments
which were developed in-house at open AI
by matthias sitting in the back and so
the observations include the position
and velocity of the gripper and it also
includes the pose the linear velocity
and angular velocity of any object that
might be in the same the action space is
continuous and it’s four dimensional and
it includes the first three dimensions
correspond to change in position along x
y&z and the fourth dimension corresponds
to opening and closing the gripper and
so again we care about the sparse reward
setting so here the agent receives at
every time step award of 0 if it solves
a task and negative one otherwise and I
just want to point out that for all of
the environments that I’m going to talk
about which include reaching pushing
sliding and picking place the agent has
at most 50 time steps to actually solve
the task ok so here’s an outline of the
method that I use so the diagram on the
left you can see that it’s very similar
to the reinforcement learning setting
that I talked about a few slides ago but
here in addition to the agent having a
policy it also has a dynamics model so
the dynamics model takes as input the
agent’s current state and its action and
it makes a prediction for the next state
and I also want to note that we we could
set up the dynamics model to make
predictions for the robots change in
state
and so here’s the intrinsic reward his
intrinsic reward that I’m using and what
you’ll notice is that it’s simply the
prediction error of the dynamics model
and so larger prediction errors should
lead to larger intrinsic rewards and so
the idea is that the the idea is that
the agent shouldn’t get stuck in regions
that it’s already explored and instead
will be encouraged to explore elsewhere
okay so for training I use simple fully
connected networks for both the policy
and the dynamics model I train the
policy using PPO which is an actor
critic architecture that was developed
here by John Shulman and briefly since
PPO is on policy I use a number of
actors in parallel to collect a large
and diverse set of data that can
actually be used to update the
parameters of both the policy and the
dynamics model okay so here’s my first
set of results so I’m applying this
method to first to the simplest of the
tasks which is reaching and here I’m
comparing a baseline PPO implementation
with no intrinsic rewards to intrinsic
rewards that were that were generated by
predicting the full neck state or by
predicting the change in state and so
for a simple task like reaching it turns
out that you don’t actually need the
intrinsic rewards but adding them you
you can see that the agents learns to
solve the task much more quickly and you
can see in the bottom panel that as the
agent learns the intrinsic reward
actually goes down okay so I next looked
at the pushing task which is actually
quite a bit more complicated than the
reaching task so unlike reaching the
baseline policy is not actually able to
solve this task but when you add in the
dense intrinsic rewards you can see that
the agent quickly learn
how to solve solve the task and near the
end of training its solving nearly 100
percent of the episodes and so when I
approach the pushing task I actually
looked at a number of different hyper
parameters and so I’ll talk about three
here briefly so I looked at the size of
the individual layers within the network
and as you might expect larger networks
tend to perform better than smaller
networks I also looked at the learning
rate for the dynamics model and so this
actually turned out to be very important
so as you can see here smaller learning
rates tend to outperform larger learning
rates and the last thing I looked at was
saying whether or not resetting the
environment early after the agent has
already solved the task will lead to
better performance so that’s where you
can see here so after the agent has
solved the task by doing an early reset
that leads to a significant boost in
performance ok so I next looked at the
pick-and-place task and this s is
actually step up in difficulty from the
pushing task but here the story is more
or less the same the baseline policy
cannot solve the task but adding in
either intrinsic reward can actually
lead to the agent learning how to solve
the task and for reasons I don’t quite
understand yet predicting the full next
state tends to outperform predicting the
change in state so that’s something I
want to look at a little bit more in the
future ok so the last thing I looked at
was a sliding task it turns out that by
comparison the sliding task is probably
more difficult than the others and so
here I actually needed a larger Network
than what was required for the others
for the other tasks and I also needed
twice as many environment interactions
for the agent to learn how to solve this
okay so to summarize consistent with
previous works my results demonstrate
that adding intrinsic rewards can
actually be useful in solving
challenging tasks with sparse rewards
although unnecessary
for solving the reaching task adding
intrinsic rewards as an exploration
bonus actually leads to improve
performance and intrinsic rewards were
actually necessary for solving a more
difficult task so in the future I plan
to look at slightly more complicated
tasks such as tool use and block
stacking and I also plan to look at
different kinds of inputs both both for
the policy and for the dynamics model
and in particular I’m really interested
in looking at combining different sensor
sensor modalities so looking at
combining images with depth maps with
contact information okay so with that
I’d like to thank Maddie for organizing
the Scholars Program I’d like to show my
appreciation to scholars past and
present I’d like to give a special thank
you to Matthias Alex and Lillian they
gave me really good feedback that I was
able to use to make modifications to
fetch environments I’d like to thank
Yura Harry Igor and Wojcik with all of
whom I’ve had really enlightening
conversations and I’d like to give a
special thank you to Yura and Harry
whose previous work was actually the
inspiration for this project and last
but not least I’d like to thank thank my
mentor for y’all who provided me with a
lot of encouragement and support over
the last couple of months so with that
I’d be happy to take any questions
[Applause]
have you tried Alice tiems is that
awfully that’s a good question probably
when I start to incorporate different
sensory modalities I’ll have to move to
using Elysee Ames also may help with the
Delta version of the inputs yeah I could
so I wanted to try it originally but the
the simple method that I used already
had such good performance that I’d sort
of put that off any other questions
was was there any stochastic
stochasticity in the environment yes so
every time the environment was reset the
target location would change and if
there were a block in the scene the
starting location of the block would
also change oh I see was there any
stochasticity in the transitions like
from one point in time to the next point
so like this kind of movement there was
no stochasticity the movements were
completely deterministic I see
how important was it to trade off the
goal-directed reward versus the
intrinsic reward like did that require
some tuning or not so much
I think most of that tuning came from
tuning the learning rate for the
dynamics model because at the if the
Dyna if the learning rate for the
dynamics model was too big then the
agent would very quickly learn how the
environment worked and so the intrinsic
rewards I would get would be really
small whereas if you delayed that it
would get larger rewards early
encouraging exploration but if you’re if
you’re referring to like the size of the
rewards the relative to rewards say
negative one versus with the intrinsic
reward was I did tune that quite a bit
turns out one works really nicely which
offsets the reward of negative one from
the environment yeah what larger rewards
don’t work and smaller intrinsic rewards
don’t work any other questions great
thank you
[Applause]
you
hi everyone thank you for being here
tonight and louder okay is it any better
okay thank you so um welcome so I will
be talking about my experiment with fine
tuning GPT two small model for question
answering tonight before I start to talk
about giving you two details about my
experiment I would like to mention the
question that I was interested in which
guided me in my decision to work on this
problem one of the biggest challenges
that we have in natural language
understanding today is the ability to
create systems that have common sense
neat reasoning which is the ability of
an intelligent system to come up with
common-sense knowledge and reason about
a given text so this is still a
notoriously difficult task and although
we have very high-performing language
models and systems today we are still
struggling to perform better in this one
so in this chart you will see you notice
that there’s an increase in the
benchmark models and data sets that have
come out in the recent years that are
targeted exclusively for common sense
reasoning especially in 2018 alone we
have a number of data sets that were
specifically designed for common sense
reasoning and you will realize that most
of them are targeting the task of
question answering so reasoning is
necessary in in performing better better
especially in most of the NLU tasks
that’s because most of the time we will
not have the solution given in a
linguistic context and there will be a
lot of ambiguities in the language that
we need to our systems are intelligent
systems needs to be able to figure out
so this is this is one of the reasons
why I
I chose the task of question answering
for my fine tuning first of all QA is
one of the most important natural
language understanding tasks that will
allow us to measure how a system is
doing in terms of common-sense reasoning
in addition to that QA requires a mix of
language processing and reasoning skills
within a single task and that’s why it’s
more practical than probably dealing
with some other more complicated than
defined tasks in addition to that better
reasoning achieved to create systems
could be applied to a variety of systems
that is not limited to natural language
processing or understanding such as
vision and robotics the examples of
which we have started to see lately so
the approach that I take in my project
was to analyze the patterns that a
fine-tuned GPT small performance on QA
tasks could reveal about how a language
mount model attains and performs
reasoning so for this I have
experimented with this small model and
finding that on the Stanford
question-answering dataset for those of
you who are not familiar with the data
set I will I will be giving some
informations about that oh sorry I meant
to go back the statue of the models
amount of architectures that I have
worked on the first one you see on the
on the left is a linear classifier that
I put on top of GPT to a small model the
first public release one and the one on
the on the right will be a bio SCM in a
naive attempt to circumvent and
bi-directional uni directionality of GPT
to most of my results will be based on
the linear model because after going
through a lot of hyper parameter tuning
and
I noticed that actually new model
perform better for those of you who are
not familiar with the squat data side
squat dataset contains over a hundred
thousand question-and-answer pairs and
which which have answers and it also has
the point tube model also has over fifty
thousand unanswerable questions with
some plausible answers added and if your
model comes up with those plausible
answer is still very informative about
how your model is doing in these two
examples that are taken from the data
sets you will see that the question with
the answer is a very factor it’s styled
simple question and the one on the other
side which has a plausible answer but
you cannot actually retrieve the answer
from the from the passage that is given
for that question so my model was these
are representative of some of the
questions that my model was able to
answer correctly or abstain from
answering because it did not have an
answer in the data set squall has in the
development Setsuko has over 11,000
questions and 5000 of which is
unanswerable and because I thought
focusing on the unanswerable questions
and the performance of the model on this
kind of question or informative about
how the reasoning and common-sense
working for the model most of my the
most of the numbers and reports that I
will be presenting today based on those
type of questions although there are
other data sets that are specifically
designed for common-sense reasoning I
use squaws because it will be it will
give me a good start to understand how
such a powerful language model although
I’m using the smallest version will
perform in common sense and what are the
mistakes what are the successes and
failures that the model will make and it
will guide me to work on more complex
sophisticated data sets going forward so
the linear model that I implemented
worked relatively better in the
unanswerable questions where it was able
to pick up that the answered the
question did not have an answer and
abstain from answering to that in the
plausible answer section the model was
able to pick first I’ve seen from out a
spring like okay this question does not
have an answer but also able to come up
with the wasa it got wrong was able to
come up with the plausible answers which
was not extremely unreasonable in this
example this is a repeat this is
representative of the strategy that the
model learned from the training the
model takes attends to the first few
tokens in the question text when it’s
trying to answer a question this
repeated so many times that I was like
okay this is this is definitely taking
some of the attention to the first few
question few tokens in the question text
and then trying to pull out the answer
from the from the passage as you can see
in this example the model predicted that
the answer contained the little phrase
that I correctly at the bottom but the
cone but the real answer the correct
answer was the the word concrete if if
we go back to the original paper of GPT
to I notice that some of the heuristics
mentioned in the paper such as how the
unsupervised 0 shall in zero shot
setting the create ask was taking up on
the questions like who what where and
when the same pattern kind of appeared
in the model that I also trained in here
some of my observations in include that
the model performs comparatively better
on questions that are unanswerable and
partial matches mainly consist of
initial tokens from the question and
when an answer is expressed in words
that are different from the question in
a way you can think about it as a
paraphrasing the model often fails to
recognize it if the order of the words
are different or used synonyms or
antonyms for expressing the same idea
the model was not performing very well
so for future directions I would like to
experiment more with the bigger model
which has been very recently released
and work more on the common-sense
reasoning datasets that are specifically
designed for this task in addition to
that I would like to work more on
natural language understanding two
unsupervised learning because the main
idea behind GPT 2 was also to eliminate
the need to fine tune models and create
task specific architectures so through
common sense reasoning I feel like
there’s a lot that can be done in that
department to see if there if we can
actually do that and as a long-term goal
I would like to explore the interactions
between natural language understanding
and other deep learning research thank
you
[Applause]
I’m happy to answer questions
the questions okay thank you
[Applause]
no sorry
hello okay so good evening everyone
today I’m going to talk about my project
during the three three months Scholar
Program and the project is about
sentiment analysis using reinforcement
learning and since before I go into
details of my project I want to first
thank my mentor Azalea who gave me who
guided me in the process three months
with great passion and also thanks to
open a I provide me such good
opportunity and the rich resource to
learn and to develop my project okay let
me introduce more details about my
project first I want you to talk about
the motivations of my project with the
development of the neural network
there’s an LP becomes a very hot topic
and because NLP can build the
computational algorithms and also let
the computer to learn and analyzed and
represent human language and there’s a
lot of NLP tasks and among all the NLP
tasks sentiment analysis has achieved
very good performance and there’s a lot
of well-tuned measured online so for
this project we propose some novel
models that can combine reinforcement
learning and a supervisor and the appeal
method to predict incentive sentiment
and send him sentiment of a sentence
and the currently as mentioned before
there’s a lot of well chewed and also
help a lot of supervised learning method
about sentiment analysis online so I
think we consider this project and we
consider our L might self learn and
capture some missing informations based
on the current models so let me talk
about details about
as mentioned before we proposed to naga
models that combine reinforcement
learning and the sentiment analysis
supervised learning method and the first
model I will say or I would call it the
sentence structure in the simple words
we just like simple crucial words in a
sentence that are useful to predict the
sentence sentiment and we can see the
models here first when the model
consists two parts the first is the
policy network and the second is the
classification network we can see the
graph examples here for the sentence
restructure the first the top rectangle
represents the policy Network and that
means for each sentence for each word in
a sentence we sample based on the policy
whether to keep or repair until 8 this
word in a sentence to predict the
sentence sentiment so after the policy
network we actually pass the selected
sub sentence to our classification
network for example we have a sentence
and there’s a lot of such words so maybe
as or as we expected we may remove those
words to predict the sentiment and also
the prediction will not be worse so
based on the selected sub sentence we
pass the sub sentence into the
classification that work and for this
classification that work it actually we
use the traditional supervisor learning
NLP method like a wave child long
short-term memory
transformer and also protein birth so so
some details on the right part we can
consider the sentiment analysis task as
a sequential decision and current
decision in a sentence always say
whether to keep or remove this word in a
sentence will affect the following
decisions and also affect the following
predictions
and which all this sequential process
can be naturally adjusted by the policy
weighed and measured and here we use the
delay and delay the reward that means we
cannot have the reward until we reach
the very end of our sentence once we
have the production of the sentence send
her a statement we can have a reward
with value and we take value 1 if we
predict correctly and with value
negative 1 if we predict well and for
the action as mentioned before it’s just
for each word we’ll decide whether to
keep it in the sentence or just delete
this sentence just delete this word in
the sentence and use the sub sentence to
predict the sentiment so this is our
first model and our second model I would
call we use the word probabilities in
the sentence to predict this sentence
sentiment so instead of considering two
networks we are only using one network
and will predict the probability of each
word and use the sum of the word log
probabilities to change the language
model and using PPO loss function
actually we’ve also tried other policy
gradient loss functions so we can also
see the model structure in detail here
so it represents our what probability
Network so for each word we have the
output we can have the output tensor of
the word probabilities after using the
protein pert Network and then we still
use the previous reward function with
value 1 if we predict correctly and the
value 0 if we predict wrong
so one thing we tried different with
previous is here we are not only using
the forward sentence order we’re also
using the backward sentence order so in
this case we can have two probabilities
of the word sent word probability and we
can define the law
functions with only the forward sentence
other probabilities oh we can combine
using the both forward and a backward
sentence other probabilities so let’s
see our dataset and the experiment here
so basically we evaluate our models
unsent and stamp resentment chip Bank
which is a public dataset originally
with five classes but for our experiment
we adopt binary labels so we use one for
positively positive sentence and zero
for negative sentence and here is the
detailed layers we tried for two models
first afford our first model the
sentence restructure for the policy
Network we’ll use our print long short
term memory that output the action
sequence tensor and also state value
tensor and once we pass the selected sub
sentence to the classification Network
we tried transformer and the pre-trained
Burt has supervised the method here for
the frigate bird we’ll just use the
simplest virgin birth the the name
should call based on case the bird model
and then we to chain the policy network
we’ve tried vanilla policy gradient
actor critique and proximal policy
optimization and wizards and for our
second model the word probabilities we
adopt protein birth again the simplest
the birth model to output the
probabilities of each word and then
define the PPO loss to chain the model
so let’s see the result we’ve tried to
in valuation metrics one is the accuracy
accuracy metric that is we can consider
as the binary or discrete evaluation
metric because we have to set a
threshold based on the probabilities
from the softmax layer and another in
valuation metric is called the AUC
metric and this we can consider
as a continuous evaluation metric so
first we can see the left stop figure
and based on the based on the
transformer when we add reinforcement
learning algorithms with the PPO has
better performance compared to the
transformer only and for the bird we can
produce compare comparable result and
for the right sub figure we can see the
word probability using PPO here we are
using the post forward and a backward
word probabilities so one expect one
leap reason for the lower AUC evaluation
metric can be because per for the
embedding of the bird we are actually
considered about the position or
embedded so that means when we take the
backward order sometimes we might have
the poor performance but that’s our
guest and we haven’t figured out why it
has some low performance now let me talk
about the conclusion and the takeaways
for this project so first we can see for
the sentence restructure model will find
adding reinforcement learning method can
improve the performance based on the
transformer model and produced
comparable result on protein pert and
also I’ve I’ve grabbed the the little
words from the testing set based our our
sentence restructure model and most
words are like the such word as we
expected so this words are not very
important to predict sentiment set as
sentence segment in common sense and for
the AUC sometimes it should an
improvement with our models so this may
be a direction for our future like
stress howatuney
for the probabilities if we want to use
the accuracy magic
and another takeaway for this whole
project I think there’s a lot of
combination with reinforcement learning
and a LLP tasks but for sentiment
analysis we should admit this task is
not a complex task and also this task is
very concrete task we just need to
predict which class the tension and the
sentence becomes or belongs to so in
this case actually with very concrete
language tasks maybe using reinforcement
learning cannot have better performance
compared to with only a supervised
learning method but maybe for more
complex language tasks like text
generation and a summarization using
reinforcement learning can have better
performance and also I’ve read some
paper during my project and find there’s
a lot of implementation in text
generation and a summarization using
reinforcement learning and the last
thing or last the benefit for me is
during this project I got the chance to
handle all the supervised learning
algorithms for sentiment analysis like
long short-term memory attention
transformer and protein part and this
gave me a good opportunity to build a
pipeline with all the others to provide
the learning for sentiment analysis
thank you
[Applause]
do you find any specific reasons as to
why RL could be a better you know method
for the text summarization I think once
I’m not quite sure but my understanding
that we cannot have good performance for
sentiment analysis because we are
actually have very concrete actions we
just want we just have to predict
whether it’s positive or negative in
this case for very concrete language
tasks it’s very hard to define a good
reward function but for text translation
and a text summarization actually we can
have a very good definition or very good
define for the reward function and with
a good reward it might be trainable for
using reinforcement learning algorithms
yeah that’s my understanding
maybe more complex language tasks can
have better performance using
reinforcement learning thank you
thank you so good news we’re running
we’re running right on schedule we’re
gonna go ahead and take a quick
15-minute break we’re gonna start the
next presentation at 6:45 so planned to
be back in your seats a minute or two
before then and thank you all so much
you
besides therapeutical treatment for
disease for example in this situation
the doctor at each time doctor leads to
decide whether to do mechanic validation
or vasopressor and the reward is whether
the patient will like discharge the from
hospital helped to fool house
just doesn’t work
in drug discovery researchers needs to
develop some new drug structure in order
to cure some disease at each state they
start with the simplest structure and at
each state you need to decide how to
grow this structure to to some desired
properties of these components so in
this situation exploration and exploit
exploitation is really important and in
some less risky environments doctors
need to decide one and what kind of lab
tests need to be scheduled in some cases
the tests are very costly and also it
may have some side effect on the patient
so it’s also important to decide some
optimal sequential actions on these
recently there are some literature’s
working in this direction most of them
are model-based they are they have
simulated environments so they can get
many samples and relatively fewer paper
our focus our model 3 we work with
purely observational data in this case
they have we have limited sample size
and the current status of this kind of
research is really preliminary and there
are a lot of challenges associated with
our observational data I will talk about
this at the end of the talk so my
project focuses on sepsis treatment in
the intensive care unit the main
reference is this recent lateral paper
by Komarovsky and I used the same mimic
3 dataset which is a very large
electronic health record data set and I
used the same definition of subsist
cohort and the same state and action
space different from their paper
I use I use a different reward design so
in the original paper they assign
positive 100 reward to the treat the
patient who successfully discharged and
assigned a minus 100 reward to those who
deceased in the hospital or within 90
days after discharge and for the RL
algorithm they use the policy iteration
and I use policy situation as well as
q-learning and they also use E I see you
Dennis L which is a much larger data set
of electronic health record but I didn’t
use that and we have a little bit
different in sample size so this is the
structure of my project I call it AI
physician so it basically the input is
the electronic health record and the
output is the optimal policy suggested
by our our algorithms so in the first
test
so the first staff works with the raw
data set and extracted to be regular
process the time series then with this
regular time series I extract it to
Markov decision tuples and based on it I
apply several different or our
algorithms because observational data
are always measured irregularly and
always some missing data in it to deal
with it I used to master first days
heuristic master for example the body
temperature are measured in saucers or
fara heads so there are some basic trend
transformation between them this can
help with us with to deal with some of
the missing data and I also use k-means
clustering so basically the states
within the same class they should have
similar
measurements so I feel some of the
missing data are using one using the
values from the observation the same
cluster and for interpret interpolation
I use sample and a whole master this is
basically approximate the time series
using the step function and we can use
more sophisticated Gaussian process so
at the end I’m interested in 24 hours
before and 48 hours after the onset of
sepsis and I extracted the sample every
four hours so for each patient we have
at most 18 steps for each patient so
once we have the regular time series we
can use the data to fill in the
environment and the replay pathogen is
container of Markov decision process
tuples that we can feed into the
algorithm this one example is the tag
tabular q the environment is used as a
folder that policy iteration algorithm
so we need to estimate from the samples
the transition matrix and the reward
matrix and will be playback buffer
actually did something like transform
the regular time series into M if it
happens and this structure follows the
open area team designed so it’s very
week later if we want to try most more
sophisticated algorithm we can just
replace this tabular here with some
other algorithm so once we have the
regular time series we want to build our
MVP tapas the variable that
characterized the state are of 47
dimensions it basically includes the
vital measurements of
patient and also the lab measures these
are continuous measurement but I use a
clustering master to cluster it into 750
discrete state and also for the action
we care about two actions the IV fluid
and vasopressin toesik administrated in
four hours so this two are also
continuous but I put them into for each
of them up put them into like five
discreet pins
so there are 25 actions in total and and
for the reward design I care about three
things the first things is that whether
the vitals or the lab measurements stay
within the desired range so look at
looking at the first graph if the values
stay in that is a desired orange we
assign like zero reward but if it’s like
outside of this normal range we we give
some negative reward that’s basically a
penalty and the second thing we care
about if is that if if there is a chart
or sharp change in consecutive
measurements so if the change is smaller
than twenty percent then the reward is
zero but if it’s larger than 20 percent
then it’s going down and it’s a negative
reward and also we consider whether this
patient successfully survived substance
were it’s this or he or she deceased off
within 80 days of discharge so it’s a
positive 50 and the legacy of 50 so in
the end we have two hundred seventy
eight thousand plus tuples and with this
whole dataset will run 400 times each
time we cut the
into training data set and testing data
set and we learned a policy on the
training data set and we evaluate the
policy in the testing data set because
that this is observational data so we
need to do off policy evaluation I use
weighted importance sampling with
bootstrapping and at the end we have a
hundred policies and associated miri
words so we choose the best policy with
the past memory world
the algorithm we use is the first one is
the physicians optimal policy the
dynamic programming policy with
estimated probability transition matrix
and reward matrix and the standard IQ
policy the weighted importance sampling
is for the off policy evaluation so we
have the behavior policy which generates
the sample we have and we have with our
learned policy PI one and we want to
estimate the value of Taiwan from the
data trajectories generated by PI zero
for each trajectory in the sample data
we define this likelihood ratio as the
ratio between how likely this sample can
appear in the four by following the
learned policy and the likelihood of
this sample so if showing up using the
behavior policy and then we
and then we define the weighted
importance sampling estimator as this
this part is the true reward of this
trajectory and this is basically the
likelihood ratio between how lightly it
is appeared in the learn the policy
world behavior policy and finally this
one is the estimator for one trajectory
so there are n trajectories in the
entire data set so we take the average
and then it’s the value of this learn
the policy now this is the convergence
of Q learning we can see that after few
steps the the difference between
sequential a Q value and also the
variance decrease and this is the Q
table we learned for different policies
we can see that the physician policy
concentrates on five actions
so this the x-axis represents the
actions taken and the y axis corresponds
to the different states and you can see
that for action 0 5 10 15 this is the
action code it corresponds to very
oppressive zero that is the low low low
low as the dosage of vasopressor
it shows that the part of physicians
policy prefer low value of vasopressor
but the DP policy and Q learn is more
diffused over these different actions
this plot shows the distribution of
different actions over 750 States again
we can see that this is the vasopressor
dosage
the physicians policy many focus the
know vessel pressure dosage but the DP
policy and the q-learning also suggests
higher vessel pressure dosage and the
lower dosage of IV fluids we also
compared the value of different policies
trajectories so this the mean reward for
the DP policy is this value and for the
Cure policy’s despair value we can see
that the Q policy gives higher reward
average on average and also this the x
axis corresponds to single trajectories
we
so the x-axis corresponds to 50
different trajectories so we can see
that for each trajectory
so basically it shows that pure learning
gives much better result and so the
fundamental challenges arise from the
fact that we work we are working with
observational data so we have limited
sample size and also we have to work
with off policy algorithm and how do we
estimate the policy from samples
generated by our different policy and in
the medical application
there’s also partial observation there
may exist confounding factors and also
how to design reward functions so that
we can encode a domain knowledge to help
guide a eye physician to like a better
decision and future works is also
related to these problems
so I immediate thing we can do is we do
like more sophisticated data
interpolation for example with Gaussian
process so in that case we can sample
like more frequently in this case we can
get more data points then we can use
some more sophisticated method and also
there’s like model-based LR if we can
estimate the environments then we can
simulate this more samples then we can
also again use that some deep R for the
algorithm and another 2 is related to
off policy evaluation and the reward
design so that’s all thank you
any questions
hi thank you for your talk this is very
interesting can you go back to the
example you gave with regards to
vasopressors yeah okay great um do you
know if there’s anything that you’ve
seen in the mimicry literature about
this behavior by the doctors I’m just
curious
it’s like if this has been mentioned if
you men saw this mentioned either in in
like the original paper that you were
citing from yeah nature digital medicine
or like you know something with about
this kind of behavior I mean because I’m
assuming that if you don’t have even if
you have a different reward like
something like this when you’re taking
what the action you might take this
might have something you might see
something related to this I’m just kind
of curious about how you would interpret
this sort of behavior right so in the
lateral paper they actually mentioned
that according to some like medical
research they also found that the
doctors tends to use nest vasopressor
dosage and this case is also like active
research area in the medical field
how do you come up with rewards you know
you had chosen positive and negative 50
if you choose other things will to
behave will the policy change yes I
think so because this we encode like
three different things if we increase
like the magnitude of the final four the
outcome at the final step then the wait
for the is to like the vital normal
range of vitals and sharp changes will
like discrete the weight of the mood is
great so it’s a very well it’s a good
reward or not right yeah that’s a very
good question I think for now mmm I
think for now it many depends on the
doctors domain knowledge right for war
we can experiment with the observational
data to see like to do some back testing
on the observation data to choose the
like the pasture design so it’s a
combination of domain knowledge and some
testing very cross validation awesome
I went to a admin conference at Stanford
some time ago and there was a researcher
who’s arguing for doctors against using
black box models in medicine I saw some
visualization last black box or because
you’re definitely able to see some
values inside of the model so it’s not
black box right mm-hmm right so for the
reward design you can actually put some
domain knowledge in there so it’s not
that black box and then because this
works with tabular learning so it’s like
the easier I think it’s the easiest
algorithm on the RL so it can offer some
insights into the data in the situation
but I think if we want to use like more
like policy gradient or DDP tea then it
may not be able to offer some intuitive
interpretations but this is also an
active research area
[Applause]
[Music]
[Music]
can everyone hear me mmm
great well welcome everyone thank you
all for coming this evening my name is
Edgar and I’m going to be talking about
knowledge distillation for transformer
language models just to give you an
overview what I won’t be talking about
I’ll introduce the transformer it
successes and limitations give you some
background about knowledge distillation
give you an interpretation of what
knowledge distillation can mean in this
context
I’ll give you guys the approach that I
took I line some future work that can be
done and I’ll have some time for
questions so the transformer model is
the latest and greatest one of the
latest and greatest in neural networks
that I’ve just come out it’s very
powerful it can generate language like
human being it can answer questions he
can summarize texts and so much more
this was very powerful and very exciting
to me because I wanted to make resources
for people who might not necessarily
have resources and one of the ways to
provide resources people as is through
human language and so when I was
starting to play with these models I
went into some trouble and that’s that
they’re huge
my little MacBook was huffing and
puffing and crashing with memory and so
I couldn’t imagine what would happened
if I try to run one of these on my phone
and so I was hoping to try to compress
these models and make them smaller while
still having similar levels of
performance and it’s going to give you
give you guys a look like introduction
to the transformer but my peers have
done such a good job I’ll skip past that
and tell you guys about what knowledge
distillation is so knowledge
distillation is getting a larger well
trained teacher neural network to teach
smaller and untrained student neural
network by getting the student to mimic
the outputs of the teacher if we say
that we give them both models the same
input and they give about the same
output you say that
student model is performing about just
as good as the teacher even though it’s
smaller and in this specific context
something that what happens when we do
knowledge distillation is we give the
neural network a sequence of words and
we mask some of them and usually we have
a loss function because and when the
transformer spits out its output it
gives a probability distribution over
words and the goal is to try to guess
the word that is being masked and you
want to minimize this so you want to get
the word right but in knowledge
distillation we’re able to provide the
student Network so much more information
by giving it the output of the teacher
so in this case we the output of the
teacher is a distribution over words
probably the ability distribution as
opposed to just a label which is one for
the correct word and zero for the others
and so what this distribution does is
give a probability over reasonable words
which will distill some knowledge into
the smaller Network so the man went to
the blank the right word is store but if
the larger neural network is able to
give the student the information that
another possible reasonable word is
groceries or bakery or and any other
synonym the student will be able to
learn a lot more and hopefully be able
to spit out the similar same
distribution for other words and some
more interpretation as to what this
means we have this idea that the neural
networks have a like certain solution
space and if the if we believe that we
can distill knowledge to the small
network we believe that the smaller
still
Network can inhabit a similar solution
space as a as a teacher and we see that
happening and sometimes that happens and
sometimes that doesn’t so to give you an
example of when the case where and like
a function can inhabit similar solution
space to another function but it doesn’t
necessarily do that as in this simple
case of curve fitting the green curve
can encompass every single which is a
much higher order polynomial can
encompass every single solution and
represent this function and so much more
because it has all these higher-order
terms but it usually doesn’t even know
it’s solution space is much larger and
that’s because it doesn’t have the right
inductive biases to find that solution
that we’re looking for in the current
way that we train that model and so the
hope for the student and the teacher
transformer models the idea is that the
larger transformer can represent a
larger solution space than the smaller
one and the larger transformer converges
to a particular solution space and the
smaller transformer converges to another
different solution space and by training
these the smaller network on the larger
network we get the like region of
convergence in solution space to get
more closely to where the larger model
is so that they can represent the symbol
or if not same transformer by spinning
out the same outputs despite being a
different model with different weights
and different sizes and so my approach
to this was taking births the
bi-directional transform
and instead of having a either 24 layers
or 12 layers with 9 layers and I wanted
to reduce the dimension of the vectors
that it works with to 576 as opposed to
768 or 1024 and this initial approach
true proved to be awful actually
initially I had a accuracy of about 5%
even though I trained for 30 to GPU days
and upon closer inspection it kind of
makes sense because I ended up with
about 72 terabytes of data Wikipedia was
initially 12 gigabytes of text but by
saving the outputs of the student or of
the teacher their own network I ended up
with a ridiculous amount of data and too
much data for the GPU it was not able to
handle long sequences and that’s because
for as opposed to each word having a
label I ended up with you have this huge
vector giving a probability distribution
for over words which is a 3522 size
vector and so I had to think of
something different because it wasn’t it
wasn’t going too well and so I thought
about the outputs of that teacher neural
network and realized that most of those
most of the information in that
distribution is not very useful if we
think about a masked word and consider a
30,000 552 vocabulary most of those
words are not going to be sending them
as most of them are not going to be
important and upon looking at them I
most of them had a dish probability of
like 2 to the negative 8 or something
very small not useful so I decided to
instead truncate the output of that
teacher neural network and only consider
10 words which left me with 384
gigabytes of data instead
and with the time that I had which was
about eight GPU days I was I was able to
get an increase to about a 7% and so
this leaves an exciting work of scaling
this the the big dogs
some of them sitting right here train
their neural network for 2048 TPU days
or 240 GPU days which is which is a lot
and so it leaves a very exciting
opportunity to scale up this task and
see if we can have a similar performance
so I like to thank my mentor
Susan Zhang wonderful guidance the rest
of the open area scholars or provided me
with a lot of support and Maddy Hall who
was able to organize all of us despite
juggling like 20 other things at the
same time thank you
[Applause]
no questions no questions okay
you
thank you so much for the support
everyone here listening to us okay and
thank you so much for a I for this
propionate for this opportunity
especially for my mentor okay Akuma Ram
and so before I start I’m going to talk
about myself so I was a support engineer
in my previous life I was doing network
security and after that I decided to
move to education and I have been doing
tons of things for education right now
actually I started a program as you as
soon as I shown open air I to try to
convince 60 plus year old people that AI
is important is just for Spanish
speakers and that’s a picture of the
group right now one of the guys Rodrigo
is taking the lead
teaching them tensorflow to Udacity so
I’m very proud i’ve also do other things
i do kids and teachers that’s me at
columbia showing the IGBT to unicorn
example to some very interested teachers
and and so my motivation behind this
project is that I think education is
completely broken I think it was badly
designed right now schools and and the
best way for us to learn and this is
what science is telling us is to do
projects right the way that we’re doing
this program is to go there dive deep
into a subject and do something that is
meaningful for us and this apply
knowledge but the schools cannot use it
why I actually am on the founding team
of a school in New York because there is
no there’s no curriculum for
project-based learning there is no time
for the teachers to learn how to do
themselves projects and if you have many
projects in a school in a classroom many
kids one project is a mess in the
classroom and the poor teachers are
actually implementing these things in
the classroom I spend tons of times
looking for resources on the web
adapting them and trying to like you
know customize into a classroom so right
now AI in education is just tracking is
just trying to see how we can
personalize learning but what if we can
actually unlock the educational content
in the web so this is a proof of concept
for that so this is my model it’s richer
so creature what it does is that it was
trained on data from the Exploratorium
here in San Francisco actually they have
a great data set of activities there are
kind of projects that are used just for
learning and inquiry and we have 3700 of
these activities so I trained Bert
oh I’m not going to explain there
because I think so when we form a little
tree that already did that so I trained
bird to actually use another data set
that has 280,000 projects DIY projects
meaning like how to build a bicycle how
to cook and chill a bus or how to create
a robot that will tell me if I need to
water my plants to tell you what
products you need to do to learn
something in particular like in this
case biology so what did I chose
Instructables so you know I have to
collect my own data which Jesus got but
the thing is that instructable is a
marvelous website with very well
documented project but the goal of this
is not to learn is to make something
that is functional so if you learn if
you put in the search like I want to do
biology
injures return to 12 projects and three
of them are actually not related to
biology but they have great great
projects so with creature we can
actually get 200 projects and the
weather right because of the weather I
trained it we can actually get project I
have explanations on them so I’m just
going to read the first paragraph of
this project it says what does just your
heart sounds like how fast is it beating
what’s the pattern a few students may be
able to detect movement will whomp part
of the beat which I got probably heard
in movies too this has to do with the
fact that I were hard has four chambers
which we will get to soon this comes
from two pairs of bolts in the chambers
of the heart
so this instructor wall was actually
meant to be done for like my students
and we have like many of these in the in
the search so right now the model is
doing 30 to 50 percent depend on the
labeled I have explanations from those
two to hundreds so how did I do this so
the way that I did it is I took all the
data from the text of all the activities
from Instructables and I divided between
four things descriptions instructions
explanations and tools and materials and
with that I trained bird to actually
predict what will be the learning topic
so each of these inspiratory on projects
they have some labels of what would you
learn and I end up having 50,000
examples I trained bird for like Tupac 9
to the minus 5 learning rate and I got I
took
20% of that for like validation I got
very good accuracy so I decided to try
it on destructible data so in this truck
table data again I took the task I chunk
it into a 512 tokens and then I predict
with that
what would you learn to do it so it was
good the result was kind of good but
then I you like develop this interface
to do active learning so what I do with
this is that I get the chunk of the text
that I actually match and then I can say
if it’s like positive or negative and
with this I improve it a lot and you
know together a bit more validation I
use the own app library that is a
reduction the dimensions reduction
library to try to see what’s happening
under the hood so this is a
visualization of the topic physics we
can see on the purple we can see
something here so on the purple one
that’s mechanics on the blue one that’s
electricity and magnetism and between
them kind of sound is bridging those two
clusters the orange is light and there
is like a tiny little bit green there
that is intersecting with with soundest
waves so actually wave somehow are very
far from sound but it’s very interesting
this is like a lien under interesting -
like exploring what’s happening with
these embeddings so I hope that by now I
convince you that AI can be used for
something else besides tracking in
education I would be very interested in
trying to do a web app what you can also
filter these projects by tools and time
that you have this was my eighth
experiment on the rest didn’t work very
well but one of them that was promising
is trying to see what is the expertise
on these projects and I use another data
set from like hackster so I’m very
interested in like developing levels of
expertise on these kinds of things and
so I’m there I now look into something
to keep feeling my creature so if you
know about penny grant or something like
us to do with AI own education I would
be very interesting to hear it that’s it
thank you
[Applause]
did you ever think about using video
yeah yeah so I actually that was one of
my first options trying to use video
especially from like TEDx a they have
like very cool content but I started
using pictures actually and it wasn’t
very promising so then I got a little
bit like scared about trying video but I
think that’s one of the things that I
would like to try
you
from a paper by a Borgia that looks at
all of the different gang metrics that
we have he evaluates over 24
quantitative measures and five
qualitative measures all of which are at
times used to evaluate ganz none of
which are optimal and none of which are
the tolkien’s metric that can tell us
everything we need to know so the two
most quoted scores you’ll find around
ganz are the inception score and the
fresh iron scepter distance both of them
are based on state-of-the-art
classifiers inception by Google and they
look out in the inception scores case
they look at marginal entropy of or
marginal distribution of labels in the
fresh eye inception distance case they
look at taking real and fake samples
putting through inception layers and at
the max pool three of v3 and look at the
difference between men and co-variants
it gives us an idea but you come out
with a scalar value so if I said to you
I have a Gann that’s got a scalar value
of eight and one with 40 I’m not sure
you could tell me kind of what that
means other than a lower scores better
in third similarly in March
researchers from here at open AI and
Google released the activation Atlas
which gives us an unprecedented ability
to look inside neural networks and
visualize what’s going on inside the
back black box so for my project I
wanted to look at what would happen if
we tried to bring these two things
together and evaluate ganz using the
activation Atlas so in terms of the
project I needed a whole ton of samples
thankfully big and the mammoth and
current state of the art Gann released
their fully trained image net generators
and so I was able to use began as one
set of samples and then SN Gann which is
the spectral normalization gang created
by me otto and others was state of the
art a few years ago and you know brought
a lot to the field especially around
normalization and it had a fydd score of
around 27 and they also open sourced
their fully trained image net generator
and so I was able to do it so we’re
considering here real images from image
net and Gann samples from began with a
fydd score around eight point seven and
scan samples from SN gam with a fydd
score around 27 to see what we can find
I then took the inception v1 Network I
used v1 to be consistent with the work
of the original activation Atlas paper
for clarity I should note that the
fresher inception distance is most often
cow
headed on the v3 Network however began
actually used v2 for their calculations
and SN GaN used only 5000 samples for
their so it gives you a sense of how
difficult metrics are in this space
having done that
passing all of these samples through the
inception v1 network across nine layers
and capturing activations your hundreds
of thousands of amped activations right
massive amounts of dimensions to bring
that down to something that we can kind
of comprehend in at least I can used you
map to dimensionality reduce to 2d and
having done that we can then apply a
grid over our dense scatter plot to find
meaningful kind of grid clusters having
done that we then need to generate icons
for each of these grid cells so a grid
cell will contain a number of related
activations and the part that really
brings the activation to life activation
Atlas is creating these feature
visualizations which give you a visual
view of a direction in activation space
having done that we can then also
calculate the KL divergence between
distributions right because these grid
cells give us a sense of density between
the different distributions and then
lastly we can create some atlases and in
this case in contrast to the original
activation Atlas each grid cell in the
Gantt ’less will not only have the
feature visualization of the grid cell
but also a bounded colored order which
gives you an indication of the
log-likelihood between two distributions
so this particular example is comparing
began to SN gown and the blue dark blue
around the border means that in this
square began has a much higher density
it also has at the top the count of the
samples in the square so you can see how
many instances of samples from began
versus SN gamma in that Square and on
the bottom we have an indication of the
maximum logit which can give you an
indication of what that grid cell is
doing in regard to softmax at an indica
t’v level lastly you’ve also got a grid
spell reference which I find really
helpful when we’re trying to talk about
these R squares and identify them so
just very briefly I’ve implemented
visualization of these icons very much
in keeping with the original page
and found like them that in regularizing
these icons it was really whitening as
you can see on the far right of the
slide here that did the most to ensure
that we had as much detail and kind of
visual fidelity as possible so what did
I find in terms of the difference
between all of these distributions and
again don’t worry about getting the
numbers on here but broadly what we find
was really interesting is if you
compared began to image net layer by
layer the biggest divergence between
those two distributions actually occurs
right at the beginning of the network in
mixed 3a and 3b which is kind of
surprising in a way to me because early
on in an inception network the network
is looking at very low level detail
right it’s looking for shapes texture
kind of brightness and so forth and one
would presume that that might be very
very similar no matter which
distribution you’re putting through and
I’ll talk a little bit about what that
may indicate then if you wanted to look
at where do we see the biggest
difference between big an and SN Y and
if we’re trying to compare how good
those two distributions are where do we
see that well actually if we change the
heat map to remove those first layers we
can better see the differentiation
that’s not being kind of watered down by
the first layers and actually it’s right
there at mixed 5b at the end and the
bright red SN GaN really in the later
layers doesn’t match the image net
distribution nearly as well as began and
that’s where we can see some of the
differentiation so what I’d like to do
now is just give you a few examples from
mixed 3a and mixed 5b to bring it to
life so this is the activation Atlas 40
by 40 began versus imagenet for D first
layer in the network and what you can
see is that each of the icons is not an
image that you might recognize right
we’re not seeing dogs or cats at this
level of the layer this is where we see
texture but what you saw is on one side
of the map we have that blood-red area
right a huge divergence and so if I look
at cell say 1511 in mix 3a what we saw
there is if you can see there’s 1300 and
two samples from imagenet that fit into
that grid square but zero samples from
big Dan actually go into the grid
squares we’re seeing already a huge
divergence
and when you look at the data set
examples of what the network is picking
up in that grid square it’s looking for
highly detailed textural areas you can
see it’s not just grass up the top with
a very cute puppy that wasn’t actually
the spatial activation the grass is like
long fronds of grass right at the base
you can see on the far right a placemat
and a manhole cover with very detailed
intricate texture and actually the two
labels most represented in this grid
cell were hay and manhole covers manhole
covers surprised me at first but when
you look at them there’s actually it’s
over there right columns second from the
bottom
manhole color has actually have a lot of
texture in the ash felt around the
manhole cover itself and on the metal of
the manhole cover so we’re seeing a lot
of texture so the question for me is
we’ll then where did all the big and hay
and manhole covers go right where would
they be and so it actually led me to go
all the way to the other side of the map
to sell 20:32 and that’s where you start
to see data set examples from big gun
and a divergence between began and image
nets so you’ve got way more big gun
examples in this cell than image net and
what we can see are a lot of the same
themes so we’ve got hey we have grass in
the bottom you even see a gray sweater
which is kind of evoking that same idea
of really detailed gray texture but the
fidelity of the texture here and the
level of detail is much less and so
already the activation atlas has shown
how the network is massively kind of
separating these and I think at 3a you
see both began and SN GaN exhibit this
same kind of quality and I think it’s
telling us more about the difference
between the fidelity of images
especially Gann samples I was putting
through a 128 resolution and if you took
say the 512 began I’d be interested to
see how this may change so then the
question I think was really interesting
is can the activation atlas tell us
something more about how big Dan differs
from SN Gunn where where do we see that
and for that we need to look at layer 5b
which is right at the end of the network
before you hit a soft max before it’s
finally classifying and so for this one
I’m showing the activation Atlas at
mixed 5b 40 by 40 grid the coloring is
showing you in
red cells where SN Gann has a much
higher density in blue cells where big
down has a much higher density right so
what’s this going to tell us well the
first thing that sent me to was cell 223
which has a very big density divergence
and what’s interesting about this cell
is it’s looking for the Intel Busha dog
which I didn’t know but is like a swiss
mountain dog very brown white and black
and it has a very specific look and in
this cell there’s only about 20 to 24
examples of both imagenet and began but
they’re very well-structured well-formed
shapes with very clear details and
snouts an interesting in this cell SN
Gynt Dan didn’t manage to have any of
its samples recognized and I think what
we’ll see as we go through some of these
examples is the same thing playing out
began has both a self attention
mechanism which allows it to maintain a
much better sense of global structure as
well as the mass scale of the batch
sizes that put through allows it to
maintain not just good texture in its
samples but also global structure so if
you go to cell 35 20 and we look at
cocktail shakers something that I don’t
think we talked about nearly enough when
we look at Gans um but this is a really
interesting cell for me because this 328
began our samples in here in this cell
image met hits a similar kind of density
around 269 but SN gain only has 61
samples picked up and if you look to the
far right of the slide which are the SN
gain samples you can see that it’s
struggling to maintain the outline shape
that structure and only kind of picking
up on some of the shape cues and if you
go to cell 21 22 now I’m not here to
tell you that began are solved the
uncanny valley but this cell is looking
at melo or one piece bathing suits and
what i find fascinating about how this
has been picked up by the atlas is that
both image netting began in this cell
have around about 150 samples which all
represent the bikini or one-piece
bathing suit right and if you think
about an image and it turns what it’s
looking for its looking for some sans
water a lot of human skin in the image
right and maybe a piece of material SN
gain manages to hit those themes but
they’re very textural with very little
shape and distinction in contrast began
and
and I use this yes you can actually see
it’s actually hit the shape of a
triangle bikini and a one-piece male Oh
which in comparison to early Ganz which
could only really do skies and broad
textural things the ability to pick out
individual shapes in such a complex area
is pretty phenomenal and you’re seeing
that play out so one of the questions I
then had the ever ending one is well
where on earth did all the SN Ganz
apples go right if we’ve got all of
these cells where began is that the same
sort of density as imagenet and doing
well in a distribution term where did SN
going go and that sent me to the bottom
of the map to cell 316 weasel cell so
the weasel cell here is looking for
things that are kind of white or caramel
and fluffy and don’t have sharp edges or
very defined kind of structure you see
things like polar bears turning up the
second image there on image net might be
a little confusing because it’s a
scooter but the part of the image that
was spatially activated was actually a
fluffy and sheepskin rug at the back of
the scooter and that’s what it picked up
on so that’s the sort of thing you see
in this cell and both began an image net
have you know densities there SN gaining
contrast has two x they’re not the
density of image net in this cell and
while at first you see some of the
similar sorts of themes polar bears and
fluffiness if you actually look through
the rest of the data set examples you
start to see and so it’s quite small
here but what you actually see are a
whole lot of examples from non angora
fluffy classes whether it’s dogs or even
inanimate objects like buildings that
have so much artifact and kind of
overexposure in the image that they’re
lumped in here and picked up by the
classifier in this area and so for me
what’s what’s really interesting here is
that it helped for me characterize a
couple of things one what on earth does
a fydd score of eight versus a fydd
score of 27 mean right at one level I
know that the mean and the covariance
are the distributions getting closer at
seven but well it’s also kind of helping
me see here is okay if I look across the
activation Atlas well began and its
shape capacity and fidelity is allowing
it to hit many of the same grid squares
in a similar issue density to image net
SN gown is hitting some
and then having these clumps of clusters
in kind of amorphous areas and that’s
helped helping to do it the other thing
I found really fascinating about using
the Atlas to understand it is when I
first created all the samples for SN
Gann honestly I freaked out a bit
because I look to them he’s not good
enough this doesn’t look like SN Gann
and what I realized is I was used to
looking at SN Gann samples in contact
sheets that showed me ten images all
from one class at a time and my human
eye when seeing ten images all from the
one class at once would infer structure
and infer a kind of meaning in those
images that the classifier may not well
pick up and so that’s something I
thought was really interesting about
using this as a way to guide which
samples to investigate and so forth
because there’s always an element of
going back to the samples in Ganz and so
this can help accelerate that so lastly
future directions
obviously there’s about 81 activation
atlases if you think of a number of grid
cells number of sorry grid size by date
of distributions by options so working
on an interface that would allow us to
do that we can extend samples some non
Gans I’d love to put the spaz
transformer through this there’d also be
an opportunity I think to keep samples
the same but actually referred
discriminator out of again and actually
use that and sort of track how different
samples are being evaluated by a
discriminator through and similar to the
MIT sale work on a gander section it
would be great to take a generator and
look layer by layer how is that
generator doing what it does so lastly I
do want to thank everybody here thank
you first angel the end of my talk but
thank you to open AI Christy and Maddy
amazing and helpful and to Chris Ola and
Ludvig Schubert from the clarity team
have been amazingly open with not just
their time but their knowledge and
wisdom which has been phenomenal and
it’s been a blast my fellow scholars so
thank you
[Applause]
how do you how do you pick cell 223 yeah
so I tried as much as possible not to
cherry-pick and just look the thing so
the the way I looked at it was to take
the log likelihood of each cell and
literally rank it and go for the ones
with the greatest divergence as the
greatest difference and interrogate
those first from both ends both from
where began was really overweight and
then where SN gown was really overweight
and compared and come at it from those
angles having done that once you’re
inside a cell and you start seeing
something like you know all the hey
labels seem to be here then I would sort
of say okay where are all the hey labels
for the alternate distribution because
that can give us an idea of almost for
mix 3a at least I always think about it
as twin grid cells right like how are
these related and trying to look into it
that way does that help
so with big yeah I know that in the
paper they have like a truncated yeah
the truncation trick if non-tariff is
very cool with began it’s you can
truncate the values of the normal that’s
used for the noise vector before they
develop the samples and if in effect if
it’s very very load you get a small
variety of samples coming out but
amazing fidelity it’s almost like a and
and these the guys themselves say it’s
almost like cherry picking your very
best options or if the truncation is
very high you end up with large variety
not so great samples these were all done
at point five truncation and a midpoint
I did some early work on looking at the
truncation and you kind of saw
gratifyingly what you might expect which
is as the truncation blew out it just
kind of and decrease you everything just
kind of exacerbated right things that
were not going so well got worse and so
on and so forth but it’s a in that
sample extension bucket and I want to do
some more that’s fine any other
questions
where you once got a question sorry yeah
I’m wondering independent of the
comparisons between the two ganz mhm I’m
wondering just what were your overall
thoughts on how well began compared to
just the image net distribution yeah so
well right let me can I not make people
sick so it if you look at apologies guys
if you look at the KL divergence by
layer if we can get there so on this
slide what you can see is big and versus
image net is our first one right so if
you discount layer 3a and 3b for a
moment actually all the way through
you’ve got pretty good kind of tracking
between image net and began and you only
see a lot of the divergence at 5b so you
I really want to rerun this with a much
higher fidelity big gun sample because a
lot of the best big and stuff that you
see those you know quintessential
examples are coming out of there 512
resolution and this is only done at 128
so I think that you would see an
improvement there and you know it’s only
in those later layers and it’s very
specific
unlike SN GaN where you’re seeing kind
of broad issues with things being lumped
into the grid cells with git begin it
seems to be specific to categories those
categories that require more structure
and so forth does that help
so if everything were blue it would mean
that your kale divergence is close to
zero right which means that the
distributions are quite close the issue
with using the inception classifier
network is that saying the way in which
the classifier network understands your
images with the purpose of classifying
them they are very very close so you
have then tipped over the valance that
your swimming suit is recognizable as a
swimming suit but having a classifier
give you a complete accurate score on
being a swimming suit does not mean that
it is photorealistic to human eyes so
it’s getting you a part of the way there
but if you want to do a comparison for
like photorealistic images classifier
alone doesn’t do that any other
questions
[Applause]
so for people who might have thought of
questions during the break or just want
to chat a little bit more informally
with some of the scholars as well as the
mentors we will be here till around 8:30
and we can just mix and mingle over in
the living room as well as here but one
more round of applause for all the
scholars and thank you all for coming
and watching the presentations with us
you
you