🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance
Video
Transcript
hi everyone I’m Satya I’m an undergrad
at MIT I’m doing my internship here at
open AI broadly my internship is about
generative modeling and so what are
generative models they are models that
generate data within some kind of
distribution so for example if you give
your model to train on dogs then you
want your model to generate images of
previously unseen dogs why is this
useful it’s not an inherently clear from
my explanation but when you’re able to
generate images of dogs you hope that
your model has learned something more
something more useful about exactly what
is in the images of these dogs and that
learned some kind of structure that’s
important so there are many applications
for these types of models the first one
I’ll be talking about is world modeling
and well modeling is useful for sample
efficient generalizable reinforcement
learning so what exactly is world
modeling well in our hopes and dreams we
want our agents to first learn basic
movements so for example when you’re
learning how to walk you learn how to
put one foot in front of the other then
you learn how to turn directions without
falling and maybe even go backwards once
you learn these basic movements you no
longer need specific instructions on how
to get from the couch to the TV or from
your house to school because now you
have a sense for how basic movements
cause interactions in your world and so
the idea is if we can train our agents
this way then hopefully they’re able to
do more complex tasks that they haven’t
necessarily been trained to do and we
can train them a lot easier using
transfer so we want to use less data to
train our agents because it’s expensive
to collect this data from our games
that’s specifically like episodes and
things like that so in this example here
I have a basic game called car racing
and so in this game as you can imagine
when you’re learning how to drive a car
you learn how to go forward faster how
to slow down how to turn and so you can
see in the video that the car is able to
go on this track and it’s never actually
seen this track before and so it’s
learning how to turn and when it messes
up it’s able to recover because it knows
exactly how its its movements will
interact with its environment and so
this is why model based reinforcement
learning agents would be really cool and
these are our hopes and dreams for them
but alas there are some shortcomings so
first of all a lot of the
implementations out there right now for
model-based reinforcement learning are
game specific they’re also difficult to
train kind of finicky and one of the
biggest problems is that if you have a
small artifact when you generate a frame
of a game when you keep rolling out and
trying different actions within that
particular world model you’re just gonna
build up on your artifacts you’re never
going to be able to fix that and correct
for it and so these small errors can
propagate really quickly and cause a lot
of problems when you’re rolling out and
obviously for some number of these
reasons we’re not able to get
model-based agents to do as well as
model free agents so in the example I
show on the right I’m showing the same
exact model that trained the car racing
video but now you can see in the ground
this is a game of pong you have two
paddles in the ball you want to keep the
ball and play but when you reconstruct
after encoding it you no longer see the
ball you no longer see the other paddle
and so what I mean by reconstruction
I’ll explain in a second but key part
here is that you’re not able to see the
important parts of the game so you’re
not you don’t know where the ball is and
so it’s hard for the model to learn so
how exactly was that car racing video
trained while you take a frame and you
run it through a variational auto
encoder so you run a series of
convolutions to get some layton’s which
I represent with Z here and then you run
a series of D convolutions to get a
reconstruction of your frame and you
compare that reconstruction to the
original frame in order to get some kind
of error that you train with you also
give these Layton’s which you hope
provides some kind of context of time
dependent state in your game to an RNN
and your R and then is going to keep
track of things like where are you in
the game how far your obstacles what
exactly is your specific game state so
the key here is that you don’t want your
Layton’s to be wasted in encoding
something that’s true of all game States
so for example if you’re running through
a grassy meadow
you don’t want your Layton’s to only be
encoding grassy meadow you want them to
encode something like oh there is a rock
there that I could trip on or something
like that so in this video that I show
it’s again the car racing game on the
left you can see a video of the ground
truth and on the right
you can see what the model actually
reconstructs you can see it’s pretty
accurate it doesn’t have those red
bumpers which aren’t really that
important but the key part here to
notice is why this worked on car racing
and not on pong is because there’s so
few elements in car racing there’s only
the track and then there’s this green
highway so if you can roughly model
those you’re able to train your agent
pretty well but it’s not generalizable
to more complex games or games where you
have multiple scales of components so
how do we seek to improve this well when
we pair our BAE
or our encoder with a strong decoder
then our Leighton’s are forced to mean
only things that are dependent on the
time which is what we want so for
example in this case when they use a
pixel CNN and we condition our pixel CNN
on the time state of our game then we’re
able to force our Leighton’s to just be
something like oh there is an obstacle
at X Y what does that obstacle look like
well that’s for the pixel CNN to
generate and so by doing this we’re able
to create a more robust model of exactly
how learn and how we think because I
know that I could trip over this table
but I also know I could trip over that
couch and the details of what I’m
tripping over aren’t that important it’s
just that I shouldn’t trip so we also
know that pixel CNN’s can generate
really high resolution images and so we
hope that they can generalize to games
that look a lot richer than just the car
racing game or the pong game even so
this is the work that I’ve been doing
with world modeling now I’m going to
transition to talking about how I’ve
been using generative models in the
context of invertible flow models
invertible flow models are known for
being expressive models with efficient
sampling before I talk about the
architecture I’ll show exactly what I
mean when something’s expressive this is
a video of my friend Prafulla
but in those gonna be here but he’s here
and basically basically what you can see
here is propolis face morphing into
different celebrities faces now why is
this impressive because in order for our
eye to look at that and say wow that’s a
realistic morphing every single one of
the images that
generated on the way there has to look
like a like a person’s face if something
looks grotesquely wrong we will notice
it right away and so this is telling us
is that when you use an invertible flow
model you learn the latent space so
richly that you can interpolate between
them coatings of different images and
still have a reasonable interpolation so
in my mind if I were to imagine prava
becoming neil degrasse tyson this is how
i would imagine it but i don’t do that
often so what do we hope what our
expectations of flow based generative
models well what we want to do is take
some random noise as you can see in the
image on the far right you take some
noise that’s sampled based on some prior
usually with the tractable density and
then you warp that via your model into
what you want to see so maybe it’s faces
maybe it’s cats and dogs so basically
what these equations are saying is you
sample a Z from your prior and then you
run that Z through your invertible
network in order to get basically a
generated output so now f needs to be
invertible our model needs to be
invertible and we can use the
statistical change of parameters formula
in order to write exactly how the
log-likelihood of the input can be
expressed in terms of this model and so
when we do generative modeling we always
want to maximize the likelihood of
generating whatever input we get because
that’s our best bet at making something
realistic here what you have is you see
that that likelihood can be modeled as a
likelihood of generating whatever that
random noises person prior - the log
determinant of the transportation the
transformation between the latency and
the inputs reality stuff it takes a lot
of effort to design a fully invertible
model there are very specific components
that can be used and each of those
components tends not to be very
expressive so when you have when you
want to build an actual model that’s
able to turn noise into phases you need
to use an extremely large model which
requires a lot of memory and so as you
can see I have placed this meme here to
demonstrate my frustrations in
which you build a large model and you
want it to be tractable but no matter
how much memory you have you always run
out of memory yes so the types of
transformations you can do our things
like affine transformations or
one-by-one convolutions and I can talk
more about that when I talk about the
specific architecture so the first flow
model was called real NVP and on top of
that people that open they I have
refined it to make what’s called globe
there aren’t that many differences
between the two but the basic idea is
you take an input and you have a multi
scale architecture which means that
because you’re only able to do one by
one convolutions you’re only able to
pick up features at a particular scale
so you need to warp your inputs and
reshape them so that you can pick up
features on different scales and so you
squeeze which means you just changed the
dimensions and then you go through a
step of flow which is shown on the left
and then the step of flow is just a
simple normalization convolution and
then there’s a coupling layer where you
split again and you run half of it
through a deep network and you run the
and you just let the other half go
through this deep network in the
coupling layer is what takes up the most
memory in our flow models so when I
thought to work with these models I
shared the parameters across the
coupling layers across all of the
different steps of flow and you would
think that well that’s a lot of
parameters that you’re losing that’s
probably like one out of every 32 flow
steps is going to be actually expressive
but what I found was you can generate
very realistic-looking samples and match
the bits per dimension using 3% of the
memory that you did before and it goes
33% faster in terms of wall clock time
and so these are just C far 10 samples
and as you can see on both sides they’re
pretty realistic yeah so that’s pretty
much it for my presentation will is
going to be coming up here to talk a bit
about the work we’ve been doing together
on improving flow models in other ways
oh yeah yes yeah yeah so in our and new
architecture we only encode so we only
go from the inputs to the Layton’s and
then we use the Layton’s as a
conditioning vector for the pixel CNN so
the raw input also goes to the pixel CNN
and that’s why what comes out of the
pixel CNN is what you actually train in
comparison to the true inputs did that
kind of make sense any other questions
so there are actually some troubles that
come with invertible models so the
biggest Pro of using an invertible model
is that you can sample really
efficiently so to draw random noise is
easy and then to run it backwards
through the model is also easy however
because you are constrained by having
these invertible components you’re
forced to have a different level of
expressivity and these can also be
finicky to train because you can go into
regions where it’s no longer invertible
or you fourth you’re like forcing it to
be invertible and it’s not able to
express what it needs to and this is
part of what my work with Lille has been
addressing