🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance
Video
Transcript
Hello, and good morning everyone!
Hi! I’m Josh Achiam, I’m a Safety researcher
here at OpenAI and I’m the main author of Spinning Up in Deep RL up and thank you
all so much for being here today at OpenAI’s 1st Spinning Up Workshop.
oFor people who are tuning in on the livestream I’d like to let you know that
there is a minor technical difficulty and so we will not be able to broadcast
the slides directly from my computer into the livestream video so you’ll be
seeing the screen through the camera. In the event that that’s not enough for you
to see it clearly I just open sourced the repo that has the PDFs for these
slides so please go to github.com slash open AI slash spinning up - workshop and
you’ll find in the RL intro folder RL intro PDF which will be the presentation
that I’m about to give so hopefully that makes it easier for you to follow along
so since this is kind of a new thing that we’re doing I’d like to start today
by talking about what it is and why we’re doing it and what we hope you get
out of it from being here education at open AI is this concept that as part of
our mission we want to make sure that we provide for the public good and that we
help foster a global community around AGI which is the thing that at opening I
we care the most about and are trying to figure out how to make sure happens in a
way that’s safe and beneficial for all of humanity so for those of you who
aren’t already familiar AGI is artificial general intelligence the idea
is that this is going to be some very powerful AI technology that’ll have the
ability to change pretty much everything about how we do anything something that
could potentially do most economically valuable work something that could solve
tasks that currently only human intelligence is capable of solving and
so we think it’s really important that we help people become aware of what AGI
is and what the technology that’ll likely underlie it is so that you can
think critically about issues that might come up in the future and also if you’re
interested participate because we really need for
people to step up and help make sure that this technology is safe and does
what we wanted to do and doesn’t cause anything harmful or detrimental to the
world so spinning up is the first thing that we’re launching under this
education at open AI initiative and the goal is to help people acquire technical
skills in the research topics that we care about so spinning up in deep RL is
a resource that hopefully all of you have seen by now it contains a number of
different pieces including a short intro to reinforcement learning so what is
this thing that we’re doing so much research about at open AI an essay about
how you would go about becoming a researcher if you’re interested in
joining a curated list of important papers in the field so this is
particularly important because since this is an emerging field there isn’t
really a clear consensus on the best way to learn it or a textbook that
completely illuminates the way from start to finish and a lot of the
important knowledge right now is still in research papers so if you want to
find out the most stuff about this you have to go digging and hopefully this
helps you figure out where to look also a code repo of key algorithms because
for any of you who have tried hacking in this field before I’m sure you found
that there were a lot of very confusing resources out there really excellent
ones but nonetheless ones that made non-obvious choices and didn’t clearly
connect what they were doing to why they were doing it and so we hope that the
repo that we provide and spinning up in deep RL is part of something to bridge
the gap there and of course some exercises so if you want to actually try
coding something up there are a few ideas there for for what to do to get
you familiar with some of the key pieces of math or algorithms or what kind of
bugs you might expect and so why are we having workshops so in addition to
putting these resources online we think it’s gonna really help people if we work
with you one-on-one if we can see you face-to-face and talk with you and have
the kind of conversations and share the ideas that just you know don’t come up
in the sort of open loop control thing that happens when we put information on
the Internet today we’d like to have you come away
from this with a better sense of what the current capabilities and limitations
are in deeper I’ll tell you a little bit about what kind of research is out there
so if you want to go and follow some line of thinking you know what’s been
done and what hasn’t and we’d like you to actually try building and running
algorithms for deep reinforcement learning for possibly for the first time
and show you how to be confident in doing that so that if you want to keep
doing it afterwards you’re able to all right so then what is deep reinforcement
learning why do we need it why do we care about it
deep reinforcement learning is the combination of reinforcement learning
with deep learning RL reinforcement learning is about solving problems by
trial and error and deep learning is about using these very powerful function
approximate errs called deep neural networks to solve problems and deep
reinforcement learning is just straightforwardly the combination where
we’re gonna have something that’s learning by trial and error and the
thing that’s getting learned is a deep neural network that’s going to make some
kind of decision or evaluate some situation and use that ultimately to in
some environment make decisions that lead to rewards where reward is just
some measure of how good or bad an outcome was so when would you want to
use RL RL is useful when for one there’s a sequential decision making problem -
you don’t know what the right thing to do in that situation is already if you
have the optimal behavior say from having watched human experts enough and
you have just a ton of data on exactly what to do in every situation then you
can use the standard tools of say supervised learning to exactly get some
machine learning system to duplicate that behavior but when you don’t have
access to that or when you suspect that what appears to be expert human behavior
is in fact suboptimal in that situation you may want to try
reinforcement learning instead because it could discover
things that wouldn’t have otherwise been known and you also have to be able to
evaluate whether or not a behavior or an outcome was good or bad this is pretty
critical so RL is good when it’s easier to
evaluate behaviors than to generate them or to exactly solve for them and when
would you use deep learning so the typical paradigm for deep
learning is that you want to approximate some very complicated function a
function that usually requires some amount of intelligence so for instance
if a human looks at a picture of a bird and then knows what species of bird that
is that’s a thing that you can’t really write down a simple mathematical rule to
do if you want to get a machine to do that you have to teach it from data and
other problems that you know you would want to do this for typically have
inputs or outputs that are very high dimensional because it’s just quite hard
to from an image or from a video stream or from an audio stream go to a decision
rule without doing some sort of learning in the middle and also you typically
want to have lots and lots of data because getting machine learning systems
to behave in any reasonable way requires that you give them sufficient examples
and there are tons of problems where this is exactly what you have and in
those domains deep learning has been very successful at exceeding whatever
was previously the state of the art from any other methods that existed before
and creating things that are now standard consumer products things that
were magic 10 years ago are like completely normal now the idea that we
have super excellent image recognition facial classification that you can talk
to your phone and it’s going to know what you said and it’s not just going to
come up with some completely random gobbledygook
this is getting better because we’re able to leverage this very powerful
technology that is deep learning for these problems and so deep RL is when
you have some very hard high dimensional problem where you can evaluate behaviors
and you want to get a machine to learn how to do it because you can’t write
down how it should in fact behave and some very simple examples of this are
say video games we you want to go from a computer looking
at an image of the screen so just raw pixels to a decision rule that scores
the most possible points in the game or behaves in a way which is cool or
interesting or exciting or perhaps a really sophisticated strategy game like
go we’re really deep thinking and intuition and creativity is necessary to
make progress you can’t write down a simple rule for that but you can learn
it with reinforcement learning or perhaps you want to control some complex
humanoid some some robot to run around and do stuff
or maybe something which is a little less silly maybe a little more real
maybe you want to get robots in a factory to quickly learn a new task when
the robot uprising happens it’s because of this we’re very sorry for this
research this was trained by the way with an algorithm that was developed
here at open a I called proximal policy optimization it’s one of the algorithms
and spinning up and if you haven’t had any experience with it then we won’t get
into it in this lecture today but any other point in the afternoon during the
hackathon happy to go into detail so before we proceed into the are L
specific stuff this is a crowd with a pretty wide range of backgrounds and so
I just want to do a very brief recap of some of the patterns from deep learning
what do you expect when you set up a deep learning problem what does that
look like what do you have to think about so we typically talk about it in
terms of the language of finding a model that is able to give the right outputs
for certain inputs so in this case the model is going to be some function of
the inputs and parameters and the parameters are adjustable we control
them we change them and we want to change them in a way that’s going to
make the model behave according to some design specification the way that we
provide the design specification and get the parameters to satisfy it is by
setting up some kind of loss function this tells you in a nutshell how good
the model is at doing the thing that you want it to do usually some measure of
just how close the output from the model is to the
I earn output and the critical thing about this loss function is that it has
to be differentiable with respect to the parameters in the model and when you
have that set up oh and of course there’s data as well so you have a bunch
of different examples of inputs and outputs and your loss function reflects
how well your model performs across all of them typically as just some average
overpour data point losses so with this set up you can then proceed to find the
optimal model through gradient descent the idea is that the gradient is a
mathematical object that tells you how much the loss changes in response to a
change in the parameters and then you want to knowing that change the
parameters in a way which is fruitful that is it reduces the loss it reduces
the measure of error so what makes deep learning deep what is the deep part it’s
this idea that function composition is at the core of the models that we make
and that we consider so function composition just means that you have a
bunch of different parametrized functions and the outputs of one are the
inputs to the next one and you can arrange these in many different
topologies we’ll call these architectures for neural networks the
very simplest kind is just one where you have an input layer and then there is a
matrix that multiplies that and then you maybe add some bias to that vector and
then you pass that through a nonlinear activation function typically this is
going to squash the outputs from that first linear transformation into
something which maybe is in the range from 0 to 1 or 0 to infinity something
relatively simple but that non-linearity happens to do a lot of work and then
when you have successive layers what it allows the model to do ultimately is
represent successively more complex features internally so you might think
of the output of each layer as being a new representation of the original input
which has maybe rearranged the information in a way which is easier for
some kind of final decision making procedure at the end of the network to
make the right decision based on aside from that very simple
there are also substantially more complex ones so the other two diagrams
on this slide are for lsdm networks so that’s in the lower left and the
transformer network that’s on the right an LS TM network is a recurrent neural
network the idea is that this is the kind of network that can accept a time
series of inputs and produce a time series of outputs and internally it has
some very complicated mechanisms for making sure that information gets
propagated effectively across time steps in a hidden state so that when you make
a decision somewhere in the future you can remember something that you saw in
the past and then you can update the network in a way which is stable and
reasonable the transformer network is substantially more complicated and it
allows networks to do something called attending over their various inputs so
at attention it’s something which is a concept that we can all kind of relate
to when we look at the world we don’t actually process literally every piece
of data that we take in concurrently we particularly attend to whatever happens
to be say in the center of our field of view or whatever we’re thinking about at
the moment whatever is most urgent and attention neural networks are able to
basically do that when they make some decision on the basis of a lot of data
they can select out the most important pieces of the data for making particular
kinds of decisions and that turns out to be very helpful in practice a few other
things about deep learning and this is mostly just I’m checking off some boxes
if you want depth on this I strongly recommend that you go see the spinning
up essay where there are a bunch of links to papers and other resources that
will give you detailed information about this but to check off the boxes we might
talk about regularizer x’ so the idea is that sometimes optimizing your loss
function picking the model that actually gives the lowest value of your loss
function may not be the best thing to do you may wind up with a phenomenon called
overfitting where you’ve made your model behave perfectly with respect to the
data that you showed it but then it does a terrible job when it’s given any other
data because it learned a decision rule which
was entirely too specific but with regularization you trade off the loss
against something which has nothing to do with performance on the particular
task but just kind of says hey cool your jets a little bit
don’t be so avid about satisfying that objective and then it turns out that
regularization actually leads to models that do a better job of generalizing to
unseen data then there are also a couple of things that make the optimization
process smoother and easier so you might do some kind of normalization technique
where internally there’s some output in the middle of the network where it’s
good to adjustably rescale that and shift it around and that’s better than
just letting the network do whatever it would have done if you didn’t do this
kind of normalization it’s sort of spooky and there are some legitimate
complaints inside the community about whether or not we really understand why
this helps but it seems to so it’s worth knowing about also you might use a more
powerful optimizer than standard gradient descent this comes up also in
reinforcement learning actually many of the things that we’ve been talking about
in these past few slides show up in deep reinforcement learning which is why I’m
bringing them up adaptive optimizers do something special
in figuring out how to tune the learning rate the amount by which you change each
parameter at each step of updating in a way which leads to typically faster
convergence so you get to the to the optimum point a little bit sooner or a
little bit easier there’s also the Reaper amortization trick but that’s
quite complicated and so we won’t actually talk about it it’s on the slide
so that you know where to look all right that’s all this stuff from deep learning
that I wanted to talk about now onto reinforcement learning so first
and foremost we have to talk about how do you formulate a reinforcement
learning problem what does that mean what does that do what are the pieces of
it how do they fit together we typically use the language of saying that there’s
an agent that interacts with an environment so the agent is whatever
thing is making some kind of decision the environment is wherever those
decisions are happening and the thing that creates the consequences of those
decisions and there’s this loop where the
environment has some state and has some measure of how good it is to be in that
state that’s a reward and the agent gets to observe the state and possibly the
reward it uses the reward for learning whether or not it observes it as a
subtle technical detail but anyway okay the agent gets a state observation
and a reward and then the agent makes some kind of decision about what action
to take it picks the action and it executes sit in the environment and then
the state of the environment changes there’s a new state of the environment
the agent perceives it the agent acts etc the goal of the agent is to figure
out what decisions will maximize the sum total of rewards that it’ll ever get
actually it’s slightly more specific than this and there are a couple of
different formulations that we can choose and we’ll talk about them
momentarily but that’s basically it in a nutshell we want to maximize this sum of
rewards that we get and the agent is going to figure out how to attain that
goal through trial and error so you just don’t know in advance what the right
thing to do is so you have to just try things see what happens see how much
reward you get and then adjust your decision on the basis of that so
reinforcement learning is about algorithms for doing precisely that but
before we can talk about the algorithms we have to introduce a bunch of
terminology for those of you who have done the work of going through the
spinning up material online this will probably be quite familiar and I’m
mostly going through it for the benefit of the audience that I expect might
watch this in the future as a starting point for this so bear with me I’ll try
to go through this reasonably quickly but we have to talk about observations
and actions policies trajectories rewards and returns what the RL
optimization problem actually is how we formalize it and then value in action
value functions and also advantage functions so there’s a whole lot of
stuff that you kind of have to know and unpack in order to really fruitfully
progress and reinforcement learning and and these are just those central pieces
so observations and actions a state is something which tells you absolutely
everything about the environment the agent usually doesn’t get access to
the state there is usually some stuff that’s just hidden from the agent so
what the agent perceives is called an observation if the observation contains
all the information in a state we called this environment fully observed if it
doesn’t we call it partially observed and states observations and actions can
be continuous or discrete for all of the problems that we care about in deep RL
the observations are continuous and the actions might be discrete or continuous
a policy is a rule for selecting actions there are a couple of different ways
that you can get to this kind of rule we typically classify them as one of two
kinds stochastic or deterministic a stochastic policy is a rule for randomly
selecting an action on the basis of the most recent observation or possibly
preceding observations as well a deterministic policy is just a map
directly from observation to action and no randomness involved at all you may be
wondering why it would be useful to have a random policy at all because it might
seem like randomness is just sort of dangerous but actually it can be quite
helpful and there are some very principled ways of optimizing stochastic
policies and it’s a little bit harder to optimize completely deterministic
policies there may also be a matter of robustness in that having a little bit
of randomness can make you more robust sometimes to perturbation then having
learned a brittle specific deterministic policy so now just to give some sort of
concrete examples in tensorflow because I assume that most of you will
probably have met tensorflow as your first deep learning library and if not
pi torch and for those of you who are stuck with tensorflow I’m so sorry you
probably should have picked pi torch I know I should have but but here we are
so in in tensorflow for a stochastic policy over discrete actions we might
first set up a placeholder for loading in observations and then we might set up
a multi-layer perceptron network and MLP network so this is just the most basic
kind of feed-forward neural network the thing that I talked about earlier which
is a succession of linear transforms of inputs followed by nonlinear transforms
of inputs in this case the linear transforms take
you to something of size 64 and there are two of them and then the activation
is at an H activation so this gets you to a range of minus one to one in a nice
smooth way and and then we produce logits based on the output from that
piece of the network so logits are basically something that proceeds having
probabilities for particular actions if you take the softmax of the logit that’s
not a function you’re familiar with I recommend looking it up it’s just
something that exponentiate Sall the logits and then divides by the sum of of
those exponentiated logins so so it normalizes the distribution to to being
a probability distribution all the entries have to be greater than zero and
sum up to 1 so we get logits and then we get actions by using TF multinomial to
sample something stochastically assuming that the probabilities are based on
taking the softmax of those logits you can ignore the squeeze that’s just there
for making sure that certain things actually work and then in the
deterministic policy let’s say we have a continuous action case so we want to
output a vector of actions where each entry can be any real value number we
will just go from observation to network to a final layer which is just going to
be the actions all right so that’s policies let’s talk about a trajectory a
trajectory is a complete sequence of states and actions through the history
of an environment the agent starts in a state takes an action then there’s a
next state next action etc the first state in the environment is sampled from
some previous distribution over starting States and then afterwards state
transitions are going to be either deterministic or stochastic but there’s
just some rule in the environment that given the current state and the current
action whatever action the agent took picks what the next state is a
trajectory is also sometimes called an episode or a rollout you’ll see this
terminology used completely interchangeably so just be aware that’s
out there there’s I’m so sorry in every new ish field a lot of terminology
confusion we’re different people in different
areas of academia worked on it for a while and use different terms and then
in the end we’re left with just a weird mishmash notation - you’re gonna see
some notation where states and actions are notated by s and a and then in code
you’ll see some places where it’s X and u and this is because of the ancient
eternal conflict between the control theorists and the reinforcement learning
theorists and we’re just stuck with it now so that aside let’s talk about
rewards and returns so a reward function is going to map from the states and
actions or states and actions and possibly next States on to just some
number that tells you good or bad positive is good negative is bad the
more positive the better and you have to if you’re a designer setting up a
reinforcement learning problem you have to pick with that reward function is
going to be so you want to make sure that you incentivize the stuff that you
want to have happen and disincentivize stuff that you don’t want to have happen
so as a very simple example suppose that you want a robot to run forward but you
don’t want it to waste a ton of energy so maybe you will give it a reward
proportional to its forward velocity but you’ll penalize it proportionally to the
some of the action magnet or to the action magnitude so you’ll discourage
superfluous actions the return of a trajectory is going to be some
cumulative reward along it we have two ways of formulating this and what you’re
going to find in deep reinforcement learning implementations is that we’re
going to completely conflate which problem we’re trying to solve with the
other but the finite horizon undiscounted sum of rewards works when
you have a finite horizon it doesn’t work when you have an infinite horizon
because if you have an infinite sum of things it might diverge unless you do
some kind of discounting so in this other case infinite horizon discounted
some of returns you have a discount factor gamma between 0 & 1 and that’s
how you down weight things that happen in the future this makes sure that this
is a reasonably well-defined quantity but why would it make sense to discount
things you probably would rather someone tell you that they’re gonna give you
$100 today than $100 in a hundred years right like it’s just good to get
upfront then there’s the reward to go this is closely related it’s basically
just a measure of return starting from a particular time step or state so the
reward to go from some point in time is just the sum of rewards that’ll happen
after that point in time and now we can talk about the reinforcement learning
problem just formally we’re going to set up a performance measure for a
particular policy PI J of Pi which is the expected value of return for
whichever formulation we’ve picked according to a distribution over
trajectories in the environment based on the choice of policy so what that means
is that again start states come from a starting distribution transitions in the
environment are based on something in the environment that transition
distribution P and actions will come from the policy conditioned on the the
observations of the states and we want to find the optimal policy PI star which
maximizes this now we have to talk about value functions so value functions are
measures of how much reward you expect to get from a particular State or state
action pair assuming that you’re going to behave a certain way so we have the
on policy value function and action value function V PI and Q PI which
respectively tell you how good it is to be in a particular state and how good it
is to be in a particular state action pair assuming that forever after being
in those places you act according to the policy PI and then there’s also V star
and Q star same thing except if you were to act according to the optimal policy
it’s great to know Q star as we’ll talk about momentarily value and action value
functions are connected the value is just the expected action value expecting
over what option you might take according to the current policy and the
advantage function tells you how much better a given action is than average
and it’s just the difference between Q and V these value functions satisfy
recursive bellman equations these are super important and they’re the
foundation of a bunch of algorithms so they’re really worth knowing and kind of
just worth grappling with I think that these can be particularly tricky at
first I remember the first time that I met reinforcement learning I was just so
turned around and lost by these the notion that there was going to be this
recursive equation where the definition of a thing depended on itself was quite
confusing but it’s it’s it’s worth just hitting your head on for a while until
it makes sense but what it’s saying is that the value of being in a particular
place is going to be as good as whatever reward you get for being in that place
plus all the rewards that you’ll ever get for all the places you’ll go
afterwards now why is it great to know Q star Q star tells you if you’re gonna
act according to the optimal policy forever after you started in this state
and took this action and we don’t care what policy this action came from how
well will you do so that means that if you want to do the best you possibly can
do all you need to know is what action maximizes Q star in a particular state
and then take that action because that’s gonna be the best action in that state
and then afterwards you’ve assumed that you’re gonna do the best that you can
ever possibly do so if you have Q star you basically have the optimal policy so
this is going to lead us ultimately to the two different kinds of algorithms
and reinforcement learning for control where in one case we’ll try to directly
optimize a policy and in the other case we’ll try to find Q star now if we want
to find Q star we have to set up a function approximator for it q theta
which will represent by some kind of deep neural network and we’re gonna want
to measure how good is it at approximating Q star and this is what
that recursive bellman equation is gonna be really helpful for because the
beautiful thing is we don’t need to have acted according to the optimal policy to
check how well Q theta fits that bellman equation we just need a bunch of
examples of state action next state and reward tuples and if we have enough of
those over enough of the environment then we can probably do a pretty good
job of fitting q theta based on that bellman equation based on maybe this
means squared bellman error and then use that afterwards for control which is
having a decision-making rule by the way I apologize if anything has been
confusing about my using sort of the terminology of control interchangeably
with the terminology of reinforcement learning when I say control I mean
having the best policy so now what kinds of RL
algorithms are out there behold a taxonomy which is much more restrictive
than it looks it looks very pretty and it looks very definitive but it’s
actually masking a lot of subtlety and you know detailed choices and the fact
that there’s actually a lot more bleed over between these things than you might
expect but at a very high level this is a
useful picture to start with that we have two different kinds of RL
algorithms ones where we have access to the model of the environment and ones
where we don’t so what that means a model of the environment is something
which tells us if we’re in a given state and we take a particular action what’s
gonna happen next the model would predict what the state of the
environment will be after that and that’s really useful because if we can
forward simulate the environment then that’s extremely helpful for evaluating
our current policy it’s extremely helpful for figuring out what a better
action would be than the one that we might want to take so if you don’t have
a model you’re quite limited you just have to figure out how to do well based
on experiences that you’ve seen your direct interactions of the environment
you don’t get any other information but if you do have a model it’s quite
potentially powerful although as we’ll discuss the methods for model-based
reinforcement learning are not quite as mature so far as the methods for model
free reinforcement learning so now okay that last slide was just a ton of
acronyms maybe not that insightful let’s talk about what these algorithms are
doing there are three key pieces in any reinforcement learning algorithm for one
you’re going to run the policy in the environment you’re going to actually try
things and get to some signal error or otherwise and then you’re going to have
to reflect and evaluate whether or not those decisions were good ones whether
or not those actions were the right ones you have to figure out how good your
current policy is so that you can use that information to improve it so you
run the policy you evaluate the policy you improve the policy and there are a
bunch of different ways of doing that and we’ll go into some depth about how
different algorithms go about doing that
so let’s start with policy optimization
minor interlude in the chat last night I surveyed people to see what they were
interested in I asked if people were interested in
math there’s gonna be some math so first at a very high level zooming out ten
thousand foot view in policy optimization we’re going to run the
policy by collecting complete trajectories or snippets of trajectories
based on our current stochastic policy and we’re going to explicitly represent
that stochastic policy with a neural network that perhaps gives these
sufficient statistics of the action distribution or something else that we
can use to derive that and sample from it and then we’re going to evaluate the
policy by figuring out the on policy value function and advantage function
and we’re going to evaluate those things for all the states and actions in the
trajectories that we sampled and then we’re going to improve the policy by
making it more likely that we take the actions that led to higher advantage and
making it less likely that we take the actions that led to lower advantage less
likely that we take the bad actions how do we do that we’re going to have to
talk about some math now I realize there’s a chance that most of you maybe
weren’t expecting that we would be doing any kind of deep mathematical excursion
but if there’s one thing that I want you to take away from today aside from just
being excited about deep RL it’s a realization that there are some
limitations to what deep RL can currently do and that this is not really
a hundred percent done as a technology where you can just apply it to a problem
without really thinking about what it’s doing under the hood and get a good
solution it’s not a black box technology yet so if you want to try deep RL on a
problem and grapple with getting it to work you do have to kind of understand
what’s going on under the hood and that means taking a look at some of the gory
mathematical details understanding how they connect and forming an intuition
for how those details will shape the failure
most of your algorithm so what we’ll talk about we’re just gonna talk about
vanilla policy gradient we’re gonna talk about how you derive the policy gradient
and a bunch of different equivalent expressions for it and then we’ll get to
the pseudocode for the sort of standard version of vanilla policy gradient which
includes maybe a few more tricks and details than the very most basic vanilla
version apologies for the choice of words there but all of this stuff is
critical to understanding more advanced policy optimization algorithms like Terp
oh and PPO we won’t be covering them in these slides but again happy to talk
about them offline during the hackathon so in policy grading algorithms what we
want to do is we want to find some kind of expression for the gradient of the
policy performance with respect to the parameters of the policy and we want to
just directly gradient to send on those parameters so we’re going to move the
parameters in the direction that increases performance and is this gonna
be easy or hard well if we just try putting the gradient onto the policy
performance we run into a problem all the parameters are down here in the
distribution they’re not inside here where we would like them if we want to
get something that we can actually use we’ll have to do some messy work to
bring the gradient inside of an expectation which we could then form a
sample estimate of so step one to getting the gradient symbol somewhere
helpful we’re going to recognize that this expectation can be rewritten as an
integral going through all of the events in trajectory space every possible
trajectory of the density the probability mass or density for that
trajectory based on that policy times the return that you would get for being
on that trajectory and now we can bring the gradient in because the limits of
this integral don’t have anything to do with the parameters and then we apply
the log derivative trick so this is a really helpful mathematical trick comes
up all over the place and deep reinforcement learning it’s basically
just this notion that the derivative of log of sum
thing is one over that something times the derivative of that something and we
rearrange it slightly but it lets us go from the gradient with respect to theta
of P to P times gradient log P this is great because now we have an expectation
again we have an expectation based on trajectories sampled according to the
current policy so if we have that data we can make a sample estimate certainly
so the very nice thing here is that what we did after bringing the gradient
inside the integral and doing this log derivative trick is that we now have
something which is an expectation again because we’re integrating through all
possible trajectories of the probability density associated to that trajectory
times something which is a function of that trajectory so this is an
expectation and we can form a sample estimate of it that we can use in a
practical algorithm but we’re not completely finished yet because we still
have to talk about what’s the gradient of that log probability for a trajectory
how does that depend on the parameters of the policy so let’s go back to the
picture that we had in the beginning there’s a starting state which is drawn
from some distribution based on the environment and then after that you pick
it the agent picks in action based on PI theta and it has probability PI theta a
given s for time step 0 then the environment picks the next state
according to whatever distribution it has over next States given your most
recent action in the most recent state by the way this is something that I lost
over earlier slightly more formalism details that you don’t quite need to
know but this is called the Markov property this notion that picking the
next state only depends on the most recent thing that happened and doesn’t
depend on the past before it that’s the the Markov property and you’ll find a
whole bunch of math if you go digging for it but you don’t have to for for
this at the very least so then what we have is that the probability of the
trajectory is going to be just the probability of that first state x the
probabilities of each transition and action selection that happens afterwards
so we get that expression up there at the top and now if we want to take its
gradient of its log we just pretty straightforwardly compute first
the log of that thing turns that product into a bunch of sums the gradient goes
through the sums and now all the terms that are based on distributions from the
environment have no dependence on the parameters of the policy the environment
doesn’t care what the policy is it’s just going to behave in whatever way it
does so those have no dependence on the parameters those derivatives are zero
and what we’re left with is just something which is as some overtime
steps of gradients of the policy and the beautiful thing is because we control
the policy and we have explicitly represented it as a neural network and
we can compute all of its gradients this is a thing that we can calculate so now
we’re at something where we can in fact calculate a sample estimate of this
gradient of policy performance and use that as the basis for a gradient ascent
algorithm for improving performance but it’s not good enough we’re not done yet
yes the function capital e so so this this capital e is an expectation and if
we want to form an estimate for the expectation so we’re not going to
compute the expectation exactly what we’re going to do is we’re going to see
what happens for a bunch of different trajectories that are sampled according
to the distribution specified in that expectation and then we’re just going to
average them and in the limit as we have an infinite amount of data that sample
average becomes exactly equal to the expectation yes
absolutely absolutely you can so it is a bunch of derivatives of the final output
with respect to each one of the parameters right because there are many
inputs to this function and we’re going to have a derivative with respect to all
of them yes I’m sorry can you repeat the question yes can we tie this explicitly
to reward so inside the expectation here we have R of tau so that’s the return
measure that we’ve chosen whichever one we picked either the infinite horizon
discounted sum of rewards along the trajectory tau or just the finite
horizon undiscounted sum of rewards so that R of tau is the sum of all the
rewards in a particular trajectory and that’s actually why the the variance of
this is going to be so unnecessarily high they’re going to be a bunch of
terms in this sample expression actually just in that expectation which which
have expectation zero on average they’re zero they don’t contribute anything but
we sample them anyway and the samples will have noise on them and so we’ll
just wind up getting the noise we won’t get much signal from them so can we
eliminate a whole bunch of terms yes we absolutely can the intuition here is
that if I give you a reward in the past and you want to update the action that
you just took really what you care about for figuring out whether or not the
action that you just took was good or bad are the consequences of that action
you don’t care about what preceded it that action and what preceded it are
almost completely uncorrelated there you’re not going to to get anything by
by updating the likelihood of that action based on an old reward so that in
expectation is going to be zero and knowing that we can now expand out this
return measure and we’re going to get this in the finite horizon case just
for simplicity but this analysis also extends to the infinite horizon case so
we now have a sum of grad log probs of the policy times the sum of rewards
we’re gonna pull the sums out of this expression so that we can just look at a
policy update at a particular time step times a reward from a different time
step and then based on that thing that we asserted above we’re gonna drop all
the terms that are inconsequential all of those are zero and so what we’re left
with after we take away all the ones where T greater than T Prime we’re left
with this sum sum over the time steps for the policy times a sum over time
steps for rewards that goes for all of the time steps after the corresponding
policy time step and then if we bring that back in what we’re seeing now is
that we want to for each time step adjust the probability of the action
from that time step in proportion to the sum of rewards that came afterwards only
the consequences of an action will affect its update yes so it’s not that
you don’t consider past actions the sum over here in the beginning runs over all
time steps so every action is going to get to some update it’s just a matter of
which rewards are used in figuring out the update for that action and it should
only be the ones that were consequences of it yes yes
um well we do care about the future right because here we have a sum of
rewards after a particular time step all the rewards in the future from that time
step so so that expectation that’s just saying that an action that happens later
shouldn’t be affected by a reward that happened before it it should only only
be affected by the rewards that happen afterwards so in the in the next slide
actually we’ll see how this expression that we have down here at the bottom
connects to the value functions so what we currently have is what I’ll call the
reward to go policy gradient because what we’re doing is we’re adjusting the
probabilities of action proportionally to the reward to go what we’re going to
do now is go from that into an expression that has q pi the action
value on policy for a state action pair instead of that reward to go and this
works because you can break up the expectation so first we’re gonna pull
the sum over time steps out of this and then this expectation over trajectories
this is sort of subtle and and maybe a little math here then we can go into
detail on here but I recommend that you go look on the spinning up website in RL
intro part 3 there’s a link separately to a proof about this but if we think
about the average thing that’s going to happen over all trajectories that’s
going to be equivalent to the average thing that happens over all of the cases
of something with the first T time steps of the trajectory we’re inside of the
expectation we’ve moved all the stuff that happens in the future
and we were able to move it inside past this one because this only depends on
time step T this doesn’t depend on stuff after T so only this
it’s gonna be affected by averaging over the future and then it turns out that
that expression the average sum of rewards that you get starting from a
time step assuming that the state and action for that time step were fixed
that’s exactly equal to the action value that’s exactly saying how good is it to
be in a particular state take a particular action and then forever after
act according to a particular policy and now we have this expression for the
policy gradient at the bottom we’re most of the way through the math okay but
what is a baseline a baseline is a really important thing because it’s
another tool in our Arsenal for taking a policy gradient expression and turning
it into something which is lower variance more likely to be useful for
producing a good update to the policy and it’s also the namesake for opening
eye baselines well let’s save one of them it’s a
couple of things but we have a expression here at the top which I claim
is basically true which is that the gradient policy gradient is the thing
that we had before but instead of Q we subtract out some function of state some
function b of st and i claim that in expectation it works out exactly the
same and so there’s a short proof here for that which is that if we look at the
expectation for that part of it what happens if you take the expected
gradient of the log probability of an action in a state times some function b
of st the b doesn’t have anything to do with the action so it’s a constant with
respect to this expectation so we pull it out and then what we’re left with is
an expectation over actions which will rewrite and now we have it in
probability times grad log prop we’re going to reverse the log derivative
trick from earlier so this is now an integral over actions of the gradient of
the probability of that action and we can pull out the gradients
we’re just sort of reversing the procedure from earlier this thing this
integral over all possible actions of the probabilities of those actions
that’s just going to sum up to one that’s just saying probability
distribution is normalized all of the chances together have to come out to
equaling 100% of sum them up and the derivative of a constant since that’s a
constant is nothing constant has no rate of change
so we get zero so all of the terms of grad log prob times the baseline in
expectation or zero so we’re free to add this baseline without changing what the
policy gradient is in expectation but we can pick it in ways that are fruitful
and make the estimate better so the typical thing to do is to pick the
baseline to be the value function and this leads us to kind of our our final
sort of ultimate form of the policy gradient the form with advantage
functions and why is this good why is this good the advantage function says
how much better in action is than average why would you prefer that over
just how good the action is well let’s say you have two actions one gets you a
hundred dollars one gets you one hundred and one dollars you only sample the one
that gets you one hundred now when you’re trying to update your policy you
can feel really great about that oh man 100 is a big number I feel great
I’m gonna double down on that action you’re acting sub-optimally if you had
been picking 5050 on average you would have gotten a hundred dollars and fifty
cents and you would have realized that the advantage of taking the action that
you picked one hundred dollars and fifty minus a hundred dollars and fifty cents
you lost fifty cents should pick the other action so you prefer to use
advantages to figure out which actions to increase the likelihood of as opposed
to just Q values all right summing it up we have these four different forms of
the policy gradient they’re all tightly connected we care about the last one but
to get to the last one we had to go through the pain but now that we’ve all
gone through that pain together you’re stronger you can go and you can
implement this and it’ll work and you’ll know why it works and you’ll feel good
about that and if it breaks you can fix it
all right so then just to sum it up this key concept we want to push up the
probabilities of good actions push down the probabilities of bad ones and also
importantly that expectation requires trajectories sampled from the current
policy so this is the concept of being on policy and reinforcement learning
that if you want to update your policy you have to use data from that policy
you can’t use data from some other policy unless you appropriately
reweighed it but relating data is complicated and really tricky so it’s
sort of preferred to not do it unless you are trying to build something new
and cool and super sample efficient and you’re willing to spend a lot of time
and effort doing research on making sure that it actually works but ok so the
policy gradient expression gives us the policy improvement step coming back
coming back a bit oh yeah sure the question was how do we know what the
average reward would have been so that we could figure out how to make the
advantage function in the first place do we compute it as we go and and actually
that’s exactly what the next slide is about which is how do we do that
business of policy evaluation how do we find an estimate of the advantage
function which is actually good and reasonable if we just have a bunch of
data where do we get the value function that we might use to subtract out as a
baseline and the idea here is that we’re going to learn it from data and
typically it’s going to be by regression so this will be a subroutine that you’ll
find in most policy optimization algorithms where you’re going to have a
value function approximator another neural network and you’re going to at
each epoch of the policy optimization algorithm update the value network to
try to match the empirical returns that you saw so for a particular state the
value should be more or less the sum of discounted rewards that you saw off to
then and then when you have the value
function approximator you can use that to estimate advantages and we’ll talk a
bit about estimating advantages from value function approximate us on the
next slide but first you may have noticed that I pulled a fast one on you
which is that we went from in all the preceding slides dealing with the finite
horizon undiscounted case and then here in our optimization problem for learning
the value function I’ve dropped in discount factors why is that the answer
is because everyone does it this is where there’s not a particularly good
reason in my opinion that this happens but pretty much every policy
optimization algorithm that I’m aware of every every single implementation uses
discounted value functions and advantage functions but then treats the policy
optimization part as undiscounted it creates some bias it seems to work
shrug it’s perfectly reasonable to do that so it sometimes seems to be helpful
to set the discount factor to something a little smaller than one so keeping it
completely undiscounted would be gamma equals 1 for whatever reason with some
optimization problems there’s some some RL problems it’s a little bit harder if
you pick gamma equals 1 than gamma 0.95 and i can’t say that there’s a
particularly good reason for this I would speculate that like in the
beginning of training if you pick a very high discount factor those empirical
returns will be very noisy and if you choose a discount factor less than 1
what you’re going to do is you’re going to attenuate some of the noise you’ll
bias that sum of rewards so that whatever happens soonest matters most
and if you happen to see a few positive rewards in a row then you’ll latch on to
that whereas maybe because of noise if you had really paid attention to
everything out to infinity you’d have just gotten a bunch of positives and
negatives and positives and negative and they would have cancelled out uh I think
it’s it’s ok to think about it like that yeah
yes yes that after a certain point the trajectory just ends you get it to time
step T and then it’s over that’s finite horizon infinite horizon
you go out to infinity alright so then how do we calculate the advantage
function given data from trajectories and a value function approximator so a
thing that I want to introduce here is this notion of n step advantage
estimates so what you’re going to do is you’re going to have a thing over on the
left side that approximates Q pi and a thing over on the right side that
approximates V PI so this thing for Q pi remember that that’s supposed to be an
estimate for how well you’ll ever do if you start in a state take an action and
then act according to the policy forever after you can just use the empirical
return the reward to go from that state as a sample estimate of the expected
value which is the Q value but in an N step advantage estimate what we’re going
to do is we’re not going to go all the way out to the end of the trajectory in
that sample estimate for Q we’re going to go n steps in and then use the value
function approximator to assume what’s going to happen for the rest of it and
this corresponds to a decision about how much bias or variance we find acceptable
in this advantage estimator so if you pick n equals 0 then your advantage
estimator in that case would be just the reward plus gamma times the value
function approximator for the next time step minus the value function
approximator for the current time step and that’s gonna be very high bias
because whatever is wrong with your value function is not going to be wrong
with your advantage function but it’ll be really low variance because the only
thing that’s going to have variance to it is the reward and the stochasticity
in the next state transition but if on the other hand you pick n equals
infinity so for the q approximator you just take the exact sum of rewards that
you got in the real trajectory and then at the end you subtract out the value
function at st you’re going to accept all of the variance that’s in the
environment but the nice thing is you don’t have
bias in forming your policy gradient estimator with this because in
expectation the Q part is going to be exactly Q in expectation and the B part
recall that that was a baseline that we added with a guarantee of no bias in the
policy gradient so on expectation that part falls out and it’s fine
so the bias-variance tradeoff is typically mitigated through what we call
generalized advantage estimation so this is a way of interpolating between all of
those different possible choices of n step advantage estimate where we use a
factor called lambda so this is sort of like another discount factor as the
interpolation variable and it’s a hyper parameter and you choose it in each
implementation that you make and it’s generally good to set it somewhere
between like 0.9 and 97 usually it’s a set it and forget it in my experience I
can I can’t think of very many cases when I saw a substantial difference in
algorithm performance from adjusting it beyond that kind of narrow range if you
set it equal to one then you’ll get exactly the case of the N equals
infinity and if you set it to zero then you’ll get exactly the N equals zero
case so it’s good to kind of leave it in the range where it’s putting a little
bit more weight on the real empirical returns than the biased value estimator
but not all the way to the extreme okay at long last I give you the pseudocode
for the full vanilla policy gradient algorithm that incorporates everything
that we’ve talked about so far what we’re going to do is collect a set of
trajectories by running the current policy in the environment and then we’ll
compute the rewards to go so that we can use them as targets for the value
function approximator will compute the advantage function estimates with any
method of advantage estimation but typically generalized advantage
estimation and then we’re gonna use those to estimate the policy gradient
with that we take a step of gradient this gradient descent we might use an
adaptive optimizer like Adam to accelerate the rate at which we learn
and then we’re going to do the supervised learning problem of trying to
get the value function approximator to match the empirical returns and that’s
how we learn our value function and then we loop that’s vanilla policy gradient
yeah absolutely so yes usually you will pick networks
have the same size for policy and value function in cases where the environment
is partially observed you may want to have a single core recurrent neural
network that’s going to be able to remember past information and then give
that corner all Network separate outputs for policy and value function and then
you’ll train that jointly and it gets a little bit complicated because I can’t
say that there’s any good work in RL theory that I’m aware of that reasons
about how it alters performance for the final policy to be simultaneously
optimizing with respect to both objectives on the same model but that’s
what you would do in that situation so so yes typically they’ll be about the
same size unless they’re actually sharing parameters and then they’re sort
of the same model yes
does the choice of initial policy affect convergence wonderful question and sadly
in a lot of cases yeah so this is part of what goes into my saying that deep
reinforcement learning is not a technology that’s ready to be used as a
black box yet so when we do experiments in deep reinforcement learning we
typically run the same exact experiment with different choices of new of seed
for the random number generators and what we find is that the seed which in
the beginning of the algorithm only changes the initialization of the
policies and value functions happens to matter quite significantly some seeds
learn some seeds don’t some seeds learn much slowly much more slowly than others
and there’s no particularly good reason for it
we are generally quite heartened when we find an algorithm that appears to be
robust to initial conditions and where the
average of the learning curves is quite narrow we think that’s great and it
doesn’t quite happen as often as we would hope all right do we have any
other questions about policy gradients so in the bottom right hand corner there
that says 47 out of 63 I may have slightly miscalibrated
how long parts one and Part two were relative to the initial time slots of 45
minutes and 1 hour respectively this is by far the longer one but since we’ve
been at it for an hour I think this is a good point to take a 15-minute break and
we’ll pick back up to discuss q-learning after coffee thank you so much
we will
we will be resuming with Joshua Humes introduction to RL in two minutes
hello
hi everyone we’re about to get started for the second part of intro to RL and
just as a heads up I prepared entirely too many slides for
the hour and 45 minutes that I was scheduled to speak please bear with that
because you know this is the first time we’re doing this and so I’m still
getting calibrated on what we can get through in that amount of time but
everything that I don’t cover by 11 a.m. when I hand over the mic to the next
speaker I’m more than happy to share with you later today during the
hackathon so in particular the material that I expect that we won’t quite get to
will involve an overview of what’s been accomplished recently in deep
reinforcement learning and where the challenges and limitations are and what
the research horizons look like on those limitations but before we do any of that
let’s continue our discussion from earlier and talk about the next major
family for algorithms for deep RL for control which is to say cue learning so
there are a lot of algorithms that fall under this umbrella deep Q learning was
one of the first algorithms that really made deep reinforcement learning viable
and popular speaking from personal experience I just started my graduate
student career in 2014 when I heard about the playing Atari with deep
reinforcement learning paper I was just becoming aware of topics in AI and AI
research and that completely and totally blew my mind it was the most exciting
thing that I had ever seen that a computer could just figure out from
looking at what was happening on a screen how to behave how to play a game
how to do something that I thought required some human spark of
understanding and capability for joy and the in the computer had it it was
beautiful and amazing and it made me want to study this and participate in
taking this technology all the way from where it was at that point to what it
could be in the future anyway q-learning
so back to this RL loop that we have run policy evaluate policy improve policy in
q-learning you run the policy by taking a step in the environment either
randomly so there’s going to be some stochasticity in what you do or you’re
going to act in a way which is called greedy with respect to your current Q
function approximator so remember what you’re trying to learn is Q star the
optimal action value function and if you happen to have Q star then whatever
action is the maximum or maximizes q theta in a particular state is the best
action to take um but when you don’t in fact have q theta equals Q star then the
the maximizing action probably isn’t great so exploring a little bit by
acting randomly is going to help you and then once you’ve taken that step in the
environment so you send an action to it and you get back a reward in the next
state you store that transition state action reward next state in a replay
buffer you save it for later because you’re going to use it for learning how
to evaluate the policy which is to say updating q theta to try to have it fit
that bellman equation and once you have that the policy improvement step is just
looking into q theta and saying what’s the action that maximizes this policy
improvement is basically implicit in Q learning and we’re gonna structure our
discussion about Q learning around the original deep Q networks algorithm but
pretty much everything in this discussion is quite general for Q
learning methods because they all kind of share this common DNA of you take a
step in the environment you take some gradient descent steps on your Q
function to minimize a mean squared bellman error and you use the techniques
that will describe in a minute experience replay on target networks to
stabilize the learning procedure so Q learning updates by bootstrapping so
what is what is that it’s this notion of how
are we actually going to fit q2 that bellman equation so we talked about
minimizing mean squared bellman error and it’s a useful picture to start with
and so I’m gonna keep using that terminology although in a few slides I’m
going to tell you something completely different and ask you to ignore this and
pretend you never heard it but this is where all the papers start and this is
where all the tutorials starts so it’s good to familiarize you what you’re
going to do to update Q is set up this loss function where you’re going to
average or sum over data from your replay buffer D and you’re gonna have
these transitions state action next state reward and you’re going to regress
Q theta against targets Y where those Y’s are obtained basically from that
bellman back up from that bellman equation as the reward plus the Q value
in the next time step and this is based on the bellman equation for the optimal
action value function so it’s gonna have that Max over next actions which is to
say that it’s going to assume that you know if Q theta was optimal if it was Q
star then whichever action maximized it in that state would be the best one to
take and that would be the best value there so interestingly you don’t
propagate gradients through why even though why has the dependence on the
parameters of Q theta and the reasons for this are kind of mathy so we’ll get
to them in a bit okay getting this to work so there are two
main techniques that I mentioned there’s experienced replaying there’s target
networks the idea behind experience replay is
just that you want to use a really wide distribution of data for training your Q
function you don’t want to fit it really well to a very narrow region of
transition space because if you do it’s not gonna be good anywhere else and if
it’s not good anywhere else you’re not going to be able to bootstrap it to the
correct values even in the places where you’ve been trying to fit it
you’ll get nothing which is actually useful for control so experience replay
helps you broaden that data distribution fit q well everywhere gets something
which is good for control target networks
so bootstrapping with function approximator is super super super
unstable that thing that we said on the previous slide where the Y’s depend
exactly on the current Thetas actually throw that out can’t do that that won’t
work if you try to do it what’s gonna happen
is typically the keys will explode they’ll go to something really large or
really negative and that’ll happen really fast you won’t be able to control
it even with reasonably well tuned learning rates you probably won’t be
able to stop it so instead what we’re gonna do is we’re gonna have target
network Q theta Targ and we’re gonna make sure that that network tracks
reasonably closely to Q theta but there’s going to be a lag so that it
updates more slowly so that if you make an update to Q theta which pushes a Q
value too high or a little too low then that doesn’t immediately propagate into
Q theta Targ and therefore does not propagate into the bootstrap so this is
this wide thing we’re gonna call this the bootstrap and then this tamps down
on instability grants it why if Q learning is so horrific ly unstable
would we want to do it like this in the first place why wouldn’t we just
differentiate through with respect to that bootstrap and the answer is it if
you differentiate all the way through it tends to not work that well and the
reason that this thing does the reason that it works well if you do this kind
of bootstrapping approach as long as you take some appropriate precautions has
something to do with the theory underlying Q learning and we’ll talk
about that in a few slides but not quite yet you’re spared for now so also
another note in DQ networks the particular algorithm that we’re talking
about right now action space matters a lot so what we
did in describing that bootstrap we had a maximization over actions of the q
function if you have a q function that accepts as input a continuous state and
a continuous action and feeds that into a deep neural network trying to figure
out the action that maximizes the Q function output is really hard that
would be a non-trivial optimization problem an expensive subroutine so if we
want to be able to get that max over actions that’s a case
where we won’t really be able to do it so dq1 will apply specifically to the
discrete action case where we’re able to use a network architecture that instead
of taking a continuous action as an input at the bottom of the network emits
action values for each possible output for each possible action at the end of
the network so a single observation goes in and then K action values come out
where K is the number of actions one for each action and then because there’s
just a finite number of them it’s very easy to figure out which action maximize
the Q value we can compare all of them directly so now but we can talk about
the pseudocode for deep Q learning this is relatively straightforward based on
the stuff that we just described there’s one thing which is a little more
specific than what I mentioned which is this business of Epsilon greedy
exploration so I mentioned before that you’re going to explore by sometimes
taking a completely random action and sometimes taking the action which is
greedy which maximizes your current Q function approximator so epsilon greedy
is a strategy for doing that where with probability epsilon where epsilon is
going to be something small you’ll pick a completely random action so uniform
random over the K different choices and with probability 1 minus Epsilon most of
the time you’ll pick the action that’s greedy with respect to your current Q
function so that’s the run policy step and then after you store that transition
into the replay buffer and anneal Epsilon because over time you want to
explore less and exploit more you want to rely on the policy as it gets better
after doing that you’re now going to evaluate the policy by learning Q star
from the data by improving q theta to be a better reflector of Q star so that’s
exactly the step of gradient descent that we described which is that you
sample some transitions from your replay buffer from your from your experience
replay memory and you compute the bootstraps for those transitions and
there’s a special case for if a transition ended in a terminal state
which that we don’t give it a value after that
particular time step and then we use those Y values in our bootstraps Q value
regression update the parameters and then every once in a while with some
frequency will copy over the parameters of the main q network onto the target
network so that’s the target network lagging the q network ensuring stability
and that’s deep Q learning in a nutshell this algorithm kicked off everything I
mean a whole bunch of stuff that preceded it you can’t really point to
any one moment in the history of a field that you know had no precedent before
this there was neural fitted Q before that there was Q learning with linear
function approximation and there were all kinds of algorithms for trying to
get things to work with nonlinear function approximation like deep neural
networks but but but this was the one that got a lot of people really really
excited so anyhow caveat emptor buyer beware this can break this will not work
on every problem out of the box you’ll try it in some places and it just won’t
work you’ll fiddle with hyper parameters and it still won’t work you’ll try some
tricks to stabilize it because there are pretty much infinity tricks to make deep
Q learning better at this point and some of the time that still won’t work so
this picture here is from a recent paper which I really love and which I strongly
recommend that you take a look at if you get interested in seeing some analysis
of failure modes for algorithms in deep RL it’s called deep reinforcement
learning and the deadly triad the deadly triad is a set of traits that deep
reinforcement learning algorithms might have which are known to occasionally
cause divergence and to create substantial obstacles to theoreticians
who would like to come up with algorithms that have provable
convergence guarantees so the deadly triad consists of function approximation
off policy learning and bootstrapping which are exactly the three things the
deep Q learning relies on we have function approximation in the form of
neural networks we have auth policy learning in the form of
spirits replay and we have bootstrapping in the form of using the target network
with a one-step backup as the regression target for q and so deep Q learning
works a whole lot of the time and then some of the time it just doesn’t so in
this set of experiments what the researchers did was they examined deep Q
learning and a few variants of it a bleeding on whether they would include a
target network so here this Q does not have a target network the regression
target that it uses is exactly based on Q theta naught Q theta tark and tried it
with a target network and then tried a couple of other tricks that relate to
how you use the target network to possibly either estimate the value in
the bootstrap or select the action in the bootstrap and those are tricks that
are known to potentially help they looked at at all these different cases
for many different Atari games as the experimental test bed and they clipped
the rewards in the environments into a certain range so that they knew exactly
mathematically what the ceiling for possible real Q value would be they
chose it to be a hundred and they looked and saw over all the experiments that
they ran how often did the maximum absolute learned Q value in an
experiment exceed the threshold which they knew was the real true maximum
possible Q value and the answer was a lot so this shows that Q learning
without target networks is very unstable in that a lot of the time you will get
this this divergence phenomenon and even as you include tricks that make it
progressively more stable you’ll still expect to see divergence every now and
then so we’re gonna dive into a little bit of math now to kind of get maybe
some intuition for why this is the case and what deep Q learning algorithms are
really trying to do and how that translates into the algorithm or doesn’t
so we’re going to start by taking the operator view of the bellman equation so
the optimal bellman operator t’east is a map from cue functions on to other
cue functions and the value of T star for a particular state action pair is
given by the the cue right by the bellman equation that we saw before the
optimal cue function is the fixed point of T star so Q star equals T star Q star
that’s great and T star has this special thing about it which is that it’s a
contraction map on the space of Q functions contraction maps have some
very special properties that we’re gonna talk about now yay
so the main thing about a contraction map is this idea that if you have two
points and you apply the contraction map to both of them they’ll basically be
closer with respect to some distance function after you’ve applied that map
to both of them than they were before so expressed mathematically we have some
some norm some distance of the norm of a thing minus the other thing and the norm
of f of X minus f of Y is going to be less than or equal to some constant
factor times the norm of the difference between x and y that distance between x
and y and when that beta is less than one then we have a contraction that’s
saying it’s getting closer together it’s shrinking why do we care about
contractions because they have unique fixed points and you can get to them by
just repeatedly applying the operator to any initial point this is something
called the binocs 20 room if you’re interested in going on Wikipedia and
finding something which is going to be more precise than however I’ve typed
this up but in a nutshell to show you that they have unique let’s forget about
uniqueness for a moment but at the very least that repeatedly applying this
operator will get you to a fixed point if we look at a sequence of points X and
we have a contraction map F with modulus beta and each point in the sequence is
just yet generated by F of the previous point and we look at the distance
between successive iterates what we see is that it’s shrinking as a function of
the iteration number so in the limit as the iteration number goes to infinity
that distance will shrink to zero it will converge repeatedly applying it
will get you to the fixed point t star is a contraction on Q functions so if
you could represent the entirety of the Q function that is to say the Q values
for every state action pair in the entirety of the environment which for
all the environments that we care about in deep reinforcement learning you
cannot easily do you can only do this with function approximation which is to
say you’re going to generalize whatever you choose for the value in one state
action pair will have some influence on another you can’t completely separate
them when you do function approximation but putting that aside so we could
represent all the action values for every state action pair and we applied T
star the operator to that function we would get a new function Q which is
closer to optimal than the one that went in and if we applied it over and over
and over again we would eventually get to Q star the fixed point of T star this
is value iteration it’s a classic algorithm and
reinforcement learning so before function approximation before deep when
you had environments where there were a discrete number of states and a discrete
number of actions and you could represent the Q values in a table of
elements one for each state action pair you could compute this exactly and use
this as a way to get to Q star now when you live in the problems that we do when
you’re trying to solve high dimensional complex video games high dimensional
complex strategy games you can’t use the table yet use a function approximator
and now your problem is that you can’t compute all of T star Qi and even if you
could you probably couldn’t find a choice of parameters that would allow
you to exactly represent it so if you want to do this kind of value iteration
you have to do it approximately and this is roughly what Q learning algorithms
with function approximation try to do which is that they push the parameters
of the network in the direction such that you move Q theta towards T
star q theta and sometimes this works and sometimes it doesn’t because when
you go to this function approximation setting this operation is not
necessarily going to be a contraction on the space of Q functions you might have
lost that property if you did expect divergence in fact I expect things to
blow up horribly if you preserved it or if you’ve done enough tricks to
stabilize it things will work pretty well in my experience Q learning
algorithms and their variants tend to be extremely sample efficient when they
work which is quite desirable and it’s very nice if they can recycle off policy
data because on policy methods sadly have to throw away tons of it but last
point on Q learning what you normally see in deep learning algorithms and deep
RL algorithms is that paradigm of there’s an objective function and you
optimize it and you find the model that optimizes the objective in Q learning
don’t be misled into believing however many times you see it that the mean
squared bellman error it’s really the thing that you’re optimizing you change
that function every time you change the target the thing that you’re really
doing is this sort of approximate value iteration you’re trying to apply an
approximate operator which is going to get you to something better you’re not
trying to minimize a loss that’s not to say that there aren’t variants of these
kinds of algorithms that do involve well-defined loss functions there’s a
whole family of algorithms called gradient temporal difference methods
which if you are theoretically inclined and willing to go down a deep deep deep
rabbit hole I recommend you check out talk to me if you want references also
in the spinning up key papers doc I believe there’s a book in the bonus
section for classic RL papers and review papers choppa Sabbath Baris book on RL
algorithms from 2010 which recaps a lot of this really great old stuff including
gradient temporal difference algorithms so I recommend you check that out if
you’re interested yes I’m actually working on some research on that right
now like I
talk to me offline yes yes yes so so this thing yes it’s called a temporal
difference error because it is the difference in the Q value based on the
next time step versus the current time step yeah yes absolutely what is the
difference between off policy and on policy the on policy algorithms have
updates which are based on the expected values of things where the distribution
and that expectation depends on the current policy so if you want to form a
sample estimate of the thing in the update equation then you first have to
run the current policy collect interactions with the environment on the
current policy and use those samples for forming that sample estimate that’s on
policy because all the data that you use has to be generated by the policy that
you’re using at the time in off policy methods like q-learning what you do when
you make an update is you use experience which might have been generated by older
policies not the current one so the current policy you could think of as
being implicitly expressed in the in the Q function approximator is current value
but many steps ago it was different and you got whatever data you got from
interacting with the environment you put that in your replay buffer and then many
steps later you still sample those states and actions from that replay
buffer to help you form your your new update to the current q function so when
the data was generated by a different policy that’s off policy yes
in what sort of gaming situation would we maybe use deep q-learning or like
what’s a use case for it so there’s a fabulous use case actually Facebook
recently released a paper on their machine learning and RL learning there
RL platform called horizon which they used to train with deep Q learning
neural networks for making decisions about when to send you push
notifications so actually DQ n is in your phones right now okay then let’s
proceed to the next part which is model based stuff so I’m going to be pretty
brief about model based stuff there’s a very wide variety of different model
based algorithms and we’re not going to drill down into them the way that we
drill down into policy learning and Q learning but we will give a relatively
brief overview of some of the more salient points and a few algorithms that
I think are particularly interesting so back to the loop run policy evaluate
policy improve policy where do models fit in so recall that a model of the
environment lets you predict what’s gonna happen next you can use that for
pretty much any of these while you’re running your policy before you take an
action you can stop and imagine what’s gonna happen if you try many different
things you can create partial rollouts that you can use to evaluate your
different choices and then you might pick something different than you would
have otherwise so that’s maybe where it can appear in running ball and running
the policy in evaluating the policy you can use that same kind of approach of
just simulating look-ahead data to help you get a maybe a more stable backup for
your q function or just use some kind of Monte Carlo tree search style algorithm
where you’re going to propagate Q values back and figure out like an average case
Q value and then for improving the policy you can regress your policy
network if you have explicitly represented one towards whatever the
outputs were from that look-ahead planning process so if you have a model
it’s very powerful you can use the a lot of different ways you can embed it
pretty deeply in into RL the problem is that models are very hard to learn and
you usually don’t have them so let’s say you have just made a wonderful brand new
complex physical robot unless you have a lot of hours to spare and control theory
expertise you probably do not know how to fully characterize that and have a
simulator model which is going to be accurate in any reasonable way certainly
not accurate enough for training it in simulation and then directly applying
that simulation trained policy into the real world you may want to try learning
a policy from data but this can be quite tricky although there are some really
exceptional success cases but because yes uh yes you could make that argument
so I let’s say hardness to learn is not a fuck oh I suppose sorry the question
was can you make the same argument for value functions and I would say that
hardness to learn in this case should be interpreted more as has the research
community figured out really robust reliable standard methods for doing it
yet but not necessarily whether there’s some intrinsic quality of hardness
finding the correct model is a supervised learning problem if you have
enough data part of the problem in RL is that you usually don’t have enough data
and you would have to get it by interacting with the environment and
there may be areas in the environment very critical to decision making which
you’ve just never observed yet so imagine that you are in a giant maze and
you can try to learn a model of the maze as you go but until you’ve seen the exit
your model does not going to be very helpful for you and navigating except to
help you perhaps avoid repeating places that you’ve been to already but but yeah
in practice models tend to be so far hard to learn so let’s look at maybe one
case study in ways that you can use models so this is the case of planning
and/or expert iteration the basic idea is that you’re going to use your model
from a current state to look ahead into the future and help guide your decision
about what action to take so in planning you might explicitly just
base your decision about what action to take on whatever the output from that
look-ahead process is and your current value function in expert iteration
you’re not only going to do that but then you’re also going to have a
explicit representation of a policy which you’ll try to improve by
regressing it towards the output from the look-ahead process so as a case
study consider alpha 0 alpha 0 is an algorithm which has succeeded at
achieving superhuman performance in a wide variety of complex 2-player fully
observed strategy games particularly chess go and shogi so this was a
successor to alphago the algorithm that beat human grandmasters and go and alpha
0 at the algorithm level is sort of beautifully simple you have a neural
network that emits two things a probability distribution over moves to
play P and a value network that says basically whether or not you’re gonna
win or lose B and you learn this with this very simple regression approach
where you’re gonna move the value function to be more like whatever the
true outcomes from games work and you’re going to update the policy by using a
model-based look-ahead operator to figure out what a better policy would
have been based on your current policy and value function and you’re just going
to move your current policy towards that and then there’s also some
regularization very straightforward and the look ahead is done with Monte Carlo
tree search so that’s just stochastic lis considering different possible
outcomes and then aggregating data after having done partial rollouts down the
game tree to figure out what would have been the best thing to do so this is one
model-based approach now this required having a perfect model of the
environment and in games like chess ergo this is feasible because you could fully
Express the rules in a way which is easy to compute and forward simulate
and you don’t have to learn anything from data and you also don’t have
anything which is partially observed so your model doesn’t have to do anything
fancy to keep track of what’s going on in the background very straightforward
and this kind of approach can be very very powerful but the problem is that
most conditions are not quite as ideal as this so another family of approaches
is where you’re going to use the model for policy evaluation so let’s say that
you have learned a model or perhaps you’re given one but more often than not
for these algorithms you’re trying to learn it concurrently with experience
you learn some models and then you’re going to have the agent quote dream in
them the agent will sample a bunch of fictitious trajectories inside of the
simulator and use those as the basis for a policy improvement step and algorithms
that are like this there’s model ensemble TRP oh and I want
to say Mehta policy optimization or model-based Mehta policy optimization
then you could also instead of using this for computing advantages and and a
policy optimization style improvement you could use this for Q learning as
well where perhaps instead of forming the target based on the bootstrap which
might be inaccurate on particular regions of state action space that you
haven’t visited you could use the model to simulate what the bootstrap might be
in those cases and use that as your backup for Q learning so that’s an
approach called model-based value expansion and these algorithms the gain
that you get from doing this is ultimately in-sample efficiency so what
happens in normal deep RL is that you use tons and tons of data from
interacting with the environment to try to improve your policy or your q
function and you make progress at whatever pace when you use the model and
you offload a whole lot of the improvement steps on to experience
collected in the model that frees you up from having to have collected that
amount of experience in the real world as long as your model is good enough if
your models not good this won’t be very helpful but if it is good and if you
only needed a little bit of data to train your model then you can get a lot
of mileage out of it and your overall RL algorithm will have used less
interactions with the real environment and otherwise this is great for cases
where interacting with the real environment is very expensive so for
instance if you want to train something on a physical robot that can be an
expensive process the robot might be slow the robot might break the robot
might have all kinds of things where it’s difficult to get it to do that or
it’s difficult to reset it you probably don’t want to have to spend that many
man-hours waiting around for the robot to finish its learning procedure so if
you can offload some of that time into simulation then it makes life better
yes is that what you would apply for self-driving cars that’s a good question
so I’m not actually all that familiar with cases where self-driving cars have
fruitfully made use of deep RL that’s not to say that they don’t I just don’t
know I would imagine that in self-driving cars it’s probably more a
matter of collecting data from experienced human experts and then using
that data as the basis for learning a behavioral policy but I’m also happy to
you know go through this later and see what we can find in the literature yes
what would model-based RL be more geared towards transfer learning I think it
could potentially be quite helpful so certainly when we think about trying to
get robotics to transfer from say simulation to reality you know we want
to make sure that the model used in simulation is high fidelity with respect
to reality and if that’s the case then this model you can think about sim to
real as sort of a model-based approach and perhaps it’s gonna be very helpful
all right and then there’s this other completely orthogonal way of using
models which I’m really fond of because it’s just sort of weird which is that
you actually take the model and embed it inside of a model free agent where the
model is going to receive inputs from the from the environment and use that
with some internal process of perhaps imagining some futures and then
transforming whatever representation and has of those futures into something
which then becomes side information to the model free agent so you train the
model separately from the agent the module that provides some information
based on the model to the agent is sort of decoupled from it except that however
it’s going to process however the model free agent will process that information
is based purely on the model free learning so this is an approach called
imagination Augmented agents I think this is really interesting and really
neat I’m not aware of a whole lot of follow-up work from when this came out I
want to say last year or the year before but I just think that because it is so
different from the other model-based approaches that’s cool whenever there’s
something different it’s cool all right that takes me to what was originally
intended to be the end of part one but it’s now the end of both parts thank you
so much
at this point I would like to turn over the mic and the stage to Matthias
Clapperton who is a researcher on the robotics team at open AI and he’ll be
presenting on the work on the robotics team for learning how to do complex
manipulation with deep reinforcement learning on a real physical robot great
thank you we have a computer suite
yay I think it works okay thank you cool so hey everybody my name is Matias
as Josh mentioned I’m super excited to be here and talk a little bit about what
robotics that openly is doing and to talk that I’m going to present this call
it’s called learning dexterity as I mentioned this is basically the effort
of the entire robotics teams for many months so everything I’m kind of talking
about is not just my work but these are robotics teams okay cool
so let’s maybe start with talking a little bit about what robotics at open
era is actually trying to do and the ultimate goal I guess robotics at open
eye has is suppose some form of general purpose robot so I think this kind of
picture illustrates as well very well we have human-like robots today and we know
that humans can do a very very large amount of different jobs and skills so
that can include things like cooking it can include things like actual labor in
some form of agricultural thing maybe it’s very precise kind of things
like surgery or building things and putting things together in this kind of
stuff and ideally we would like to have a robot that has a similar similar level
of dexterity and a similar level of well general purpose Ness if you will the way
robotics looks right now it’s very different from that so we have these
kind of very specialized robots so an example I think that is good it’s the
Roomba which is on the lower in the upper left corner here that can clean
your house but it can only clean your house it can only vacuum your house and
similarly your things like self-driving cars which to some extent also robots
that are very good at one thing which is driving themselves but they cannot do
anything else and the robots there are more kind of versatile and more
complicated they are either very often controlled by humans so an example for
that would be doing surgery so we have robots that can assist humans in that
but they’re always controlled by human operator which is a surgeon or we have
more complicated robots in factories but those are typically just programmed to
basically blindly execute a given show secretary so someone sits with the robot
and figures out how to do a certain process in a factory and the robot is
very very stupid and has no idea what’s going on so the question of course is
how can we kind of step away from that paradigm and how can we have robots that
work in an actual physical world and aware of their surroundings and given
that this is the spinning out workshop that’s concerned with ll it’s not so
surprising that we think RL may be a good approach to that and we know that
RL works really well in certain domains so I’ve picked out two examples here
that probably most people have seen on the left side we have alphago zero
playing against Lisa at all and a game of Go and as you know alphago zero won
this game in fact I think one almost all games
that it has ever played and the follow-up versions of alphago zero
beyond beyond human capabilities when it comes to playing go similarly we have
dota 2 so this is some of the work that the dota team at opening AI has been
doing for a while we have this door abort called opening at five that is
very very good at playing the game dota 2 which is a 5v5 multiplayer game and it
is approaching like professional levels so it’s it’s consistently winning and
can semi-pros and we are already playing against some pros in fact we’ve done
that last summer at the International unfortunately we have not yet won
against those pros so the question is how does this work in robotics and of
course yes like a lot of work in this in robotics it’s not like we we are the
only ones doing this and I just like to give a bunch of examples that I think
are kind of illustrating what people are typically doing today
the first approach here is somewhat reason it’s from 2017 and I think it
looks really cool so you can see the agent is even able to use certain tools
so in this case a hammer it can open doors it can do all sorts of things
the unfortunate thing here is that all of this looks really cool about it’s
only in simulation and ultimately in robotics it doesn’t really count if it’s
only in simulation because you want the physical robot to do something otherwise
it’s not very useful so the other approach that people have
been taking is to train on the actual robot itself so this is a some work from
2016 where people have been doing dextrose in hand manipulation so the
goal of the robot here is to kind of manipulate this this tube filled with
coffee beans for some reason into a target orientation and they do all the
learning on the on the actual robot and that of course has the advantage of not
having to do any form of transfer because you learn on the robot you
exactly know how the robot is going to work and once you have a good policy
you’re done the downside of that of course is that well you have to run on
the actual robot so it kind of breaks a lot on you it’s very slow to do you
can’t really scale this up unless you get a lot of robots which is actually
something that people are doing so this is the approach thing by Google and
typical Google fashion scale it up so just get a lot of robots and let them do
it for two months in parallel and then you can suddenly train on the robot
because well you have 20 of those doing it in parallel and it can do very
meaningful stuff so in this case they have learned to grasp arbitrary objects
out of this kind of box that I have sitting here and this is actually very
impressive demo like this kind of been picking stuff is actually very hard the
thing is still that obviously this does not really scale all that well because
this is a relatively simple task yet you need 20 robots going for two months and
you will also just have to babysit the robot all the time right like you’ll
have to repair it when it breaks you’ll have to kind of reset the environment
when certain objects fall out of the bin and all of this kind of stuff so it’s
just a lot of work so what we’re trying to do is to kind of combine the benefits
of those two approaches so training in simulation and then transferring to the
physical world which is called sim to real and I’ll be talking a lot more
about this but before I do that I’d like to introduce you to the test that we
actually have in mind when we when we do our research so we decided to do
dextrose in hand manipulation and the reason for that is that it is first of
all very hard to do and then second of all
it is something that we’re interested in because we know that our hands these
universal end effectors right so human hands are very versatile in what they
can do it they can be very dexterous you can do an cooking thing or you can
operate on a human if you’re searching at least but you can also do very heavy
lifting with it and you can use tools made for you nuts hands and these kind
of things so so this is basically the motivation for why we choose this kind
of hand and just kind of tasks because it’s hard and because it’s also
ultimately useful for the channel purpose robot we would like to build and
the reason why it’s hard I think is summarized relatively well in this this
kind of slide so we use a hand called the shadow Dexter’s hand which is
depicted in this picture it has 24 joints and it has 20 actuators so what
this means is that your policy it and every time set has to produce an action
for 20 individual actuaries and it actually has to coordinate right like
you’ll have to have different joints work together to do certain things so
it’s a really high dimensional kind of control problem that’s typically well
out of reach of what traditional control problems can solve as I mentioned
ultimately we wanna run this on real hardware and so we have to work with the
real hard way and all its flaws and issues so this includes things like
noisy and delayed sensing so that’s just a fact of physical hardware systems
right like they will not have perfect information and they will have delays
and certain certain quirks that you kind of have to deal with the other issue
that comes out of this sensing is that you actually have to handle partial
observability so in simulation you have perfect knowledge of everything that’s
going on because well it’s your simulation and you can just read out
from your simulation what the current state is but on the physical system you
can only use what you can actually sense so obviously certain things like the
friction for instance of the system cannot directly be observed and then
last of all this is actually super hard to simulate as it turns out the reason
for that is that you have a lot of contacts going on so if you have
something in your hand like you kind of constantly touch it and contexts are
notoriously hard to model accurately first of all and then the hand itself is
also incredibly complicated so it’s 10 actuated which means that you kind of
have tendons pulling and just causes a lot of unmodeled kind of things in you
and your hardware that you have not modeled in simulation cool so as I
mentioned we set out to solve this problem with our seem to real approach
so we trained in simulation and then we transfer to the physical hardware and
while this sounds very easy it is not very easy because the transfer problem
as you’ll see is actually not very easy to overcome but before we talk about
that let’s have a look at what what we can do in simulation and what the policy
that we train looks like in simulation I think this also illustrates the task at
hand so that you can actually understand later what what the robot is trying to
do so as you can see you kind of have this block with colored faces and the
task is to rotate this block into the desired target orientation that you have
and the target is depicted as this kind of like semi-transparent additional
block on the right hand side so now it’s trying to bring up the blue face a yeah
it got it and then kind of moves on to the next goal and as you can see it just
kind of involved like it coordinating its fingers it has to kind of use its
permit it’s kind of using gravity to let it roll and it’s like even in simulation
this is not super easy to learn the hard way itself looks like this so this is
the cage we call it it houses all sorts of things in the middle of course you
have the shadow Dexter’s hand which is the robot itself and then you have it
surrounded by quite a lot of these face based tracking cameras so we have 12 of
those in total and what they do is they provide you with relatively accurate
sensing in in Cartesian space so we have LED markers on the hand itself so we
know where the hand is and we also have LED markers on the object so we know
where the object is and those guys basically they sense the slide of the
LED and since multiple cameras can kind of see the same LED marker that it can
do triangulation and you can recover the position in in space from that
information we also have an alternative setup
because as I mentioned ultimately we’d like to have something that’s more
general and having a motion capture system is not very kind of real-world
like so we also have RGB camera so those are regular RGB cameras we have three of
them surrounding the scene and they can also be used for sensing in fact they
can be used for post estimation of the object so you don’t even have to have
any any special kind of sensing on the object itself the cameras can do it for
you and the reason why we have three is just just so they can first kind of
recover depth information and then second they can also kind of work around
occlusions because it’s in the hand from certain angles you cannot sometimes see
the object because it’s kind of covered by the hand so this is how it looks up
close when we run things so as you can see we have the we have the hand with
the block in it palm and in this case it’s the block that we use for face
based tracking so you can kind of also see the LEDs on it that we use this is
simply much easier to do when when kind of testing these algorithms so we have
these kind of world setups all right so the big question of course is how do we
do the transfer so I showed you a video of the policy doing its thing in
simulation and I showed you the physical Hardware so we can have all the building
blocks but how can we actually transfer it to the physical robot and if you just
train it in simulation it will not work at all it’s the short version so I’ll be
showing some kind of numbers for that as well but there you can believe me if I
say the transfer problem is really the core issue that we’re dealing with here
and the approach that we’re taking is relatively straightforward actually so
what we do is we use two main techniques the first one of course being
reinforcement learning to learn the actual control policy and then the
second technique being the main randomization to make sure that the
learn control policy actually transfers to the physical system and I’ll be
speaking about both of those in a little bit more detail so let’s get started
with the main randomization so this is a technique that has been used for a
little while pretty popular paper when it comes to
this is from 2016 in this paper what they did is they learn to fly a drone
and the way they approach this is they trained in only in simulation using
these kind of randomized buildings so you can kind of see it has a lot of
different rooms in it the textures are very different so the walls look
different of ceilings of floors and they train a drone to fly in all of those
rooms and what they then do is they take this drone that was only ever flying
inside a simulation and show that they can actually fly another completely
different actual room simply because it kind of has seen all of this variant
doing during its training it kind of like from its perspective what happens
is that the policy think so justice is another like randomization it’s kind of
weird but oh well I know how to handle it so it flies in the actual room and
people that open either has been using similar approaches as well so this is
some work from my colleague Josh Tobin what he has been doing is he has been
using domain randomization for grasping so this is using a robot called the
fetch so it’s you’ll see a better picture in a
moment but it’s basically a a simple robot armed with a parallel group at the
end and what he would like to do is pick up these objects that you kind of see in
these randomized scenes and by basically using the same approach so he’s
randomizing all sorts of things like the looks of the objects of shape of the
objects the background the color of the table as you can see he can then use
this information or this training to transfer to the physical robot even
though it has never seen the actual physical table and what was pretty
surprising in this research is that it turns out you don’t even need
photorealistic rendering so as you can see like this it looks not realistic at
all it’s like pretty computer graphics and and still it transfers to the
physical to the physical world so the important thing here is that you have
this variety and not necessarily realistic environments yeah yes so all
of the the two approaches that I showed are using
using vision to learn a policy yes in this case I think it’s actually not
using the vision to learn a policy directly I think it’s instead just
predicting the location of the object and then there’s a policy that the
Kinect can grasp it from that so some some other work in this domain which i
think is equally important is physics randomization and this has been done by
Jason pang who used to be an intern at open air in 2017 and he’s basically
using the same idea of randomizing but now for physics instead of visual
appearances so it’s kind of hard to like visualize what’s going on but what the
policy in training sees a certain worlds that are just different so maybe they
have different masses maybe they have different frictions of the table maybe
the robot itself behaves differently and so on and so forth and what he was able
to show is that this again is sufficient to train strictly in simulation and then
transfer to the physical robot so the test at hand here is again with the
fetch robot and it’s trying to move this this park to the goal location which is
marked in in red and on the left hand side you see a policy that has been
trained with those physics randomizations and on the right hand
side it has been trained without and as you can see obviously the one on the
left hand side does a pretty decent job it’s like relatively precise it can push
the park where it wants to go and the one on the right kind of freaks out so
it shakes very violently in fact the building was shaking when he was
deploying this and it cannot really do with the job and the reason is that it
well has kind of over fit to the simulation which simply is not fully
accurate even though it’s calibrated to be close to the robot and then it
doesn’t generalize to the actual physical world where’s the one with
physics randomization stars okay of course so that’s the main randomization
in a nutshell so both the visual randomization and the physics
randomization yeah
yeah it’s it’s not very realistic honestly I mean it’s realistic in the
sense that it’s the physical so if you randomize too much your simulation will
become unstable because you’ve set in certain parameters such that they cannot
make sense anymore but it’s not very realistic like the
masses will be very high sometimes it’s like smart to move the puck and it’s
more about diversity again yeah okay cool
so I’ll now speak about our approach so what I previously talked about was
mostly other people’s work even though they’re also in the robotics team but
this is the the learning dexterity approach that we took so again remember
the goal is to have the shadow hand rotate an object in hand and to kind of
start it off I think it makes sense to just give you the the overview of the
entire system and then we’ll kind of dive into some of them details after
that so again as I mentioned everything we do is only in simulation so we never
see the actual physical robot until we run on it like we’ve never seen it so so
the way it works is that we collect a lot of data in simulations so we have
many many simulations running in parallel which is kind of depicted here
in box a and all of those are randomized which is kind of visualized by them
having different visual appearances but also think physics randomizations so the
friction and the masses will also be randomized and using this collected data
we basically end up training two different networks so one of them is a
policy and the other one is job is a vision network because we’d ultimately
like to run this from vision alone without the face base the policy network
is what is depicted in Box B here and the way it works is that it takes the
observed robot state which is the position of the five fingertips so you
have doting coaches in a space of 15 dimensions in total so it knows where
its fingertips are and then also the pose of the object so that means just at
the orientation and the rotation in space sorry the position and the
rotation in space and this information is then fed into an LSD and policy so
it’s a recurrent policy and it produces the next action and we train this in
simulation using reinforcement learning the second network that we have which is
actually distinct they are not end-to-end
this is two networks that we train separately it’s a vision Network and the
rate and vision Network uses works is that it takes three different images so
remember we had these three RGB cameras surrounding so images rendered from the
perspective of those but again only in simulation and then using a
convolutional neural network predicts the pose of the object from that
information from those images and again this is only trained in simulation when
it comes to actually deploying this to transfer as you can maybe kind of guess
is that we can combine those two systems to get us what we ultimately would like
so you use the actual cameras to sense the position or the pose of the object
using the vision network so you feed it into that and then by having the object
pose and the fingertip locations you use your L SCM policy to produce actions and
that allows the robot to basically see what is going on and react accordingly
and of all only being trained in simulation yeah potentially honestly we
have mostly used this approach because we knew it worked from previous research
it is almost as accurate as face base and face base is very very accurate I
think if you spend a lot of time you could probably develop something with
more traditional methods I don’t question it but like we would like to
have something that’s more general again and having a convolution that conversion
neural network - it seemed like the most general approach we could have yeah yeah
it’s kind of interesting so ideally you would just use whatever the robot has as
joint sensing so it knows it should know what its own joints are as it turns out
the sensor in the shadow hand uses hall-effect sensing which is a magnetic
kind of sensor and they interfere quite a lot so if you think as a close
together you will actually not know where your fingers are so that’s the
reason why we don’t use it we would like to use it but it turned out to be not
precise enough for what we ultimately wanted to do so we couldn’t actually
rely but yeah you’re right like like this is
more for more for work around like ideally the robot should just tell us
what the joint positions are and then we wouldn’t need the fingertip positions no
it actually has very limited information it’s very surprising that it works like
that yeah yes yeah yeah very good question this is there’s a lot of debate
about this I don’t think it does we have some indication that it doesn’t in fact
it seems to help like the performance seems to improve over the board like we
have certain ways of measuring symptom transfer and when we randomize more we
tend to get better performance on all the environments so I don’t think it’s
it’s compromising actually I think it’s more of an adaptive policy but then
there’s people who disagree so it’s currently a little bit unclear okay cool
so as I mentioned we need to randomize and of course we use appearance
randomization so this is only for the vision Network so this is basically what
I’ve described before just for our setup so you can kind of see we have three
different cameras showing the same scene and we randomize this scene quite quite
heavily so the robot changes its color the background changes its color
importantly the block itself stays mostly the same because it actually has
that color like you cannot randomize the dye but rarely but we changed the
material of the of the block as well so it looks slightly different and then we
of course have that vision network which again is relatively straightforward so
the way it works is it takes those three camera images then uses convolutions and
the rest net architecture and spatial softmax to kind of process them and then
simply calculates all the things and produces the final object position on
object rotation so the pose of the object and this is simply trained with
supervised learning because in simulation you actually have perfect
ground truth which is another very convenient thing you actually perfectly
precisely know where your your object is you have not to actually sense it at all
and this is what the model actually sees so it’s actually I think very
interesting because it looks very very different from
randomization and yet it generalizes to that simply because it has seen enough
variety that it’s kind of okay with with yet another variety that’s kind of weird
but still within distribution in that sense so when it comes to the physics
randomizations that we use we randomized quite a lot of things as well so we have
things like object dimensions for instance we have things like masses
obviously and then mostly things about the robot itself so things like the way
we actuate the robot things like damping within its joints and all of this stuff
and the reason for that is that it’s actually very hard to measure this so
another neat thing is that you can in this physics randomization actually
account for your uncertainty so for the object dimensions we know those with
relatively little uncertainty because we can just measure the dimension of the
block but things like the actuation we learn much less about and so we kind of
widen the randomizations for those and another kind of cool thing is that we
randomized the gravity vector which may seem a little bit weird but it basically
amounts to like when you when you mount the hand it’s not perfectly parallel to
the to the floor like it will be slightly angled because of imperfections
and by randomizing the gravity vector you kind of get this effect as well like
it’s sometimes slightly angled and it turned out to be actually very useful
and then we of course also have noisy observations and noisy actions simply
because it’s a rare reality of the of the physical system the policy is very
very simple so what it gets is the noisy observations so that’s five fingertip
positions and the poles of the object and the goal so it’s knows what it wants
to do and then we normalize a little bit so this is just making sure that things
have a zero mean and unit variance and then use one fully connected value layer
and one lsdm to produce the actual distribution and from that we sample and
then perform perform that on the robot so it’s a relatively shallow and
relatively small network over all the more so yeah
they only come in through the simulation they cannot be observed directly so
sorry
they are simply set in the simulation so the environment has been changed but the
policy cannot sense this directly it has to infer this basically because on the
physical robot it also cannot sense it like we don’t know what it is on the
physical system so it basically what what we think it ultimately ends up
doing is some form of system identification so when it’s running it’s
implicitly inferring certain information about the environment and then using
this information to kind of adapt itself accordingly yeah sorry I couldn’t hear
yeah so so we add Gaussian noise to the observations and to the actions yeah all
right so I think I’m running a little bit late actually how bad is this huh
okay then we have to hurry a little bit cooler so disappeared of training let me
speak about this and then I’ll show a video so disappeared training I think is
very interesting because we use basically the same system that the dota
team uses as well so we have a very large-scale kind of system and the way
it works is that we have role of workers who generate a lot of experience and
then we have an optimizing machine that’s kind of using this information to
update its policy and we use approximate policy optimization for that so a non
policy algorithm as I think josh has explained earlier today and I think it’s
kind of cool that we use the same system estera let me skip over some things but
I think I’ve want to show this so this is when it’s running on the physical
robot as you can see it’s using vision so there are no markers on the actual
object the robot hand is doing all of this this is not cut in any way it is
not sped up again the goal is depicted in the right corner here so it will try
to get the e face front and the end face up top and it will get 250 successful
rotations in this case so it can do quite a lot of those and it can run on
the on the physical system and if I have enough time one one kind of final thing
that I think is actually very interesting is that it actually learns
certain strategies that happen to have names so we have thing a pivoting where
you kind of like use two fingers to create a rotational axis and then you
rotate around that and things like finger gating and the reason why they
have names is because they are used by humans as well and they have been kind
of studied very well they emerge automatically in our case so we have
never shown the robot what a human would do it has kind of discovered that itself
and the reason why they come up is simply because it has a human-like
morphology right like it has a human-like hand and it just turns out
that these strategies are equally useful for humans and robots but they have kind
of been rediscovered quote-unquote which i think is a really
thing so I wanted to mention that and yet we have some qualitative results
that show that randomizations are very important so if you don’t randomized you
get no successes if you randomized you do it turns out memory is very important
so you need an LLC M you cannot simply have a feed for policy and you need a
lot of experience so for the final policy we use a hundred years worth of
data so imagine doing that on the physical robot like probably not such a
good idea so but we can get away with it because we use simulation so we do all
of this in 50 hours and I think with that I have to close all right thank you
great thank you so much Matthias we’re gonna switch out the slides and then
please welcome to the stage the leader of the safety team at open AI dario
Amadei all right just a minute to get the slides
right
very good thing that you’re ensuring that computers in the future will not be
as malicious so I work on a team at open AI that thinks about making AI systems
do what humans want them to do which is you know kind of very central
to open the eyes mission and you know which which we think of as you know
something that our focus on distinguishes us from from other
organizations we think it’s very important particularly as systems get
more capable to ensure that they you know both in a narrow and broad sense
benefit society so this workshops called spinning up in
in deep RL so it’s useful to step back and you know think about what is what is
RL accomplished in the last couple years and where is it going so you know this
is actually out of date we should add add a couple things to it but you know
if we look at playing games like go if we look at for about a year ago multi
agent behaviors where you can use RL and self play to train agents to sumo
wrestle each other off a pad we are able to play competitively against
professional professional players in dota 2 the robot results which you just
saw and you know we should probably add just in the last week or two you know
the results we’ve seen on StarCraft which is you know in some ways similar
to dota but just a different kind of game with the different kind of
properties and yet you know that shows that these techniques are really are
really pretty general and are are advancing pretty quickly so you know if
we step back and reflect on you know kind of where are things going you know
some properties that we could point out of these RL agents that are becoming
more and more true right that we’re not true five years ago but are becoming
more and are true we have systems that have an
extended interaction with complex real-time environment they have a very
high level of autonomy and speed you can imagine systems like this in the real
world being used to make decisions faster than humans can intervene or in
more complex ways than humans could you know could hope to understand so
regulating the economy or financial system managing large networks of
computers these are the kinds of things that as RL technology matures it will be
better and better better and better able to do and you know the these systems
unlike supervised learning systems and unlike in any interesting way you know
the simple RL systems were a few years ago these systems are able to teach
themselves and discover their own strategies and in many cases they
discover non-trivial strategies you know just like we saw with the robot it kind
of recapitulating a lot of strategies that humans use you know we see in go
and dota and Starcraft a lot of human strategies that have names you know the
RL system discovers and recapitulates but it also sometimes discover
strategies that a human would never would never have thought of so if we
look at what these properties mean together one thing it means is that the
connection between us as designers specifying what we want the system to do
and what the system actually does in theory the system does in theory if
everything is done right the system does what we want but that that rope it’s
longer it’s more afraid it’s more tenuous
than for just kind of less less autonomous systems that we’ve we’ve
designed in the past and there are many ways relative to you
know simple computer systems or machine learning systems like supervised
learning for for these systems to go wrong and so a couple years ago several
people on most most of whom are now now now constitute the the open a nice
safety team started started thinking about this you know we’re worried about
current systems worried about tomorrow systems eventually we’re worried about
you about about building general
intelligence and what that what that will mean for the world and making sure
that those systems are safe so you know we wrote kind of a position paper and
this kind of started us thinking about you know the directions and how to even
think about this problem of you know do systems reliably do do what we want them
to do and the the kind of general framework and division we came up with
was okay so you know let’s let’s let’s narrowly scope the problem we’re not
we’re speaking not about kind of wider or societal impacts although those are
also important but you know just narrowly the designer had a clear thing
they wanted the system to do and then you know the system gets trained it gets
deployed it goes through some long process actual system fails at this
catastrophic ly and we kind of divide it up into into a couple things one is you
know you’re you’re giving the system some Direction some objective function
that it learns from like the reward in RL there are ways for that to be subtly
wrong and you can get spectacularly wrong behavior if that happens you might
have the right objective function but your system has problems with robustness
doesn’t generalize well it you know exhibits on exhibits unpredictable
behavior as its learning it does dangerous things even if the final
policy it’s gonna learn makes sense and then as a reminder that you know there
like this all exists on top of kind of software
implementation that has bugs in and of itself and so you know these the a and B
are new but they’re layered on top of the general just the general
unreliability of software so kind of a useful way to think about let’s put CSI
because it’s not really a machine learning problem or just you know a
reminder that this is layered on top of existing problems but a crude analogy we
can make is you know it’s a bit like the simple statistical concepts of bias and
variance right better a better objective function you know that’s that’s about
reducing bias and making sure your aim for the right target robustness is is
about making sure that you’re narrowly cluster around the target and that you
always get what you’re intending to get so we’re interested in in both problem
because I have limited time I’m going to talk about our work on the getting the
objective function side right I think you know open AI does more the opening I
safety team does more of that relative to other you know other teams say at
Google brain or deep mind that that think about these problems and so I’ll
mostly talk about that but increasingly and maybe I’ll have a little bit of time
to talk about it at the end we’re also thinking about the robustness direction
and how these two things interact so just to be clear about what we mean I
think this this this this video has been widely circulated so I apologize if for
people who are already familiar with it but you know about about a year and a
half ago we you know we were we were training lots of flash games using RL
and you know there there happens to be this boat race game so you know I I just
set lots of lots of games running with with a reward function so the way this
boat race works is supposed to go along the course and you’re supposed to that
you’re supposed to finish the course but the way the reward function works and
it’s hard to reach in and write a different reward function is you get you
get points for you know these markers along the way that are mostly along the
course but it turns out there’s this this little Lagoon in the corner of the
course where you can go around in circles and get more and more power-ups
and that turns out to get you a faster rate of power up to naturally finishing
the course there’s nothing wrong with RL here the system did what it was supposed
to do but it identifies the weakness of the connection between a reward function
in the final behavior the reward function that you specify that you may
think corresponds to some behavior that you want may in fact correspond to very
different behaviors and you get no feedback on that other than just finding
out what the system does right when I first trained this I trained along with
a bunch of other games two days later I looked at this I’m like
what what in the world has what in the world is this doing it doesn’t make any
sense and then I thought about it alone I’m like oh of course that makes sense
and you know so the more powerful the system is the more autonomous it is the
less of human is paying attention to it the more potential there is this is like
you know generate dozens of these examples but
you know robotic system where we forgot to make the table totally fixed it has a
high mass but it’s not fixed turns out to be easy it’s hard to send the send
the puck exactly to the point you want it to be it’s easier to send the puck
observe if it’s gonna be a little to the right or a little to the left and then
nudge the table so that it hits it exactly it’s very it’s very clever it’s
a correct solution to the problem but the problem was not the right problem so
the general approach that we’ve kind of hit on and we’ve been pursuing the
strategy for about a year a year a year a year and a half is that the this
training loop is too long right the human at the beginning says here’s a
mathematical roared function like go go optimize this then you look back at the
end of training you might get the right thing you might not if you don’t you
have to go back to the beginning or you know maybe the system is already doing
something dangerous so maybe we should have humans be involved interactively in
the training process right when we train humans to do things it’s not just like
here’s your goal go off tell me what you did you know two weeks later so if we if
we do this is there a way that we can use a human to decide what the reward
function is in a continuous way that’s more reliable that’s more naturalistic
so that the system ends up imbued with human goals and values but it’s able to
act faster and bigger than human scale once it’s trained it knows what the
human wants and it does it example of this is like instead of RL we can learn
from demonstrations but that kind of has the same problem a human demonstrates at
AI system copies it and there’s kind of it’s hard it’s hard to do better than
the human it’s hard to course-correct it’s hard for the human to say you
should be doing this instead of this and traditional RL has has a loop that’s too
long so the kind of first effort we did in this direction was we called it deep
RL from human preferences so the idea is you know I want this thing to do a back
flip and I you know it’s hard to mathematically specify the reward
function for a back flip we tried by looking at all the
individual joint angles and you know it turns out it just gives you some
think very very like you know very awkward looking but what we do instead
is and you know this is now running for the second time but a human looks at the
behavior of the system and says which of these is more like a backflip than the
other the system just starts by acting randomly it has it has just like a
random reward function and human gives it feedback on what what is more like
what the human wants and then the AI system you know like the the RL system
has a reward predictor and it tries to fit a reward predictor consistent with
what the human says the human prefers and then in the background it’s running
a whole bunch of copies of of the RL environment and those copies optimize
the reward function that it learns from the human the human only ever has to
give feedback on a very small fraction of the AI systems behavior doesn’t have
to see everything it does just has to get enough samples to give the to give
the policy an idea of what the reward function should be so another way to put
it is the human trains the reward function and the reward function trains
the RL system so what I just said can be kind of pictured in this the grey part
is the standard set up for for for reinforcement learning where you have an
RL Aughra than the environment they exchange observations and actions and
there’s a reward that kind of that kind of you know comes from the ether that
was ultimately specified by a designer but that isn’t thought about as being
part of the problem here what we have is that reward starts out being completely
random and the human sees examples of the agent’s behavior and feeds them to a
reward predictor so the reward predictor is changing and improving and adapting
over time and the RL system is both learning from the existing reward
function and adapting to changes in the reward function
so we did several versions of it in our paper and we found that a simple active
learning technique helped relative to random it didn’t help by that much but
but it helped the idea is you train an ensemble of reward predictors that are
trained on subsets of the data and that that allows you to have kind of like
semi independent predictors and you can pick examples where the predictors are
uncertain meaning that those are parts of the space or situations where there’s
just the reward predictor has more uncertainty and so would like more
feedback from the human that helps you can go much more sophisticated in that
direction right the system could like ask the human like what you know like
you know what what am i doing that’s wrong what am i doing that’s not clear
the human could say to the system like you know I’d like you to produce some
examples of this right and then it becomes much more like a like teacher to
human teacher to human pupil teaching process and a lot of what we’re doing is
kind of going in that direction but we kind of have to start so imitation
learning has the following limitations when you do imitation learning you
except for noise reduction which is usually a small effect you can’t perform
better than the human does so as we’ll see in some future tasks here there are
cases where learning from preferences allows you to perform better than how
the human does the reason for that is with imitation learning you just do what
the human does here you learn what the human wants and once you learn the
reward function you could do it better than the human right so consider
something like you know if I didn’t know how to play go I can teach you the rules
of go and then you can do RL on the rules of go and get much better than me
or you can just copy my moves if you’re just copying my moves you can never do
better than me if I teach you the rules and then you use RL to learn how to
learn how to play you can you can then
you can then in principle do better
another another difference is you tend to get kind of like
better sample sample you tend to get like better sample efficiency you can
come up with strategies that a human wouldn’t would you can come up with
strategies that human wouldn’t have thought of and many tasks a human just
can’t do so actually this backflip task it’s actually very hard for a human to
demonstrate that task like you’d have to get a VR setup and if we look at like
the tasks of the future right where you know like let’s say I want to defend you
know a large corporate IT network or something and I want to respond to
threats in real time that’s just something where I I can’t get training
data from a human I’m asking the machine to do things that a human can’t can’t do
which is what we ultimately want AI systems to be able to do does that kind
of answer the question yeah so we have an option in this paper for basically I
don’t know or I think we had separate options for I don’t know or it just
throws out the data or these two look about the same in which case it like
waits them equally in the predictor and in yeah so that’s that’s easy to
incorporate I think ultimately the communication needs to be in terms of
language and not in terms of clicking left or right and then that will kind of
like make a richer space for doing things and saying I don’t know or like
show me some other examples these things aren’t comparable at all become much
more common so the nice thing about this is given an environment without changing
the code at all only changing what the human provides as feedback you can get
totally different behaviors so in about half an hour a human can train this this
are all system this is like simple simple atari enduro game i can train it
to do the usual thing which is to to race ahead of all the other cars but i
can also train it to go exactly at the same speed as other cars and when it
does that you know it’s able to actually get there very very effectively like you
know stay exactly even with other cars which isn’t it isn’t easy you have to go
at kind of exactly the same speed and match their speed and so
exact same code just the human provided different different feedback one thing
we show is if we don’t give you the rewards for Atari games we just hide
this hide them from you humans giving feedback on basically you know trying to
get the system to get the highest score that it can works really well on the
kind of right of each panel those like colored bars that are moving that
represents how much reward the system is thinks that it’s getting or just how
much how much it thinks to give an action is good so if you look at the
breakout case when the ball hits the paddle instead of so on the Left when
the ball hits the paddle instead of you know instant instead of instead of the
ball going to the bottom it says yeah I got a lot of reward from that same with
pong the when it surfaces to get oxygen and Seaquest it’s very very very high
very high reward level so the predictors seem to correspond to what you know to
what human would say is good behavior which is not surprising because of human
training them so we did did a bunch of we did a bunch of experiments and you
know with fixed reward Atari games your goal is just to do as good as you would
if you knew the reward right so you’re like hiding the reward from yourself and
you’re trying to learn the reward from a human so most of the time it does it
does almost as good but actually there are cases where it can do better
we’re in enduro that the algorithm we used a3c has trouble learning enduro
because of sparsity of the reward but a human actually helps to shape the reward
right in enduro you have to like kind of like read the control stick to go at a
certain speed in order to get in order to get any reward at all so you can
start you can start to move and the RL system doesn’t give you any reward and
then you have to keep moving faster and faster to get reward and some some
algorithms never figure that out but the human will basically say okay yeah you
went ahead you made progress that’s better than when you’re not moving and
so little by little with just with a few feedback points it can lead the system
and so the human can shape the reward and they’re actually cases like the the
curve for enduro in the bottom right where you can actually do better than
the human did or you can actually do better than that
standard then a standard oral algorithm did even though you had less information
instead of knowing the right reward function you just had a human indicate
the reward function also works for a bunch of kind of like simulated robotics
tasks we haven’t really tried it in the real world relevant to the question
about demonstrations we’ve we actually followed this up with an effort
combining human feedback with demonstrations so what that did is you
know there’s some tasks human can do it but we liken it you we’d
like the RL system to do it better however we can initialize from human
human demonstrations the AI system copies that but then on top of that on
top of that initialization we run RL RL with human preferences so there’s no
reward function there’s no like programmatic reward function anywhere
it’s entirely learning from humans but the first step the human demonstrates
and then the second step the AI system copies in the human says it would be
better if you could do it this way and again the second step allows you to
exceed human performance or do tasks that humans can’t do right the humans
like this is as well as I know how to do it the AI system copies that the the
human says you know ok well I wouldn’t be able to do this myself but if you
move back and forth really quickly and shot those two ships that will be better
than if you didn’t do that the AI system is capable of that and so it can
bootstrap itself to kind of beyond beyond human capabilities more recently
and we don’t have any work out on this but I think we will soon we’ve started
applying this to natural language so in the last year or so there have been kind
of big a lot of progress on large language models like open the eyes GPT
and google’s burt where you just take a big corpus of text you train just just a
big transformer model to to predict the next word or the next token and that
allows you to generate very coherent text and can also be fine-tuned to solve
a lot of linguistic tasks so one one idea there is can we find two NAT via RL
from human preferences right I have a language model it’s
a lot of text some of its happy some of its sad five five minutes yeah you know
some of its formal statements or informal statements some of its jokes
the language model maybe has some idea in its internal representation of the
difference between those things but you know if I just sample from the language
model it just kind of gives me random samples of stuff so can i push this
language model in directions and to produce behaviors that only a human can
specify that can’t be specified programmatically so things like
statements that rhyme or our-our statements that are in iambic pentameter
could you make a system that is you know from the logic of learning from human
preferences is a better poet than any human could be or something like this or
you know makes makes like very positive sentiment statements that are you know
that it’s hard to find enough enough positive sentiment statements to to copy
from so that’s the direction we’re going in and then I think you know like a long
term vision for it would be you know we would you know we want a system that
basically has an ongoing dialogue with with a human the human asset to do
something really complicated like planning and executing the mission to
Mars you know the system kind of kind of clarifies ss4 instructions while it’s
learning and while it’s doing the task and we make sure that that things like
pathological solutions the problem don’t happen one way to get to Mars really
quickly is you know to escape from Earth and propel yourself by dropping a bunch
of like nuclear explosions back at earth that would work that would get you to
Mars this project called the Orion project in the 1950s although plan was
to detonate the nuclear weapons when they were like far away from Earth but
this is not a solution we would favor how do we make sure that that that AI
systems don’t don’t do things like that cool so I’ve only talked about a subset
of what the safety team is working on but you know we have around 15 members
here some of some of these efforts were done in collaboration with with with
deep mind and various various academic and
we have a number of kind of interns and faculty affiliates but you know we’re a
safety team is is is continuing to hire and we’re we’re interested in you know
further advancing these and these in other areas thank you so much sorry oh
hello everyone we are now at the conclusion of today’s morning talks but
before we break for lunch I would like to invite all of the volunteers who are
joining us today from open AI and Berkeley and New Haven school to please
come up to the front so as we proceed into the afternoon hackathon and
breakout sessions these will be the faces that will be around to help you
that you should ask questions to these people are all talented researchers or
contributors or engineers in this space many of these people are employees of
open AI ever and we also have I think the only person here who’s not currently
employed by opening I was previously employed by open AI so if you want to
pick our brains about what it’s like here what we do why it matters please
feel free can we just have everyone get maybe a sense to introduce themselves
sure I am Daniel I work on the safety team as a male engineer working on the
language fine-tuning project from a human feedback
yeah Matthias I’ve owned robotics I’m Ethan I’m on the safety team working on
model-based or LM safe exploration with Josh I’m Carl I’m on the games team
primarily studying transfer learning and procedurally generated environments my
name is Dylan I’m a PhD student at UC Berkeley and I mainly work on preference
learning I’m Amanda and I’m on the policy team here opening I I marry I
work on the safety team on safe exploration alright and another thing
that I want to say thank you all so much for being here today something that I
hope we can do is really make this a useful experience
for all of you and I hope that over the course of the day that you know you give
us feedback about what you find helpful and not helpful and what it is that
you’re hoping to get out of this experience so that we can figure out you
know how to help you get to that and and thank you so much please enjoy lunch